A Platform for Interactive Data Science with Apache Spark for On-premises Infrastructure.

Rafal Lokuciejewski, Dominik Schüssele, Florian Wilhelm, Sven Groppe


Various cloud providers offer integrated platforms for interactive development in notebooks for processing and
analysis of Big Data on large compute clusters. Such platforms enable users to easily leverage frameworks
like Apache Spark as well as to manage cluster resources. However, Data Scientists and Engineers are facing
the lack of a similar holistic solution when working with on-premises infrastructure. Especially a central point
of administration to access a notebooks’ UI, manage notebook kernels, allocate resources for frameworks like
Apache Spark or monitor cluster workloads, in general, is currently missing for on-premises infrastructure. To
overcome these issues and provide on-premises users with a platform for interactive development, we propose
a cross-cluster architecture resulting from an extensive requirements engineering process. Based on opensource components, the designed platform provides an intuitive Web-UI that enables users to easily access
notebooks, manage custom kernel-environments as well as monitor cluster resources and current workloads.
Besides an admin panel for user restrictions, the platform provides isolation of user workloads and scalability
by design. The designed platform is evaluated against prior solutions for on-premises as well as from a user
perspective by utilizing the User Experience Questionnaire, an independent benchmark tool for interactive
Original languageEnglish
Publication statusPublished - 2021

Research Areas and Centers

  • Centers: Center for Artificial Intelligence Luebeck (ZKIL)
  • Research Area: Intelligent Systems


Dive into the research topics of 'A Platform for Interactive Data Science with Apache Spark for On-premises Infrastructure.'. Together they form a unique fingerprint.

Cite this