Spark on Hadoop integration with Jupyter

For several years, Jupyter notebook has established itself as the portable solution in the Python universe. Historically, Jupyter is the go-to tool for data scientists who develop primarily in Python. Over the years, Jupyter has evolved and now has a wide range of features thanks to its plugins. One of the main advantages of Jupyter is also its ease of deployment.

More and more Spark developers are favoring Python over Scala to develop their various jobs for rapid development.

In this article, we'll see together how to connect a Jupyter server to a Spark cluster running on Hadoop yarn secured with Kerberos.

How to install Jupyter?

We cover two methods of connecting Jupyter to a Spark cluster:

  1. Configure a script to launch a Jupyter instance that will have a Python Spark interpreter.
  2. Connect Jupyter notebook to a Spark cluster via the Sparkmagic extension.

Method 1: Create a startup script

Conditions:

  • Have access to a Spark cluster machine, usually a head node or an edge node;
  • Having an environment (Conda, Mamba, virtualenv, ..) with the ‘jupyter' package. Example with Conda: conda create -n pysparktest python=3.7 jupyter.

Create a script in /home directory and insert the following code, changing the paths to match your environment:

#! /bin/bash


export PYSPARK_PYTHON=/home/adaltas/.conda/envs/pysparktest/bin/python


export PYSPARK_DRIVER_PYTHON=/home/adaltas/.conda/envs/pysparktest/bin/ipython3


export PYSPARK_DRIVER_PYTHON_OPTS="notebook --no-browser --ip=10.10.20.11--port=8888"

pyspark \
  --master yarn \
  --conf spark.shuffle.service.enabled=true \
  --conf spark.dynamicAllocation.enabled=false \
  --driver-cores 2 --driver-memory 11136m \
  --executor-cores 3 --executor-memory 7424m --num-executors 10

Running this script creates a Jupyter server that can be used to develop your Spark jobs.

The main advantages of this solution:

  • Fast execution;
  • No need to edit cluster conf;
  • Customization of the Spark environment by customer;
  • Local environment of the edge from which the server was launched;
  • Dedicated environment per user that avoids problems related to server overload.

The main disadvantages of this solution:

  • Adaptation drive (using too many resources, poor configuration, etc…);
  • Need to have access to a cluster edge node;
  • The user has only one Python interpreter (which is PySpark);
  • Only one environment (Conda or other) available per server.

Method 2: Connect a Jupyter cluster via Sparkmagic

What is Sparkmagic?

Spark magic is a Jupyter extension that allows you to launch Spark processes through Livy.




kick magic

Conditions:

  • Have a Spark cluster with Livy and Spark available (for reference: HDP, CDP ouch TDP);
  • Have a Jupyter server. JupyterHub is used for this demonstration;
  • Have configured identity theft on the cluster.

Creates the jupyter user within the cluster

In this test, users are managed via FreeIPA on a Kerberized HDP cluster.

The creation of jupyter user:

ipa user-add

Creation of the password:

ipa passwd jupyter

Verifying that the user has a key on one of the cluster edge nodes and that user identification works accordingly:

kinit jupyter
curl --negotiate -u : -i -X PUT
"http://edge01.local:9870/webhdfs/v1/user/vagrant/test?doas=vagrant&op=MKDIRS"

Note that the command above creates a /user/vagrant directory in HDFS. It requires administrator-type permissions via the identity theft described in the next section.

Finally, check that jupyter the user is really a part of sudo group on the server where Jupyter will be installed.

User impersonation for the Jupyter user

Since we are in the case of a kerberized HDP cluster, we need to enable impersonation for jupyter user.

To do this, change core-site.xml file:

<property>
    <name>hadoop.proxyuser.jupyter.hostsname>
    <value>*value>
property>
<property>
    <name>hadoop.proxyuser.jupyter.groupsname>
    <value>*value>
property>

Install and activate the Sparkmagic extension

As stated in documentationyou can use the following commands to install the extension:

pip install sparkmagic
jupyter nbextension enable --py --sys-prefix widgetsnbextension
pip3 show sparkmagic
cd /usr/local/lib/python3.6/site-packages
jupyter-kernelspec install sparkmagic/kernels/sparkkernel
jupyter-kernelspec install sparkmagic/kernels/pysparkkernel
jupyter-kernelspec install sparkmagic/kernels/sparkrkernel

This example uses pip but it also works with other Python package managers.

Spark magic configuration

Sparkmagic requires each user to have the following:

  • A .sparkmagic directory in the root of each user i /home/ catalog;
  • A custom config.json file in the users .sparkmagic catalog.

Here is an example config.json file:

{
   "kernel_python_credentials":{
      "username":"{{ username }}",
      "url":"http://master02.cdp.local:8998",
      "auth":"Kerberos"
   },
   "kernel_scala_credentials":{
      "username":"{{ username }}",
      "url":"http://master02.cdp.local:8998",
      "auth":"Kerberos"
   },
   "kernel_r_credentials":{
      "username":"{{ username }}",
      "url":"http://master02.cdp.local:8998",
      "auth":"Kerberos"
   },
   "logging_config":{
      "version":1,
      "formatters":{
         "magicsFormatter":{
            "format":"%(asctime)s\t%(levelname)s\t%(message)s",
            "datefmt":""
         }
      },
      "handlers":{
         "magicsHandler":{
            "class":"hdijupyterutils.filehandler.MagicsFileHandler",
            "formatter":"magicsFormatter",
            "home_path":"~/.sparkmagic"
         }
      },
      "loggers":{
         "magicsLogger":{
            "handlers":[
               "magicsHandler"
            ],
            "level":"DEBUG",
            "propagate":0
         }
      }
   },
   "authenticators":{
      "Kerberos":"sparkmagic.auth.kerberos.Kerberos",
      "None":"sparkmagic.auth.customauth.Authenticator",
      "Basic_Access":"sparkmagic.auth.basic.Basic"
   },
   "wait_for_idle_timeout_seconds":15,
   "livy_session_startup_timeout_seconds":60,
   "fatal_error_suggestion":"The code failed because of a fatal error:\n\t{}.\n\nSome things to try:\na) Make sure Spark has enough available resources for Jupyter to create a Spark context.\nb) Contact your Jupyter administrator to make sure the Sparkmagic library is configured correctly.\nc) Restart the kernel.",
   "ignore_ssl_errors":false,
   "session_configs":{
      "driverMemory":"1000M",
      "executorCores":2,
      "conf":{
         "spark.master":"yarn-cluster"
      },
      "proxyUser":"jupyter"
   },
   "use_auto_viz":true,
   "coerce_dataframe":true,
   "max_results_sql":2500,
   "pyspark_dataframe_encoding":"utf-8",
   "heartbeat_refresh_seconds":30,
   "livy_server_heartbeat_timeout_seconds":0,
   "heartbeat_retry_seconds":10,
   "server_extension_default_kernel_name":"pysparkkernel",
   "custom_headers":{

   },
   "retry_policy":"configurable",
   "retry_seconds_to_sleep_list":[
      0.2,
      0.5,
      1,
      3,
      5
   ],
}

Edit /etc/jupyterhub/jupyterhub_config.py – SparkMagic

In my case, I decided to change /etc/jupyterhub/jupyterhub_config.py file to automate some processes related to SparkMagic:

  • The creation of .sparkmagic the folder is /home/ directory for each new user;
  • Generating config.json file.
c.LDAPAuthenticator.create_user_home_dir = True
import os
import jinja2
import sys, getopt
from pathlib import Path
from subprocess import check_call
def config_spark_magic(spawner):
    username = spawner.user.name
    templateLoader = jinja2.FileSystemLoader(searchpath="/etc/jupyterhub/")
    templateEnv = jinja2.Environment(loader=templateLoader)
    TEMPLATE_FILE = "config.json.template"

    tm = templateEnv.get_template(TEMPLATE_FILE)
    msg = tm.render(username=username)

    path = "/home/" + username + "/.sparkmagic/"
    Path(path).mkdir(mode=0o777, parents=True, exist_ok=True)

    outfile = open(path + "config.json", "w")
    outfile.write(msg)
    outfile.close()
    os.popen('sh /etc/jupyterhub/install_jupyterhub.sh ' + username)
c.Spawner.pre_spawn_hook = config_spark_magic
c.JupyterHub.authenticator_class = 'ldapauthenticator.LDAPAuthenticator'
c.LDAPAuthenticator.server_hosts = ['ipa.cdp.local']
c.LDAPAuthenticator.server_port = 636
c.LDAPAuthenticator.server_use_ssl = True
c.LDAPAuthenticator.server_pool_strategy = 'FIRST'
c.LDAPAuthenticator.bind_user_dn = 'uid=admin,cn=users,cn=accounts,dc=cdp,dc=local'
c.LDAPAuthenticator.bind_user_password = 'passWord1'
c.LDAPAuthenticator.user_search_base = 'cn=users,cn=accounts,dc=cdp,dc=local'
c.LDAPAuthenticator.user_search_filter = '(&(objectClass=person)(uid={username}))'
c.LDAPAuthenticator.user_membership_attribute = 'memberOf'
c.LDAPAuthenticator.group_search_base = 'cn=groups,cn=accounts,dc=cdp,dc=local'
c.LDAPAuthenticator.group_search_filter = '(&(objectClass=ipausergroup)(memberOf={group}))'

The main advantages of this solution:

  • Has three interpreters via Sparkmagic (Python, Scala and R);
  • Customize Spark resources via config.json file;
  • No need to have physical access to the cluster for users;
  • Ability to have multiple Python environments available;
  • Connecting JupyterHub with a LDAP.

The main disadvantages of this solution:

  • Adaptation drive (using too many resources, poor configuration, etc…);
  • Modification of the HDP/CDP configuration;
  • More complex deployment;
  • A notebook has only one interpreter.

Conclusion

If you develop Spark jobs and legacy solutions such as Zeppelin no longer suits you or you are limited by its version, you can now set up a Jupyter server cheaply to develop your jobs while taking advantage of the resources of your clusters.

#Spark #Hadoop #integration #Jupyter

Source link

Leave a Reply