
Spark on Hadoop integration with Jupyter
For several years, Jupyter notebook has established itself as the portable solution in the Python universe. Historically, Jupyter is the go-to tool for data scientists who develop primarily in Python. Over the years, Jupyter has evolved and now has a wide range of features thanks to its plugins. One of the main advantages of Jupyter is also its ease of deployment.
More and more Spark developers are favoring Python over Scala to develop their various jobs for rapid development.
In this article, we'll see together how to connect a Jupyter server to a Spark cluster running on Hadoop yarn secured with Kerberos.
How to install Jupyter?
We cover two methods of connecting Jupyter to a Spark cluster:
- Configure a script to launch a Jupyter instance that will have a Python Spark interpreter.
- Connect Jupyter notebook to a Spark cluster via the Sparkmagic extension.
Method 1: Create a startup script
Conditions:
- Have access to a Spark cluster machine, usually a head node or an edge node;
- Having an environment (Conda, Mamba, virtualenv, ..) with the ‘jupyter' package. Example with Conda:
conda create -n pysparktest python=3.7 jupyter
.
Create a script in /home
directory and insert the following code, changing the paths to match your environment:
#! /bin/bash
export PYSPARK_PYTHON=/home/adaltas/.conda/envs/pysparktest/bin/python
export PYSPARK_DRIVER_PYTHON=/home/adaltas/.conda/envs/pysparktest/bin/ipython3
export PYSPARK_DRIVER_PYTHON_OPTS="notebook --no-browser --ip=10.10.20.11--port=8888"
pyspark \
--master yarn \
--conf spark.shuffle.service.enabled=true \
--conf spark.dynamicAllocation.enabled=false \
--driver-cores 2 --driver-memory 11136m \
--executor-cores 3 --executor-memory 7424m --num-executors 10
Running this script creates a Jupyter server that can be used to develop your Spark jobs.
The main advantages of this solution:
- Fast execution;
- No need to edit cluster conf;
- Customization of the Spark environment by customer;
- Local environment of the edge from which the server was launched;
- Dedicated environment per user that avoids problems related to server overload.
The main disadvantages of this solution:
- Adaptation drive (using too many resources, poor configuration, etc…);
- Need to have access to a cluster edge node;
- The user has only one Python interpreter (which is PySpark);
- Only one environment (Conda or other) available per server.
Method 2: Connect a Jupyter cluster via Sparkmagic
What is Sparkmagic?
Spark magic is a Jupyter extension that allows you to launch Spark processes through Livy.
Conditions:
- Have a Spark cluster with Livy and Spark available (for reference: HDP, CDP ouch TDP);
- Have a Jupyter server. JupyterHub is used for this demonstration;
- Have configured identity theft on the cluster.
Creates the jupyter user within the cluster
In this test, users are managed via FreeIPA on a Kerberized HDP cluster.
The creation of jupyter
user:
ipa user-add
Creation of the password:
ipa passwd jupyter
Verifying that the user has a key on one of the cluster edge nodes and that user identification works accordingly:
kinit jupyter
curl --negotiate -u : -i -X PUT
"http://edge01.local:9870/webhdfs/v1/user/vagrant/test?doas=vagrant&op=MKDIRS"
Note that the command above creates a
/user/vagrant
directory in HDFS. It requires administrator-type permissions via the identity theft described in the next section.
Finally, check that jupyter
the user is really a part of sudo
group on the server where Jupyter will be installed.
User impersonation for the Jupyter user
Since we are in the case of a kerberized HDP cluster, we need to enable impersonation for jupyter
user.
To do this, change core-site.xml
file:
<property>
<name>hadoop.proxyuser.jupyter.hostsname>
<value>*value>
property>
<property>
<name>hadoop.proxyuser.jupyter.groupsname>
<value>*value>
property>
Install and activate the Sparkmagic extension
As stated in documentationyou can use the following commands to install the extension:
pip install sparkmagic
jupyter nbextension enable --py --sys-prefix widgetsnbextension
pip3 show sparkmagic
cd /usr/local/lib/python3.6/site-packages
jupyter-kernelspec install sparkmagic/kernels/sparkkernel
jupyter-kernelspec install sparkmagic/kernels/pysparkkernel
jupyter-kernelspec install sparkmagic/kernels/sparkrkernel
This example uses pip
but it also works with other Python package managers.
Spark magic configuration
Sparkmagic requires each user to have the following:
- A
.sparkmagic
directory in the root of each user i/home/
catalog; - A custom
config.json
file in the users.sparkmagic
catalog.
Here is an example config.json
file:
{
"kernel_python_credentials":{
"username":"{{ username }}",
"url":"http://master02.cdp.local:8998",
"auth":"Kerberos"
},
"kernel_scala_credentials":{
"username":"{{ username }}",
"url":"http://master02.cdp.local:8998",
"auth":"Kerberos"
},
"kernel_r_credentials":{
"username":"{{ username }}",
"url":"http://master02.cdp.local:8998",
"auth":"Kerberos"
},
"logging_config":{
"version":1,
"formatters":{
"magicsFormatter":{
"format":"%(asctime)s\t%(levelname)s\t%(message)s",
"datefmt":""
}
},
"handlers":{
"magicsHandler":{
"class":"hdijupyterutils.filehandler.MagicsFileHandler",
"formatter":"magicsFormatter",
"home_path":"~/.sparkmagic"
}
},
"loggers":{
"magicsLogger":{
"handlers":[
"magicsHandler"
],
"level":"DEBUG",
"propagate":0
}
}
},
"authenticators":{
"Kerberos":"sparkmagic.auth.kerberos.Kerberos",
"None":"sparkmagic.auth.customauth.Authenticator",
"Basic_Access":"sparkmagic.auth.basic.Basic"
},
"wait_for_idle_timeout_seconds":15,
"livy_session_startup_timeout_seconds":60,
"fatal_error_suggestion":"The code failed because of a fatal error:\n\t{}.\n\nSome things to try:\na) Make sure Spark has enough available resources for Jupyter to create a Spark context.\nb) Contact your Jupyter administrator to make sure the Sparkmagic library is configured correctly.\nc) Restart the kernel.",
"ignore_ssl_errors":false,
"session_configs":{
"driverMemory":"1000M",
"executorCores":2,
"conf":{
"spark.master":"yarn-cluster"
},
"proxyUser":"jupyter"
},
"use_auto_viz":true,
"coerce_dataframe":true,
"max_results_sql":2500,
"pyspark_dataframe_encoding":"utf-8",
"heartbeat_refresh_seconds":30,
"livy_server_heartbeat_timeout_seconds":0,
"heartbeat_retry_seconds":10,
"server_extension_default_kernel_name":"pysparkkernel",
"custom_headers":{
},
"retry_policy":"configurable",
"retry_seconds_to_sleep_list":[
0.2,
0.5,
1,
3,
5
],
}
Edit /etc/jupyterhub/jupyterhub_config.py
– SparkMagic
In my case, I decided to change /etc/jupyterhub/jupyterhub_config.py
file to automate some processes related to SparkMagic:
- The creation of
.sparkmagic
the folder is/home/
directory for each new user; - Generating
config.json
file.
c.LDAPAuthenticator.create_user_home_dir = True
import os
import jinja2
import sys, getopt
from pathlib import Path
from subprocess import check_call
def config_spark_magic(spawner):
username = spawner.user.name
templateLoader = jinja2.FileSystemLoader(searchpath="/etc/jupyterhub/")
templateEnv = jinja2.Environment(loader=templateLoader)
TEMPLATE_FILE = "config.json.template"
tm = templateEnv.get_template(TEMPLATE_FILE)
msg = tm.render(username=username)
path = "/home/" + username + "/.sparkmagic/"
Path(path).mkdir(mode=0o777, parents=True, exist_ok=True)
outfile = open(path + "config.json", "w")
outfile.write(msg)
outfile.close()
os.popen('sh /etc/jupyterhub/install_jupyterhub.sh ' + username)
c.Spawner.pre_spawn_hook = config_spark_magic
c.JupyterHub.authenticator_class = 'ldapauthenticator.LDAPAuthenticator'
c.LDAPAuthenticator.server_hosts = ['ipa.cdp.local']
c.LDAPAuthenticator.server_port = 636
c.LDAPAuthenticator.server_use_ssl = True
c.LDAPAuthenticator.server_pool_strategy = 'FIRST'
c.LDAPAuthenticator.bind_user_dn = 'uid=admin,cn=users,cn=accounts,dc=cdp,dc=local'
c.LDAPAuthenticator.bind_user_password = 'passWord1'
c.LDAPAuthenticator.user_search_base = 'cn=users,cn=accounts,dc=cdp,dc=local'
c.LDAPAuthenticator.user_search_filter = '(&(objectClass=person)(uid={username}))'
c.LDAPAuthenticator.user_membership_attribute = 'memberOf'
c.LDAPAuthenticator.group_search_base = 'cn=groups,cn=accounts,dc=cdp,dc=local'
c.LDAPAuthenticator.group_search_filter = '(&(objectClass=ipausergroup)(memberOf={group}))'
The main advantages of this solution:
- Has three interpreters via Sparkmagic (Python, Scala and R);
- Customize Spark resources via
config.json
file; - No need to have physical access to the cluster for users;
- Ability to have multiple Python environments available;
- Connecting JupyterHub with a LDAP.
The main disadvantages of this solution:
- Adaptation drive (using too many resources, poor configuration, etc…);
- Modification of the HDP/CDP configuration;
- More complex deployment;
- A notebook has only one interpreter.
Conclusion
If you develop Spark jobs and legacy solutions such as Zeppelin no longer suits you or you are limited by its version, you can now set up a Jupyter server cheaply to develop your jobs while taking advantage of the resources of your clusters.
#Spark #Hadoop #integration #Jupyter
Source link