I’ve recently started working quite a bit with Spark and have found that there’s not much guidance on best practices for packaging and deploying libraries and apps on Spark. I’m planning to write a series of posts on Spark packaging and app deployment as we find patterns that work for the data platform at Mozilla.
Spark is written in Scala, but provides client libraries for Scala, Java, Python, and a few other languages. At Mozilla, we tend to write our large-scale ETL jobs in Scala, but most of the folks interacting with Spark are doing so in Python via notebooks like Jupyter, so we generally need to support libraries in both Scala and Python. This post focuses on how Python bindings can be packaged and deployed alongside Java/Scala code.
A Java jar file is just a ZIP archive
To make a Java or Scala library available to a Spark application, you can
either bundle that dependency into the “uberjar”
file containing your application code using a dependency management tool like sbt
or mvn
(which is Spark’s recommended method for pacakaging an application)
or you can place a jar on the Spark server nodes and make sure that jar is
on the classpath. But keep in mind that
the Java jar format is simply a ZIP archive with a specified structure.
The Java distribution comes with a jar
executable that can be used for
looking inside or extracting a jar, but it’s not strictly necessary.
unzip mylib.jar
works just as well as jar xf mylib.jar
.
Python libraries can be deployed as zips
The usual method for installing a Python library is via pip
which
can handle downloading and installing a lib to an appropriate place
on the filesystem such that the Python interpreter’s import mechanism will find it.
We can certainly do this for distributing the Python bindings of a Spark
package, but that means we end up having to maintain the dependency in two
places, keep them in sync, and make sure both the jar and the Python binding
are properly installed and available on all nodes.
We found this difficult to accomplish in practice,
so let’s consider what other options we have.
The more manual way to install a Python package is to add entries to
sys.path
in your Python session
(usually via the PYTHONPATH
environment variable)
to point at directories in your filesystem containing python code.
Since Python 2.3, this sys.path
mechanism for import searching has been
extended with the
zipimport
module to allow reaching into ZIP archives as well as directories.
So you can create a file lib.zip
with the following contents:
mylib/
__init__.py
and if you add that to sys.path
, you can now import mylib
:
import sys
sys.path.append('/path/to/lib.zip')
import mylib
Do you see where this is leading?
Embedding a Python package in a jar
Since both Java libraries and Python packages can be distributed as ZIP archives, there’s nothing stopping us from using a single file as both.
If we have a working lib.jar
containing our Java code, we can
unzip it, add mylib/
at the root, and zip it back up.
We can use this method to create a single file containing both
Java/Scala code and relevant Python bindings.
That single file can be added both to Spark’s classpath and to Python’s
sys.path
.
We now have a simple method for deploying all the code relevant to a
Spark package as a single file. There’s no longer an opportunity for
version skew between the python bindings and the Java or Scala implementation.
Nor do we have to worry about confused PySpark users downloading a package
via pip
only to find that it raises errors without having the corresponding
Java jar in place.
Building and Distributing a Spark Package
This method of shipping Python bindings by packaging into the jar is
exactly the approach taken by Databricks'
sbt-spark-package.
If you’re building a spark library in Scala using sbt
, that plugin
will look for a python
directory and include its contents when packaging
a jar. The output can be uploaded to spark-packages.org and pulled into
a Spark or PySpark session via the --packages
command line option.
In Mozilla’s case, we didn’t want to rely on uptime of spark-packages.org
so we instead elected to build an uberjar of spark packages
(see telemetry-spark-packages-assembly),
deploy that under /usr/lib/spark/jars
to all nodes when bootstrapping a Spark cluster,
and also add that the jar path to the PYTHONPATH
environment variable.
Enabling Spark packages on EMR
We run Spark via Amazon’s EMR product and one additional issue we ran into is
that while the --packages
command line option on EMR adds the downloaded jar
to sys.path
in pyspark
sessions, that path is not valid on the master node,
so attempts to import any embedded Python packages will fail.
We worked around this by
adding a PYTHONSTARTUP
script
that corrects the path.