lib.jar: Java library? Python package? Both?

I’ve recently started working quite a bit with Spark and have found that there’s not much guidance on best practices for packaging and deploying libraries and apps on Spark. I’m planning to write a series of posts on Spark packaging and app deployment as we find patterns that work for the data platform at Mozilla.

Spark is written in Scala, but provides client libraries for Scala, Java, Python, and a few other languages. At Mozilla, we tend to write our large-scale ETL jobs in Scala, but most of the folks interacting with Spark are doing so in Python via notebooks like Jupyter, so we generally need to support libraries in both Scala and Python. This post focuses on how Python bindings can be packaged and deployed alongside Java/Scala code.

A Java jar file is just a ZIP archive

To make a Java or Scala library available to a Spark application, you can either bundle that dependency into the “uberjar” file containing your application code using a dependency management tool like sbt or mvn (which is Spark’s recommended method for pacakaging an application) or you can place a jar on the Spark server nodes and make sure that jar is on the classpath. But keep in mind that the Java jar format is simply a ZIP archive with a specified structure.

The Java distribution comes with a jar executable that can be used for looking inside or extracting a jar, but it’s not strictly necessary. unzip mylib.jar works just as well as jar xf mylib.jar.

Python libraries can be deployed as zips

The usual method for installing a Python library is via pip which can handle downloading and installing a lib to an appropriate place on the filesystem such that the Python interpreter’s import mechanism will find it. We can certainly do this for distributing the Python bindings of a Spark package, but that means we end up having to maintain the dependency in two places, keep them in sync, and make sure both the jar and the Python binding are properly installed and available on all nodes. We found this difficult to accomplish in practice, so let’s consider what other options we have.

The more manual way to install a Python package is to add entries to sys.path in your Python session (usually via the PYTHONPATH environment variable) to point at directories in your filesystem containing python code.

Since Python 2.3, this sys.path mechanism for import searching has been extended with the zipimport module to allow reaching into ZIP archives as well as directories.

So you can create a file lib.zip with the following contents:

mylib/
  __init__.py

and if you add that to sys.path, you can now import mylib:

import sys
sys.path.append('/path/to/lib.zip')

import mylib

Do you see where this is leading?

Embedding a Python package in a jar

Since both Java libraries and Python packages can be distributed as ZIP archives, there’s nothing stopping us from using a single file as both.

If we have a working lib.jar containing our Java code, we can unzip it, add mylib/ at the root, and zip it back up. We can use this method to create a single file containing both Java/Scala code and relevant Python bindings. That single file can be added both to Spark’s classpath and to Python’s sys.path.

We now have a simple method for deploying all the code relevant to a Spark package as a single file. There’s no longer an opportunity for version skew between the python bindings and the Java or Scala implementation. Nor do we have to worry about confused PySpark users downloading a package via pip only to find that it raises errors without having the corresponding Java jar in place.

Building and Distributing a Spark Package

This method of shipping Python bindings by packaging into the jar is exactly the approach taken by Databricks' sbt-spark-package. If you’re building a spark library in Scala using sbt, that plugin will look for a python directory and include its contents when packaging a jar. The output can be uploaded to spark-packages.org and pulled into a Spark or PySpark session via the --packages command line option.

In Mozilla’s case, we didn’t want to rely on uptime of spark-packages.org so we instead elected to build an uberjar of spark packages (see telemetry-spark-packages-assembly), deploy that under /usr/lib/spark/jars to all nodes when bootstrapping a Spark cluster, and also add that the jar path to the PYTHONPATH environment variable.

Enabling Spark packages on EMR

We run Spark via Amazon’s EMR product and one additional issue we ran into is that while the --packages command line option on EMR adds the downloaded jar to sys.path in pyspark sessions, that path is not valid on the master node, so attempts to import any embedded Python packages will fail.

We worked around this by adding a PYTHONSTARTUP script that corrects the path.