lib.jar: Java library? Python package? Both?
I’ve recently started working quite a bit with Spark and have found that there’s not much guidance on best practices for packaging and deploying libraries and apps on Spark. I’m planning to write a series of posts on Spark packaging and app deployment as we find patterns that work for the data platform at Mozilla.
Spark is written in Scala, but provides client libraries for Scala, Java, Python, and a few other languages. At Mozilla, we tend to write our large-scale ETL jobs in Scala, but most of the folks interacting with Spark are doing so in Python via notebooks like Jupyter, so we generally need to support libraries in both Scala and Python. This post focuses on how Python bindings can be packaged and deployed alongside Java/Scala code.[Read More]
A Change Data Capture Pipeline From PostgreSQL to Kafka
We previously wrote about a pipeline for replicating data from multiple siloed PostgreSQL databases to a data warehouse in Building Analytics at Simple, but we knew that pipeline was only the first step. This post details a rebuilt pipeline that captures a complete history of data-changing operations in near real-time by hooking into PostgreSQL’s logical decoding feature. The new pipeline powers not only a higher-fidelity warehouse, but also user-facing features.
Building Analytics at Simple
Early in 2014, Simple was a mid-stage startup with only a single analytics-focused employee. When we wanted to answer a question about customer behavior or business performance, we would have to query production databases. Everybody in the company wanted to make informed decisions, from engineering to product strategy to business development to customer relations, so it was clear that we needed to build a data warehouse and a team to support it.[Read More]
Safe Migrations With Redshift
A Search for Exotic Particles
My Ph.D. dissertation, performed at the University of Wisconsin-Madison.
Abstract: A search for exotic particles decaying via WZ to final states with electrons and muons is performed using a data sample of pp collisions collected at 7 TeV center-of-mass energy by the CMS experiment at the LHC, corresponding to an integrated luminosity of 4.98 inverse femtobarns. A cross section measurement for the production of WZ is also performed on a subset of the collision data. No significant excess is observed over the Standard Model background, so lower bounds at the 95% confidence level are set on the production cross sections of hypothetical particles decaying to WZ in several theoretical scenarios. Assuming the Sequential Standard Model, W’ bosons with masses below 1143 GeV are excluded. New limits are also set for several configurations of Low-Scale Technicolor.[Read More]
Some LHC Calculations
Influencing Dynamics in Neural Networks
This research was performed under an NSF-funded Research Experience for Undergraduates program at Indiana University and then extended at Wittenberg University to serve as an undergraudate honors thesis. The work was overseen by John Beggs. It was later published in the 2006 issue of Wittenberg University’s non-fiction literary magazine, Spectrum.[Read More]
Changing Concepts of Ethnic and National Identity in Dakar
This research was performed as a capstone project for the Senegal: Arts & Culture academic program of the School for International Training in Spring of 2005 under the direction of Souleye Diallo. The project was advised by Mamadou Aliou Diallo, Université Cheikh Anta Diop. It was published in the 2006 issue of Wittenberg University’s non-fiction literary magazine, Spectrum.[Read More]