Deduplication: Where Apache Beam Fits In

Summary of a talk delivered at Apache Beam Digital Summit on August 4, 2021.

Title slide

This session will start with a brief overview of the problem of duplicate records and the different options available for handling them. We’ll then explore two concrete approaches to deduplication within a Beam streaming pipeline implemented in Mozilla’s open source codebase for ingesting telemetry data from Firefox clients.

We’ll compare the robustness, performance, and operational experience of using the deduplication built in to PubsubIO vs. storing IDs in an external Redis cluster and why Mozilla switched from one approach to the other.

Finally, we’ll compare streaming deduplication to a much stronger end-to-end guarantee that Mozilla achieves via nightly scheduled queries to serve historical analysis use cases.

The Nitty-Gritty of Moving Data with Apache Beam

Summary of a talk delivered at Apache Beam Digital Summit on August 24, 2020.

Title slide

In this session, you won’t learn about joins or windows or timers or any other advanced features of Beam. Instead, we will focus on the real-world complexity that comes from simply moving data from one system to another safely. How do we model data as it passes from one transform to another? How do we handle errors? How do we test the system? How do we organize the code to make the pipeline configurable for different source and destination systems?

We will explore how each of these questions are addressed in Mozilla’s open source codebase for ingesting telemetry data from Firefox clients. By the end of the session, you’ll be equipped to explore the codebase and documentation on your own to see how these concepts are composed together.

Encoding Usage History in Bit Patterns

Originally published as a cookbook on docs.telemetry.mozilla.org to instruct data users within Mozilla how to take advantage of the usage history stored in our BigQuery tables.

DAU windows in a bit pattern

Monthly active users (MAU) is a windowed metric that requires joining data per client across 28 days. Calculating this from individual pings or daily aggregations can be computationally expensive, which motivated creation of the clients_last_seen dataset for desktop Firefox and similar datasets for other applications.

A powerful feature of the clients_last_seen methodology is that it doesn’t record specific metrics like MAU and WAU directly, but rather each row stores a history of the discrete days on which a client was active in the past 28 days. We could calculate active users in a 10 day or 25 day window just as efficiently as a 7 day (WAU) or 28 day (MAU) window. But we can also define completely new metrics based on these usage histories, such as various retention definitions.

[Read More]

The Dashboard Problem and Data Shapes

Cross-posted on the Data@Mozilla blog.

Memphis Shapes by AnnaliseArt

The data teams at Mozilla have put a great deal of effort into building a robust data ingestion pipeline and reliable data warehouse that can serve a wide variety of needs. Yet, we keep coming back to conversations about the dashboard problem or about how we’re missing last mile tooling that makes data accessible for use in data products that we can release to different customers within Mozilla.

[Read More]

Guiding Principles for Data Infrastructure

Originally published as an article on docs.telemetry.mozilla.org. So you want to build a data lake… Where do you start? What building blocks are available? How can you integrate your data with the rest of the organization? This document is intended for a few different audiences. Data consumers within Mozilla will gain a better understanding of the data they interact with by learning how the Firefox telemetry pipeline functions. Mozilla teams outside of Firefox will get some concrete guidance about how to provision and lay out data in a way that will let them integrate with the rest of Mozilla. [Read More]

Triggering trusted CI jobs on untrusted forks

Originally published as a guest post on the CircleCI Blog. On public code repositories, arbitrary users are generally allowed to make forks and issue pull requests (which we will refer to as “forked PRs”). These users could be from outside your organization and are generally considered untrusted for the purposes of running automated builds. This causes a problem if you want a Continuous Integration service to run tests that need access to credentials or sensitive data as a malicious developer could propose code that exposes the credentials to CI logs. [Read More]

Deploying documentation to GitHub Pages with continuous integration

Originally published as a guest post on the CircleCI Blog. Continuous integration (CI) tools have been evolving towards flexible, general-purpose computing environments. They aren’t just used for running tests and reporting results, but often run full builds and send artifacts to external systems. If you’re already relying on a CI system for these other needs, it can be convenient to build and deploy your documentation using the same platform rather than pulling in an additional tool or service. [Read More]

Managing secrets when you have pull requests from outside contributors

Originally published as a guest post on the CircleCI Blog. Mozilla likes to work in the open as much as possible, which means we primarily do our development in publicly accessible code repositories, whether we expect outside collaborators or not. Those repositories, however, still need to hook into other systems, which sometimes involves managing sensitive credentials. How can we enable those connections to provide rich workflows for maintainers while also providing a great experience for outside contributors? [Read More]

lib.jar: Java library? Python package? Both?

I’ve recently started working quite a bit with Spark and have found that there’s not much guidance on best practices for packaging and deploying libraries and apps on Spark. I’m planning to write a series of posts on Spark packaging and app deployment as we find patterns that work for the data platform at Mozilla.

Spark is written in Scala, but provides client libraries for Scala, Java, Python, and a few other languages. At Mozilla, we tend to write our large-scale ETL jobs in Scala, but most of the folks interacting with Spark are doing so in Python via notebooks like Jupyter, so we generally need to support libraries in both Scala and Python. This post focuses on how Python bindings can be packaged and deployed alongside Java/Scala code.

Zip-Eat!

[Read More]

A Change Data Capture Pipeline From PostgreSQL to Kafka

Originally posted on the Simple engineering blog; also presented at PGConf US 2017 and Ohio LinuxFest 2017

We previously wrote about a pipeline for replicating data from multiple siloed PostgreSQL databases to a data warehouse in Building Analytics at Simple, but we knew that pipeline was only the first step. This post details a rebuilt pipeline that captures a complete history of data-changing operations in near real-time by hooking into PostgreSQL’s logical decoding feature. The new pipeline powers not only a higher-fidelity warehouse, but also user-facing features.

change data capture diagram

[Read More]