Deduplication: Where Apache Beam Fits In

Summary of a talk delivered at Apache Beam Digital Summit on August 4, 2021.

Title slide

This session will start with a brief overview of the problem of duplicate records and the different options available for handling them. We’ll then explore two concrete approaches to deduplication within a Beam streaming pipeline implemented in Mozilla’s open source codebase for ingesting telemetry data from Firefox clients.

We’ll compare the robustness, performance, and operational experience of using the deduplication built in to PubsubIO vs. storing IDs in an external Redis cluster and why Mozilla switched from one approach to the other.

Finally, we’ll compare streaming deduplication to a much stronger end-to-end guarantee that Mozilla achieves via nightly scheduled queries to serve historical analysis use cases.

The Nitty-Gritty of Moving Data with Apache Beam

Summary of a talk delivered at Apache Beam Digital Summit on August 24, 2020.

Title slide

In this session, you won’t learn about joins or windows or timers or any other advanced features of Beam. Instead, we will focus on the real-world complexity that comes from simply moving data from one system to another safely. How do we model data as it passes from one transform to another? How do we handle errors? How do we test the system? How do we organize the code to make the pipeline configurable for different source and destination systems?

We will explore how each of these questions are addressed in Mozilla’s open source codebase for ingesting telemetry data from Firefox clients. By the end of the session, you’ll be equipped to explore the codebase and documentation on your own to see how these concepts are composed together.

Encoding Usage History in Bit Patterns

Originally published as a cookbook on docs.telemetry.mozilla.org to instruct data users within Mozilla how to take advantage of the usage history stored in our BigQuery tables.

DAU windows in a bit pattern

Monthly active users (MAU) is a windowed metric that requires joining data per client across 28 days. Calculating this from individual pings or daily aggregations can be computationally expensive, which motivated creation of the clients_last_seen dataset for desktop Firefox and similar datasets for other applications.

A powerful feature of the clients_last_seen methodology is that it doesn’t record specific metrics like MAU and WAU directly, but rather each row stores a history of the discrete days on which a client was active in the past 28 days. We could calculate active users in a 10 day or 25 day window just as efficiently as a 7 day (WAU) or 28 day (MAU) window. But we can also define completely new metrics based on these usage histories, such as various retention definitions.

[Read More]

The Dashboard Problem and Data Shapes

Cross-posted on the Data@Mozilla blog.

Memphis Shapes by AnnaliseArt

The data teams at Mozilla have put a great deal of effort into building a robust data ingestion pipeline and reliable data warehouse that can serve a wide variety of needs. Yet, we keep coming back to conversations about the dashboard problem or about how we’re missing last mile tooling that makes data accessible for use in data products that we can release to different customers within Mozilla.

[Read More]

Guiding Principles for Data Infrastructure

Originally published as an article on docs.telemetry.mozilla.org. So you want to build a data lake… Where do you start? What building blocks are available? How can you integrate your data with the rest of the organization? This document is intended for a few different audiences. Data consumers within Mozilla will gain a better understanding of the data they interact with by learning how the Firefox telemetry pipeline functions. Mozilla teams outside of Firefox will get some concrete guidance about how to provision and lay out data in a way that will let them integrate with the rest of Mozilla. [Read More]