Building the Air data stack

tl;dr — It’s amazing how productive analysis can be with modern tools and a well-designed data warehouse, and it’s amazing how much of the time required to get there has been just a function of writing SQL in dbt.

The start: Many tools, none of them trusted

Air’s data capabilities were typical of a 20-person startup when I joined in late 2020: Event tracking in Mixpanel, a few folks writing ad hoc queries against our production database, and a lot of reliance on out-of-the-box reporting from the tools where work happened — Intercom, Stripe, etc. No one was dedicated to working with data full time.

 
 

This setup led to a few predictable problems:

  1. No one knew where to find anything. For a question as simple as “How many new subscribers did we get last week?” it wasn’t clear whether to go to Mixpanel, to rely on the Stripe interface, to write a customer query in Stripe’s SQL tool, or look into our product backend.

  2. No one knew what to trust. “It heard we shouldn’t rely on Mixpanel — is that true?” “Someone told me there’s something tricky about internal users — how do I exclude them again?”

  3. Data couldn’t be combined and remixed. We tried to solve this by pushing additional properties to Mixpanel, but that is full or shortcomings. If I wanted to know what share of users who invited collaborators in trial (Mixpanel) later subscribed (Stripe), there was no way to do it.

  4. We couldn’t agree on a unified data model. This is a bit more nuanced, but without one place to combine all our data, it was really hard to push and pull data between our various tools. The customer view in Intercom was a messy amalgamation of scraps from Mixpanel and Stripe, for instance.

The Band-Aid: Google Sheets Operating Report

I knew the long-term solution was going to take months to build, but the lack of basic business visibility and storytelling was killing us. The solution: A weekly report generated in Google Sheets, with data sourced from several different places. I’d run queries against Stripe, Mixpanel, and our product database, paste the results into Sheets, and then generate tables and charts for reporting.

 
 

For a small set of key metrics, the report resolved questions of trust and provided a venue for us to talk about the business. But it was also annoying to produce and very limited in scope. What we saw in the Operating Report would point us in the right direction, and then we’d be back where we started: Unable to answer our second-order questions.

Next: Building a data stack

What we wanted

  1. A clearinghouse for all critical data. This is where we’d enforce rules, combine different sources into higher-order data sets, etc.

  2. A single source for all critical reporting. No more cobbling together different reports in different tools. One tool that is understood as the ground truth for everything that matters to the business.

  3. Exploration and analysis capabilities for all. Make it easy to slice and dice data across the business, to share the results with others, and to enable anyone to do their own analysis.

Who we hired

The first and most important step was hiring a consulting partner to help us get going. I spoke to a bunch of different folks and opted for Fishtown Analytics. We worked with two analytics engineers for three months, getting half of each engineer’s time. We checked in weekly on progress and shared a Slack channel. I cannot speak highly enough of the team there, and the tool they’ve built for data modeling, dbt, has been a revelation. They moved much faster, at much higher quality, and at lower cost, than we would have if we’d dedicated a full time engineer who was learning dbt for the first time. By the time our engagement ended Fishtown had built almost all of the core data models we needed and had set up a system we could easily extend.

We didn’t hire a dedicated data person right away. The thinking was: Get the foundations in place first, then hire someone to build on top. We started the search for a senior analyst in the last month of the Fishtown engagement.

What we built

As soon as we hired Fishtown, we had to decide on the stack. Based on conversations with lots of other practitioners, the shape of our stack was very clear:

  • Data ingestion. One tool to manage the pipeline of data flowing from our various tools into the data warehouse. The two best options are Fivetran and Stitch. Fivetran is slightly more expensive, but was universally described as more reliable. We picked Fivetran and have not been disappointed.

  • Data warehousing. The relational database for all our data. Snowflake and BigQuery are the clear leaders, while Redshift seen as one generation behind them. Snowflake is more expensive, and only really differentiates itself at massive scale. We picked BigQuery. (We also had GCP credits via our investors, which made BigQuery and Fivetran essentially free for the first year.)

  • Data modeling. Tooling for analysts and engineers to combine, scrub, and manipulate data into higher-order reporting tables. dbt is the clear leader here, and especially given that we hired Fishtown was the obvious choice.

  • BI tool. The place for charts, dashboards, and exploration. BI is where you’ll see the most complexity and nuance. Looker is the most widely-know and used of the modern tools, but also the most expensive. Tableau is deep but also somewhat antiquated. There are a bunch of really inexpensive tools, like Redash, that should work well if you’re cost sensitive. We went with Mode, which is in many ways comparable to Looker but roughly half the cost. On the plus side, it hosts python and R notebooks, which means we won’t sprawl into people running local Jupyter notebooks. In addition, Looker’s main pitch is that much of the data modeling work that we’re doing in dbt could instead be done in LookML. But that scared the heck out of me — we wanted data modeling under source control and visible only to a few very technical folks, and wanted to avoid confusion about what gets manipulated in the BI tool and what gets manipulated in dbt.

Speaking costs: Ours aren’t high. We’re a small paid SaaS tool. As a result we don’t have a ton of data, and our costs associated with data scale are low. Fivetran is going to end up costing a few thousand dollars a year. Same for BigQuery. dbt is free for the first seat, but once we make a data hire we’ll probably get a couple seats at $50 per seat per month. Mode is the biggest expense, with a platform fee in the thousands of dollars and per user costs of $20 per seat per month.

It’s important to think about exactly what your data needs are — and are not.

  • Air was going to be using data primarily to make deliberative decisions, not to power product experiences in real time.

  • Air wasn’t focused on complex real-time customer messaging that more transactional products require.

  • Air has a relatively small data set, not petabytes and petabytes.

  • Air was largely working with slow-moving, explicitly structured data, not infinitely variable user and event features.

Here’s where everything shook out:

 
 

The power of dbt has been the refinement of data into distinct stages. As we’ve onboarded Air team members to Mode, I’ve had to product a bunch of documentation explaining how to think about this. Here’s the basic summary:

 
 

My impressions of the tools

  • BigQuery has been solid so far. We’re an easy use case — not a ton of data, not a ton of users, not a ton of complexity — but it has passed with flying colors so far.

  • Fivetran has made life so easy. Coming from Squarespace, where we built all our pipelines within data engineering, to a world where we can just plug all our tools in and not have to worry about data flowing properly has been mind-expanding. To appreciate just how easy life has been with Fivetran, consider this email they sent the after a BigQuery outage (which produced downstream problems that Fivetran saw):

    Hello Andrew,

    Due to the recent Google BigQuery incident between 01-18 16:27:48 UTC and 01-21 02:30:01 UTC, you may have several tables in your destination(s) with missing data from this timeframe.

    Google has resolved the issue starting 01-21 02:30:01 UTC, so your connectors have begun to sync again, but data starting from 01-18 16:27:48 UTC may be missing.

    Disregard any instructions you may have received from Google on how to fix the issue. The instructions are not sufficient to restore your impacted tables & data due to how Fivetran’s replication process works with BigQuery.

    In order to resolve this, Fivetran will perform the following:

    1. Automatically create a copy of the correct data a few hours before the incident began in your destinations. Note: This will incur some storage cost for your BigQuery instance.

    2. We will pause your connectors while we restore to the correct data, and then unpause once the restore process is complete.


    Note: Customers with database connectors whose retention periods cleared logs within the incident timeframe may need to re-sync. Fivetran will issue refunds for any re-syncs incurred by this incident.

    What a reassuring email to get. Even with a full-time data engineering team you couldn’t ask for a better response.

  • dbt is incredible if you have a full-time data person. It puts a ton of structure around data changes and replaces lots of boilerplate with templates to ensure consistency. It makes scheduling jobs and tracking data provenance really easy. I can imagine with multiple team members it would be essential. Can’t imagine working without it. That said, it is too much overhead if you just have one person hacking around in the data part-time. And we relied on Fishtown for the initial setup and buildout, so I can’t speak to how easy it is to start from scratch.

  • Mode has been excellent. It still struggles with some of the basic things that all dashboard tools struggle with — why is it so hard to just automatically give week-over-week changes calculations, dammit! — but it looks good and works as expected. They brag about supporting python notebooks natively, but the truth is their notebooks are way clunkier than something like Jupyter. The only benefit of Mode notebooks is they’re easier to share. A data scientist doing daily python analysis is going to want something more responsive.

At last (so far): Nirvana

We recently wrapped up our engagement with Fishtown and have onboarded other members of the organization into BigQuery and Mode. So far so good.

Once the core tables were refined — a process that took most of the three months we had with Fishtown — building dashboards was practically instantaneous. One weekend of focused work from me produced core dashboards covering all the metrics we care about.

There are still a few key steps in the rollout phase:

  • Onboarding the rest of the Air team to allow them to explore data on their own

  • Building out the second tier of exploratory Mode dashboards, one-off analyses, etc.

  • Hiring our first data analyst to extend the dbt workflow and drive day-to-day investigations

  • Adding more data sources, such as Intercom and Salesforce

  • Pushing data from the warehouse into other tools, especially Salesforce

I’m really not worried about these remaining steps given the foundations we’ve built.

Overall: It feels so good to be building on a 2020 stack instead of a 2010 stack.

 
 

Helpful links

A few posts I’ve read recently that really helped me think through our approach: