Exporting Firebase Data to BigQuery
Moving the Firebase data to BigQuery is a great way to get more out of your data, and here's how to do it.
Why is data ingestion so hard? This post explores the challenges of data ingestion and introduces ingestr, an open-source solution to simplify the process.
Burak Karakan
Co-founder & CEO
One of the first issues companies run into when it comes to analyzing their data is having to move the data off of their production databases:
Due to these contributing factors, companies usually move their data off to an analytical database such as Google BigQuery or Snowflake after a certain size and scale for analytical purposes.
While moving the data to a database that is fit for purpose sounds good, it has its own challenges:
All of these reasons build up the problems around data ingestion, and there are already a bunch of tools in the market that aim to solve this.
The moment the data ingestion/copy problem is acknowledged, the first reaction across many teams is to build a tool that does the ingestion for them, and then schedule it via cronjobs or more advanced solutions. The problem sounds simple on the surface:
insert
statements, or some other platform-specific way to load the data into the databaseHowever, this on-the-surface analysis forgets quite a few crucial questions:
As you can see, there are many open points and they all require a solid understanding of the problem at hand, along with the investment to make the overall initiative success. Otherwise, the engineering team builds quick hacks to get them up and running, and these "hacks" start to become the backbone of the analytical use-cases, making it very hard, if not impossible, to evolve the architecture as the business evolves.
Some smart people saw the problem at hand and came up with various solutions to make this process easier.
Over the years some teams have decided that data ingestion can be performed simply via UI-driven solutions that have pre-built connectors across various platforms, which means non-technical people can also ingest data. Two major players that come to mind are Fivetran and Airbyte, both giant companies trying to tackle the long-tail of the data ingestion problem.
Even though there are a few differences between these no-code platforms, their primary approach is that you use their UI to set up connectors, and you forget about the problem without needing any technical person, e.g. a marketing person can set up a data ingestion task from Postgres to BigQuery.
While these tools do have a great deal of convenience, they still pose some challenges:
All in all, while the UI-driven data ingestion tools like Fivteran or Airbyte allow teams to get going from zero, there are still some issues that causes teams to stay away from them and resort to writing code due to the flexibility it provides.
There has been an emerging open-source Python library from the company dltHub called dlt, which focuses more on the use cases where there'll still be code written to ingest the data, but the code could actually be a lot smaller and maintainable. dlt has built-in open-source connectors, but it also allows custom sources & destinations to be built by teams for their specific needs. It is flexible but allows quick iteration when it comes to ingestion.
There are a couple of things dlt takes care of very nicely:
dlt is quite a powerful library and has a very vibrant, growing community. It might be the perfect companion when it comes to engineers wanting to write code for certain custom requirements.
However, we felt that there might be a middle ground for simpler use-cases that doesn't require coding, but also doesn't lock us into a UI-driven workflow.
While we like dlt a lot at Bruin, we felt that there were quite a few simpler scenarios that we couldn't justify writing, maintaining, and deploying code for:
updated_at
column.While all of these are possible with dlt, it requires these people to write code and figure out a way to deploy them and monitor them. It is not incredibly hard, but it is also not trivial. Seeing all these patterns, we have decided to take a stab at the problem in an open-source fashion.
ingestr is a command-line application that allows you to ingest data from any source into any destination using simple command-line flags, without having to write any code and still having it as part of your tech stack.
pip install ingestr
ingestr takes away the complexity of managing any backend or writing any code for ingesting data, simply run the command and watch the magic
.
ingestr makes a couple of opinionated decisions about how the ingestion should work:
replace
, append
, merge
and delete+insert
While there will be quite a few scenarios where the teams would benefit from the flexibility dlt provides, we believe that 80% of the real-life scenarios out there would fall into these presets, and for those ingestr could simplify things quite a bit.
🌟 Give it a look and give us a star on GitHub! We'd love to hear your feedback and feel free to join our Slack community here.
Moving the Firebase data to BigQuery is a great way to get more out of your data, and here's how to do it.
A comprehensive guide to querying and working with the Firebase events table in BigQuery, including useful functions and techniques for easier data analysis.
Bruin CLI is an open-source data pipeline tool built with Go, with built-in data ingestion, transformation, and data quality checks.