Launch
6 min read

Building an AI Data Analyst Sucks

I'll teach you how to do this, and you'll get mad at me for it.

Burak Karakan

Co-founder & CEO

We have been building data infrastructure for years now. We have been trying to build it in a way that is intuitive, easy to manage, and fast for human beings. Surprisingly, we discovered that while trying to get humans to be more productive, we ended up making agents more productive with data as well.

Over the past few months we have doubled down on building an AI data analyst. We built it on top of the same principles we use to build Bruin: code-first, open-source, no lock-in. We are now open-sourcing all that to make it useful for others to build it themselves.

You might not love it, but you asked for it.

Why don't we see data agents everywhere?

Coding has been taken over by AI agents: we all use Claude Code, Codex, OpenCode, Pi, or whatever is the new cool agent in town this week. Everyone and their mother are using Claude Code now.

There are a few reasons why agents are a good fit for coding:

  • Agents need text: Code is essentially just text.
  • Agents need tools: Coding tools are everywhere, be it compilers, interpreters, or editors.
  • Agents need metadata: Code kinda is already the metadata. You can mostly figure out the behavior by reading the code.
  • Agents need feedback: You can build, run, test, and observe the code.

These factors make coding the perfect fit for agents. They can read it, write it, and run it.

Unfortunately, these factors are not very applicable to data:

  • Agents need text: Most of the data is state. Pipelines are distributed across many tools, driven by UI workflows, and not accessible anywhere.
  • Agents need tools: they need to be able to run queries, refresh datasets, run backfills, and do this across platforms. They need to produce artifacts such as dashboards, Slack messages, and emails.
  • Agents need metadata: Data catalogs lock in all the metadata around the data, and they are not accessible without special MCP servers ot custom integrations. Metadata gets lost across different tools.
  • Agents need feedback: You can't safely rum data workloads in isolated development environments, and you need to be able to do that to make agents productive with data.

Considering these factors, it becomes a humongous challenge to bring AI agents into data workloads using traditional tools. What worked yesterday doesn't work today.

This is also why building an AI data analyst is not a trivial task if you want it to be accurate.

What does an AI data analyst need?

Like everybody else, we discovered quickly that the context is the king.

The agent:

  • Needs to know about the available tables and data
  • Needs to understand the metadata about these tables
  • Needs to know the queries that generate the tables
  • Needs to understand the lineage of the data.
  • Needs to be able to explore the data progressively.
  • Needs to be able to access the data securely.

These points have been things that Bruin has naturally been a good fit for. Our open-source tooling already had built-in cataloging capabilities, it could contain SQL, Python, and non-executable assets, it tracks their lineage, and does all of these in regular files. It also exposes quite a few tools to the agent so that it can query the data securely, connect to multiple different systems, compare tables, and more.

In addition, it can:

  • connect to tens of different systems: BigQuery, Snowflake, Clickhouse, Databricks, Redshift, and more
  • run queries and ingest data to and from any of these systems
  • run quality checks on the data
  • access catalog via plain text files
  • support business glossaries natively

This means that by building the first steps of an AI data analyst, you get not only a better, faster and more accurate analyst, but also the beginning of an AI data engineer.

Context Rot

Is it all bells and whistles? No. The problems around managing the context is still a challenge.

You need to be able to generate the initial metadata, and keep it up to date. You need to bring your business definitions into it. You need to be able to read context externally. You need to introduce your metric definitions into it. The list is endless, and it is still a lot of work.

Did we solve all these problems? No. Not yet.

What we are open-sourcing today is our toolset that allows you to build your own context layer and put your AI agent on top of it.

  • It contains a built-in bruin import database command that allows you to import your database schema into Bruin.
  • It has a built-in bruin ai enhance command that allows you to enhance the metadata with AI.
  • It allows you to represent your data assets, including your dashboards, in a version-controlled way.
  • It has a built-in MCP server that allows your agents to connect to your warehouse easily.

We will continue improving the capabilities here.

Build your own AI data analyst: open-source

We are open-sourcing this toolset under the Bruin Academy. It is a few steps, and should allow you to build a solid baseline to build and improve your own AI data analyst.

Give it a try, and let us know what you think about it. We'll continue expanding the capabilities here, and we are looking forward to seeing what you build with it.

You can also join our Slack community to get help, and share your experiences.

Godspeed

For those of you that just want a version of this that works automatically, we also have Bruin Cloud, our managed platform that allows you to build your own AI data analyst in a few clicks.