Why It's Reasonable to Be Skeptical About AI in Data - and Why It's Fixable
A practical framework for building an AI context layer using open-source tools, turning skepticism about AI in data engineering into a working solution with self-healing pipelines and iterative team adoption.
Arsalan Noorafkan
Developer Advocate
I was quite skeptical about AI coding agents, and honestly for a while I think I was right to be. I'd try them, the agent would hallucinate something that doesn't exist or generate 400 lines of code I couldn't even meaningfully review. Every time I'd walk away thinking "this just created more work for me". What changed was realizing the problem wasn't the agent, it was me. I was throwing it at a codebase with zero context and expecting magic, which is something I would never do with a real engineer joining the team.
This is the key difference between AI in data versus software engineering - in software, the code breaks or the result is visibly wrong, but in data the agent returns a confident response and validating it takes longer, so people either blindly trust it or become too skeptical to even try it. The problem is rooted in the inability to create and maintain the context layer - and this is usually a time constraint, not a money one.
Most teams don't have proper documentation or data governance, so when a new hire joins they read the pipeline code, query the tables, and ask questions. That's exactly what an LLM needs to do too. Your existing pipelines, models, and data warehouse already have most of the context you need, and you just have to organize it. This framework proposes giving agents access to existing open-source tools and a set of specific instructions that utilize the agents' ability to perform a constant loop of scanning pipelines, querying tables, and reading metadata and logs to update the model/assets files inside pipelines with descriptions, definitions, tags, quality checks, and additional metadata. This way your context also lives right inside the files that contain the SQL, Python, YML, etc. files in your pipeline - instead of creating a separate system, database, knowledge base, or some other vendor-locked format.
In this post I'll break down why the skepticism is valid, why the root cause is missing context and not model quality, and then walk through how to actually build that context layer using open-source CLI tools, SQL, Python, and LLMs. We'll also get into self-healing pipeline tasks that keep the data and context healthy, and how to manage the transition with your team and stakeholders - because the technical solution is only half of it.
Trust is essential when it comes to working with data - if you lose the trust of the data consumers, then all the pipelines, tables, and reports lose their value. This was already an issue even when only humans were involved, but now with the introduction of AI agents trying to make changes to pipelines and tables or creating dashboards and reports, the issue has reached a critical point.
Here's how my journey started... 8 months ago I wanted to troubleshoot some missing data. I finally set up Bruin MCP in Cursor (also after resisting to use Cursor in the first place), asked it to query every table in the downstream lineage of the raw table to find where the missing data happens. It almost correctly found where the problem was but without me asking it, it changed the query by changing an inner join to a left join, in addition to completely messing up all the existing inline comments and asset definitions. I was so frustrated I immediately deleted the branch, opened VS Code and created a new branch and went back to doing it myself. Every time I tried it, I was disappointed. But to be honest, I was testing its limits by giving a vague prompt and expecting magic - and quite frankly, that's exactly how I'd feel with interns and juniors too, I give them a task like "investigate the missing data" without giving them any context and then they submit a PR where they ALSO changed the inner to right join.
Around 2 months after my first attempt with AI, I was in the process of completing a major pipeline redesign that was a business critical and operational pipeline. It also coincided with my knowledge transfer and offboarding from the company. So I finally did it - I locked myself in my office at home for a week and as I built the new pipeline (this time with some help from Cursor) but I really put effort into writing down all the business context inside a pipeline readme file with also full explanation of the pipeline schedule & orchestration logic, as well as inline comments to each CTE explaining what/why of the logic and custom quality checks that acted as unit tests so that if a future person/agent makes a change that fundamentally changes the table then the check would fail (e.g. the aggregation level). This was not easy but by the end of it, I was able to ask Cursor to query the data for me, enhance the table & column level descriptions, run the pipeline, and even finally help with investigations and troubleshooting.
Albeit the models and MCP servers have improved since then, but I learned that patience is key - the same way that hiring and onboarding a new engineer would be. A week or two of handholding will go a long way.
My journey from extreme skepticism to now using Conductor to have a dozen different agents working like minions on different things has been transformational and quite shocking to myself even. Now let's dive deeper into the details.
Missing or weak context is not the root problem but rather just a symptom. The problem is rooted in the inability to create and maintain the context layer - and this is usually a time constraint, not a money one. Even expensive and sophisticated services require a substantial investment of time and effort to deploy and maintain. Even when it is done, new models and frameworks fundamentally disrupt previous implementations and make them obsolete.
There have been attempts at building context managers for data stacks, but they are quite often made obsolete with new models that handle context better. This normally involves complex processes to extract information from the pipelines and warehouse, as well as people's minds, and loading them into isolated systems (very often vendor-locked) that are solely set up for context management - think of vector DBs, RAGs, MCPs, etc. Separate systems are hard to integrate into the existing workflows and processes of data teams and architecture, which demands a significant commitment of time and effort; especially when such systems are threatened by disruptive models.
There's also a contrary belief that new models with bigger context windows can sift through all the pipelines and warehouse metadata to obtain the context - but setting aside the needle in a haystack problem, this doesn't solve the problem of stale or wrong information and if the agents were to validate everything by querying every table, they would then have to do this every time a session is started and the warehouse costs will explode (not even considering how slow it would be).
The "inability" to create and maintain a context layer is also a symptom of what data engineers' roles have become. In between adhoc requests, fixing broken pipelines, integrating new data sources, and so on... it becomes difficult to allocate time to such things. But I still think that the "inability" is tied into the complexity, effort, and prioritization constraints caused by using the wrong tools, and even more importantly the weak change management and implementation plan. Which is why we will dive deeper into "what" tools to use and "how" to implement it.
Here's the classic onboarding process of a new data engineer, scientist, or analyst:
they start with getting access to the repo, pipelines, warehouse, and any available documentation
they ask "what should I focus on" - in other words, what are the current active/operational pipelines and tables
they start with opening a table and running a select * from xyz limit 10 to see what the table looks like
they cross-reference the table with the pipeline and lineage to understand where the data comes from and how it is processed
they will start investigating the why/how - so they start running a few more queries, checking the Python & SQL code, and reading table/column descriptions and any inline comments and documentations
lastly, they will talk to data consumers and read the queries coming from users and dashboards to understand where/how the data is actually used
This framework proposes giving agents access to tools and a set of specific instructions that utilize the agents' ability to perform a constant loop of scanning pipelines, querying tables, and reading metadata and logs to update the model/assets files inside pipelines with descriptions, definitions, tags, quality checks, and additional metadata. This task can be built into the pipeline itself so that it constantly updates the context. Essentially identical to the exact same process a new hire does - they don't read the documentation (there usually isn't any), they read the pipeline code, query the tables, and then ask questions.
I'll be honest, anytime I hired someone (or even when I joined a company myself) the first task was to create/update the docs in the pipeline. This is a great way to create and maintain docs, but if it takes a new hire to do, then it's definitely not a reliable solution. Also, let's be honest, after the initial onboarding everyone is too busy to really maintain the docs which is why it becomes stale - and this is very dangerous for AI because it assumes the docs are correct and will rely on them to generate text-to-sql, resulting in garbage output.
There are a few core principles of this framework:
keep it simple (no new tech)
keep it inside existing systems (no extra configs)
Building these tasks inside the pipelines allows the agents to get 90% of the metadata and context under control, while delivering the 10% as a set of questions and tasks that someone in the data team should resolve. It is also important to build this step into existing workflows - for example, sending a message in Slack and tagging the team that can answer. It will look something like:
@marketing-emea is the Q2 ARPDAU target increase for Germany really 500%?
marketing-emea: yes that is normal, we launched a new marketing campaign (ID: 1234) and we increased the target
@data-eng is it normal that table xyz_f hasn't been updated since last month?
data-eng: yes that is normal, that is a legacy table and the correct table is xyz_t
The ultimate goal is not to eliminate the human involvement in managing the context, but reducing it down to a digestible amount. This helps breakdown the tasks/questions into 2 categories:
ones that are answered naturally through conversations like in Slack
creates a backlog of tasks (in Jira, Linear, GitHub Issues) for the data team to address
The key here is that the agent should be instructed to ask questions when it encounters ambiguity rather than guessing - and those questions need to be surfaced to the right people, whether that's a Slack message, a PR comment, or a ticket in Linear. This is exactly what a good new hire does, they ask a lot of questions early on and that's a good thing. The difference is that every time someone answers one of these questions, the answer gets written back into the asset metadata or the glossary as a permanent record. So the context layer doesn't just get built once, it gets built continuously through every interaction and every question that gets answered.
I'm sorry for being blunt, but AI is not lazy like us humans - if you instruct it to always follow the same process and update the docs then it will do it, there's no "oh I forgot" or "I'll do that later".
Here's a concrete example of this socratic method:
the agent is analyzing the orders table and sees a column called total_amount
it doesn't know if this is gross revenue or net revenue after refunds
instead of assuming, it flags it: "the column total_amount in analytics.orders has no description - is this gross or net revenue? should refunded orders be excluded?"
data team answers in Slack "net revenue after refunds, exclude cancelled orders"
that answer then gets written back into the asset file as the column description and into the glossary under the Revenue entity
based on the permissions and instructions, it will push it or make a PR for review
next time any agent or any new hire encounters that column, the answer is already there
I know it is easier said than done and I'd be lying if I said it is magic, it definitely is not. The goal here is to make it easier and faster to get started on building the semantic layer, but more importantly making it easier to maintain.
In the example below I will utilize Bruin's open-source CLI features but the same principles and methodology apply to any tool out there.
If you're already using Bruin for your data pipelines, step 1 and 2 will be irrelevant since you already have pipelines and tables/assets.
Set up your connection to your database/warehouse - keep in mind that this connection and its permissions will be used by the agent so make sure it is secure.
Let's get started by bringing in the basic schema of our tables, this will create a skeleton "pipeline" which consists of empty assets (no SQL code) that represent each table.
The ai enhance command analyzes each table as a data analyst would traditionally do, everything from checking min/max/unique values for each column to finding the aggregation level and so on, then adding all that metadata to the context files it creates.
At its core, this ai enhance feature is just a long prompt to the agent instructing it to analyze the data - you can check out the repo to see the prompt: github.com/bruin-data/bruin
This function will generate table and column level descriptions, tags, quality checks, and table metadata.
bruin ai enhance my-analytics
# assets/analytics/orders.asset.yml (after enhancement)
name: analytics.orders
type: bq.source
description: "Customer orders with purchase details and fulfillment status. One row per order. Granularity: order-level."
tags:
- ecommerce
- transactions
columns:
- name: order_id
type: INTEGER
description: "Unique identifier for each order"
checks:
- name: not_null
- name: unique
- name: customer_id
type: INTEGER
description: "Foreign key to analytics.customers"
checks:
- name: not_null
- name: status
type: VARCHAR
description: "Current fulfillment status of the order"
checks:
- name: accepted_values
value: ["pending", "shipped", "delivered", "refunded"]
- name: total_amount
type: FLOAT
description: "Net order total in USD after discounts, before tax"
checks:
- name: not_null
- name: positive
- name: created_at
type: TIMESTAMP
description: "Timestamp when the order was placed (UTC)"
checks:
- name: not_null
What surprised me most was how much the metadata mattered. The first version of the agent answered poorly because it had no context. After running bruin ai enhance, the descriptions, column tags, and quality checks made the agent far more reliable. I spent more time cleaning metadata than tweaking prompts, and it paid off.
Bruin also has a feature called Glossary and right now we're working on integrating that into this as well, with also built-in integrations into your knowledge base (Confluence, Notion, etc.) so that it can create a glossary/dictionary of all your key metrics, terminology, definitions, etc.
Here's what the glossary looks like:
# glossary.yml
domains:
finance:
description: Revenue, costs, and margin analysis
owners:
- "Finance Team"
contact:
- type: "slack"
address: "#finance-data"
entities:
Revenue:
description: Net revenue after refunds and discounts, in USD.
domains:
- finance
attributes:
Monthly:
type: decimal
description: "Sum of total_amount for delivered orders in a calendar month"
Gross:
type: decimal
description: "Sum of total_amount before refunds"
Customer:
description: An individual or business that has completed at least one purchase.
domains:
- finance
attributes:
ID:
type: integer
description: The unique identifier of the customer in our systems.
LTV:
type: decimal
description: "Lifetime value: total revenue attributed to this customer"
Inside an asset you can extend the glossary definitions and attributes like this:
It is always best practice to set up general instructions, rules, restrictions, and context that the agent can follow - the best way to do this is by creating an AGENTS.md file.
In this file, you will give the agent general instructions on which tools to use and how to use them, but also include some general business context - although, it is best to keep business context inside the pipeline itself, either in the glossary, asset, or pipeline readme file.
# AGENTS.md
## Data access
- Utilize Bruin MCP and Bruin CLI to connect to the warehouse and query the data.
- Use `bruin query --connection my-dwh --query "<SQL>"` for all data access
- Always show the SQL query and explain your reasoning before executing it
- Use LIMIT 10 when exploring unfamiliar tables or testing complex queries
- Read the `assets/` directory to understand available tables and their schemas before querying
- This is a **read-only** environment - never run INSERT, UPDATE, DELETE, or DROP statements
## Domain context
- "Revenue" always means net revenue after refunds unless explicitly stated otherwise
- All timestamps are stored in UTC
- The `status` field in orders uses: pending, shipped, delivered, refunded
- Q1 = Jan-Mar, Q2 = Apr-Jun, Q3 = Jul-Sep, Q4 = Oct-Dec (calendar year)
## When you're unsure
- If a metric definition is ambiguous, check the glossary.yml first
- If the glossary doesn't have the answer, ASK - do not guess
- Format questions as: "I found X in the data but expected Y - should I interpret this as Z?"
A context layer is only as good as the data underneath it - if the pipelines are broken or the data is stale, the context is wrong and the agents reach the wrong conclusions. The classic garbage-in & garbage-out pitfall is in full effect here.
Since the method I introduced above requires clean and healthy data for the agent to build the context layer on top of, it becomes quintessential for your pipelines and tables to be accurate and reliable. Hence all the buzz around "self-healing pipelines", but let's look into what that actually looks like.
Building processes inside pipelines that trigger agents to complete specific tasks to check the health of pipelines, models, and data itself is important. A set of rule-based and scope-restricted tasks allow agents to automatically address common issues or flag, triage, and quarantine for human review.
Data contracts might be useful here too. Sometimes it is needed to define specific behaviour based on a quality check or fallback plan. For example if table xyz has less than avg 100 data points per minute per location, then don't insert and instead use table abc (for example to prevent ML pipelines making inference with low quality data). Most of the times you can implement dynamic pipelines with your existing pipeline orchestrator, whether it supports it out of the box or needs to be handled manually.
Let's explore some examples of how self-healing pipelines work - what the agent investigates and triages, how it communicates it to the team, and what steps are taken to resolve the issue.
Keep in mind, in all these examples the agent is responsible for the initial investigation and proposed solution(s), but depending on the permissions/instructions the agent will either send a message to the team, create a ticket or PR, or just merge the fix. This all depends on your risk appetite and how much freedom and autonomy you want to give to the agent.
In terms of monitoring, one thing I've done before and I'm not sure if it's the best way, is to create tasks in between other SQL tasks (or inside a python task) that logs specific things inside a table (e.g. pipeline status, error logs, row counts, quality check status, etc), and then connect those tables to Grafana to create monitoring dashboards and alerts.
One way of utilizing these monitoring tables is to create scheduled agent tasks (e.g. daily) that check these tables and if there is something that needs to be escalated, it will send a message in Slack with charts and graphs visualizing the logs - this is much more efficient than setting a reminder to check the monitoring dashboards. This is similar to classic alerting systems but it goes a step further because the agent comes to you with not just an alert, but the data visualized and analyzed, and even with a proposed solution and a PR ready for review.
Here's an example:
a weekly scheduled agent task runs every Monday morning to check for anomalies in website traffic
this week it notices that paid_social traffic has been declining day over day for 7 days
at the same time, a new UTM appeared (utm_source=fb_spring25, utm_medium=NULL) and its traffic has been growing at roughly the same rate
because utm_medium is null, this traffic gets bucketed into the not set channel instead of paid_social - so the traffic didn't actually drop, it was just misattributed
the agent sends a message in #marketing-data with a chart showing both trends side by side, identifies that the campaign links were likely updated around April 14 with the new UTMs missing utm_medium and utm_campaign, and proposes a fix: either update the campaign links or add a mapping rule in the pipeline that maps fb_spring25 to paid_social
without this agent, the marketing team would notice the drop in their weekly report, open a ticket, the data team would investigate and it would take days to figure out it was just a UTM tagging issue
Most databases and warehouses support some form of table-valued functions (TVFs) - in BigQuery it's called TABLE FUNCTION, in Snowflake it's UDTF, in PostgreSQL it's a function with RETURNS TABLE, and SQL Server has had TVFs for a long time. The naming is different but the concept is the same, you define a function that takes specific parameters and returns a table as a result.
One interesting method that I found effective is the use of TVF as the interface between your agent and the data - you essentially define a function that is used to query tables, think of it as an API interface for the table(s). This way you only expose these TVFs to the agents and they have to provide the specific parameters to get a result back.
In other words, instead of trusting that the agent will query the table the correct way, you offload some of that responsibility to a built-in function that enforces it. This is especially important when querying large tables that must be filtered by partition, otherwise you'll have agents querying TBs of data just to answer a simple question.
The reason this works so well for agents is that some tables have very specific query logic that needs to be followed - certain filters that must always be applied, certain joins that need to happen in a specific order, certain aggregation levels that shouldn't be changed. If you let an agent write a raw SQL query against these tables, there's a real chance it will miss a critical filter or join incorrectly and return results that look right but are wrong. By wrapping the table in a TVF, you bake in the correct logic and the agent just has to provide the parameters - it literally cannot query the data the wrong way.
Here's a concrete example:
you have an analytics.orders table that should always be filtered by date range, should always exclude test orders (is_test = false), and should always join with analytics.customers to get the customer region
without a TVF, the agent might run SELECT * FROM analytics.orders WHERE created_at > '2025-01-01' and get results that include test orders and have no region info
with a TVF, the agent calls SELECT * FROM analytics.fn_orders(start_date, end_date, region) and the function handles the test order filter, the customer join, and the date range logic internally
-- BigQuery example
CREATE TABLE FUNCTION analytics.fn_orders(
start_date DATE,
end_date DATE,
region STRING
)
AS (
SELECT
o.order_id,
o.customer_id,
c.region,
o.status,
o.total_amount,
o.created_at
FROM analytics.orders o
JOIN analytics.customers c ON o.customer_id = c.customer_id
WHERE o.is_test = false
AND o.created_at BETWEEN start_date AND end_date
AND c.region = region
);
Then in your AGENTS.md you just tell the agent to use analytics.fn_orders instead of querying the raw table directly - the agent doesn't need to know about the test order filter or the customer join, it just provides the date range and region and gets clean results back.
Everything above is the technical solution, but that's only half the battle. The other half is getting your team and stakeholders on board with this new way of working. You can build the best context layer in the world but if your data engineers don't trust the agents and your stakeholders don't understand what's changing then it won't matter.
The pattern I've seen work well is a gradual ramp up. It usually starts with using MCP integrations with cursor or claude code to just lookup the data platforms' documentation and query the warehouse to find the specific part of a query or script that was causing some issue. Then after creating some extensive agent rules/instructions documentation as well as creating a readme for each pipeline that contains some business context, the team can start relying on agents to build pipelines or make major changes.
The next step is usually a more structured workflow that involves AI agents completing some tasks - such as linting, automated pipeline runs, code review, and most importantly updating the documentation.
The data engineering team must act as role models. No other team will trust AI if the team responsible for creating the data infrastructure doesn't. This is also the time to test things internally before giving data consumers access.
Here's an overview of what an iterative roll-out internally within the data eng team looks like:
set up agents and MCPs with access to your pipelines and repos
ask the agents to help with basic auto-fill, syntax, and documentation lookup tasks
ask the agents to query the data and investigate reported issues, only helping with the preliminary investigation
ask the agents to write inline comments, table/column level descriptions, and create readme files for each pipeline
create an AGENTS.md file that acts as your Data Engineering Bible - consolidation of your data architecture, styleguide, best practices, and design principles
ask the agents to help with proposing solutions to problems, optimize queries, and create the skeleton of new pipelines/assets
create the 1.0 version of the context layer - run the ai enhance or similar functions to complete any missing documentation, metadata, and context
build more strict and consistent tasks for constantly updating/maintaining your documentation and context (don't hold back here, be relentless and keep going until every pipeline, table, column, and even CTE is documented)
start putting everything to the test - ask the agent to query and analyze, investigate and troubleshoot, propose ideas and solutions, and even build a pipeline end-to-end (this is the only way to truly put it to the test)
important reminder: during this stage, you will be frustrated, impatient, and feel like giving up but try to take the failures as feedback that needs to be addressed - the same way when a new hire makes a mistake, you don't immediately fire them, it is a signal that there's a lack of context or tools to do their job
This not a step-by-step guide but it is a general overview of the journey I've been through and noticed other teams experience as well. This is also not the finish line, it's just the first checkpoint before you get the data consumers involved.
After the internal data engineering effort is done, you will reach the point where domain expertise really matters.
One of the important things to keep in mind when you get to this phase is to avoid investing into a costly, vendor locked-in, and effortful solution. You still haven't proven the efficacy of an AI agent to justify the investment. That's why using free open-source tools is the answer.
This might require you to find a few champions in the analytics, marketing, product, and other teams where you will hop on a call with them, install cursor or claude code, set up the MCP and warehouse connection, and show them how to ask questions and get an answer.
There are many tools you can use to implement a simple solution like this, for Bruin I've put together this tutorial to set up the whole thing end-to-end: getbruin.com/learn/ai-data-analyst
Get these early-adopters and champions to put it to the test, share feedback, and (again) most importantly improve the context but this time from a business domain perspective.
For example, a product marketing analyst might:
ask the agent "what was our conversion rate from the spring campaign by region?"
the agent queries the data and returns a result, but the analyst notices something off - the agent included free trial signups in the conversion calculation
the analyst corrects it: "conversions should only count paid signups, free trials are tracked separately"
that feedback gets written back into the glossary under the Conversion entity: "Conversion: a completed paid signup. Free trial signups are excluded and tracked under the Trial entity."
from this point on, every agent and every new team member knows the correct definition
This is the kind of context that the data engineering team would never have on their own - it lives in the heads of the domain experts. The whole point of involving these champions early is to extract that knowledge and bake it into the context layer so that the agents can actually be useful to the broader team.
At this point you've got the context layer built, the data engineering team is on board, and a few champions from other teams have been testing it. Now you want to roll this out to the rest of the company - but it doesn't make sense to set up cursor or claude code locally for every data consumer, that's not scalable and most of them won't bother with the setup anyway. You need something that lives where people already work, like Slack or Teams or even WhatsApp.
The agent should be self-learning, meaning that every conversation it has with a data consumer is an opportunity to improve the context. When someone corrects the agent or provides a clarification, that gets written back into the documentation. But here's where it gets interesting, the agent also needs to detect conflicts. If person A from marketing says "conversion means paid signups only" but person B from product says "conversion includes free trials", the agent should detect this conflict and escalate it rather than just overwriting the previous definition. Not every correction should be treated equally either - you need to set up specific permissions and rules for who's response is actionable. Asset owners should have the authority to make changes to the context for their assets, but a random person in sales shouldn't be able to override a definition set by the finance team.
Access management is also critical here. Each team or department's agent should only have access to the data they are allowed to access - marketing shouldn't be able to query finance and HR tables, and the product team shouldn't have access to raw customer PII unless they have the right permissions. This is where the connection and warehouse level permissions come into play, you can create separate connections with different access levels and assign them to the agents for each team.
Beyond just asking questions and getting responses, teams will also want to access traditional dashboards and scheduled reports, but at this point they want an "agentic" way to create them and ask follow up questions. Imagine a product manager creates an entire dashboard by just asking question to the agent, and then one day they realize a specific trend and want to further analyze it, so they tag that specific chart (essentially adding the chart's query and metadata) and ask a follow up question - this basically mimics the exact same interaction that a product manager and a product analyst would normally have.
Bruin Cloud supports this out of the box: getbruin.com/dashboards - but regardless of the tool you use, the important thing is that data consumers can go from asking a question to saving it as a recurring report without needing to involve the data engineering team every time.
And inevitably, teams will find issues - whether it's a wrong number in a response, a broken chart in a dashboard, or a report that doesn't look right. The agent should be able to take the context from the conversation and create a detailed ticket in Jira, Linear, or GitHub Issues. Not just a vague "data looks wrong" ticket but an actual detailed issue with the query that ran, the expected vs actual result, the tables involved, and even a preliminary investigation into what might be causing it (e.g. in the asset reports.sales_emea the CTE "dedup_customers" is missing "transaction_id" in the QUALIFY statement). This by itself will save a ton of back-n-forth between data consumers and the data engineering team because normally the first 3-4 messages in a ticket are just the data engineer asking which table? which column? what date range? what did you expect? - the agent already has all of that context from the conversation.
The ultimate question that remains is how to convince the C-suite, executives, and management to get onboard. That's exactly why I propose this iterative and "start small, free, and minimal-effort" solution because it is much more convincing when you go and say "here's what we've done so far, here's the impact, here's the adoption rate" instead of just proposing a hypothetical plan with arbitrary targets. Setting aside IT restrictions for now, a proof-of-concept and minimal viable product is the easiest way to prove the value of such a solution before investing more time and money.
Using AI in data engineering and data analysis is inevitable, but it is also facing the most amount of resistance because agents can return results that look plausible when they are completely wrong - data is used to make decisions, and quite often they are decisions that can make or break a company. That's why the level of skepticism and hesitation from data teams to rely on AI is understandable.
Every company wants to solve this problem because they understand the value, but the data team is saying "context is broken" and they don't have the time and effort to invest into fixing it.
That's why a ground-up, iterative, and patient approach is necessary:
start small and onboard your AI agent like a new hire, get it to analyze the data, read your pipeline code, and explore - focus on the existing knowledge, not that 10% of context that lives outside the pipelines
stress test the agent internally in the data engineering team using free open-source tools that require minimal set up - focus on improving data quality and documentation, not the smartest or fastest LLM model
roll out the agent to a few champions across different teams - focus on translating their business domain into context (the 10% left from before), not some fancy expensive "semantic tool"
integrate the agent inside the communication channels and workflows already in use (Slack, Teams, WhatsApp, etc.) - focus on the feedback loop that feels "human", not a cold robot that no one wants to talk to
In this article, we've gone through the journey from an individual data engineer, the entire data engineering team, the early adopters in your organization, all the data consumers, and finally the executives and decision makers.
The problem isn't the model, it's the inability to create and maintain a context layer - and that inability is a time and tooling problem, not a money one.
The problem:
agents return confident but wrong answers because they don't have the context to know for example how "revenue" is defined in different teams or which table is stale or which join is required - they guess instead of ask
existing solutions like vector DBs, RAGs, and other vendor-locked context managers are complex to set up, hard to maintain, made obsolete every time a new model drops, but more importantly require too much effort to get started and prove its value
The solution:
start simple with open-source free tools to improve the context using the knowledge that already exists inside your pipelines and infrastructure - asset definitions, glossaries, inline comments, quality checks, all living directly in your pipeline files so it stays close to the code and gets maintained as part of the normal workflow
roll it out step-by-step, at each step enhancing the context using business domain knowledge from the people who actually use the data - this way you prove the value using an MVP before approaching execs with a big proposal
Onboard the agent like a new hire - give it access to your existing pipelines and warehouse and let it explore.
map out the tables and schema, write basic descriptions, improve docs and inline comments - get the agent to do the heavy lifting here by analyzing the data and generating the initial metadata
back-n-forth with the data eng team to improve the context and metadata - treat every wrong answer as a signal that something is missing and fill in the gap (reminder: BE PATIENT)
set up self-healing tasks inside the pipelines to make sure the underlying data is healthy and accurate - the context layer is useless if the data underneath it is broken
roll it out to champions that have most of the domain expertise (and data literacy), this will further close the gap in the missing context that the data engineering team wouldn't have on their own
integrate the agent inside existing communication channels where data consumers can ask questions, create dashboards, and schedule reports
Next step is to take this proof-of-concept to the execs and show them the value.
Start with the data engineering team, expand to domain champions, then roll it out to the rest of the company through the channels they already use.
get the data eng team comfortable first - start with documentation lookup and investigation, then gradually expand to pipeline building and code review until the 1.0 context layer is solid enough to put in front of other teams
find champions in marketing, product, finance, etc. and let them break it - their corrections and feedback must be treated gracefully and patiently, involve them as design partners and not just data consumers
roll it out to the rest of the company via the common communication channels - focus on proper access management, conflict detection, self-service dashboards and reports, and automatic ticket creation and troubleshooting
Don't go to the C-suite with a hypothetical plan and arbitrary targets - go with a working proof-of-concept, real adoption numbers, and concrete examples of time saved.
Start small, start free, prove the value, then scale.