Step 5-10 min

Teach the AI Your Domain

Use bruin ai enhance to document every column, then add a pipeline-scoped AGENTS.md with marketing context so the agent reasons about attribution, CAC, and LTV like a real analyst.

What you'll do

  1. Run bruin ai enhance so the agent auto-documents every table and column
  2. Create a pipeline-specific AGENTS.md inside the marketing-analyst-101/ folder with marketing-specific domain knowledge, attribution rules, and query guidelines

Why this step matters

Data alone is not enough. An AI agent looking at a column called cost_micros will guess "cost, in some unit" - fine for casual questions, wrong for CAC math (it's cost × 1,000,000, so $5.00 is stored as 5000000). Your job in this step is to make sure the agent never has to guess about units, time grains, or attribution logic.

You're giving the agent three layers of context now:

  • Workspace-level rules - the root AGENTS.md from Step 2 tells the agent how to work with Bruin (CLI, validate, test limits, etc.) across any pipeline in this workspace.
  • Schema context - what each table and column is. Auto-generated in seconds with bruin ai enhance.
  • Pipeline-specific domain context - how a marketing analyst thinks about attribution, CAC, LTV, and cross-channel blending for this pipeline. Lives next to the pipeline as marketing-analyst-101/AGENTS.md so it travels with it and doesn't leak into other pipelines.

AI coding tools (Claude Code, Cursor, Codex) read nested AGENTS.md files when they work inside that subfolder, so the pipeline-scoped file automatically layers on top of the workspace-level one. Together these turn the agent from "eager intern" into "actual analyst."

1. Auto-enhance schema context

Prompt:

AI Prompt

Run bruin ai enhance marketing-analyst-101/ and let me know when it's done. It should enhance assets across all three layers - raw.*, staging.*, and reports.*. Then show me the changes it made to google_ads_campaigns.asset.yml (raw) and reports/channel_daily.sql (reports) so I can spot-check the descriptions.

This command looks at each table, asks an LLM to describe what it sees, and writes column descriptions directly into the asset YAML - across all three layers. After it runs, your google_ads_campaigns.asset.yml will have a columns: block like:

columns:
  - name: campaign_id
    type: bigint
    description: "Google Ads campaign ID"
  - name: date
    type: date
    description: "Date of the campaign stats (UTC)"
  - name: cost_micros
    type: bigint
    description: "Cost in micros - divide by 1,000,000 for USD"
  # ...many more

These descriptions become part of the MCP context the agent reads before every query. You don't need to memorize any of this - the agent will look it up when it needs to.

If the agent got a description wrong (e.g. it misread a column meaning), ask it to fix that specific line. The YAML is yours to edit.

2. Create a pipeline-specific AGENTS.md

The root AGENTS.md you seeded in Step 2 covers how the agent should work with Bruin in general. This new file covers what this specific pipeline's data means - glossary, attribution rules, unit caveats. It lives inside the pipeline folder (marketing-analyst-101/AGENTS.md) so the marketing-specific context stays scoped to the pipeline it belongs to - if you later add a different pipeline in the same workspace, you won't pollute it with marketing glossary entries.

Prompt the agent:

AI Prompt

Create a new file at marketing-analyst-101/AGENTS.md (inside the pipeline folder - not the workspace root) with the content below. Do not modify the root AGENTS.md; the pipeline-specific file layers on top of it automatically when the agent works inside the pipeline folder. After you're done, show me the file.

# AGENTS.md - marketing-analyst-101

## Pipeline overview
This is a marketing-analyst research pipeline. Data lives in DuckDB
(`./marketing-analyst-101/marketing.duckdb`), organized in three layers:

**Raw layer** (`raw.*`) - unmodified from source:
- `raw.google_ads_campaigns` - daily campaign-level ad spend and conversions (from Google Ads). `cost_micros` is in **millionths of a dollar**.
- `raw.klaviyo_campaigns` - campaign-level email metrics and attributed revenue (from Klaviyo)
- `raw.ga4_traffic` - daily sessions by source/medium with conversion events (from GA4)

**Staging layer** (`staging.*`) - cleaned, typed, deduplicated, unit-normalized:
- `staging.google_ads` - adds `cost_usd`, `ctr`, `cvr`, `cpc_usd`; deduped on (campaign_id, date)
- `staging.klaviyo` - adds `open_rate`, `click_rate`, `revenue_per_recipient`; deduped on campaign_id
- `staging.ga4` - adds a normalized `channel` column (Paid Search / Organic / Email / Direct / Other) and `cvr`

**Reports layer** (`reports.*`) - analysis-ready cross-channel surface:
- `reports.channel_daily` - one row per (channel, date) for the last 90 days, with `cost_usd`, `clicks`, `conversions`, `attributed_revenue`, `cac_usd`, `roas`, and a `data_caveat` column documenting per-channel attribution assumptions

The DuckDB connection name is `duckdb-default`.

## Layer usage rules (IMPORTANT)
**Default to `reports.*` for all analysis.** It's pre-cleaned, unit-normalized, and the cross-channel join is already done.

Only drop to lower layers with an explicit reason:
- **`staging.*`** - when reports doesn't surface a column you need (e.g. `campaign_name`, `subject_line`, `sessionSource`), or when validating a transform
- **`raw.*`** - for **data validation** (does cost_micros / 1e6 in raw match cost_usd in staging?), **troubleshooting** (where did a wrong number come from?), or when you genuinely need a column that hasn't been promoted upstream yet

When you do drop down, **say so in your response**: "I'm using `staging.google_ads` here because `reports.channel_daily` doesn't break out by `campaign_name`." This keeps the user aware of which surface you're querying.

## Domain glossary (marketing)
- **CAC** - Customer Acquisition Cost: marketing spend ÷ new customers acquired, over the same period
- **Blended CAC** - CAC that includes spend from all channels (paid + organic email + organic social). Divide total marketing cost by total new customers.
- **LTV** - Lifetime Value: total net revenue attributed to a customer over their lifetime
- **LTV:CAC** - ratio; healthy e-commerce is typically >3
- **ROAS** - Return on Ad Spend: revenue attributed to ads ÷ ad spend
- **CTR** - Click-through rate: clicks ÷ impressions
- **CVR** - Conversion rate: conversions ÷ clicks (for ads) or conversions ÷ sessions (for web)
- **CPM** - Cost per 1,000 impressions
- **CPC** - Cost per click
- **Open rate / Click rate** - Klaviyo: opens ÷ recipients, clicks ÷ recipients (unique by default)
- **Attribution window** - the time after an ad click / email click during which a conversion is credited to it. GA4 default is 30 days click / 1 day view; Klaviyo default is 5 days.

## Units and time caveats
- `raw.google_ads_campaigns.cost_micros` is in **micros** - divide by 1,000,000 for USD dollars
- `raw.klaviyo_campaigns.revenue` is in USD (dollars, not cents)
- `raw.ga4_traffic.purchaseRevenue` is in the GA4 property's reporting currency (usually USD)
- All `date` columns are UTC. If your business is US-Eastern, you may see one-day offsets vs. native UI.
- GA4 data has **24–48 hour delay** for full accuracy. Last two days of `ga4_traffic` are estimates.
- Klaviyo `opens` are unreliable on iOS (Apple Mail Privacy Protection inflates them). Prefer `clicks` as engagement signal.

## Attribution rules of thumb
- **Don't double-count revenue.** Google Ads, Klaviyo, and GA4 each have their own attribution model. If you sum attributed revenue across all three it will exceed your actual revenue.
- **Prefer last-non-direct** for cross-channel attribution when available. GA4's `sessionSource` / `sessionMedium` already apply this.
- For **paid vs. owned** comparisons, treat Google Ads revenue and Klaviyo revenue as separate, and use GA4's `source/medium` breakdown to sanity-check overlap.

## Query guidelines
- For all channel-level analytical questions, start with `reports.channel_daily`. CAC, ROAS, and cost are pre-computed; you usually only need to filter and aggregate.
- For campaign-level breakdowns, drop to `staging.google_ads` (paid) or `staging.klaviyo` (email). Note that `reports.channel_daily` aggregates by channel, not campaign.
- Always specify the time window explicitly. State which channels count as "paid" if computing blended CAC.
- For cross-channel attribution comparisons, surface the `data_caveat` column from `reports.channel_daily` or restate the assumptions in plain English.
- Never sum `attributed_revenue` across channels and call it "total revenue" - each channel uses its own attribution model. Present per-channel.

When the agent is done, your workspace has two AGENTS.md files working in tandem: the root one with Bruin rules (applies everywhere), and marketing-analyst-101/AGENTS.md with marketing domain context (applies when the agent is working inside this pipeline). Restart your AI coding tool (new Claude Code session / reopen Cursor) so it picks up the new file.

3. Sanity-check the context

Ask your agent a scoping question that should only be answerable if it read AGENTS.md:

AI Prompt

In this project, what unit is cost_micros stored in, and how should I convert it to USD?

The agent should answer with "micros - divide by 1,000,000". If it fumbles or asks you instead, the pipeline-level AGENTS.md wasn't loaded - restart the session, and make sure the agent is working inside the marketing-analyst-101/ folder.

What just happened

Your agent now has three layers of context: workspace-level Bruin rules (root AGENTS.md), per-column schema descriptions (auto-generated into the asset YAML), and pipeline-specific marketing knowledge (marketing-analyst-101/AGENTS.md). Every question you ask from here on out is reasoned against this stack - so you won't get CAC numbers off by 1,000,000x, or revenue double-counted across channels. You've just done what takes a new human analyst weeks of onboarding.