Step 4-12 min

Build Staging + Report Layers

Add a clean middle layer on top of the raw ad, email, and web data, then build a final report table that unifies channel-level performance - the analysis-ready table your AI agent will query.

What you'll do

Ask the agent to add two new layers of SQL assets on top of the raw tables from Step 3:

  1. staging.* views - one per raw source, cleaning and typing the data, converting units (micros → USD), computing derived columns
  2. reports.* table - the cross-channel analysis-ready surface that the AI will query in Step 6

Why this step matters

Your raw tables are the source of truth - but they're messy: cost_micros is in millionths of a dollar, GA4 has 24–48 hour reporting delays, Klaviyo opens are inflated by Apple Mail Privacy Protection, and each source reports in its own attribution model. Asking analytical questions directly against raw means re-deriving the same transforms every time, in every prompt.

The layered pattern fixes that once:

  • raw.* - untouched copy of what each source returned (Step 3 output)
  • staging.* - cleaned, typed, deduplicated, unit-normalized. One view per source.
  • reports.* - business-ready cross-channel joins with pre-computed KPIs (CAC, ROAS, CTR, CVR, blended CAC by day)

Once this layer exists, you ask questions against reports.* and the AI doesn't have to remember to divide cost_micros by 1,000,000 or pick an attribution model every session. Faster, cheaper, less room for mistakes.

1. Scaffold the staging views

Prompt the agent:

AI Prompt

In marketing-analyst-101/assets/, create three Bruin SQL assets for the staging layer. All should materialize as views (cheap, always fresh) with schema staging.

staging/google_ads.sqlstaging.google_ads

  • Source: raw.google_ads_campaigns
  • Cast date to DATE, trim/dedupe on (campaign_id, date) using QUALIFY ROW_NUMBER() OVER (PARTITION BY campaign_id, date ORDER BY date DESC) = 1
  • Convert cost_microscost_usd = cost_micros / 1e6
  • Derived metrics per row: ctr = clicks / NULLIF(impressions, 0), cvr = conversions / NULLIF(clicks, 0), cpc_usd = cost_usd / NULLIF(clicks, 0)
  • Keep: campaign_id, campaign_name, date, impressions, clicks, conversions, conversion_value, cost_usd, ctr, cvr, cpc_usd

staging/klaviyo.sqlstaging.klaviyo

  • Source: raw.klaviyo_campaigns
  • Cast send_time to TIMESTAMP and derive send_date = send_time::DATE
  • Dedupe on campaign_id (keep the most recent snapshot)
  • Derived metrics: open_rate = opens / NULLIF(recipients, 0), click_rate = clicks / NULLIF(recipients, 0), revenue_per_recipient = revenue / NULLIF(recipients, 0)
  • Add a comment noting that opens are unreliable on iOS; prefer clicks or revenue for engagement signals
  • Keep: campaign_id, send_time, send_date, recipients, opens, clicks, unsubscribes, revenue, open_rate, click_rate, revenue_per_recipient

staging/ga4.sqlstaging.ga4

  • Source: raw.ga4_traffic
  • Cast date to DATE
  • Dedupe on (date, sessionSource, sessionMedium)
  • Classify channel from medium: cpc → 'Paid Search', organic → 'Organic', email → 'Email', (none) / direct → 'Direct', everything else → 'Other'
  • Derived: cvr = conversions / NULLIF(sessions, 0)
  • Keep: date, sessionSource, sessionMedium, channel, sessions, totalUsers, conversions, purchaseRevenue, cvr

For each asset, add a top-level description:, column descriptions, and at least one quality check (e.g. not_null on the grain columns, unique on the dedupe key).

Show me the SQL for each asset before writing the files. Don't run anything yet - I'll approve first.

Read each SQL file the agent produces. Look specifically for the unit conversion (cost_microscost_usd), the dedup logic, and the channel classification. If anything looks off, push back - it's cheaper to correct here than downstream.

2. Scaffold the reports table

Prompt:

AI Prompt

Now create one Bruin SQL asset for the reports layer, materialized as a table with schema reports.

reports/channel_daily.sqlreports.channel_daily

  • One row per (channel, date) for the last 90 days
  • Union three channel streams:
    Paid Search (from staging.google_ads):
    • channel = 'Paid Search', date, cost_usd, clicks, conversions, attributed_revenue = conversion_value, sessions = clicks (proxy)

    Email (from staging.klaviyo, using send_date as date):
    • channel = 'Email', cost_usd = 0, clicks, conversions = NULL (Klaviyo doesn't report conversions this way), attributed_revenue = revenue, sessions = recipients (proxy for audience size)

    Organic (from staging.ga4 where channel = 'Organic'):
    • channel = 'Organic', cost_usd = 0, clicks = NULL, conversions, attributed_revenue = purchaseRevenue, sessions
  • After the union, compute per row:
    • cac_usd = cost_usd / NULLIF(conversions, 0) - NULL for channels without conversion counts
    • roas = attributed_revenue / NULLIF(cost_usd, 0) - NULL / infinite for cost=0 channels (set to NULL when cost_usd <= 0)
  • Add data_caveat column with a short string noting the attribution assumptions (e.g. 'Klaviyo revenue uses Klaviyo's 5-day click attribution')

Add asset-level description that makes the attribution model explicit, column descriptions, tags (tier:reports, domain:marketing), and quality checks (not_null on channel + date, accepted_values on channel = 'Paid Search', 'Email', 'Organic').

Show me the SQL before writing. I'll approve before you run anything.

This is the analysis-ready cross-channel surface. Every question in Step 6 starts here.

3. Run and spot-check

Prompt:

AI Prompt

Run bruin validate marketing-analyst-101/ first. If it passes, run bruin run marketing-analyst-101/ and stream the output. Tell me which assets succeeded and which failed.

After it completes, do spot checks across all three layers using bruin query --connection duckdb-default:

  1. Unit sanity: SELECT SUM(cost_micros)/1e6 FROM raw.google_ads_campaigns vs SELECT SUM(cost_usd) FROM staging.google_ads - should match within a penny
  2. Dedup audit: SELECT campaign_id, date, COUNT(*) FROM staging.google_ads GROUP BY 1,2 HAVING COUNT(*) > 1 - should return 0 rows
  3. Channel coverage: SELECT channel, COUNT(*), MIN(date), MAX(date), SUM(cost_usd), SUM(attributed_revenue) FROM reports.channel_daily GROUP BY channel - all three channels should appear with sensible totals
  4. CAC sanity: SELECT channel, AVG(cac_usd) FROM reports.channel_daily WHERE cac_usd IS NOT NULL GROUP BY channel - CAC should be positive and in the "tens to hundreds" USD range for most consumer businesses

Show me the results before moving on.

You should see three channels in reports.channel_daily, spanning roughly 90 days, with coherent CAC and ROAS numbers.

4. Why the layering matters

Now you have a clean three-tier pipeline:

raw.*                      staging.*                   reports.*
──────────────────         ──────────────────          ──────────────────
google_ads_campaigns →     google_ads          ┐
klaviyo_campaigns    →     klaviyo             ├──→   channel_daily
ga4_traffic          →     ga4                 ┘
(source truth)             (cleaned + typed)           (analysis-ready)

Every question you ask in Step 6 goes against reports.channel_daily first. The AI agent doesn't have to remember that cost_micros needs dividing, or that Klaviyo revenue uses a different attribution window than GA4. Raw and staging exist for validation and troubleshooting, not for everyday analysis.

In the next step, you'll enhance each layer with column-level descriptions and update your pipeline-scoped AGENTS.md so the agent knows to default to reports.*.

What just happened

You built a proper layered pipeline - raw → staging → reports - without writing any SQL by hand. The agent wrote every asset, you validated the unit conversions and dedup, Bruin ran the whole pipeline. Your data is cleaned, typed, deduped, unit-normalized, and joined into one cross-channel surface. Next step: make sure the AI knows which layer to query when.