Step 2-5 min

Scaffold Your Project

Create an empty Bruin project with DuckDB as the local database - the workspace your ad, email, and web analytics data will land into.

What you'll do

Ask your AI agent to create a Bruin project called marketing-analyst-101 with a local DuckDB database as the destination, then seed an AGENTS.md file at the project root with the Bruin rules the agent should follow in every future session.

Why this step matters

A Bruin project is a folder on your laptop that holds:

  • A config file (.bruin.yml) - where your API keys and database connection live
  • An assets folder - where the rules for pulling data will go (Google Ads, Klaviyo, GA4 assets)
  • A DuckDB file - the database itself, which is just a single .duckdb file on your disk

DuckDB is the key choice here. It's a database that lives entirely in a file - no server to run, no cloud account to create, no credit card. It behaves like Postgres or BigQuery but all the data sits on your laptop. Perfect for learning, perfect for ad-hoc marketing analysis without going through the data team.

Prompt the agent

Open your AI coding tool in an empty folder you'd like to use as your workspace, and paste this prompt:

AI Prompt

Using Bruin MCP, run bruin init empty marketing-analyst-101 to scaffold a new pipeline from the empty template. Then add a DuckDB connection called duckdb-default pointing to ./marketing-analyst-101/marketing.duckdb. Finally, run bruin connections test --name duckdb-default and show me the output.

The agent will:

  1. Run bruin init empty marketing-analyst-101
  2. Edit .bruin.yml to add a duckdb connection block pointing at ./marketing-analyst-101/marketing.duckdb
  3. Run the test command to confirm the connection works

What the agent just created

Your folder now looks like this:

./                                # current folder
├── .bruin.yml                    # project config - holds API keys + connections
└── marketing-analyst-101/        # the pipeline
    ├── pipeline.yml              # pipeline config
    ├── assets/                   # where ingestion rules will live (empty for now)
    │   └── empty.sql             # placeholder - safe to delete
    └── marketing.duckdb          # the database (created on first write)

Ask the agent to delete the placeholder:

AI Prompt

Remove marketing-analyst-101/assets/empty.sql - I'll add real assets next step.

Peek inside .bruin.yml

Have the agent show you .bruin.yml. You should see something like:

environments:
  default:
    connections:
      duckdb:
        - name: "duckdb-default"
          path: "./marketing-analyst-101/marketing.duckdb"

This file is your control panel. Every API key and OAuth token you add (Google Ads, Klaviyo, GA4) will live here too. It's local, git-ignored by default, and never leaves your machine.

Seed AGENTS.md with Bruin rules

Any AI coding tool (Claude Code, Cursor, Codex) automatically reads an AGENTS.md at the project root whenever the workspace opens. This is where you tell the agent how to work in this project - before it ever touches your data.

Prompt the agent:

AI Prompt

Create an AGENTS.md at the root of this workspace (next to .bruin.yml) with the content below, then show me the file after creation. Also ask me whether I want you to scaffold separate dev and prod environments in .bruin.yml, or keep a single default environment for now - wait for my answer before making any changes to environments.

# AGENTS.md

## How you (the AI agent) should work in this project

### Ground rules
- This is a Bruin project. Use the **Bruin MCP** tools when available - they're the fastest, most reliable path
- Use the **Bruin CLI** for all pipeline operations: `bruin init`, `bruin run`, `bruin validate`, `bruin query`, `bruin ai enhance`, `bruin connections test`
- Reference the **[Bruin docs](https://getbruin.com/docs)** when unsure about asset types, materializations, connection configs, or CLI flags. Don't guess - the docs are authoritative
- **Ask me when unclear.** If a request is ambiguous (time range, columns, grain, attribution model, which environment, etc.), ask before guessing. A clarifying question now beats a wrong answer later

### Environments
- Before adding or changing connections, confirm with me whether this project uses a single `default` environment or separate `dev` / `prod` environments
- Never copy secrets between environments without asking
- Keep env-specific values (DB paths, API keys, schemas) scoped to the right environment block

### Cap data volume when testing
- For exploratory queries, use `LIMIT 20` (or fewer) until you've confirmed the shape of the result
- For ingestr assets during testing, use Bruin's **interval dates** (`interval_start`, `interval_end`) to cap the backfill window - never pull unbounded history on a first run
- Prefer **narrow, explicit date windows** (e.g. last 7 days) over open-ended scans

### Validate before you run
- Run `bruin validate <path>` before `bruin run` - it catches YAML errors, missing connections, broken refs, and type mismatches without burning compute
- If validation fails, read the error, fix the root cause at the source, then re-validate. Do not chain `run` attempts hoping they work
- After running, do **spot checks across layers**: row counts, date min/max, null counts, and a handful of sample rows at each tier (raw → staging → marts). A pipeline that "succeeded" can still have silently wrong data

### Document as you add
- Every asset: **top-level description** explaining what it produces and why
- Every meaningful column: **column-level description**. Don't let `cost_micros`, `status_code`, or `price_adj` go undocumented
- Add **tags** to group assets (e.g. `tier:raw`, `domain:marketing`, `owner:analyst`)
- Add **quality checks** (`not_null`, `unique`, `accepted_values`, `positive`) on columns where the invariant matters
- Add **custom checks** (SQL-backed assertions) for business rules that can't be expressed at the column level
- Add **metadata** (owner, source system, refresh cadence, SLA) so future-you and future-agents have context

### Keep this file current
- When you learn a non-obvious fact about the data, a convention the user prefers, or a mistake worth avoiding - **append it to this file**. That's how the agent gets smarter across sessions

### Naming and structure
- Schemas: `raw.*` for ingestion output, `staging.*` for cleaned/typed, `marts.*` for business-ready tables
- Asset file names should match the table they produce (e.g. `raw.google_ads_campaigns` → `google_ads_campaigns.asset.yml`)
- Keep SQL assets under `<pipeline>/assets/`, same folder as ingestr assets
- SQL assets should be **idempotent** - re-running produces the same output

### Safety
- **Never commit `.bruin.yml`.** It holds secrets. It's git-ignored by default; verify before staging
- **Never run `DROP`, `DELETE`, `TRUNCATE`, or `UPDATE`** on raw tables - they're the ingestion output and should be rebuilt via `bruin run`, not mutated
- When I say "reset" or "start over", ask me exactly what to drop before dropping anything

### Show your work
- Before running a command that modifies files or data, show the plan and wait for approval
- Print the SQL before executing it
- When editing YAML, show the final file or a diff - not just "done"

When the agent's done, you have two foundations in place: the project structure itself, and a root AGENTS.md that will travel with the workspace and keep every future session consistent. In Step 4 you'll add a second, pipeline-specific AGENTS.md inside the marketing-analyst-101/ folder for the marketing-specific domain knowledge - keeping general Bruin rules at the workspace level and domain context scoped to the pipeline.

What just happened

You have a Bruin project with a working DuckDB connection and an AGENTS.md that tells the AI agent exactly how to behave inside this project. No data in it yet, but the plumbing and the playbook are both in place. Next step: plug in Google Ads, Klaviyo, and GA4 as data sources and let Bruin pull real data into DuckDB.