Duck DB
DuckDB is an in-memory database designed to be fast and reliable.
Bruin supports using a local DuckDB database.
Connection
connections:
duckdb:
- name: "connection_name"
path: "/path/to/your/duckdb/database.db"The field path is the only one you need and it can point to an existing database or the full path of the database that you want to create and where your queries would be materialized.
WARNING
DuckDB does not allow concurrency between different processes, which means other clients should not be connected to the database while Bruin is running.
Assets
DuckDB assets should use the type duckdb.sql and if you specify a connection it must be of the duckdb type. For detailed parameters, you can check Definition Schema page.
Examples
Create a view with orders per country
/* @bruin
name: orders_per_country
type: duckdb.sql
materialization:
type: view
@bruin */
SELECT COUNT(*) as orders, country
FROM events.orders
WHERE status = "paid"
GROUP BY countryMaterialize new customers per region and append them to an existing table
/* @bruin
name: new_customers_per_region
type: duckdb.sql
materialization:
type: table
strategy: append
@bruin */
SELECT COUNT(*) as customers, region
WHERE created_at >= {{ start_date }}
AND created_at < {{ end_date }}
FROM events.customersduckdb.sensor.query
Checks if a query returns any results in DuckDB, runs every 5 minutes until this query returns any results.
name: string
type: string
parameters:
query: stringParameters:
query: Query you expect to return any results
Example: Partitioned upstream table
Checks if the data available in upstream table for end date of the run.
name: analytics_123456789.events
type: duckdb.sensor.query
parameters:
query: select exists(select 1 from upstream_table where dt = "{{ end_date }}"Example: Streaming upstream table
Checks if there is any data after end timestamp, by assuming that older data is not appended to the table.
name: analytics_123456789.events
type: duckdb.sensor.query
parameters:
query: select exists(select 1 from upstream_table where inserted_at > "{{ end_timestamp }}"duckdb.seed
duckdb.seed is a special type of asset used to represent CSV files that contain data that is prepared outside of your pipeline that will be loaded into your DuckDB database. Bruin supports seed assets natively, allowing you to simply drop a CSV file in your pipeline and ensuring the data is loaded to the DuckDB database.
You can define seed assets in a file ending with .asset.yml or .asset.yaml:
name: dashboard.hello
type: duckdb.seed
parameters:
path: seed.csvParameters:
path: The path to the CSV file that will be loaded into the data platform. This can be a relative file path (relative to the asset definition file) or an HTTP/HTTPS URL to a publicly accessible CSV file.
WARNING
When using a URL path, column validation is skipped during bruin validate. Column mismatches will be caught at runtime.
Examples: Load csv into a Duckdb database
The examples below show how to load a CSV into a DuckDB database.
name: dashboard.hello
type: duckdb.seed
parameters:
path: seed.csvExample CSV:
name,networking_through,position,contact_date
Y,LinkedIn,SDE,2024-01-01
B,LinkedIn,SDE 2,2024-01-01Lakehouse Support beta
DuckDB can query Iceberg and DuckLake tables through its native extensions. DuckLake supports DuckDB, SQLite, or Postgres catalogs with S3-backed storage.
Connection
Add the lakehouse block to your DuckDB connection in .bruin.yml:
connections:
duckdb:
- name: "example-conn"
path: "./path/to/duckdb.db"
lakehouse:
format: <iceberg|ducklake>
catalog:
type: <glue|postgres|duckdb|sqlite>
auth: { ... } # optional
storage:
type: <s3>
auth: { ... } # optional| Field | Type | Required | Description |
|---|---|---|---|
format | string | Yes | Table format: iceberg or ducklake |
catalog | object | Yes | Catalog configuration (Glue for Iceberg, DuckDB/SQLite/Postgres for DuckLake) |
storage | object | No | Storage configuration (required for DuckLake) |
Supported Lakehouse Formats
DuckLake
| Catalog | S3 |
|---|---|
| DuckDB | |
| SQLite | |
| Postgres | |
| MySQL | Planned |
Iceberg
| Catalog | S3 |
|---|---|
| Glue |
For background, see DuckDB's lakehouse format overview.
Catalog Options
For guidance, see DuckLake's choosing a catalog database.
Glue
catalog:
type: glue
catalog_id: "123456789012"
region: "us-east-1"
auth:
access_key: "${AWS_ACCESS_KEY_ID}"
secret_key: "${AWS_SECRET_ACCESS_KEY}"
session_token: "${AWS_SESSION_TOKEN}" # optionalPostgres
catalog:
type: postgres
host: "localhost"
port: 5432 # optional - default: 5432
database: "ducklake_catalog"
auth:
username: "ducklake_user"
password: "ducklake_password"DuckDB
catalog:
type: duckdb
path: "metadata.ducklake"catalog.path should point to the DuckLake metadata file.
Note that if you are using DuckDB as your catalog database, you're limited to a single client.
SQLite
catalog:
type: sqlite
path: "metadata.sqlite"Storage Options
S3
Bruin currently only supports explicit AWS credentials in the auth block. Session tokens are supported for temporary credentials (AWS STS).
storage:
type: s3
path: "s3://my-ducklake-warehouse/path" # required for DuckLake, optional for Iceberg
region: "us-east-1"
auth:
access_key: "${AWS_ACCESS_KEY_ID}"
secret_key: "${AWS_SECRET_ACCESS_KEY}"
session_token: "${AWS_SESSION_TOKEN}" # optionalUsage
Bruin makes the lakehouse catalog active for your session and ensures a default main schema is available (cannot create Iceberg on S3 schemas/tables, so they must already exist). You can query tables with or without a schema:
SELECT * FROM my_table;You can also use the fully qualified path:
SELECT * FROM iceberg_catalog.main.my_table;NOTE
Unqualified table names resolve to the main schema of the active catalog. Use <schema>.<table> to target non-main schemas.
Example Asset
/* @bruin
name: lakehouse_example
type: duckdb.sql
connection: example-conn
@bruin */
SELECT SUM(amount) as total_sales
FROM orders;