Bruin Academy

End-to-End Pipeline: NYC Taxi

Build a complete data pipeline from scratch using real NYC taxi data - from ingestion to staging to reports, all orchestrated with Bruin and DuckDB.

What is this?

A hands-on tutorial where you build a real data pipeline end-to-end using NYC taxi trip data. You'll go from raw API data to clean, aggregated reports - learning ingestion, transformation, quality checks, and AI-assisted development along the way.

What you'll use: Bruin CLI for pipeline orchestration, DuckDB as a local data warehouse, Python for ingestion, SQL for transformations, and the Bruin MCP with an AI agent to accelerate development.

What you'll build

A three-layer pipeline:

  1. Ingestion - Python asset that pulls taxi trip data from the NYC TLC API, plus seed files for lookup tables
  2. Staging - SQL asset that cleans, deduplicates, and joins the raw data with lookup tables
  3. Reports - SQL asset that aggregates trips by date, taxi type, and payment method

By the end, you'll have a fully orchestrated pipeline with dependencies, quality checks, and materialization strategies - and you'll know how to use an AI agent to build it faster.

Before you start