End-to-End Pipeline: NYC Taxi
Build a complete data pipeline from scratch using real NYC taxi data - from ingestion to staging to reports, all orchestrated with Bruin and DuckDB.
What is this?
A hands-on tutorial where you build a real data pipeline end-to-end using NYC taxi trip data. You'll go from raw API data to clean, aggregated reports - learning ingestion, transformation, quality checks, and AI-assisted development along the way.
What you'll use: Bruin CLI for pipeline orchestration, DuckDB as a local data warehouse, Python for ingestion, SQL for transformations, and the Bruin MCP with an AI agent to accelerate development.
What you'll build
A three-layer pipeline:
- Ingestion - Python asset that pulls taxi trip data from the NYC TLC API, plus seed files for lookup tables
- Staging - SQL asset that cleans, deduplicates, and joins the raw data with lookup tables
- Reports - SQL asset that aggregates trips by date, taxi type, and payment method
By the end, you'll have a fully orchestrated pipeline with dependencies, quality checks, and materialization strategies - and you'll know how to use an AI agent to build it faster.
Before you start
- Bruin CLI installed
- VS Code or Cursor with the Bruin extension
- Familiarity with Bruin Core Concepts (recommended)
Introduction to Bruin
Learn what Bruin is, how it replaces five separate tools with one platform, and get an overview of the NYC taxi pipeline we'll build.
Install Bruin & Create Your First Pipeline
Install Bruin CLI, set up the VS Code extension, initialize a project with DuckDB, and run your first assets to understand the basics.
Build the NYC Taxi Pipeline
Create a three-layer pipeline with Python ingestion, SQL staging, and report assets - complete with materialization, dependencies, and quality checks.
Build with the Bruin MCP and AI
Use the Bruin MCP with an AI agent to generate the entire pipeline from a single prompt, query your data, and ask questions about your pipeline.