Chicago Crash data analysis

Vijay Agnihotri

This project is an end-to-end data pipeline that ingests, transforms, and stores Chicago traffic crash data for analysis. It pulls from three public datasets on the City of Chicago Data Portal — Crashes, People, and Vehicles — via the SODA2 REST API, loads the raw records into a local DuckDB database, and then joins them into a single analytics-ready table. The pipeline runs on a daily schedule, fetching only new records each run by filtering the API on the Bruin-provided start and end dates, so it avoids re-downloading the entire dataset (~1M+ crash records) every time. The pipeline is built with Bruin, an open-source data pipeline framework. It uses Python assets with automatic dependency management via uv and pyproject.toml to handle API ingestion, materialization with the merge strategy on the People and Vehicles assets to upsert records by primary key and append on Crashes, and DuckDB as a first-class connection for serverless local analytics. Asset dependencies (depends) enforce sequential execution to avoid DuckDB write-lock conflicts, while the SQL asset leverages Bruin's create+replace materialization to rebuild the merged table from the three raw sources on each run. The daily schedule and built-in environment variables (BRUIN_START_DATE, BRUIN_END_DATE) enable incremental loading out of the box.

View on GitHub Vote on Slack