Community Project Showcase
Browse projects built with Bruin. Explore real-world pipelines, ingestion workflows, and analytics solutions from the community.
41 Projects
CryptoFlow Analytics
Crypto markets generate massive amounts of data across hundreds of exchanges, thousands of tokens, and multiple sentiment indicators. Individual investors and analysts face three core challenges: - Data fragmentation — Prices, volumes, sentiment, and trending data live in separate APIs with different formats - Signal noise — Raw price changes alone are misleading without context (volume confirmation, market breadth, sentiment) - Regime blindness — Most dashboards show what happened, but fail to classify where we are in the market cycle CryptoFlow Analytics solves this by building a unified intelligence layer that ingests, cleans, enriches, and analyzes crypto data to produce actionable signals - not just charts. :hammer_and_wrench: Bruin Features Used - Python Assets: 5 ingestion scripts fetching from CoinGecko, <http://Alternative.me|Alternative.me> APIs, and CSV seed - SQL Assets: 9 BigQuery SQL transformations across staging (3) and analytics (6) layers - Seed Assets: CSV-based reference data for coin categories - Materialization: table strategy for all assets; merge for incremental ingestion - Dependencies: Explicit depends declarations creating a proper DAG - Quality Checks: Built-in (not_null, unique, positive, accepted_values) on every asset - Custom Checks: Business logic validations (e.g., "Bitcoin must exist in data", "dominances sum to ~100%") - Glossary: Structured business term definitions for crypto concepts - Pipeline Schedule: Daily schedule via pipeline.yml - Bruin Cloud: Deployment, monitoring, and AI analyst - AI Data Analyst: Conversational analysis on all analytics tables - Lineage: Full column-level lineage via bruin lineage
GitHub Repository Insights
A production-style data pipeline built with Bruin for the Bruin Zoomcamp challenge. This pipeline ingests GitHub repository metadata from the GitHub API, transforms the data through staging, and produces an analytics report — all orchestrated locally using DuckDB.
GitHub Activity Analytics Dashboard
GitHub generates millions of public events every day — pushes, pull requests, issues, forks, stars — across thousands of repositories and contributors worldwide. This raw activity stream is publicly available via <http://gharchive.org|gharchive.org>, but it is not pre-aggregated or directly queryable in a useful analytical form. This project builds an end-to-end batch data pipeline that answers: - Which event types dominate GitHub activity on any given day or hour? - Which repositories attract the most contributors and drive the most events? - How does activity vary across the day (UTC), and what are the peak hours? - What is the daily mix of event types — is it push-heavy, or driven by issues and PRs? - Which programming language ecosystems (inferred from repo naming patterns) are most active?
edgar daily hub
EDGAR Daily Hub is a full-stack data platform that brings transparency to SEC filing activity. By automatically ingesting the EDGAR daily index every business day, it tracks filing volumes across all form types and flags unusual spikes — like surges in insider ownership disclosures — through a clean, interactive dashboard. Users can build a personal watchlist of stock tickers to monitor filings for companies they care about, turning a tedious manual research process into a seamless daily workflow. Built with React/TypeScript, Python FastAPI, and MotherDuck as the analytical data warehouse, with Supabase handling auth and user data. The pipeline runs on a daily automated schedule via GitHub Actions and is deployed on <http://Fly.io|Fly.io> with Docker.
NZ Electricity Generation Pipeline
This project tracks New Zealand's electricity generation mix across 8 years (2018–2026), pulling monthly CSVs from the Electricity Authority's public API through a three-layer transformation pipeline (staging, core, mart) that feeds a Looker Studio dashboard, revealing an ~85% renewable grid driven mostly by hydro. The pipeline is built entirely with Bruin, an open-source CLI tool that replaces the usual Airflow + dbt + Great Expectations stack with a single binary: SQL and Python assets coexist in the same pipeline with automatic dependency resolution, incremental materialisation, and quality checks embedded directly in asset definitions rather than maintained as a separate test suite. That "one tool, one config format" design meant I could focus on the data logic, unpivoting 50 trading-period columns, deduplicating records, and building partitioned and clustered fact tables, rather than writing glue code between tools.
Kickstarter Campaign Analytics Pipeline
Kickstarter campaign analytics pipeline that answers which factors (category, goal, country, staff pick, duration) best predict Kickstarter success. It ingests ~203k campaigns from Hugging Face (Parquet), lands data in Google Cloud Storage, loads and models it in BigQuery (raw → staging → analytics marts), and feeds a Looker Studio dashboard. Infrastructure (GCS, BigQuery) is provisioned with Terraform. Bruin orchestrates the workflow: GCP connection and environments in .bruin.yml, pipeline definition in pipeline.yml (variables), Python assets for download/load to GCS and BigQuery, BigQuery SQL assets with dependencies (depends), materialization (views/tables), column metadata and data quality checks in asset headers, and Jinja variables (var.*) in SQL.
Kenya Renewable Energy Data Pipeline
This project builds a fully automated data pipeline that ingests Kenya's renewable energy data from four open sources — Ember Climate, IRENA, Our World in Data, and <http://EnergyData.info|EnergyData.info> — transforms it through a Bronze → Silver → Gold medallion architecture stored in a single DuckDB file, and serves it as a live Evidence.dev dashboard deployed via GitHub Actions to GitHub Pages. The pipeline tracks Kenya's progress toward its 2030 target of 100% renewable electricity across five analytical dimensions: generation mix, installed capacity, carbon intensity, electricity access, and geospatial grid infrastructure. It rebuilds automatically every day, requiring zero manual intervention to keep the data current. Bruin features used include; python assets, duckdb SQL assets, asset dependency graph, materialization strategies, built-in data quality checks, multi-language pipeline, CLI validation, single connection config, downstream execution, and checks only mode.
Project_Customer_Churn_Bank
Overview: An end-to-end ELT pipeline built to analyze ABC Bank's customer retention by integrating internal demographics with external 2022 Eurostat market benchmarks. The project focuses on identifying "Premium Segment" churn, discovering that 80% of churned customers are high-earners. How I used Bruin: I leveraged Bruin to move away from traditional script-based workflows to a modern, declarative asset-based architecture: Infrastructure as Code (IaC): Every BigQuery table and view was defined as a Bruin asset, including physical layer optimizations like Clustering on country and gender to boost query performance. Data Lineage & Dependencies: Using Bruin's DAG capabilities, I ensured a clean Medallion-like flow: Staging (cleaning) :arrow_right: Reference (Eurostat data) :arrow_right: Fact (Salary benchmarking logic). Automated Data Quality: I integrated built-in quality checks (not_null, unique) directly into the asset definitions, ensuring that only validated data reached my Looker Studio dashboard. Seamless Deployment: Bruin managed the entire lifecycle from Kaggle ingestion through GCS to BigQuery materialization with a single --force execution command. Key Findings: The pipeline revealed that churned customers earn an average of 5,048 EUR MORE than the national benchmark, and identified a 70.45% churn rate among the 46-60 age group in Germany, providing the bank with clear targets for retention programs.
Github Trends
End-to-end data pipeline tracking GitHub developer activity trends using GCP, Bruin, dbt and Looker Studio
Jobs Analytics Project
Belarus IT Job Market Analytics — a batch pipeline that tracks Data Engineer, Data Analyst and Data Scientist vacancies on <http://rabota.by|rabota.by> (powered by <http://HH.ru|HH.ru> API). Monitors vacancy dynamics over time, required experience levels, and top-5 in-demand skills per role.
Closer-Every-Year
A batch pipeline tracking **gender gap indicators and relationship trends** (marriage, divorce, age at first marriage, pay gap) across European countries from 2005 to 2024, all sourced from the Eurostat API. **Bruin features I used:** • `type: python` assets for ingestion (Eurostat API → Parquet on GCS) • `type: bq.sql` and `type: duckdb.sql` assets for staging + analytics SQL transformations • `strategy: merge` on `(country, year)` — fully idempotent, no duplicate runs • Dependency resolution via asset references — no manual DAG wiring • Dual-environment setup: `local-pipeline` runs on DuckDB, `gcp-pipeline` runs on BigQuery — **same asset code, different connections** • `bruin run --environment cloud` for the GCP pipeline, `--workers 1` for local (DuckDB doesn't support concurrent writes) • Docker-based setup with the Bruin container + Terraform container side by side
Thalassa-Analytics
Thalassa is a production-style batch data engineering project for Greek maritime traffic analytics. It uses Bruin to orchestrate a fully scheduled pipeline that ingests public sailing traffic data from the <http://data.gov.gr|data.gov.gr> sailing_traffic API, lands raw records in BigQuery, transforms them into curated analytics tables, and serves the results through a Streamlit dashboard covering operational KPIs, route patterns, and port analysis.
GitHub Activity Dashboard
GitHub Activity Dashboard transforms raw GitHub event data into structured tables and generates interactive dashboards showing repository engagement and hourly activity. Using Bruin, the pipeline demonstrates end-to-end orchestration, incremental staging, and daily scheduling.
Automated Ads Reporting Suite
A fully automated data pipeline that consolidates, transforms, and visualizes digital advertising performance data across multiple platforms (Google Ads, Meta Ads, TikTok Ads, LinkedIn Ads).
Trading Helper Pipeline
A daily ETL pipeline that fetches end-of-day OHLCV data for QQQ, NQ futures, VIX, and VVIX from yfinance, stores it in a local DuckDB database, and serves an interactive Streamlit dashboard for pre-market analysis.
Olist Ecommerce Analytics Pipeline
This cloud-native ELT pipeline transforms 100k+ Brazilian e-commerce orders into actionable logistics and customer satisfaction insights by migrating raw data from Kaggle into Google BigQuery using a multi-layered Medallion architecture (Raw → Staging → Mart). Powered by Bruin, the project utilizes Python-to-Cloud Ingestion for automated downloads, Native SQL Modeling for scalable transformations, and Automated Data Quality Checks (e.g., unique, not_null, positive_value) to ensure end-to-end data integrity for its interactive visualization dashboard.
Olist E-Commerce Analytics Pipeline
This cloud-native ELT pipeline transforms 100k+ Brazilian e-commerce orders into actionable logistics and customer satisfaction insights by migrating raw data from Kaggle into Google BigQuery using a multi-layered Medallion architecture (Raw → Staging → Mart). Powered by Bruin, the project utilizes Python-to-Cloud Ingestion for automated downloads, Native SQL Modeling for scalable transformations, and Automated Data Quality Checks (e.g., unique, not_null, positive_value) to ensure end-to-end data integrity for its interactive visualization dashboard.
AI Economic Index
This project implements the end-to-end data pipeline: from raw Anthropic dataset ingestion to warehouse transformation (dbt), analytics outputs, and finally reporting via Evidence and Bruin. The purpose is to deliver insights on AI’s role in today’s economy. In this project, I use Bruin as an AI data analyst. To enable this, I first set up a Bruin pipeline that passes through key intermediate and marts dbt models, such as task-to-SOC mappings, enriched datasets, and reporting tables. Every Bruin asset and its table columns were fully documented so Bruin can more easily parse the data and generate queries. Once the pipeline was in place, I connected Bruin as an app in Discord, so the Bruin dataset could be consumed directly for AI-driven data analysis by users.
min-fin
A minimalist, metadata-based on-budget finance tracker using Databricks and bruin. Bruin handles the ETL pipelines while Databricks hosts the data/schema/dashboard.
US Flights Data Engineering Project (2015)
:memo: Problem Description The aviation industry generates massive amounts of data daily. This project analyzes a dataset of 5.8 million flights in the US from 2015 to identify patterns in delays and cancellations. The goal is to provide actionable insights for operational management through a robust data pipeline and interactive dashboards. Source: <https://www.kaggle.com/datasets/usdot/flight-delays> Key Questions Addressed: Punctuality: Which airlines and airports are the most/least punctual (OTP)? Correlation: How do flight distance and time of day affect the probability of delay? Seasonality: What are the seasonal trends in flight reliability? :building_construction: Project Architecture Since the official DOT Bureau of Transportation Statistics does not provide a public API, the data is sourced from Kaggle. The project follows a modern ELT (Extract, Load, Transform) approach using the Medallion Architecture (Bronze, Silver, Gold layers), moving data from raw CSVs to structured analytical reports. :hammer_and_wrench: Technologies & Infrastructure Cloud: Storage (Google Cloud Storage, GCS), Data Warehouse (Google BigQuery) Infrastructure: Batch Processing Architecture Workflow Orchestration & Transformation: Bruin Language: Python (Ingestion), SQL (Transformations) Data Visualization: Power BI (Desktop & Service)
Chicago Crash data analysis
This project is an end-to-end data pipeline that ingests, transforms, and stores Chicago traffic crash data for analysis. It pulls from three public datasets on the City of Chicago Data Portal — Crashes, People, and Vehicles — via the SODA2 REST API, loads the raw records into a local DuckDB database, and then joins them into a single analytics-ready table. The pipeline runs on a daily schedule, fetching only new records each run by filtering the API on the Bruin-provided start and end dates, so it avoids re-downloading the entire dataset (~1M+ crash records) every time. The pipeline is built with Bruin, an open-source data pipeline framework. It uses Python assets with automatic dependency management via uv and pyproject.toml to handle API ingestion, materialization with the merge strategy on the People and Vehicles assets to upsert records by primary key and append on Crashes, and DuckDB as a first-class connection for serverless local analytics. Asset dependencies (depends) enforce sequential execution to avoid DuckDB write-lock conflicts, while the SQL asset leverages Bruin's create+replace materialization to rebuild the merged table from the three raw sources on each run. The daily schedule and built-in environment variables (BRUIN_START_DATE, BRUIN_END_DATE) enable incremental loading out of the box.
Skypulse Streaming Pipeline
SkyPulse is an end-to-end real-time data pipeline that integrates heterogeneous global streams—flight positions, weather conditions, and seismic events—into a unified operational view of airspace. It continuously ingests data from OpenSky, Open-Meteo, and USGS, processes it through Redpanda and Apache Flink using 5-minute tumbling windows, and stores it in a Supabase landing zone. The core strength of the system lies in its use of Bruin, which structures and transforms raw data into a layered analytical model (staging → intermediate → marts), enabling consistent, cross-stream enrichment. This allows the generation of composite geospatial risk scores per grid cell, combining aircraft density, seismic activity, and weather conditions, all visualized in near real-time through an interactive Streamlit dashboard
Swiss Traffic Pipeline
Swiss Traffic Pipeline is a Bruin-orchestrated data pipeline that ingests, stages, and transforms the Swiss Federal Roads Office (FEDRO) 2025 annual traffic bulletin into a Looker Studio dashboard
Chat Analysis
The goal of this project was to understand how users interact with a chatbot—analyzing patterns, behaviors, and engagement to uncover meaningful insights.
Global Urbanization & Mobility Intelligence Pipeline
Production-grade Bruin pipeline that models 126 years of urbanization, density, land expansion, and mobility pressure across 217 countries and areas. The project uses Bruin CLI for orchestration, PostgreSQL for storage, Grafana for visualization, and GitHub Actions for CI/CD.
Airflow-Studio
A visual DAG builder for Apache Airflow. Drag, drop, and connect operators on a canvas generate valid, idiomatic Python DAG files in both Traditional and Taskflow API syntax.
Global Energy Transition Pipeline
This pipeline ingests, transforms, and analyses 125 years of global energy data (1900–2026) across 200+ countries to track the worldwide transition from fossil fuels to renewables. The project uncovered surprising stories like Denmark jumping from 15% to 91% renewable electricity since 2000, DR Congo and Ethiopia already running on 100% renewable power. The pipeline is built entirely with Bruin on DuckDB, with an interactive Evidence.dev dashboard for visualization and GitHub Actions for CI/CD. Bruin Features Used: - Seed assets: ingested the OWID(Our World in Data) CSV dataset directly into DuckDB - SQL assets: built 4 mart tables (global trends, country trends, renewables leaderboard, Africa energy spotlight) - Data quality checks: 54 automated checks across all assets — column-level (not_null, non_negative) and custom SQL checks - bruin ai enhance: AI-powered documentation, column descriptions, domain tags, and data-driven quality checks on all mart assets - pipeline.yml orchestration: daily schedule with dependency management - bruin lineage: full DAG visualisation of asset dependencies - bruin validate: integrated into GitHub Actions CI/CD workflow
Return Analysis
Returns are expensive for ecommerce businesses because they reduce revenue, increase operational costs, and can reveal product or customer behavior issues. The goal of this project is to build an end-to-end data pipeline that answers questions such as: How do return rates change over time? Which product categories have the highest return rates? Which SKUs drive the most revenue loss from returns? Which customers return the most items? The output is a set of analytics tables in BigQuery and a dashboard in Looker Studio for reporting.
audio-trend-data-project
This project aimed to explore content consumption patterns across music and podcasts on streaming platforms, and see if there were any trends amongst them.
research-intel-pipeline
Real-time scientific research analytics pipeline for AI, ML, and Computational Biology. Ingests papers from arXiv and OpenAlex, transforms them with Bruin into a layered analytical model, and serves insights through an interactive Streamlit dashboard.
EV Market Intelligence Pipeline
Built an end-to-end serverless data lakehouse on AWS to analyze global electric vehicle (EV) sales trends—all at near $0 cost. I used the Bruin CLI for orchestration to tie the stack together. Bruin allowed me to: 1. Unify Ingestion & Transformation: Sequence the Kaggle-to-S3 Python asset and the dbt-run asset within a single pipeline. 2. Environment Management: Seamlessly handle local DuckDB processing while targeting AWS S3 for the final "Gold" mart tables. 3. Speed: Achieve rapid development within a GitHub Codespaces environment using a lightweight CLI.
Afrofinance Pulse
A Bruin + DuckDB pipeline tracking Africa's fintech ecosystem, currency volatility, startup funding flows, and fintech readiness across 12 African countries.
UK Retail Analytics Pipeline
An end-to-end data engineering pipeline built with Bruin and DuckDB for the UK Online Retail II dataset (1,067,371 real transactions). The pipeline ingests raw Excel data through a Python asset, cleans it in a staging layer, and produces 5 analytical mart tables covering monthly revenue trends, product performance with return rate analysis, RFM customer segmentation, country analysis, and cancellation tracking. Features 66 automated quality checks across all 7 assets, GitHub Actions CI/CD, and AI analysis via the Bruin AI Data Analyst. Key finding: one product ranks 4th by revenue at £168K but has a 100% return rate.
Air Quality Around Italy SCAN
End-to-end batch pipeline to monitor air quality across Italy. Pulls real-time data from [AQICN](<https://aqicn.org>), stores it on **GCS**, transforms it with **dbt + DuckDB**, and visualizes it in an interactive **Streamlit** dashboard.
Hospital Facility Management BI Learning Project
Built a scenario-based BI app using NHS ERIC data + synthetic data to model hospital facility operations. Stack: DuckDB, Bruin, Streamlit Focus: • Maintenance backlog • Compliance • Sterile services • Patient flow Goal: bridge real-world operations with BI modeling.
Mexico Biomass Analytics: End-to-End Data Pipeline
This project solves a critical data logistics problem in Mexico's renewable energy sector. It combines fragmented agricultural waste data (SIAP) and existing infrastructure records (SEMARNAT) to identify the national biomass "Opportunity Gap." Designed with an EtLT architecture, the pipeline uses Terraform for IaC deployment on Google Cloud Platform. Bruin serves as the unified orchestrator handling data ingestion via Python, and heavy business transformations using SQL natively in BigQuery. The optimized data warehouse powers an interactive Looker Studio dashboard and integrates with Bruin Cloud's AI Data Analyst to provide actionable, natural-language insights for infrastructure investment. Tech Stack: Bruin, GCP (Cloud Storage, BigQuery), Terraform, Python, SQL, Looker Studio.
Global "Net Zero" Energy Transition Tracker
An end-to-end, production-grade data platform designed to monitor and analyze the structural decoupling of economic growth from carbon intensity. The system processes 25 years of historical data (2000–2024) across 231 countries to track progress toward global Net Zero targets, with a specialized strategic deep dive into the ASEAN region and Indonesia.
:mag: OtakuLens
A fully cloud-native, end-to-end data engineering pipeline that ingests metadata for 500+ anime titles from MyAnimeList, transforms it through a production-grade ELT process, and serves interactive analytics through a live Streamlit dashboard, including semantic anime recommendations powered by sentence embeddings.
U.S. Crude Oil Production Analytics Pipeline
The primary goal of this project is to build a robust and scalable data pipeline that tracks crude oil production across different U.S. states using publicly available data from the Energy Information Administration (EIA). The pipeline handles end-to-end data processing—from ingestion of raw datasets to transformation into analytics-ready tables—enabling analysis of production trends across regions and over time to support data-driven insights in the energy sector. Bruin Features Used This project leverages Bruin to simplify and manage the data pipeline using: Data Ingestion: Integrating raw EIA datasets directly into the data warehouse as part of the pipeline Data Transformation: SQL-based processing to clean, structure, and aggregate crude oil production data Pipeline Execution (bruin run): Running the full pipeline with automatic dependency handling Bruin AI: Assisting in SQL development and accelerating pipeline implementation
NYC Citi Bike Data Pipeline
This project is an end-to-end data engineering pipeline built around NYC Citi Bike trip data. It ingests monthly trip records from the public source, stores them in Google Cloud Storage, transforms them in BigQuery, and publishes the results through an interactive Plotly Dash dashboard. The goal is to turn raw trip-level data into clean, analysis-ready tables and visual insights about ridership patterns, bike type usage, trip duration, distance, and seasonality. The project uses several Bruin features to manage the workflow: - Asset definitions and dependencies to organize the raw, staging, and report layers. - Connection configuration for BigQuery through a named Bruin connection. - Materialization strategies including incremental `delete+insert` and full rebuild `create+replace`. - Built-in data quality checks such as `not_null` and `accepted_values` to improve reliability and consistency.
Hong Kong Transit Pulse — 香港交通脈搏
Hong Kong Transit Pulse is an end-to-end batch data engineering pipeline that ingests raw GTFS feeds and MTR open data, transforms them into analytics-ready models, and surfaces insights via an interactive Streamlit dashboard. The pipeline runs daily, pulling from two open data sources — HK Transport (GTFS) and MTR Corporation — loading them into Google Cloud Storage, transforming through BigQuery layers (raw → staging → marts), and visualising in a 4-tab dashboard with a real-time streaming layer on top.