Community Project Showcase
Browse projects built with Bruin. Explore real-world pipelines, ingestion workflows, and analytics solutions from the community.
27 Projects
CryptoFlow Analytics
Crypto markets generate massive amounts of data across hundreds of exchanges, thousands of tokens, and multiple sentiment indicators. Individual investors and analysts face three core challenges: - Data fragmentation — Prices, volumes, sentiment, and trending data live in separate APIs with different formats - Signal noise — Raw price changes alone are misleading without context (volume confirmation, market breadth, sentiment) - Regime blindness — Most dashboards show what happened, but fail to classify where we are in the market cycle CryptoFlow Analytics solves this by building a unified intelligence layer that ingests, cleans, enriches, and analyzes crypto data to produce actionable signals - not just charts. :hammer_and_wrench: Bruin Features Used - Python Assets: 5 ingestion scripts fetching from CoinGecko, <http://Alternative.me|Alternative.me> APIs, and CSV seed - SQL Assets: 9 BigQuery SQL transformations across staging (3) and analytics (6) layers - Seed Assets: CSV-based reference data for coin categories - Materialization: table strategy for all assets; merge for incremental ingestion - Dependencies: Explicit depends declarations creating a proper DAG - Quality Checks: Built-in (not_null, unique, positive, accepted_values) on every asset - Custom Checks: Business logic validations (e.g., "Bitcoin must exist in data", "dominances sum to ~100%") - Glossary: Structured business term definitions for crypto concepts - Pipeline Schedule: Daily schedule via pipeline.yml - Bruin Cloud: Deployment, monitoring, and AI analyst - AI Data Analyst: Conversational analysis on all analytics tables - Lineage: Full column-level lineage via bruin lineage
GitHub Repository Insights
A production-style data pipeline built with Bruin for the Bruin Zoomcamp challenge. This pipeline ingests GitHub repository metadata from the GitHub API, transforms the data through staging, and produces an analytics report — all orchestrated locally using DuckDB.
GitHub Activity Analytics Dashboard
GitHub generates millions of public events every day — pushes, pull requests, issues, forks, stars — across thousands of repositories and contributors worldwide. This raw activity stream is publicly available via <http://gharchive.org|gharchive.org>, but it is not pre-aggregated or directly queryable in a useful analytical form. This project builds an end-to-end batch data pipeline that answers: - Which event types dominate GitHub activity on any given day or hour? - Which repositories attract the most contributors and drive the most events? - How does activity vary across the day (UTC), and what are the peak hours? - What is the daily mix of event types — is it push-heavy, or driven by issues and PRs? - Which programming language ecosystems (inferred from repo naming patterns) are most active?
edgar daily hub
EDGAR Daily Hub is a full-stack data platform that brings transparency to SEC filing activity. By automatically ingesting the EDGAR daily index every business day, it tracks filing volumes across all form types and flags unusual spikes — like surges in insider ownership disclosures — through a clean, interactive dashboard. Users can build a personal watchlist of stock tickers to monitor filings for companies they care about, turning a tedious manual research process into a seamless daily workflow. Built with React/TypeScript, Python FastAPI, and MotherDuck as the analytical data warehouse, with Supabase handling auth and user data. The pipeline runs on a daily automated schedule via GitHub Actions and is deployed on <http://Fly.io|Fly.io> with Docker.
NZ Electricity Generation Pipeline
This project tracks New Zealand's electricity generation mix across 8 years (2018–2026), pulling monthly CSVs from the Electricity Authority's public API through a three-layer transformation pipeline (staging, core, mart) that feeds a Looker Studio dashboard, revealing an ~85% renewable grid driven mostly by hydro. The pipeline is built entirely with Bruin, an open-source CLI tool that replaces the usual Airflow + dbt + Great Expectations stack with a single binary: SQL and Python assets coexist in the same pipeline with automatic dependency resolution, incremental materialisation, and quality checks embedded directly in asset definitions rather than maintained as a separate test suite. That "one tool, one config format" design meant I could focus on the data logic, unpivoting 50 trading-period columns, deduplicating records, and building partitioned and clustered fact tables, rather than writing glue code between tools.
Kickstarter Campaign Analytics Pipeline
Kickstarter campaign analytics pipeline that answers which factors (category, goal, country, staff pick, duration) best predict Kickstarter success. It ingests ~203k campaigns from Hugging Face (Parquet), lands data in Google Cloud Storage, loads and models it in BigQuery (raw → staging → analytics marts), and feeds a Looker Studio dashboard. Infrastructure (GCS, BigQuery) is provisioned with Terraform. Bruin orchestrates the workflow: GCP connection and environments in .bruin.yml, pipeline definition in pipeline.yml (variables), Python assets for download/load to GCS and BigQuery, BigQuery SQL assets with dependencies (depends), materialization (views/tables), column metadata and data quality checks in asset headers, and Jinja variables (var.*) in SQL.
Kenya Renewable Energy Data Pipeline
This project builds a fully automated data pipeline that ingests Kenya's renewable energy data from four open sources — Ember Climate, IRENA, Our World in Data, and <http://EnergyData.info|EnergyData.info> — transforms it through a Bronze → Silver → Gold medallion architecture stored in a single DuckDB file, and serves it as a live Evidence.dev dashboard deployed via GitHub Actions to GitHub Pages. The pipeline tracks Kenya's progress toward its 2030 target of 100% renewable electricity across five analytical dimensions: generation mix, installed capacity, carbon intensity, electricity access, and geospatial grid infrastructure. It rebuilds automatically every day, requiring zero manual intervention to keep the data current. Bruin features used include; python assets, duckdb SQL assets, asset dependency graph, materialization strategies, built-in data quality checks, multi-language pipeline, CLI validation, single connection config, downstream execution, and checks only mode.
Project_Customer_Churn_Bank
Overview: An end-to-end ELT pipeline built to analyze ABC Bank's customer retention by integrating internal demographics with external 2022 Eurostat market benchmarks. The project focuses on identifying "Premium Segment" churn, discovering that 80% of churned customers are high-earners. How I used Bruin: I leveraged Bruin to move away from traditional script-based workflows to a modern, declarative asset-based architecture: Infrastructure as Code (IaC): Every BigQuery table and view was defined as a Bruin asset, including physical layer optimizations like Clustering on country and gender to boost query performance. Data Lineage & Dependencies: Using Bruin's DAG capabilities, I ensured a clean Medallion-like flow: Staging (cleaning) :arrow_right: Reference (Eurostat data) :arrow_right: Fact (Salary benchmarking logic). Automated Data Quality: I integrated built-in quality checks (not_null, unique) directly into the asset definitions, ensuring that only validated data reached my Looker Studio dashboard. Seamless Deployment: Bruin managed the entire lifecycle from Kaggle ingestion through GCS to BigQuery materialization with a single --force execution command. Key Findings: The pipeline revealed that churned customers earn an average of 5,048 EUR MORE than the national benchmark, and identified a 70.45% churn rate among the 46-60 age group in Germany, providing the bank with clear targets for retention programs.
Github Trends
End-to-end data pipeline tracking GitHub developer activity trends using GCP, Bruin, dbt and Looker Studio
Jobs Analytics Project
Belarus IT Job Market Analytics — a batch pipeline that tracks Data Engineer, Data Analyst and Data Scientist vacancies on <http://rabota.by|rabota.by> (powered by <http://HH.ru|HH.ru> API). Monitors vacancy dynamics over time, required experience levels, and top-5 in-demand skills per role.
Closer-Every-Year
A batch pipeline tracking **gender gap indicators and relationship trends** (marriage, divorce, age at first marriage, pay gap) across European countries from 2005 to 2024, all sourced from the Eurostat API. **Bruin features I used:** • `type: python` assets for ingestion (Eurostat API → Parquet on GCS) • `type: bq.sql` and `type: duckdb.sql` assets for staging + analytics SQL transformations • `strategy: merge` on `(country, year)` — fully idempotent, no duplicate runs • Dependency resolution via asset references — no manual DAG wiring • Dual-environment setup: `local-pipeline` runs on DuckDB, `gcp-pipeline` runs on BigQuery — **same asset code, different connections** • `bruin run --environment cloud` for the GCP pipeline, `--workers 1` for local (DuckDB doesn't support concurrent writes) • Docker-based setup with the Bruin container + Terraform container side by side
Thalassa-Analytics
Thalassa is a production-style batch data engineering project for Greek maritime traffic analytics. It uses Bruin to orchestrate a fully scheduled pipeline that ingests public sailing traffic data from the <http://data.gov.gr|data.gov.gr> sailing_traffic API, lands raw records in BigQuery, transforms them into curated analytics tables, and serves the results through a Streamlit dashboard covering operational KPIs, route patterns, and port analysis.
GitHub Activity Dashboard
GitHub Activity Dashboard transforms raw GitHub event data into structured tables and generates interactive dashboards showing repository engagement and hourly activity. Using Bruin, the pipeline demonstrates end-to-end orchestration, incremental staging, and daily scheduling.
Automated Ads Reporting Suite
A fully automated data pipeline that consolidates, transforms, and visualizes digital advertising performance data across multiple platforms (Google Ads, Meta Ads, TikTok Ads, LinkedIn Ads).
Trading Helper Pipeline
A daily ETL pipeline that fetches end-of-day OHLCV data for QQQ, NQ futures, VIX, and VVIX from yfinance, stores it in a local DuckDB database, and serves an interactive Streamlit dashboard for pre-market analysis.
Olist Ecommerce Analytics Pipeline
This cloud-native ELT pipeline transforms 100k+ Brazilian e-commerce orders into actionable logistics and customer satisfaction insights by migrating raw data from Kaggle into Google BigQuery using a multi-layered Medallion architecture (Raw → Staging → Mart). Powered by Bruin, the project utilizes Python-to-Cloud Ingestion for automated downloads, Native SQL Modeling for scalable transformations, and Automated Data Quality Checks (e.g., unique, not_null, positive_value) to ensure end-to-end data integrity for its interactive visualization dashboard.
Olist E-Commerce Analytics Pipeline
This cloud-native ELT pipeline transforms 100k+ Brazilian e-commerce orders into actionable logistics and customer satisfaction insights by migrating raw data from Kaggle into Google BigQuery using a multi-layered Medallion architecture (Raw → Staging → Mart). Powered by Bruin, the project utilizes Python-to-Cloud Ingestion for automated downloads, Native SQL Modeling for scalable transformations, and Automated Data Quality Checks (e.g., unique, not_null, positive_value) to ensure end-to-end data integrity for its interactive visualization dashboard.
AI Economic Index
This project implements the end-to-end data pipeline: from raw Anthropic dataset ingestion to warehouse transformation (dbt), analytics outputs, and finally reporting via Evidence and Bruin. The purpose is to deliver insights on AI’s role in today’s economy. In this project, I use Bruin as an AI data analyst. To enable this, I first set up a Bruin pipeline that passes through key intermediate and marts dbt models, such as task-to-SOC mappings, enriched datasets, and reporting tables. Every Bruin asset and its table columns were fully documented so Bruin can more easily parse the data and generate queries. Once the pipeline was in place, I connected Bruin as an app in Discord, so the Bruin dataset could be consumed directly for AI-driven data analysis by users.
min-fin
A minimalist, metadata-based on-budget finance tracker using Databricks and bruin. Bruin handles the ETL pipelines while Databricks hosts the data/schema/dashboard.
US Flights Data Engineering Project (2015)
:memo: Problem Description The aviation industry generates massive amounts of data daily. This project analyzes a dataset of 5.8 million flights in the US from 2015 to identify patterns in delays and cancellations. The goal is to provide actionable insights for operational management through a robust data pipeline and interactive dashboards. Source: <https://www.kaggle.com/datasets/usdot/flight-delays> Key Questions Addressed: Punctuality: Which airlines and airports are the most/least punctual (OTP)? Correlation: How do flight distance and time of day affect the probability of delay? Seasonality: What are the seasonal trends in flight reliability? :building_construction: Project Architecture Since the official DOT Bureau of Transportation Statistics does not provide a public API, the data is sourced from Kaggle. The project follows a modern ELT (Extract, Load, Transform) approach using the Medallion Architecture (Bronze, Silver, Gold layers), moving data from raw CSVs to structured analytical reports. :hammer_and_wrench: Technologies & Infrastructure Cloud: Storage (Google Cloud Storage, GCS), Data Warehouse (Google BigQuery) Infrastructure: Batch Processing Architecture Workflow Orchestration & Transformation: Bruin Language: Python (Ingestion), SQL (Transformations) Data Visualization: Power BI (Desktop & Service)
Chicago Crash data analysis
This project is an end-to-end data pipeline that ingests, transforms, and stores Chicago traffic crash data for analysis. It pulls from three public datasets on the City of Chicago Data Portal — Crashes, People, and Vehicles — via the SODA2 REST API, loads the raw records into a local DuckDB database, and then joins them into a single analytics-ready table. The pipeline runs on a daily schedule, fetching only new records each run by filtering the API on the Bruin-provided start and end dates, so it avoids re-downloading the entire dataset (~1M+ crash records) every time. The pipeline is built with Bruin, an open-source data pipeline framework. It uses Python assets with automatic dependency management via uv and pyproject.toml to handle API ingestion, materialization with the merge strategy on the People and Vehicles assets to upsert records by primary key and append on Crashes, and DuckDB as a first-class connection for serverless local analytics. Asset dependencies (depends) enforce sequential execution to avoid DuckDB write-lock conflicts, while the SQL asset leverages Bruin's create+replace materialization to rebuild the merged table from the three raw sources on each run. The daily schedule and built-in environment variables (BRUIN_START_DATE, BRUIN_END_DATE) enable incremental loading out of the box.
Skypulse Streaming Pipeline
SkyPulse is an end-to-end real-time data pipeline that integrates heterogeneous global streams—flight positions, weather conditions, and seismic events—into a unified operational view of airspace. It continuously ingests data from OpenSky, Open-Meteo, and USGS, processes it through Redpanda and Apache Flink using 5-minute tumbling windows, and stores it in a Supabase landing zone. The core strength of the system lies in its use of Bruin, which structures and transforms raw data into a layered analytical model (staging → intermediate → marts), enabling consistent, cross-stream enrichment. This allows the generation of composite geospatial risk scores per grid cell, combining aircraft density, seismic activity, and weather conditions, all visualized in near real-time through an interactive Streamlit dashboard
Swiss Traffic Pipeline
Swiss Traffic Pipeline is a Bruin-orchestrated data pipeline that ingests, stages, and transforms the Swiss Federal Roads Office (FEDRO) 2025 annual traffic bulletin into a Looker Studio dashboard
Chat Analysis
The goal of this project was to understand how users interact with a chatbot—analyzing patterns, behaviors, and engagement to uncover meaningful insights.
Global Urbanization & Mobility Intelligence Pipeline
Production-grade Bruin pipeline that models 126 years of urbanization, density, land expansion, and mobility pressure across 217 countries and areas. The project uses Bruin CLI for orchestration, PostgreSQL for storage, Grafana for visualization, and GitHub Actions for CI/CD.
Airflow-Studio
A visual DAG builder for Apache Airflow. Drag, drop, and connect operators on a canvas generate valid, idiomatic Python DAG files in both Traditional and Taskflow API syntax.
Global Energy Transition Pipeline
This pipeline ingests, transforms, and analyses 125 years of global energy data (1900–2026) across 200+ countries to track the worldwide transition from fossil fuels to renewables. The project uncovered surprising stories like Denmark jumping from 15% to 91% renewable electricity since 2000, DR Congo and Ethiopia already running on 100% renewable power. The pipeline is built entirely with Bruin on DuckDB, with an interactive Evidence.dev dashboard for visualization and GitHub Actions for CI/CD. Bruin Features Used: - Seed assets: ingested the OWID(Our World in Data) CSV dataset directly into DuckDB - SQL assets: built 4 mart tables (global trends, country trends, renewables leaderboard, Africa energy spotlight) - Data quality checks: 54 automated checks across all assets — column-level (not_null, non_negative) and custom SQL checks - bruin ai enhance: AI-powered documentation, column descriptions, domain tags, and data-driven quality checks on all mart assets - pipeline.yml orchestration: daily schedule with dependency management - bruin lineage: full DAG visualisation of asset dependencies - bruin validate: integrated into GitHub Actions CI/CD workflow