All integrations
AWS EMR Serverless
+
Bruin

AWS EMR Serverless + Bruin

SourceDestination

Ingest data from AWS EMR Serverless or push enriched data back — with quality checks, lineage, and scheduling. Defined in YAML, version-controlled in Git.

For business teams

What you get

  • 100+ sources into ${pn}

    Pull from any tool, database, or API directly into AWS EMR Serverless. One YAML file per source, all managed by Bruin.

  • Data quality you can trust

    Column-level and custom SQL checks on any AWS EMR Serverless table. Bad data gets blocked before it reaches dashboards.

  • Full lineage visibility

    Trace data from ingestion through transforms to final reports. When something breaks, find the cause in seconds.

  • SQL + Python in one pipeline

    Build transforms in AWS EMR Serverless with both SQL and Python. Bruin resolves dependencies across languages automatically.

For data & engineering teams

How it works

  • 100+ managed connectors

    Ingest from any source directly into AWS EMR Serverless with one YAML file per source. Bruin manages connections and scheduling.

  • YAML-defined, Git-versioned

    Every pipeline is a YAML file. Review in PRs, deploy with CI/CD, roll back with git revert.

  • SQL + Python assets

    Build transformation layers in AWS EMR Serverless with SQL and Python. Bruin resolves dependencies and handles materialization.

  • Quality gates between stages

    Quality checks run between ingestion and transformation. Bad data gets blocked before it reaches downstream models.

Before you start

AWS credentials
EMR Serverless application

Step 1

Add your AWS EMR Serverless connection

Connect using AWS credentials and EMR Serverless application configuration. Add this to your Bruin environment file — credentials are stored securely and referenced by name in your pipeline YAML.

connections:
  emr_serverless:
    type: emr-serverless
    uri: "emr-serverless://access_key:secret_key@application_id?region=us-east-1"

Step 2

Create your pipeline

Define a YAML asset that tells Bruin what to pull from AWS EMR Serverless and where to land it. This file lives in your Git repo — reviewable, version-controlled, and deployable with CI/CD.

Available tables

spark_tableshive_tablesiceberg_tables
name: raw.emr_serverless_spark_tables
type: ingestr

parameters:
  source_connection: emr_serverless
  source_table: 'spark_tables'
  destination: bigquery

Step 3

Add quality checks

Add column-level and custom SQL checks to your AWS EMR Serverless data. If a check fails, the pipeline stops — bad data never reaches downstream models or dashboards.

Validate data freshness on every sync
Ensure IDs are unique across tables
Block bad data before it reaches downstream models
columns:
  - name: id
    checks:
      - name: not_null
      - name: unique

custom_checks:
  - name: freshness check
    query: |
      SELECT MAX(updated_at) >
        TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 24 HOUR)
      FROM raw.emr_serverless_spark_tables

Step 4

Run it

One command. Bruin connects to AWS EMR Serverless, pulls data incrementally, runs your quality checks, and lands clean data in your warehouse. If a check fails, the pipeline stops — bad data never reaches downstream.

Backfill historical data with --start-date
Schedule with cron or trigger from CI/CD
Full lineage from AWS EMR Serverless to your dashboards
$ bruin run .
Running pipeline...

  emr_serverless_spark_tables
    ✓ Fetched 2,847 new records
    ✓ Quality: campaign_id not_null     PASSED
    ✓ Quality: spend not_null           PASSED
    ✓ Quality: no negative ad spend     PASSED
    ✓ Loaded into bigquery

  Completed in 12s

Ready to connect AWS EMR Serverless?

Start for free, or book a demo to see how Bruin handles ingestion, quality, lineage, and scheduling for your entire data stack.