How do you migrate from Pentaho PDI to Bruin?

Start by inventorying jobs, sources, destinations, schedules, owners, and downstream reports. Then migrate one critical flow into Bruin assets, add quality checks, run Pentaho and Bruin in parallel, compare outputs, and retire old jobs only after parity is proven.

Should we migrate every Pentaho job at once?

No. The safer path is one business-critical pipeline at a time. Large rewrites usually hide business logic mistakes. Bruin works well for incremental migration because it can run alongside existing Pentaho jobs.

Can Bruin handle custom Pentaho steps or legacy sources?

Usually yes, either through ingestr sources, SQL assets, or Python materializations. Python is useful when the source is a custom API, file drop, mainframe export, FTP process, or proprietary system.

Migrating from Pentaho Data Integration to Bruin

Migrating from Pentaho is rarely a clean "old tool out, new tool in" project.

It is usually more awkward than that. You have PDI jobs that have been edited for years, Spoon transformations that nobody wants to touch, scheduled flows that write into reporting tables, a few custom scripts nearby, and at least one downstream dashboard that finance will notice if it breaks.

If you are evaluating alternatives to an older Pentaho estate, the Bruin team can help with onboarding and migration planning so the first pass focuses on inventory, mapping, parity checks, and a realistic cutover plan instead of a blank-page rewrite.

So the migration plan needs to be boring.

Not heroic. Not a six-month rewrite. Boring.

The goal is to move one flow at a time from Pentaho Data Integration or Kettle into Bruin, prove the output, add governance that probably did not exist before, and only then retire the old job.

Step 1: inventory the Pentaho estate

Start with a spreadsheet if you must. The format does not matter at first. The columns do.

Capture this for every job and transformation:

Field	Why it matters
Job or transformation name	The thing you are migrating
Owner	Someone needs to approve parity
Source systems	Databases, SaaS tools, files, APIs, FTP drops
Destination tables or files	What downstream users actually consume
Schedule	When the job runs and what it depends on
Runtime	How long it takes today
Failure mode	What usually breaks
Downstream reports	What will complain if the output changes
Business rules	Logic hidden inside steps, filters, joins, lookups
Data quality assumptions	Row counts, uniqueness, null checks, freshness

This sounds obvious, but it is where most migrations fail. Teams convert the easy transformations and miss the hidden business rule that was sitting in a filter step from 2018.

Step 2: pick one flow

Do not start with the biggest job.

Pick a pipeline that is important enough to matter and small enough to finish. A good first candidate has:

Two to five source tables or files
Clear downstream consumers
Known business logic
A measurable output
A friendly owner who can validate the result

Bad first candidates: the huge job that touches 60 tables, a finance close process nobody understands, or a job that writes into a report nobody owns.

You can get to those later. The first migration is about learning the pattern.

Step 3: map Pentaho concepts to Bruin concepts

Here is the mental model:

Pentaho concept	Bruin concept
Transformation step	SQL or Python asset logic
Job dependency	`depends` relationship
Database input	ingestr asset or SQL asset
File input	ingestr, Python materialization, or warehouse external table
Lookup step	SQL join or Python enrichment
Filter rows	SQL `WHERE` clause or Python transform
Output table	Asset materialization
Job schedule	Bruin Cloud schedule or CI/orchestrated run
Manual validation	Asset quality checks
Operational notes	Metadata, owners, tiers, documentation

The migration is not about recreating every visual step one-to-one. That is how you carry old complexity into the new system.

The better move is to recreate the business intent.

Step 4: rebuild ingestion first

Move source extraction before transformation logic.

In Bruin, common ingestion jobs use ingestr:

name: raw.postgres_orders
type: ingestr
parameters:
  source_connection: postgres
  source_table: public.orders
  destination: snowflake
  incremental_strategy: merge
  incremental_key: updated_at

columns:
  - name: id
    type: integer
    primary_key: true
  - name: updated_at
    type: timestamp

For custom sources, use Python materialization instead. This is where Bruin is useful for old enterprise systems that do not fit a neat connector catalogue.

"""@bruin
name: raw.partner_export
type: python
connection: snowflake
materialization:
  type: table
  strategy: replace
@bruin"""

import pandas as pd

def materialize(**kwargs):
    export_path = kwargs["secrets"]["partner_export_path"]
    return pd.read_csv(export_path)

That might replace a Pentaho file input, FTP step, custom shell wrapper, or a weird export process around the edge of the PDI job.

Step 5: move transformations into SQL or Python

Most Pentaho transformations become SQL. Joins, filters, aggregations, date logic, standardization, deduplication, and reporting tables are usually clearer in SQL than in a visual canvas.

/* @bruin
name: marts.daily_revenue
type: sf.sql
depends:
  - raw.postgres_orders
owner: finance-analytics
materialization:
  type: table
meta:
  tier: gold
  migrated_from: pentaho
columns:
  - name: revenue_date
    type: date
    checks:
      - name: not_null
  - name: order_count
    type: integer
    checks:
      - name: non_negative
  - name: gross_revenue
    type: float
    checks:
      - name: non_negative
@bruin
*/

SELECT
  DATE_TRUNC('day', created_at) AS revenue_date,
  SUM(amount) AS gross_revenue,
  COUNT(*) AS order_count
FROM raw.postgres_orders
WHERE status = 'completed'
GROUP BY 1

Use Python when the logic is actually Python-shaped: custom API calls, ML scoring, fuzzy matching, file parsing, complicated enrichment, or a proprietary library that already exists in your company.

The mistake is forcing everything into one language. Bruin lets SQL and Python depend on each other, so use the right tool for each part.

Step 6: add checks before comparing outputs

Do not wait until production to add quality checks.

Every migrated asset should have checks inside the asset definition:

columns:
  - name: id
    type: integer
    description: "Primary key"
    checks:
      - name: unique
      - name: not_null

This is the point of migrating to Bruin instead of just another ETL tool. The pipeline should say what healthy means.

Pentaho migration

Planning a Pentaho migration?

Tell us what your PDI jobs look like. The Bruin team can help separate the easy source moves, SQL rewrites, Python materializations, checks, MCP access, and DAC dashboards before cutover.

Step 7: run Pentaho and Bruin in parallel

Parallel runs are non-negotiable for anything important.

Compare:

Row counts by table
Freshness timestamps
Null rates on key fields
Primary key uniqueness
Aggregated metrics by day, region, product, customer segment, or whatever matters
Downstream dashboard numbers
Runtime and failure frequency

You are looking for two kinds of differences.

First, migration bugs. Maybe a filter moved incorrectly. Maybe a lookup joined on the wrong key. Fix those.

Second, old bugs. This is awkward, but it happens. You may discover the Pentaho job was wrong and everyone got used to the wrong output. Do not hide that. Write it down, get the business owner to approve the corrected logic, and add a check so it does not come back.

Step 8: cut over one consumer at a time

Once Bruin matches or intentionally corrects the old output, cut over one downstream consumer.

Not all of them.

One report, one table, one team. Let it run. Watch it. Then expand.

The safest sequence is:

Bruin writes to a parallel schema.
Analysts validate the output.
One dashboard or consumer switches to Bruin output.
The old Pentaho output stays available for rollback.
After an agreed window, Bruin becomes the source of truth.
The Pentaho job is disabled, documented, and removed later.

Do the boring work. Future you will be grateful.

What Bruin Cloud adds after migration

Bruin CLI and ingestr handle the developer workflow. Bruin Cloud adds the enterprise layer:

Scheduling and orchestration
Run history and observability
Catalog and lineage
Asset tiers and meta-keys
SSO and RBAC
Audit logs
Cost visibility
DAC dashboards
MCP access for governed cloud operations from supported assistants
AI data analyst workflows in Slack, Microsoft Teams, browser, and other channels

This is the part many Pentaho migrations miss. They move ETL logic but do not improve governance. Then six months later they have the same operational mess in a newer tool.

The whole point is to leave with a better system.

A simple first-week plan

If you want to start this week:

Pick one Pentaho job.
Export its sources, destinations, schedule, and owner.
Create a Bruin project.
Rebuild ingestion with ingestr or Python.
Recreate the transformation in SQL or Python.
Add checks.
Run side by side.
Compare outputs with the business owner.

That is enough. Do not turn the first week into a platform strategy exercise.

Bottom line

Migrating from Pentaho to Bruin is not about making old jobs look modern. It is about making the data platform easier to understand, safer to change, and useful for AI-driven analysis.

Start with one flow. Preserve the business logic. Add checks. Prove parity. Then expand.

For the side-by-side comparison, read Pentaho vs Bruin. If you are still choosing the broader category, start with best data pipeline tools in 2026.