Technical
8 min read

Score XGBoost models in BigQuery, no Python required

Most batch ML scoring pipelines waste DS time pulling data into Python containers to do arithmetic the warehouse already does. Here is the trick, the receipt, and a copy-paste Jinja macro that translates an XGBoost model to SQL, validated on DuckDB.

Sabri Karagonen

Data & Product

Score XGBoost models in BigQuery, no Python required

For a long time I was overcomplicating model deployment in my head. A batch scoring run looked like this: pull millions of rows out of BigQuery into a Python container, call model.predict, write the scores back. Wire the container into the orchestrator. Add retries. Add monitoring. Get ML eng to review. The story I had absorbed was that deploying ML models requires ML engineering and too much technical information.

Then I noticed the obvious. ML models are just math. You can calculate an XGBoost prediction by hand, given enough patience. The same goes for any tree-based model and most regressions, they all translate to arithmetic and arithmetic translates to SQL.

If you can do the math on paper, you can do it in SQL. And for batch scoring, you almost always should.

You don't need a container to generate scores

Walk into most companies running ML in production and the batch scoring pipeline looks the same. There is a SQL job that builds the feature table in BigQuery. A scheduled Python container reads the feature table out of BigQuery, runs model.predict, and writes the scores back. A bit of orchestration glue. Maybe a Cloud Run service. Sometimes Vertex AI for fancier teams.

You already pay BigQuery to scan the feature table. Then you pay to move millions of rows over the network into a Python container. The container loads a model, runs arithmetic, and writes the predictions back. Three services and one network round trip on the critical path, for what should be one query. Why extra costs with extra infra?

Simpler infra is easier to maintain. So why is a Python service in the loop at all?

Trees are CASE WHENs

An XGBoost regressor with 50 trees is a sum of 50 functions plus a constant. Each function is a binary tree where every internal node says feature < some_number and every leaf is a number. The prediction is base_score + sum(tree_outputs).

A binary tree of inequalities has a one-to-one translation to nested CASE WHENs:

CASE WHEN credit_score < 650
  THEN CASE WHEN income < 50000 THEN -0.04 ELSE 0.07 END
  ELSE CASE WHEN employment_years < 2 THEN -0.02 ELSE 0.09 END
END

That is one tree. You write 50 of them, sum them, add the base score, and you have the regression prediction.

For a binary classifier with binary:logistic, each tree adds a contribution in logit space. Wrap the sum with a sigmoid and you get the probability:

1.0 / (1.0 + EXP(-margin)) AS prediction

The only subtle part is missing values. XGBoost stores a default direction for each split: when the feature is NULL, the model goes left or right depending on what was learned during training. That maps to one extra clause in the WHEN, either OR feature IS NULL or nothing.

The macro

Three pieces. A Bruin macro that owns all the SQL-rendering logic. A small Python helper that extracts the tree structure from a trained XGBoost model. A Bruin asset that holds only the tree data and a one-line macro call. Supports reg:squarederror regression and binary:logistic classification. I validated end-to-end on DuckDB through real gonja against model.predict and model.predict_proba, predictions match to about 1e-7.

The macro lives in macros/xgb.sql. A recursive walker emits nested CASE WHENs, then one macro per objective sums trees and adds the intercept (and a sigmoid for binary):

{%- macro parse_xgb_tree(n) -%}
{%- if n.v is defined -%}{{ n.v }}
{%- else -%}CASE WHEN {{ n.f }} < {{ n.t }}{% if n.m %} OR {{ n.f }} IS NULL{% endif %} THEN {{ parse_xgb_tree(n.c[0]) }} ELSE {{ parse_xgb_tree(n.c[1]) }} END
{%- endif -%}
{%- endmacro %}

{%- macro xgb_predict_regression(trees, intercept, primary_key, feature_table) -%}
SELECT
  {{ primary_key }},
  ({{ intercept }}{% for t in trees %} + ({{ parse_xgb_tree(t) }}){% endfor %}) AS prediction
FROM {{ feature_table }}
{%- endmacro %}

{%- macro xgb_predict_binary(trees, intercept, primary_key, feature_table) -%}
SELECT
  {{ primary_key }},
  1.0 / (1.0 + EXP(-({{ intercept }}{% for t in trees %} + ({{ parse_xgb_tree(t) }}){% endfor %}))) AS prediction
FROM {{ feature_table }}
{%- endmacro %}

The Python helper extracts each tree as a small nested dict with only the fields the macro needs (f feature, t threshold, m missing-goes-left, c children, v leaf value):

import json, math

def simplify(node):
    if "leaf" in node:
        return {"v": float(node["leaf"])}
    return {
        "f": node["split"],
        "t": float(node["split_condition"]),
        "m": node["missing"] == node["yes"],
        "c": [simplify(node["children"][0]), simplify(node["children"][1])],
    }

def extract(model, objective):
    booster = model.get_booster()
    trees = [simplify(json.loads(t)) for t in booster.get_dump(dump_format="json")]
    raw = json.loads(booster.save_config())["learner"]["learner_model_param"]["base_score"]
    if isinstance(raw, str) and raw.startswith("["):
        raw = json.loads(raw)[0]  # newer XGBoost returns base_score as a JSON array
    p = float(raw)
    intercept = p if objective == "regression" else math.log(p / (1 - p))
    return trees, intercept

# usage
trees, intercept = extract(model, "binary")
print(f"{{% set trees = {json.dumps(trees, separators=(',', ':'))} %}}")
print(f"{{% set intercept = {intercept} %}}")

The asset has two {% set %} lines and one macro call:

/* @bruin
name: scored_users
type: bigquery.sql
materialization:
  type: table
@bruin */

{% set trees = [{"f":"credit_score","t":650,"m":true,"c":[{"v":-0.04},{"v":0.07}]},{"f":"income","t":50000,"m":false,"c":[{"v":-0.02},{"v":0.09}]}] %}
{% set intercept = 0.4636 %}

{{ xgb_predict_binary(trees, intercept, primary_key='user_id', feature_table='analytics.user_features') }}

When the model retrains, run python build_model_sql.py | pbcopy and paste the two {% set %} lines into the asset. The macro stays untouched, the asset gets fresh data, the rest of the pipeline does not change.

30 minutes to 30 seconds

At my previous company, the recommendation team had a batch scoring pipeline that took about 30 minutes per run. Pull features out of BigQuery, score in a Python container, write scores back. After we replaced the Python step with a SQL query generated from the trained model, the same job took 30 seconds. The NLP team picked up the same approach. So did the fraud team.

Speed was nice. The bigger change was that everything stayed in the warehouse. One SQL asset replaced a feature export, a Python service, and a write-back step. Onboarding a new model went from "stand up a container, wire up Vertex AI, get ML eng to review" to "regenerate a SQL file."

Shorter development and deployment time, better for me. A full-stack data scientist can ship a scoring pipeline alone, because the only artifact in production is SQL. The model becomes a file, the pipeline becomes a query, and there is no service to maintain.

A nice side effect: most warehouses let you wrap a scoring expression as a SQL UDF. Once you do that, an analyst in Looker or Mode can call predict_default(user_id) from a dashboard. The model has gone from "ML eng asset" to "function anyone can use."

Where this breaks

  • Big models break long queries. BigQuery caps resolved query length around 12 MB. A forest with thousands of deep trees will hit that. The fix is to split trees across multiple SQL views and sum them, or to register the scoring expression as a persistent UDF, but at that point you are doing real engineering and a serving stack might be cheaper.
  • Multiclass classification needs softmax. XGBoost emits n_estimators × n_classes trees, indexed by class. You sum per class, then apply softmax across classes. Same idea, more code.
  • Categorical features (enable_categorical=True) are not handled. XGBoost emits set-membership splits for those, not < thresholds. The macro needs an extra branch that renders feature IN (...) splits.
  • Real-time inference is the wrong workload for this. If you need a sub-100ms prediction on a single row in response to a user action, keep your serving stack.

If you want multiclass or categorical splits packaged as a Bruin template, contact me and I will prepare it.

Delete the Python step

Look at the batch scoring pipeline you maintain. If it pulls features out of the warehouse, runs a Python container, and writes scores back, paste the macro above into your repo, point it at your trained model, and run the generated SQL against your feature table. Compare the predictions. Confirm they match. Then delete the Python service.

The point is shorter development and deployment time. A full-stack data scientist ships the whole pipeline alone, without ML eng review or a separate serving stack. Your warehouse is your model server.