R Assets
Bruin brings R statistical computing capabilities to your data pipelines:
- Run R scripts with full access to R's powerful statistical and data analysis packages
- Automatic dependency management with renv integration
- Access to connection credentials and secrets via environment variables
- Execute complex statistical computations alongside SQL and Python assets
R assets allow you to leverage R's extensive ecosystem for statistical analysis, machine learning, data visualization, and more within your Bruin pipelines.
"@bruin
name: statistical_analysis
type: r
depends:
- raw_user_data
@bruin"
library(dplyr)
cat("Running R statistical analysis\n")
# Your R code here
results <- data.frame(
metric = c("mean", "median", "sd"),
value = c(42.5, 40.0, 5.2)
)
print(results)Dependency Management
R assets support dependency management through renv, R's standard dependency management tool. Bruin searches for the closest renv.lock file in the file tree and automatically restores the environment with the specified packages.
For example, assume you have a file tree such as:
* folder1/
* folder2/
* analysis.r
* renv.lock
* folder3/
* report.r
* renv.lock
* folder4/
* folder5/
* folder6/
* model.r
* renv.lock- When Bruin runs
analysis.r, it will usefolder1/folder2/renv.locksince they are in the same folder - For
report.r, since there is norenv.lockin the same folder, Bruin goes up one level and findsfolder1/renv.lock - Similarly,
renv.lockin the main folder is used formodel.rsince none offolder6,folder5, orfolder4have anyrenv.lockfiles
Using renv
To create an renv.lock file for your R assets:
# In your R console, navigate to your asset directory
renv::init() # Initialize renv for the project
renv::install("dplyr") # Install packages you need
renv::install("ggplot2")
renv::snapshot() # Create renv.lock fileManual Dependency Management
If you don't use renv.lock, you can manage dependencies directly in your R script using install.packages():
"@bruin
name: manual_deps_example
type: r
@bruin"
# Check if package is installed, install if not
if (!require("jsonlite", quietly = TRUE)) {
install.packages("jsonlite", repos = "https://cloud.r-project.org")
}
library(jsonlite)
# Your code hereAsset Definition
R assets use a multiline string with @bruin markers to define metadata in YAML format. This is similar to Python's approach but uses R's native string syntax:
"@bruin
name: asset_name
type: r
depends:
- upstream_asset1
- upstream_asset2
secrets:
- key: MY_SECRET
inject_as: R_SECRET
@bruin"
# Your R code starts here
cat("Hello from R!\n")The configuration block must:
- Start with
"@bruinon its own line (can also use single quotes'@bruin) - End with
@bruin"on its own line (matching quote type) - Contain valid YAML configuration between the markers
- Preserve proper YAML indentation
All standard asset parameters are supported. See the SQL asset documentation for a complete list of available configuration options including:
- Dependencies (
depends) - Secrets and connections (
secrets) - Parameters (
parameters) - Columns and quality checks (
columns) - Custom checks (
custom_checks) - And more
Secrets and Connections
Secrets and connections are injected as environment variables in JSON format. See the secrets documentation for more details on how to define and use secrets.
"@bruin
name: r_with_secrets
secrets:
- key: postgres_connection
inject_as: DB_CONN
@bruin"
library(jsonlite)
# Access the secret from environment variable
connection_json <- Sys.getenv("DB_CONN")
conn_details <- fromJSON(connection_json)
# Use connection details
cat(sprintf("Connecting to: %s\n", conn_details$host))Environment Variables
Bruin introduces a set of environment variables by default to every R asset.
Builtin
The following environment variables are available in every R asset execution:
| Environment Variable | Description |
|---|---|
BRUIN_START_DATE | The start date of the pipeline run in YYYY-MM-DD format (e.g. 2024-01-15) |
BRUIN_START_DATETIME | The start date and time of the pipeline run in YYYY-MM-DDThh:mm:ss format (e.g. 2024-01-15T13:45:30) |
BRUIN_START_TIMESTAMP | The start timestamp of the pipeline run in RFC3339 format with timezone (e.g. 2024-01-15T13:45:30.000000Z07:00) |
BRUIN_END_DATE | The end date of the pipeline run in YYYY-MM-DD format (e.g. 2024-01-15) |
BRUIN_END_DATETIME | The end date and time of the pipeline run in YYYY-MM-DDThh:mm:ss format (e.g. 2024-01-15T13:45:30) |
BRUIN_END_TIMESTAMP | The end timestamp of the pipeline run in RFC3339 format with timezone (e.g. 2024-01-15T13:45:30.000000Z07:00) |
BRUIN_RUN_ID | The unique identifier for the pipeline run |
BRUIN_PIPELINE | The name of the pipeline being executed |
BRUIN_FULL_REFRESH | Set to 1 when the pipeline is running with the --full-refresh flag, empty otherwise |
BRUIN_THIS | The name of the R asset |
BRUIN_ASSET | The name of the R asset (same as BRUIN_THIS) |
Pipeline
Bruin supports user-defined variables at a pipeline level. These become available as a JSON document in your R asset as BRUIN_VARS. When no variables exist, BRUIN_VARS is set to {}. See pipeline variables for more information on how to define and override them.
Here's an example:
"@bruin
name: r_with_variables
@bruin"
library(jsonlite)
# Access pipeline variables
vars_json <- Sys.getenv("BRUIN_VARS")
vars <- fromJSON(vars_json)
cat(sprintf("Environment: %s\n", vars$environment))
cat(sprintf("Region: %s\n", vars$region))Examples
Basic R Script
The simplest R asset with no dependencies:
"@bruin
name: hello_r
type: r
@bruin"
cat("Hello from R!\n")
result <- 2 + 2
cat(sprintf("2 + 2 = %d\n", result))Statistical Analysis with Dependencies
Using R packages for statistical analysis:
"@bruin
name: statistical_summary
depends:
- raw_data
@bruin"
library(dplyr)
cat("Performing statistical analysis\n")
# Generate sample data
data <- data.frame(
value = rnorm(1000, mean = 50, sd = 10),
category = sample(c("A", "B", "C"), 1000, replace = TRUE)
)
# Statistical summary
summary_stats <- data %>%
group_by(category) %>%
summarise(
mean = mean(value),
median = median(value),
sd = sd(value),
min = min(value),
max = max(value)
)
print(summary_stats)Working with Database Connections
Accessing database credentials via environment variables:
"@bruin
name: db_analysis
secrets:
- key: postgres-default
inject_as: PG_CONN
@bruin"
library(jsonlite)
library(DBI)
library(RPostgres)
# Parse connection details
conn_json <- Sys.getenv("PG_CONN")
conn_details <- fromJSON(conn_json)
# Connect to database
con <- dbConnect(
RPostgres::Postgres(),
host = conn_details$host,
port = conn_details$port,
dbname = conn_details$database,
user = conn_details$username,
password = conn_details$password
)
# Query data
result <- dbGetQuery(con, "SELECT COUNT(*) FROM users")
cat(sprintf("Total users: %d\n", result$count))
# Clean up
dbDisconnect(con)Time-Series Analysis
Using R's extensive time-series capabilities:
"@bruin
name: time_series_forecast
depends:
- historical_data
@bruin"
library(forecast)
cat("Running time-series forecast\n")
# Example time series
ts_data <- ts(rnorm(100), frequency = 12, start = c(2020, 1))
# Fit ARIMA model
model <- auto.arima(ts_data)
# Forecast next 12 periods
forecast_result <- forecast(model, h = 12)
cat("Forecast complete!\n")
print(summary(forecast_result))Installation
R assets require R to be installed on your system. Install R using one of these methods:
- macOS:
brew install r - Ubuntu/Debian:
sudo apt-get install r-base - Windows: Download from CRAN
- Other platforms: See CRAN installation guides
To verify R is installed correctly:
R --versionAdvanced Configuration Example
Here's an example showing a more complex configuration:
"@bruin
name: comprehensive_analysis
type: r
depends:
- raw_user_data
- product_catalog
secrets:
- key: postgres-analytics
inject_as: DB_CONN
columns:
- name: user_id
type: integer
checks:
- name: not_null
- name: unique
@bruin"
library(dplyr)
library(jsonlite)
# Access environment variables
db_json <- Sys.getenv("DB_CONN")
db <- fromJSON(db_json)
cat("Running comprehensive analysis\n")
# Your analysis code hereBest Practices
- Use renv for reproducibility: Create an
renv.lockfile to ensure consistent package versions across environments - Use the string-based multiline format: The multiline string format (using
"@bruin ... @bruin") makes complex configurations with dependencies, secrets, and parameters much easier to read and maintain - Quote choice: Use double quotes
"or single quotes'- both work, just ensure the opening and closing quotes match - Handle errors gracefully: Use R's error handling (
tryCatch) to provide clear error messages - Log progress: Use
cat()orprint()statements to provide visibility into your R script's execution - Clean up resources: Always close database connections and file handles when done
- Test locally: Run your R scripts locally before integrating them into pipelines