SemiLayerDocs

Data Mapping — Transforms

After from resolves a value, transform can reshape it. Transforms are declarative, composable, and run once per row during ingest.

The shape

type TransformSpec = BuiltinTransform | BuiltinTransform[]

A single transform or an array. Arrays chain left-to-right — the output of one becomes the input of the next.

// Single
transform: { type: 'round', decimals: 2 }

// Chain
transform: [
  { type: 'trim' },
  { type: 'toNumber' },
  { type: 'round', decimals: 0 },
]

The 14 built-in transforms

Type coercion

{ type: 'toString' }       // any → string. null → ''
{ type: 'toNumber' }       // string → number. non-numeric → NaN
{ type: 'toBoolean' }      // 'true' | '1' | 'yes' → true; anything else → false
{ type: 'toDate' }         // ISO string | epoch ms → Date; invalid → null

Use when the source stores values in the wrong type. Common: 'true' string from an old schema where booleans weren't native; price_cents as an integer string from a JSON import.

Numeric

{ type: 'round' }                                    // default 0 decimals, banker's round
{ type: 'round', decimals: 2 }                       // 3.14159 → 3.14
{ type: 'round', decimals: 0, mode: 'ceil' }         // 3.01 → 4
{ type: 'round', decimals: 0, mode: 'floor' }        // 3.99 → 3

mode defaults to 'round'. 'ceil' and 'floor' do what they say.

String — whitespace and casing

{ type: 'trim' }           // '  hello  ' → 'hello'. null → ''
{ type: 'lowercase' }      // 'HELLO' → 'hello'. null → ''
{ type: 'uppercase' }      // 'hello' → 'HELLO'. null → ''

String — length and replacement

{ type: 'truncate', length: 200 }
// 'very long description ...' → 'very long descripti...' (first 200 chars)

{ type: 'replace', pattern: 'http://', replacement: 'https://' }
// 'http://example.com' → 'https://example.com'
  • truncate requires length: number > 0.
  • replace is a global regex replace. pattern is a regex source string (no flags — g is implicit); escape backslashes if you need literal ones.

Array ↔ string

{ type: 'split', separator: ',' }
// 'tag1,tag2,tag3' → ['tag1', 'tag2', 'tag3']. null → []

{ type: 'join', separator: ' · ' }
// ['a', 'b', 'c'] → 'a · b · c'. non-array → String(value)

Both require separator: string.

Default value

{ type: 'default', value: 'Unknown' }
// null → 'Unknown'. undefined → 'Unknown'. 'actual' → 'actual'

Functionally equivalent to nullAs: 'Unknown' + undefinedAs: 'Unknown' on the field. Use whichever reads better in context. Tends to be convenient inside a chain — [{type: 'trim'}, {type: 'default', value: '—'}].

Custom JavaScript

{
  type: 'custom',
  body: 'return value == null ? null : value.trim().replace(/\\s+/g, " ")',
}

A JS function body that receives (value, field, row) and returns the new value:

  • value — the post-resolution, post-prior-transforms value being mapped.
  • field — the output field name (string).
  • row — the entire raw source row (after null-sentinel replacement, before mapping). Useful when a transform needs to peek at another column.
// Example: compose a slug from multiple columns via custom
slug: {
  type: 'text',
  transform: {
    type: 'custom',
    body: `
      const base = (row.title || 'untitled').toLowerCase()
      return base.replace(/[^a-z0-9]+/g, '-').replace(/^-|-$/g, '')
    `,
  },
}

Chaining left-to-right

Arrays run top-to-bottom. Each transform's output is the next one's input.

{
  type: 'number',
  from: 'raw_price',
  transform: [
    { type: 'toString' },                   // 19.995          (number) → '19.995'
    { type: 'trim' },                       // '19.995'        (already trimmed)
    { type: 'replace', pattern: '\\$', replacement: '' },  // '$19.995' → '19.995'
    { type: 'toNumber' },                   // '19.995'        → 19.995
    { type: 'round', decimals: 2 },         // 19.995          → 20.00
  ],
}

Real-world chain. Messy input (money that sometimes has a $, sometimes doesn't, sometimes a string, sometimes a number) gets normalized to a rounded number.

When transforms run

Per row, during ingest, after from resolution and nullAs/undefinedAs substitution:

source_row
    ↓
null-sentinel replacement
    ↓
from resolution  → raw_value
    ↓
undefinedAs / nullAs  → resolved_value
    ↓
transform chain  → final_value  ← runs here
    ↓
embedding + index

So:

  • A custom transform that reads row.some_other_column sees the raw source value (before that other field's mapping ran).
  • A transform acting on a merge: 'concat' result sees the joined string, not the parts.
  • null and undefined after nullAs/undefinedAs handling depend on whether those fields are set. Most transforms are null-tolerant (they return '', [], null as appropriate).

Never-throws contract

Transforms never throw. If a transform encounters something it can't handle (e.g. { type: 'round' } on a string that doesn't parse), the original value is returned unchanged and an internal log entry is emitted.

This is deliberate. Ingest processes millions of rows; one bad value should not kill the batch. The tradeoff is that silently-wrong data becomes your job to catch — see Mapping recipes for patterns like "coerce-then-check-with-a-second-field."

Validation

semilayer push validates transform shapes:

  • type must be one of the 14 built-in values.
  • split / join / replace / truncate / custom require their required params.
  • round.decimals must be a non-negative integer; round.mode must be 'round' | 'ceil' | 'floor'.
  • custom.body must be a string. (It's not parsed for safety — it runs in the worker.)

Not validated: whether custom.body is well-formed JS, whether a replace pattern is a valid regex, whether a transform chain ends up producing the right type. These fail at ingest time with a quiet log entry + the original value passed through.

Next: Recipes — common mapping patterns, production-tested.