How Thyme Works

This page explains the Thyme data path from the perspective of someone using the platform.

The big picture

Source

Your Python

features.py

Pipeline

Definition Service

stores definitions · creates topics

Pipeline

Engine

streaming aggregations

Dataset

Feature Store

event-time keyed

Pipeline

Query Server

runs extractors

Featureset

Your model / app

online · point-in-time

Step 1: You write Python

You define datasets, pipelines, featuresets, and sources in a Python file. Nothing runs at import time - decorators only register metadata.

@dataset(index=True)
class Transaction: ...

@dataset(index=True)
class UserStats:
    @pipeline(version=1)
    @inputs(Transaction)
    def compute(cls, t): ...

@featureset
class UserFeatures: ...

Step 2: `thyme commit` sends definitions to your Thyme instance

When you run thyme commit features.py, the CLI:

Imports your Python file, triggering all the decorators
Serializes the registered definitions into a structured payload
POSTs the payload to the definition service at your Thyme instance

The definition service stores your definitions and creates a Kafka topic for each dataset. From this point on, the engine knows what to run.

Step 3: The engine picks up the job and starts streaming

The engine runs continuously in the background. When it sees new job definitions from the control plane, it:

Spins up a source connector for each dataset that has a @source attached
The source polls your external system (e.g., an Iceberg table) on the configured interval and publishes new rows to Kafka
Spins up a pipeline runner for each @pipeline definition
The runner consumes from Kafka, applies windowed aggregations, and writes results to the feature store

The engine keeps running, continuously updating feature values as new events arrive. You don't manage this process - it runs until the definition changes.

Step 4: You query features

When your model or application needs feature values, it queries the query server:

GET $THYME_BASE_URL/features?featureset=UserFeatures&uid=user_42

The query server:

Looks up the featureset definition
Resolves the extractor DAG - which extractors need to run and in what order
Reads aggregated values from the feature store (keyed by entity ID)
Runs the extractor code in Python, passing the raw aggregated values
Returns the composed feature values as JSON

Online vs offline queries

The same query endpoint supports both modes:

Online (latest value):

GET $THYME_BASE_URL/features?featureset=UserFeatures&uid=user_42

Offline / point-in-time (value as of a specific timestamp):

GET $THYME_BASE_URL/features?featureset=UserFeatures&uid=user_42&ts=2024-01-15T12:00:00Z

Because Thyme uses event time throughout, point-in-time queries return exactly the feature value that would have been served at that moment. This makes offline training datasets consistent with online serving - eliminating training/serving skew.

What you don't manage

Kafka consumers and consumer groups
State store compaction and eviction
Checkpoint recovery after restarts
Topic creation and partition assignment
Watermark advancement and late-event handling

These are handled by the engine. Your job is to define what the features are, not how to compute them at scale.

For the contracts the engine guarantees you (exactly-once, watermarks, online/offline parity), see Durability & Consistency.

How Thyme Works

The big picture

Step 1: You write Python

Step 2: thyme commit sends definitions to your Thyme instance

Step 3: The engine picks up the job and starts streaming

Step 4: You query features

Online vs offline queries

What you don't manage

On this page

Step 2: `thyme commit` sends definitions to your Thyme instance