Read & Write Paths
Two paths, one definition. How Thyme keeps online and offline features consistent.
Thyme runs two paths over the same feature definition. The write path continuously ingests events, applies windowed aggregations, and stores state keyed by entity and event time. The read path answers queries - online and point-in-time - by reading that same state and running extractors on top.
Both paths share a single source of truth: the feature store, an event-time-keyed state written transactionally by the engine and read non-blocking by the query server. Because online and offline queries read the same materialized state - not a copy, not a reconciled view - they return the same numbers.
Write path
The write path is always running. It is not a batch job; there is no schedule.
- Sources. Every dataset has a source - streaming (Kafka, Kinesis) or polling (Postgres, Iceberg, S3, BigQuery, Snowflake). Polling sources track a cursor field and publish new rows onto a per-dataset Kafka topic as they arrive.
- Raw Dataset. The landing zone for each source. Event time and entity key are extracted here so all downstream operators see canonical records.
- Windowed Pipeline. Your
@pipelinedecorator definesgroupby+aggregateover one or more time windows. Supported operators areSum,Count,Avg,Min,Max, andApproxPercentile. Sum/Count/Avg are invertible - old records falling out of a window can be subtracted without a recompute. ApproxPercentile uses tiled t-digest sketches, so percentile rank is ready at query time, not computed on the fly. - Aggregated Dataset. The pipeline's output is itself a queryable, event-time-keyed dataset. This is what your featureset reads from.
- Commit. Each pipeline tick commits atomically - output state, internal accumulator state, and source progress advance together, paired with a transactional Kafka produce - giving end-to-end exactly-once semantics.
You do not manage Kafka consumers, topic partitions, watermark advancement, late-event handling, or checkpoint recovery. You declare the feature; the engine runs it.
Read path
The read path runs on demand, against the same state the write path produced.
- Query Server. An HTTP service receives a featureset query. Optional
ts=makes the query point-in-time; omitted, it serves the latest value. - Extractor DAG. Your
@extractorfunctions form a DAG. The server resolves dependencies, topologically sorts them, and runs the Python extractors in-process - no network hop to a separate Python worker. - Featureset response. Composed features are returned as JSON. P50 server-side latency is well under 1 millisecond in production, because all state is read from a local replica of the feature store with no network reads.
- Query-run record. After responding, the server spawns a detached task that records a metadata-only audit row and returns the run's UUID on the
X-Query-Run-Idresponse header. The SDK surfaces this asThymeResult.query_run_id; the CLI prints aResults:link to the web UI. Persistence is best-effort - a failed audit insert never fails the user's query. See Query Runs for what's captured and what isn't.
Why this shape
The write path and the read path share one source of truth. Neither is a projection of the other. That is the structural reason training/serving skew doesn't exist in Thyme:
- A training job issuing a point-in-time query hits the same state, through the same extractor code, as an online request issued during inference.
- There is no batch snapshot to reconcile against a streaming job; there is no streaming job to reconcile against a batch snapshot. There is one pipeline, two modes of query.
- New features go live as soon as the pipeline advances. No backfill coordination, no offline-online deployment dance.
For the contracts this shape gives you - exactly-once processing, watermark and lateness behaviour, online/offline parity - see Durability & Consistency.