Thyme
Case Studies

Real-Time Price Anomaly Detection

Distribution-based anomaly detection using streaming sketches and the ApproxPercentile operator.

The problem

Online marketplaces have a pricing integrity problem. When a seller suddenly lists a $25 product at $200, is it scarcity, a compromised account, or ranking manipulation?

The hard part isn't spotting a $200 price. It's knowing that $200 is abnormal for that specific product, given its own pricing history. A $200 listing for a luxury item is normal. A $200 listing for a phone case is a red flag. You need that answer in milliseconds, at checkout.

The feature: "Where does this product's recent pricing sit relative to its own historical distribution?" - the percentile rank of its 7-day max price within its 180-day distribution.

Why this feature is hard

  • Pre-computed batch percentiles go stale - won't catch a spike from 10 minutes ago
  • Simple thresholds (flag anything above 2× average) produce false positives on high-variance products
  • Min/max heuristics assume uniform distributions - real prices are skewed and multimodal

What you need is a compact, incrementally-updatable summary of the distribution - the kind of structure provided by streaming sketches like t-digest.

The Thyme solution

from datetime import datetime
from thyme import (
    ApproxPercentile, Config, Max, dataset, extractor, extractor_inputs,
    extractor_outputs, feature, featureset, field, inputs, pipeline, source,
)
from thyme.dataset import Field

config = Config.load()
bookings_source = config.postgres_source(table="product_bookings")

@source(bookings_source, cursor="timestamp", every="5s", max_lateness="1h")
@dataset(version=1)
class ProductBooking:
    product_id: Field[str]      = field(key=True)
    price:      Field[float]    = field()
    timestamp:  Field[datetime] = field(timestamp=True)

@dataset(version=1, index=True)
class ProductPriceStats:
    product_id:          Field[str]      = field(key=True)
    max_price_7d:        Field[float]    = field()
    price_pct_rank_180d: Field[float]    = field()
    timestamp:           Field[datetime] = field(timestamp=True)

    @pipeline(version=1)
    @inputs(ProductBooking)
    def compute_stats(cls, bookings):
        return bookings.groupby("product_id").aggregate(
            max_price_7d=Max(of="price", window="7d"),
            price_pct_rank_180d=ApproxPercentile(of="price", window="180d"),
        )

@featureset
class PriceFeatures:
    product_id:          str   = feature()
    max_price_7d:        float = feature(ref=ProductPriceStats.max_price_7d)
    price_pct_rank_180d: float = feature(ref=ProductPriceStats.price_pct_rank_180d)
    price_decile:        int   = feature()

    @extractor
    @extractor_inputs("price_pct_rank_180d")
    @extractor_outputs("price_decile")
    def compute_decile(cls, ts, pct_rank):
        if pct_rank is None:
            return 5
        return min(int(pct_rank * 10) + 1, 10)

How it works

The ApproxPercentile operator maintains a streaming sketch of the value distribution per entity. As new price events arrive, they update the sketch incrementally - no full recomputation, no historical data scans. The percentile rank is pre-computed at write time and stored alongside the other aggregates, so the read path returns a single float per entity. Read latency is 1–2 ms.

Production results

Materialized features

Productmax_price_7dpct_rank_180dDecileInterpretation
p_spike$220.810.99771099.8th percentile - clear anomaly
p_cheap$31.870.56886Near median - normal
p_premium$620.580.50826Near median of its own distribution - normal

p_premium at $620 is perfectly normal (50th percentile of its own distribution), even though $620 would trigger any naive "high price" threshold.

AWS production results (500k events, Graviton c7g.xlarge)

MetricValue
Feed rate1,672 events/sec (sustained over 5 min)
E2E throughput1,110 events/sec
Read P50 (E2E)0.66 ms
Read P95 (E2E)1.95 ms
Read P99 (E2E)2.83 ms
Sustained QPS1,625
Online/offline parity100%

Performance charts

Events processed over time

Sustained write throughput

Query latency distribution

Read latency vs SLA targets

Write latency

Extractor execution time

Comparison with other platforms

CapabilityThymeTectonFennelZipline (Airbnb)
Approximate percentileApproxPercentileapprox_percentileNot supportedApproxPercentile
Pre-computed rankYes (write-time)Via on-demand featuresN/AVia derived features

Reproducing this on your own data

Point ProductBooking at your own bookings or pricing source, commit, and the per-product percentile rank is live within seconds.

thyme commit features.py
curl -H "Authorization: Bearer $THYME_API_KEY" \
    "$THYME_BASE_URL/features?entity_id=p_spike&featureset=PriceFeatures"

On this page