Operations
Monitoring
The Grafana dashboards and metrics surfaced on the in-product Monitoring page.
Thyme's web UI has a /monitoring page that embeds a set of Grafana dashboards covering the running platform. This page describes what those dashboards show so you can quickly find the right view when something looks off.
Dashboards
| Dashboard | What it shows |
|---|---|
| System Health Overview | High-level status of all services, error rates, uptime |
| Engine Performance | Events processed per second, write latency, backfill progress |
| Query Server Performance | Query latency percentiles (P50 / P95 / P99), QPS, extractor execution time |
| Definition Service | Commit history, definition counts, topic-creation events |
| Load Test | Write throughput, read latency vs SLA targets, online/offline parity |
When to use which dashboard
- You committed something and queries return 404 / null - Definition Service dashboard, then Engine Performance to see whether the new pipeline has produced output.
- Latency regressed - Query Server Performance, then drill into specific featuresets via Query Runs.
- Throughput looks low - Engine Performance for events/sec per worker, then check the Sources page for stuck cursors.
- Something is on fire and you don't know what - start with System Health Overview and follow the red.
Where the metrics come from
Every Thyme service exposes Prometheus metrics on /metrics:
| Service | Path | Key metrics |
|---|---|---|
| Definition Service | :8080/metrics | Commit counts, commit latency, topic creation |
| Engine | :8081/metrics | Events processed, aggregation latency, write latency |
| Query Server | :8081/metrics | Query latency (P50/P95/P99), extractor execution time, QPS |
These endpoints are excluded from auth. Your platform team scrapes them into Prometheus and renders the dashboards above; you consume them through the web UI.