Time Series Rollups and Aggregation at Scale

Many sites provide statistics for their users’ content. Play counts, page views, API calls. Usually presented as a graph over time. This is easy when aggregating over the raw events data. But that’s neither particularly fast nor scalable. As soon as the number of events or the time window gets too big this approach is no longer feasible.

Today there are purpose-built time series databases that handle much of the heavy lifting. But underneath, they all rely on the same fundamental patterns. Understanding those patterns helps you pick the right tool, configure it properly, and know when to build something custom.

The landscape

The number of time series databases has exploded over the past decade. Here’s a rough map.

TimescaleDB extends PostgreSQL with continuous aggregates that automatically maintain pre-computed rollups. As new data arrives, only affected time buckets are recalculated. If you’re already running Postgres, it’s the easiest on-ramp.

ClickHouse takes a different angle. Its materialized views trigger on insert, shifting aggregation work from query time to write time. You define a raw table, a rollup table, and a view that connects them. Fast column-oriented storage makes it popular for analytics workloads.

InfluxDB was built specifically for time series from the start. Version 3.x includes a downsampler plugin that aggregates measurements over configurable time intervals using functions like avg, sum, min, max. It also supports external stream processing through Kafka-based pipelines.

VictoriaMetrics positions itself as a faster, more scalable drop-in replacement for Prometheus. While Prometheus itself has no built-in downsampling or long-term storage, extensions like Thanos add exactly that: automatic downsampling (5-minute resolution after 40 hours, 1-hour after 10 days) plus object storage for retention beyond what a single Prometheus instance can hold.

For large-scale batch processing of time series, Apache Spark and libraries like Flint offer distributed computation across clusters. This is the territory where you’re processing terabytes of historical data rather than serving real-time dashboards.

On the managed side, AWS Timestream offers a serverless time series database that automatically moves data between in-memory and cold storage tiers based on age.

Different tools, same underlying ideas.

The fundamentals

Every one of these systems implements some variation of three core concepts: pre-aggregation, rollup strategies, and resolution tradeoffs.

Pre-aggregation

Querying raw events is O(n) where n is the number of events. For a billion events, that’s a billion rows to scan. Pre-aggregation collapses events into time buckets ahead of time. Instead of counting a million page views for March 2024, you store a single counter: 2024-03 = 1000000. Query time becomes nearly constant regardless of how many events actually occurred.

Rollup strategies

A natural choice is to roll up data into multiple time bucket granularities. By creating counters per different time units, the number of operations can be kept at a minimum while still providing fast aggregation over large time windows.

Calculating an aggregate then becomes as simple as adding a few counters. It’s the sum of the most covering aggregates that fit into the given time interval.

Getting the total count for the interval from 2008-11-30 until 2011-02-03 means getting 8 values and adding them.

Choosing the time units defines the read pattern and the number of counters needed. A natural combination would be year+month+day but year+week+day could work too. Even just rolling up into day buckets might be enough. For an informed decision, check against real user queries and calculate the required storage.

Resolution tradeoffs

More granularity means more storage. Fewer rollup levels mean faster writes but slower reads for large time ranges. Most systems let you define retention policies: keep full-resolution data for 30 days, hourly rollups for a year, daily rollups forever. This is exactly what Thanos does for Prometheus, and what TimescaleDB’s continuous aggregate policies automate.

The right balance depends entirely on the use case. Real-time dashboards need second-level granularity for the last hour but nobody needs per-second data from three years ago.

The fundamentals cover the read path, but the events still need to be rolled up. How that happens has evolved significantly.

Lambda architecture

The simplest approach is a batch job that periodically recomputes all rollups from raw events. But pure batch creates lag. The Lambda architecture closes this gap by running two parallel systems. A batch layer provides rollups up until a certain point in time. Everything past that is served from a fast in-memory layer.

Combining Rollups and Real Time Counters

The in-memory layer can even be transient since the eventual source of truth is always the batch layer. SoundCloud used a similar architecture for their play count statistics, backed by Cassandra for storage and Memcached for caching. Their system called “Stitch” would regenerate aggregated data buckets through daily and monthly batch jobs while a speed layer handled real-time updates using distributed counters.

The downside of Lambda is maintaining two separate code paths for the same business logic. The batch pipeline and the streaming pipeline need to produce consistent results, and keeping them in sync is tricky to get right.

The batch layer in practice

A naive approach is to increment counters as events arrive. A single event fans out into multiple writes:

event at 2011-03-14 =>
    2011 += 1
    2011-03 += 1
    2011-03-14 += 1

This puts a lot of write pressure on the store. An increment is not just a write but also a read. And all increments should ideally happen in a single transaction.

Another option is to generate the counters from log files using a MapReduce job. The mapper emits data for per-track and global counters:

def mapper(lines):
    result = []
    for date, track in lines:
        result.append((track, date))   # per-track counter
        result.append(("*", date))     # global counter
    return result

After mapping, the data gets sorted. The reducer detects changes in the incoming data, does the rollups and emits the final counts:

def reducer(sorted_lines):
    prev = {"d": "", "m": "", "y": "", "t": ""}
    count = {"d": 0, "m": 0, "y": 0}

    for track, date in sorted_lines:
        d, m, y = date[:10], date[:7], date[:4]

        if prev["t"]:
            if d != prev["d"]:
                emit(prev["t"], prev["d"], count["d"])
                count["d"] = 0
            if m != prev["m"]:
                emit(prev["t"], prev["m"], count["m"])
                count["m"] = 0
            if y != prev["y"]:
                emit(prev["t"], prev["y"], count["y"])
                count["y"] = 0
            if track != prev["t"]:
                count = {"d": 0, "m": 0, "y": 0}

        count["d"] += 1
        count["m"] += 1
        count["y"] += 1
        prev = {"d": d, "m": m, "y": y, "t": track}

    if prev["t"]:
        emit(prev["t"], prev["d"], count["d"])
        emit(prev["t"], prev["m"], count["m"])
        emit(prev["t"], prev["y"], count["y"])

Ideally this job creates data ready for bulk import into a fast store that can then serve web requests in a timely manner.

The partitioning problem

In a distributed MapReduce setup, a partitioner splits data across machines. For this to work correctly, all events for a given track within a full year must reach the same reducer. Otherwise you can’t calculate complete yearly rollups.

This is fine for per-track counters. But for global counters (the "*" key), it means all events of an entire year funnel through a single machine. That’s a hard bottleneck. Adding more machines to the cluster won’t help because one reducer still has to see everything.

The fix is a secondary job. The first pass computes partial rollups per partition. The second pass combines them. It adds complexity, so whether it’s worth it depends on the data volume and how fast the batch needs to complete.

Kappa architecture

The Kappa architecture, proposed by Jay Kreps, simplifies things by eliminating the batch layer entirely. Instead of maintaining two pipelines, all data flows through a single stream processing system. Historical reprocessing is handled by replaying the event log from the beginning. One codebase, one set of semantics. With modern stream processing frameworks like Apache Flink or Kafka Streams, this has become practical at significant scale.

But Kappa has real limitations. Reprocessing from raw events works well for recent data and low-cardinality dimensions. It breaks down when you need aggregations over long time ranges or across high-cardinality keys like per-user statistics across millions of users. Good luck deriving yearly rollups from raw events when the event log holds billions of entries. The throughput and parallelism requirements for a Kappa system can be substantial, and replaying months or years of history through a stream processor is expensive if it’s even feasible.

The right choice depends on the use case. Kappa works well for systems where the time window is bounded, the cardinality is manageable, and you can afford the infrastructure to keep up with replay demands. Lambda remains the better fit when you need deep historical aggregations over high-cardinality data, where a batch layer can churn through the volume efficiently in a way that stream replay cannot.

Choosing your approach

For most teams, a purpose-built time series database is the right answer. TimescaleDB if you want to stay in the PostgreSQL ecosystem, ClickHouse for heavy analytics, InfluxDB or VictoriaMetrics for metrics and monitoring.

Custom rollup pipelines still make sense when your data doesn’t fit neatly into the time series model, when you need full control over the aggregation logic, or when you’re operating at a scale where managed solutions become prohibitively expensive.

Either way, the fundamentals don’t change. Pre-aggregate. Roll up into time buckets. Trade resolution for speed and storage. These patterns were solid in 2011 and they’re solid now. The tools just got a lot better at hiding the machinery.