Notes on roll-up effectiveness in Apache Druid

Photo by Ben Lei on Unsplash

Notes on roll-up effectiveness in Apache Druid

I thought I’d share some lessons I learned over the last few years about ingestion-time aggregation in Druid, and how cardinality affects it. This is kinda a little process I go through mentally that I thought would be helpful to share!

What is roll-up?

In case you don't know what roll-up is, it takes incoming rows in Druid and then aggregates them, spitting out metrics from the measures that you have. There's the native rollup, which you turn on with streaming ingestion, and then there's the modern Druid approach to roll-up, which – in plain English, a GROUP BY in the INSERT statement for batch ingestion.

You might choose to emit your usual MAX, MIN, and so on metrics, or something more hÿpercool like a data sketch to speed up approximation operations. In native, you put all that in the metricsSpec section of your ingestion specification - with INSERT you just use the usual aggregates.

Awesome! Now you can reduce your 10m rows per second of people surfing TikTok on your office WiFi network (I will not enter into the debate as to whether TikTok connects to WiFi) to just 10 per second, giving you the aggregates ahead of time that you would otherwise have computed every time with each query.

In terms of efficiency of this operation, it's the same as it is for any GROUP BY operation: the cardinality of dimensions you have in your SELECT. In Druid’s case, that’s the source data columns that you list in your ingestion specification or INSERT statement. So be cautious!

Timestamp truncation

A discrete piece of functionality in Druid is to automatically truncate incoming timestamps. That’s done by specifying a queryGranularity in the ingestion spec or by using a suitable time function in your INSERT.

Here’s an example data set where queryGranularity processing is at FIVE_MINUTE. Druid has, for every incoming event, truncated the timestamp (like a TIME_FLOOR).

TimeNameDance
09:00PeterFandango
09:00JohnFandango
09:00PeterFandango
09:05JohnFandango
09:05JohnFandango
09:05JohnFandango
09:05PeterWaltz
09:05PeterWaltz

This is a necessary thing for effective roll-up – if you did a GROUP BY on the raw timestamp, you'd end up with a row for every millisecond (worst case).

Now let’s use our eye-minds and think about the roll-up.

Periodic single-dimension cardinality

Imagine that we're going to add a COUNT Each column has low cardinality within that time period, so we get a nice aggregation: 8 went in, 4 came out.

TimeNameDanceCount
09:00PeterFandango2
09:00JohnFandango1
09:05JohnFandango3
09:05PeterWaltz2

But what about this one:

TimeNameDance
09:00PeterFandango
09:00MaryFandango
09:00TomFandango
09:05BrianFandango
09:05PeterFandango
09:05MaryFandango
09:05TomFandango
09:05TerryFandango

Notice that, within the 5 minutes buckets (our queryGranularity truncated timestamp) every single event relates to a different dancer. When the GROUP BY kicks in, those 8 incoming rows get emitted as 8 rows: the GROUP BY is across all of the dimensions.

TimeNameDanceCount
09:00PeterFandango1
09:00MaryFandango1
09:00TomFandango1
09:05BrianFandango1
09:05PeterFandango1
09:05MaryFandango1
09:05TomFandango1
09:05TerryFandango1

Periodic multi-dimension cardinality

And there’s a second scenario: lots of combinations of values.

TimeNameDance
09:00PeterFandango
09:00MaryPolka
09:00MaryVogue
09:05BrianFandango
09:05LucyWaltz
09:05ClaireFandango
09:05SianWaltz
09:05TerryWaltz

Here there are just too many combinations of values in each five-minute interval. Every dancer dances a different dance.

TimeNameDanceCount
09:00PeterFandango1
09:00MaryPolka1
09:00MaryVogue1
09:05BrianFandango1
09:05LucyWaltz1
09:05ClaireFandango1
09:05SianWaltz1
09:05TerryWaltz1

Periodic hierarchy

One cause for this combined cardinality problem could be data hierarchy. Let’s imagine that Peter is King of the Fandango and Voguing. (Well done, Peter). John, meanwhile, is King of the Foxtrot, Waltz, and Paso Doble. (Ie, parent-child).

TimeTeacherDance
09:00PeterFandango
09:00JohnFoxtrot
09:00PeterVogue
09:05PeterFandango
09:05JohnFoxtrot
09:05JohnWaltz
09:05PeterFandango
09:05JohnPaso Doble

The roll-up ends up looking like this:

TimeTeacherDanceCount
09:00PeterFandango1
09:00JohnFoxtrot1
09:00PeterVogue1
09:05PeterFandango2
09:05JohnFoxtrot1
09:05JohnWaltz1
09:05JohnPaso Doble1

Here, the roll-up is less effective because each dancer (the root) knows a distinct set of dances (the leaves) and it’s very unlikely that they’d repeat the same dance in the same roll-up period.

You can look at data you have ingested already to get a feel for its profile.

Some lovely SQL

If you've got your data in Druid already, find the number of rows in a one hour period simply by using the Druid console:

SELECT COUNT(*) AS rowCount
FROM "your-dataset"
WHERE "__time" >= CURRENT_TIMESTAMP - INTERVAL '1' HOUR

Finding cardinality is very easy as well:

SELECT COUNT (DISTINCT your-column) AS columnCardinality
FROM "your-dataset"
WHERE "__time" >= CURRENT_TIMESTAMP - INTERVAL '1' HOUR

Of course, you can do more than one at once, but just be cautious - on large datasets this can swamp your cluster…

SELECT COUNT (DISTINCT your-column-1) AS column1Cardinality,
   COUNT (DISTINCT your-column-2) AS column2Cardinality,
   :
   :
FROM "your-dataset"
WHERE "__time" >= CURRENT_TIMESTAMP - INTERVAL '1' HOUR

Even more useful is the ratio of rows to unique values of a column. If you have the wikipedia edits sample data loaded, try this query, which gives you the ratio for just a few of the columns in the data set. (Notice that the __time column WHERE clause is absent.)

SELECT CAST(COUNT (DISTINCT cityName) AS FLOAT) / COUNT(*) AS y,
   CAST(COUNT (DISTINCT channel) AS FLOAT) / COUNT(*) AS channelRowratio,
   CAST(COUNT (DISTINCT cityName) AS FLOAT) / COUNT(*) AS cityNameRowratio,
   CAST(COUNT (DISTINCT comment) AS FLOAT) / COUNT(*) AS commentRowratio,
   CAST(COUNT (DISTINCT countryIsoCode) AS FLOAT) / COUNT(*) AS countryIsoCodeRowratio,
   CAST(COUNT (DISTINCT countryName) AS FLOAT) / COUNT(*) AS countryNameRowratio,
   CAST(COUNT (DISTINCT diffUrl) AS FLOAT) / COUNT(*) AS diffUrlRowratio,
   CAST(COUNT (DISTINCT flags) AS FLOAT) / COUNT(*) AS flagsRowratio,
   CAST(COUNT (DISTINCT isAnonymous) AS FLOAT) / COUNT(*) AS isAnonymousRowratio,
   CAST(COUNT (DISTINCT isMinor) AS FLOAT) / COUNT(*) AS isMinorRowratio,
   CAST(COUNT (DISTINCT isNew) AS FLOAT) / COUNT(*) AS isNewRowratio,
   CAST(COUNT (DISTINCT isRobot) AS FLOAT) / COUNT(*) AS isRobotRowratio
FROM "wikipedia"

Those approaching 1 are the main cause of your low roll-up. In the wikipedia dataset, it’s clearly the diffUrl column. At the other of the scale are indicators of queries that are suffering because of the poor roll-up - like the wikipedia sample data columns that start with is.

The next step, whether the data has high compound cardinality, is more tricky. So I used the query above to create combinations of dimensions to assess, say, the dancer and the dance.

SELECT __time, COUNT(*) AS rowCount,
    COUNT (DISTINCT columnName) AS columnCardinality
FROM "datapipePMRawEvents"
WHERE "__time" >= CURRENT_TIMESTAMP - INTERVAL '1' HOUR
GROUP BY 1, 2

What I decided to do in my instance was to either shrink the number of dimensions (so maybe have two tables, one with all fields, and one with a commonly-used subset of fields designed to take advantage of rollup - then, I could query the appropriate dataset for a particular use case.). The other was to attack the cardinality problem. Maybe with a datasketch (if it is COUNT DISTINCT / set operation / Quantile) though I’ve heard some people using clever functions (from simple truncation to using if-then-else to create a new dimension).

In my instance I could then increase the queryGranularity to HOUR and I ended up with just one row instead of hundreds. It was particularly important as I was working in a clickstream project – and that has tonnes of data!

There was another option as well – which was to create different tables with some common filters already applied – that reduced the row count and had a big impact on cardinality as well.

So there you have it: remember to conceptualise and test your GROUP BY on the raw data, and remember cardinality and hierarchies.

Huzzah!

Ingestion parallelism

There's another important effect on this operation, and it requires thinking about how the ingestion is actually executed.

Ingestion tasks can be run in a single thread – but that would kinda defeat the purpose of a shared-nothing, microservices way of doing things. Instead, with Druid, you can set a number of sub-tasks that will actually go and do the ingestion.

Now, in MSQ-land, things are slightly different (there's a shuffle stage) - but with other ingestion types, splits of the incoming data are assigned to different task workers by the overlord.

Whether it's individual files in S3 or a collection of topics in Apache Kafka, each worker will get its own data. And each worker will then do roll-up on its own data.

Some community members have found that this means the roll-up itself is not as efficient – only parts of our rows in the tables above actually end up on each task.

Of course, compaction can help sort that out after the fact, but it might be in your interest to think about preventing it from happening in the first place.

One option people have applied is to set up hashing across dimensions of data upstream of Druid's Kafka consumer to set the partition that data should go to. The result being that workers get data that is much more likely to GROUP BY efficiently than if something like a round-robin event distribution was being used.