I thought I’d share some lessons I learned over the last few years about ingestion-time aggregation in Druid, and how cardinality affects it. This is kinda a little process I go through mentally that I thought would be helpful to share!
What is roll-up?
In case you don't know what roll-up is, it takes incoming rows in Druid and then aggregates them, spitting out metrics from the measures that you have. There's the native rollup, which you turn on with streaming ingestion, and then there's the modern Druid approach to roll-up, which – in plain English, a GROUP BY
in the INSERT
statement for batch ingestion.
You might choose to emit your usual MAX
, MIN
, and so on metrics, or something more hÿpercool like a data sketch to speed up approximation operations. In native, you put all that in the metricsSpec
section of your ingestion specification - with INSERT
you just use the usual aggregates.
Awesome! Now you can reduce your 10m rows per second of people surfing TikTok on your office WiFi network (I will not enter into the debate as to whether TikTok connects to WiFi) to just 10 per second, giving you the aggregates ahead of time that you would otherwise have computed every time with each query.
In terms of efficiency of this operation, it's the same as it is for any GROUP BY
operation: the cardinality of dimensions you have in your SELECT
. In Druid’s case, that’s the source data columns that you list in your ingestion specification or INSERT
statement. So be cautious!
Timestamp truncation
A discrete piece of functionality in Druid is to automatically truncate incoming timestamps. That’s done by specifying a queryGranularity
in the ingestion spec or by using a suitable time function in your INSERT
.
Here’s an example data set where queryGranularity
processing is at FIVE_MINUTE
. Druid has, for every incoming event, truncated the timestamp (like a TIME_FLOOR
).
Time | Name | Dance |
09:00 | Peter | Fandango |
09:00 | John | Fandango |
09:00 | Peter | Fandango |
09:05 | John | Fandango |
09:05 | John | Fandango |
09:05 | John | Fandango |
09:05 | Peter | Waltz |
09:05 | Peter | Waltz |
This is a necessary thing for effective roll-up – if you did a GROUP BY
on the raw timestamp, you'd end up with a row for every millisecond (worst case).
Now let’s use our eye-minds and think about the roll-up.
Periodic single-dimension cardinality
Imagine that we're going to add a COUNT
Each column has low cardinality within that time period, so we get a nice aggregation: 8 went in, 4 came out.
Time | Name | Dance | Count |
09:00 | Peter | Fandango | 2 |
09:00 | John | Fandango | 1 |
09:05 | John | Fandango | 3 |
09:05 | Peter | Waltz | 2 |
But what about this one:
Time | Name | Dance |
09:00 | Peter | Fandango |
09:00 | Mary | Fandango |
09:00 | Tom | Fandango |
09:05 | Brian | Fandango |
09:05 | Peter | Fandango |
09:05 | Mary | Fandango |
09:05 | Tom | Fandango |
09:05 | Terry | Fandango |
Notice that, within the 5 minutes buckets (our queryGranularity
truncated timestamp) every single event relates to a different dancer. When the GROUP BY
kicks in, those 8 incoming rows get emitted as 8 rows: the GROUP BY
is across all of the dimensions.
Time | Name | Dance | Count |
09:00 | Peter | Fandango | 1 |
09:00 | Mary | Fandango | 1 |
09:00 | Tom | Fandango | 1 |
09:05 | Brian | Fandango | 1 |
09:05 | Peter | Fandango | 1 |
09:05 | Mary | Fandango | 1 |
09:05 | Tom | Fandango | 1 |
09:05 | Terry | Fandango | 1 |
Periodic multi-dimension cardinality
And there’s a second scenario: lots of combinations of values.
Time | Name | Dance |
09:00 | Peter | Fandango |
09:00 | Mary | Polka |
09:00 | Mary | Vogue |
09:05 | Brian | Fandango |
09:05 | Lucy | Waltz |
09:05 | Claire | Fandango |
09:05 | Sian | Waltz |
09:05 | Terry | Waltz |
Here there are just too many combinations of values in each five-minute interval. Every dancer dances a different dance.
Time | Name | Dance | Count |
09:00 | Peter | Fandango | 1 |
09:00 | Mary | Polka | 1 |
09:00 | Mary | Vogue | 1 |
09:05 | Brian | Fandango | 1 |
09:05 | Lucy | Waltz | 1 |
09:05 | Claire | Fandango | 1 |
09:05 | Sian | Waltz | 1 |
09:05 | Terry | Waltz | 1 |
Periodic hierarchy
One cause for this combined cardinality problem could be data hierarchy. Let’s imagine that Peter is King of the Fandango and Voguing. (Well done, Peter). John, meanwhile, is King of the Foxtrot, Waltz, and Paso Doble. (Ie, parent-child).
Time | Teacher | Dance |
09:00 | Peter | Fandango |
09:00 | John | Foxtrot |
09:00 | Peter | Vogue |
09:05 | Peter | Fandango |
09:05 | John | Foxtrot |
09:05 | John | Waltz |
09:05 | Peter | Fandango |
09:05 | John | Paso Doble |
The roll-up ends up looking like this:
Time | Teacher | Dance | Count |
09:00 | Peter | Fandango | 1 |
09:00 | John | Foxtrot | 1 |
09:00 | Peter | Vogue | 1 |
09:05 | Peter | Fandango | 2 |
09:05 | John | Foxtrot | 1 |
09:05 | John | Waltz | 1 |
09:05 | John | Paso Doble | 1 |
Here, the roll-up is less effective because each dancer (the root) knows a distinct set of dances (the leaves) and it’s very unlikely that they’d repeat the same dance in the same roll-up period.
You can look at data you have ingested already to get a feel for its profile.
Some lovely SQL
If you've got your data in Druid already, find the number of rows in a one hour period simply by using the Druid console:
SELECT COUNT(*) AS rowCount
FROM "your-dataset"
WHERE "__time" >= CURRENT_TIMESTAMP - INTERVAL '1' HOUR
Finding cardinality is very easy as well:
SELECT COUNT (DISTINCT your-column) AS columnCardinality
FROM "your-dataset"
WHERE "__time" >= CURRENT_TIMESTAMP - INTERVAL '1' HOUR
Of course, you can do more than one at once, but just be cautious - on large datasets this can swamp your cluster…
SELECT COUNT (DISTINCT your-column-1) AS column1Cardinality,
COUNT (DISTINCT your-column-2) AS column2Cardinality,
:
:
FROM "your-dataset"
WHERE "__time" >= CURRENT_TIMESTAMP - INTERVAL '1' HOUR
Even more useful is the ratio of rows to unique values of a column. If you have the wikipedia edits sample data loaded, try this query, which gives you the ratio for just a few of the columns in the data set. (Notice that the __time
column WHERE
clause is absent.)
SELECT CAST(COUNT (DISTINCT cityName) AS FLOAT) / COUNT(*) AS y,
CAST(COUNT (DISTINCT channel) AS FLOAT) / COUNT(*) AS channelRowratio,
CAST(COUNT (DISTINCT cityName) AS FLOAT) / COUNT(*) AS cityNameRowratio,
CAST(COUNT (DISTINCT comment) AS FLOAT) / COUNT(*) AS commentRowratio,
CAST(COUNT (DISTINCT countryIsoCode) AS FLOAT) / COUNT(*) AS countryIsoCodeRowratio,
CAST(COUNT (DISTINCT countryName) AS FLOAT) / COUNT(*) AS countryNameRowratio,
CAST(COUNT (DISTINCT diffUrl) AS FLOAT) / COUNT(*) AS diffUrlRowratio,
CAST(COUNT (DISTINCT flags) AS FLOAT) / COUNT(*) AS flagsRowratio,
CAST(COUNT (DISTINCT isAnonymous) AS FLOAT) / COUNT(*) AS isAnonymousRowratio,
CAST(COUNT (DISTINCT isMinor) AS FLOAT) / COUNT(*) AS isMinorRowratio,
CAST(COUNT (DISTINCT isNew) AS FLOAT) / COUNT(*) AS isNewRowratio,
CAST(COUNT (DISTINCT isRobot) AS FLOAT) / COUNT(*) AS isRobotRowratio
FROM "wikipedia"
Those approaching 1 are the main cause of your low roll-up. In the wikipedia
dataset, it’s clearly the diffUrl
column. At the other of the scale are indicators of queries that are suffering because of the poor roll-up - like the wikipedia
sample data columns that start with is
.
The next step, whether the data has high compound cardinality, is more tricky. So I used the query above to create combinations of dimensions to assess, say, the dancer and the dance.
SELECT __time, COUNT(*) AS rowCount,
COUNT (DISTINCT columnName) AS columnCardinality
FROM "datapipePMRawEvents"
WHERE "__time" >= CURRENT_TIMESTAMP - INTERVAL '1' HOUR
GROUP BY 1, 2
What I decided to do in my instance was to either shrink the number of dimensions (so maybe have two tables
, one with all fields, and one with a commonly-used subset of fields designed to take advantage of rollup - then, I could query the appropriate dataset for a particular use case.). The other was to attack the cardinality problem. Maybe with a datasketch (if it is COUNT DISTINCT / set operation / Quantile) though I’ve heard some people using clever functions (from simple truncation to using if-then-else to create a new dimension).
In my instance I could then increase the queryGranularity
to HOUR
and I ended up with just one row instead of hundreds. It was particularly important as I was working in a clickstream project – and that has tonnes of data!
There was another option as well – which was to create different tables with some common filters already applied – that reduced the row count and had a big impact on cardinality as well.
So there you have it: remember to conceptualise and test your GROUP BY
on the raw data, and remember cardinality and hierarchies.
Huzzah!
Ingestion parallelism
There's another important effect on this operation, and it requires thinking about how the ingestion is actually executed.
Ingestion tasks can be run in a single thread – but that would kinda defeat the purpose of a shared-nothing, microservices way of doing things. Instead, with Druid, you can set a number of sub-tasks that will actually go and do the ingestion.
Now, in MSQ-land, things are slightly different (there's a shuffle stage) - but with other ingestion types, splits of the incoming data are assigned to different task workers by the overlord.
Whether it's individual files in S3 or a collection of topics in Apache Kafka, each worker will get its own data. And each worker will then do roll-up on its own data.
Some community members have found that this means the roll-up itself is not as efficient – only parts of our rows in the tables above actually end up on each task.
Of course, compaction can help sort that out after the fact, but it might be in your interest to think about preventing it from happening in the first place.
One option people have applied is to set up hashing across dimensions of data upstream of Druid's Kafka consumer to set the partition that data should go to. The result being that workers get data that is much more likely to GROUP BY
efficiently than if something like a round-robin event distribution was being used.