Apache Druid® adoption essentials

Some things you just gotta know...

There are some things that our brains needs to dwell on when it comes to Druid. Things we need to think about before we even start our POC or pilot. Druid is different to warehouses, datalakes + query engines, timeseries database, search system - or any other type of NoSql DB you can shake a Metacritic rating at. There's just some things you gotta know...

1 – Druid is distributed

Apache Druid has a shared-nothing, microservices architecture. That's what helps us make robust, scalable deployments.

2 - Some queries just work better

Druid's fan-out, fan-in (aka scatter / gather) query engine (notwithstanding anything that happens with MSQ) means you need to know how queries execute. When you do, you'll get why some queries are awesomely fast, and some are ... tricky.

3 - Know if you gotta approximate, mate.

At speed on big data sets, a UI will thank you for using approximation. Druid does some of it out of the box, so it's good to know what it is. And how you can help it out.

4 - Design each Druid TABLE

A TABLE in Druid has gotta be thought about – you'll want to consider how fast it needs to be, what the level of detail is, where it'll be stored, retention periods...

(Side note: there are other types of Druid Datasource too - not just tables.)

  • Here's my video on it

  • There are some real important principles of data modelling for Druid - you can run through all of them in the Ingestion and Data Modeling course

  • Get to know the detail of segments in the docs

  • Watch this notebook series from the awesome Sergio Ferragut about how Druid shards incoming data into segments through the PARTITIONED and CLUSTERED

  • You can read more on changing data through reindexing in the docs

  • For info about retention and tiering, check Load and Drop rules - there's even a retention tutorial

  • Read about UNION ALL in SQL

  • There's a cool doc on considerations for multi-tenancy in Druid

  • Also remember to read about how Druid updates data

5 - Some fish upstream

Yep you can twist and turn your data at ingestion time in Druid. But some stuff you may want to think about doing upstream.

  • Check out my video on this very topic.

  • Read about using native DimensionsSpec for inclusions, exclusions, and schemaless ingestion, and to enable indexes.

  • Check native TransformsSpec for expressions and filtering - and obvs in MSQ that's just your usual SQL functions in INSERT

  • Read about how Druid can aggregate at ingestion time using rollup - the equivalent being GROUP BY with MSQ INSERT

6 - Druid's not an island

There might be other things that you need to put around Druid to successfully integrate it into your pipeline.