Apache Druid® adoption essentials

There are some things that our brains needs to dwell on when it comes to Druid. Things we need to think about before we even start our POC or pilot. Druid is different to warehouses, datalakes + query engines, timeseries database, search system - or any other type of NoSql DB you can shake a Metacritic rating at. There's just some things you gotta know...

1 – Druid is distributed

Apache Druid has a shared-nothing, microservices architecture. That's what helps us make robust, scalable deployments.

Check this video where I speak and have pretty slides on this very topic.
Take this course where you get your hands on a cluster irl
Read about Druid’s processes, clustered deployments, use of Apache Zookeeper, the Metadata Database, and Deep Storage
Check docs on ingestion (and Apache Kafka ingestion) and the load rules that actually get the data pre-loaded onto Druid's processes to be queried. Also worth reading a little about memory mapping in Druid.
And for extra beans, check the docs on task affinity

2 - Some queries just work better

Druid's fan-out, fan-in (aka scatter / gather) query engine (notwithstanding anything that happens with MSQ) means you need to know how queries execute. When you do, you'll get why some queries are awesomely fast, and some are ... tricky.

I've got a video on it that's based on one from Gian Merlino (PMC Chair) - checkit
Read the detail on the query engines in Druid: Native scan queries, search queries and GROUP BY queries
Read more about the role of the Broker
Check out more about JOIN operations in Druid in the docs, too.
On the segment side, there's a great page on segment optimisation in the docs and another video from me, walking through that

3 - Know if you gotta approximate, mate.

At speed on big data sets, a UI will thank you for using approximation. Druid does some of it out of the box, so it's good to know what it is. And how you can help it out.

Guess what - I've got a video on it!
Get to know how to use Sketches by looking at the docs on the SQL functions, and get the super-technical lowdown from the Apache Datasketches project
Don't forget you can use rollup / GROUP BY to generate sketches at ingestion time and give those functions a speed boost
Check out some videos on Datasketches being used with Druid – visit the Imply developer resource center and hit "datasketches", like:
- Not Exactly! Approximate Algorithms For Big Data by Fangjin Yang and Nelson Ray
- Combining Druid and DataSketches for Real-time, Robust Behavioral Analytics by Himanshu Gupta and Mike Sefanov
- Fast Approximate Counting Using Druid and DataSketch, Elan Helfin and Aravind Sethurathnam
Read about the approximate TopN native query in the docs

4 - Design each Druid `TABLE`

A TABLE in Druid has gotta be thought about – you'll want to consider how fast it needs to be, what the level of detail is, where it'll be stored, retention periods...

(Side note: there are other types of Druid Datasource too - not just tables.)

Here's my video on it
There are some real important principles of data modelling for Druid - you can run through all of them in the Ingestion and Data Modeling course
Get to know the detail of segments in the docs
Watch this notebook series from the awesome Sergio Ferragut about how Druid shards incoming data into segments through the PARTITIONED and CLUSTERED
You can read more on changing data through reindexing in the docs
For info about retention and tiering, check Load and Drop rules - there's even a retention tutorial
Read about UNION ALL in SQL
There's a cool doc on considerations for multi-tenancy in Druid
Also remember to read about how Druid updates data

5 - Some fish upstream

Yep you can twist and turn your data at ingestion time in Druid. But some stuff you may want to think about doing upstream.

Check out my video on this very topic.
Read about using native DimensionsSpec for inclusions, exclusions, and schemaless ingestion, and to enable indexes.
Check native TransformsSpec for expressions and filtering - and obvs in MSQ that's just your usual SQL functions in INSERT
Read about how Druid can aggregate at ingestion time using rollup - the equivalent being GROUP BY with MSQ INSERT

6 - Druid's not an island

There might be other things that you need to put around Druid to successfully integrate it into your pipeline.

Here's my video (you should have known this was coming by now)
And there's a tonne of videos on the Druid ecosystem over on the developer resource center from other people, cooler than me! Here's a selection:
- Automating CI/CD for Druid Clusters at Athena Health, Shyam Mudambi / Ramesh Kempanna / Karthik Urs
- Druid and Spot Instances at SuperAwesome, Nicolas Trésegnie
- Building an Enterprise-Scale Dashboarding/Analytics Platform at Target, Jeremy Woelfel
- Building a Real-Time Gaming Analytics Service with Apache Druid at Game Analytics, Ramón Latres Guerrero
- Interactive time-series analysis with Apache Druid, Apache Superset, and Facebook Prophet by Robert Stolz
- One event to rule them all at Outbrain, Daria Litvinov
- Simon Späti’s Open-source data warehousing article
In the docs, check out:
- Druid configuration reference
- Druid security overview
- Task logs, the general logging overview, and Druid metrics and emitters - and check out the learn.imply.io course on Logs and Metrics, too

Apache Druid® adoption essentials

Some things you just gotta know...

Table of contents

1 – Druid is distributed

2 - Some queries just work better

3 - Know if you gotta approximate, mate.

4 - Design each Druid `TABLE`

5 - Some fish upstream

6 - Druid's not an island

Apache Druid® adoption essentials

Some things you just gotta know...

Table of contents

1 – Druid is distributed

2 - Some queries just work better

3 - Know if you gotta approximate, mate.

4 - Design each Druid TABLE

5 - Some fish upstream

6 - Druid's not an island

4 - Design each Druid `TABLE`