Photo by Markus Winkler on Unsplash
Apache Druid® adoption essentials
Some things you just gotta know...
There are some things that our brains needs to dwell on when it comes to Druid. Things we need to think about before we even start our POC or pilot. Druid is different to warehouses, datalakes + query engines, timeseries database, search system - or any other type of NoSql DB you can shake a Metacritic rating at. There's just some things you gotta know...
1 – Druid is distributed
Apache Druid has a shared-nothing, microservices architecture. That's what helps us make robust, scalable deployments.
Check this video where I speak and have pretty slides on this very topic.
Take this course where you get your hands on a cluster irl
Read about Druid’s processes, clustered deployments, use of Apache Zookeeper, the Metadata Database, and Deep Storage
Check docs on ingestion (and Apache Kafka ingestion) and the load rules that actually get the data pre-loaded onto Druid's processes to be queried. Also worth reading a little about memory mapping in Druid.
And for extra beans, check the docs on task affinity
2 - Some queries just work better
Druid's fan-out, fan-in (aka scatter / gather) query engine (notwithstanding anything that happens with MSQ) means you need to know how queries execute. When you do, you'll get why some queries are awesomely fast, and some are ... tricky.
I've got a video on it that's based on one from Gian Merlino (PMC Chair) - checkit
Read the detail on the query engines in Druid: Native scan queries, search queries and GROUP BY queries
Read more about the role of the Broker
Check out more about JOIN operations in Druid in the docs, too.
On the segment side, there's a great page on segment optimisation in the docs and another video from me, walking through that
3 - Know if you gotta approximate, mate.
At speed on big data sets, a UI will thank you for using approximation. Druid does some of it out of the box, so it's good to know what it is. And how you can help it out.
Guess what - I've got a video on it!
Get to know how to use Sketches by looking at the docs on the SQL functions, and get the super-technical lowdown from the Apache Datasketches project
Don't forget you can use rollup /
GROUP BY
to generate sketches at ingestion time and give those functions a speed boostCheck out some videos on Datasketches being used with Druid – visit the Imply developer resource center and hit "datasketches", like:
Not Exactly! Approximate Algorithms For Big Data by Fangjin Yang and Nelson Ray
Combining Druid and DataSketches for Real-time, Robust Behavioral Analytics by Himanshu Gupta and Mike Sefanov
Fast Approximate Counting Using Druid and DataSketch, Elan Helfin and Aravind Sethurathnam
Read about the approximate TopN native query in the docs
4 - Design each Druid TABLE
A TABLE
in Druid has gotta be thought about – you'll want to consider how fast it needs to be, what the level of detail is, where it'll be stored, retention periods...
(Side note: there are other types of Druid Datasource too - not just tables.)
Here's my video on it
There are some real important principles of data modelling for Druid - you can run through all of them in the Ingestion and Data Modeling course
Get to know the detail of segments in the docs
Watch this notebook series from the awesome Sergio Ferragut about how Druid shards incoming data into segments through the
PARTITIONED
andCLUSTERED
You can read more on changing data through reindexing in the docs
For info about retention and tiering, check Load and Drop rules - there's even a retention tutorial
Read about UNION ALL in SQL
There's a cool doc on considerations for multi-tenancy in Druid
Also remember to read about how Druid updates data
5 - Some fish upstream
Yep you can twist and turn your data at ingestion time in Druid. But some stuff you may want to think about doing upstream.
Check out my video on this very topic.
Read about using native DimensionsSpec for inclusions, exclusions, and schemaless ingestion, and to enable indexes.
Check native TransformsSpec for expressions and filtering - and obvs in MSQ that's just your usual SQL functions in
INSERT
Read about how Druid can aggregate at ingestion time using rollup - the equivalent being
GROUP BY
with MSQINSERT
6 - Druid's not an island
There might be other things that you need to put around Druid to successfully integrate it into your pipeline.
Here's my video (you should have known this was coming by now)
And there's a tonne of videos on the Druid ecosystem over on the developer resource center from other people, cooler than me! Here's a selection:
Automating CI/CD for Druid Clusters at Athena Health, Shyam Mudambi / Ramesh Kempanna / Karthik Urs
Druid and Spot Instances at SuperAwesome, Nicolas Trésegnie
Building an Enterprise-Scale Dashboarding/Analytics Platform at Target, Jeremy Woelfel
Building a Real-Time Gaming Analytics Service with Apache Druid at Game Analytics, Ramón Latres Guerrero
Interactive time-series analysis with Apache Druid, Apache Superset, and Facebook Prophet by Robert Stolz
One event to rule them all at Outbrain, Daria Litvinov
Simon Späti’s Open-source data warehousing article
In the docs, check out:
Task logs, the general logging overview, and Druid metrics and emitters - and check out the learn.imply.io course on Logs and Metrics, too