Photo by Pablo Gentile on Unsplash
Getting to know Apache Druid®
Some useful resources to guide you along the way
I've been working with Druid now for four-and-a-bit years, joining Imply as its first pre-sales engineer in Europe, and now proud to be working with the open-source community full time as Imply's Director of Developer Relations!
This workaversary gave me a chance to think back on the people and things that helped me learn about Druid, and to compile a list of things that I would recommend anyone goes and checks out Druid.
Get to know the Why
Take a listen to Eric Tschetter, the person who made the first commit to the apache/druid repository. You'll learn why Druid got built in the first place - and therefore what technical issues those originators were trying to fix.
For a deep dive, check out the official Druid whitepaper, written way back when to give the findings of their research into the real-time analytics landscape as it was, and why the team decided to do the mad thing of creating a new database.
It's also worth checking out the following to see how other people are using Druid:
Try it now
Get straight into it and run through the Quickstart tutorials. Some people prefer a hands-on lab (like the Druid Basics course on Druid Faculty by Imply), some just wget
Druid and run it in the JRE on their laptops, some prefer Docker. Some mad people like me put it onto Raspberry Pis! But either way, get an environment up and see what it looks like.
Druid Faculty - Druid Basics
I'd recommend resisting the temptation to go all out and deploy Druid in clustered mode. Be selfish and try out Druid on your own laptop for now.
Learn what it's made of
Now you can start to pull back the onion skin. If you took Druid basics you'll have had a peek - but now get into it and find out what the Druid architecture is like. You need to know this especially if (like me!!!) Druid is the first database you've used that has a microservices, shared-nothing architecture.
- Video - Remember! Druid's distributed
I've also put together a post on the qualities of UIs built on top of Druid that might help you think about what kind of UI Druid is great for.
Here's some key Druid Docs pages for you to look at:
Find out what it's for
Take a look at these videos, where you can learn a little more about other technologies that are often used alongside Druid and why.
Video - Incoming Data
Video - Druid Ecosystem
If you're already tired of my voice (!!) look at this, where there are tonnes of cool videos from people in the community that use Druid. It's nearly carved up into aligned technologies so you can find something you're using in your real-time analytics pipeline (like Apache Kafka) and see what they're doing with it.
- Imply's Druid developer resource center
As well as videos you'll find on the resource center, the docs also have some pages on how Druid differs from other database technologies. Start your reading by looking at Druid versus Elastic.
Learn what Druid ingestion does
Segments are a fundamental concept in Druid. You need to know how they're created, where they go, and how they're used at query time.
This recording of a talk by Gian Merlino will help you understand how this special data format ends up being so important at query time.
This series of videos from Sergio Ferragut goes into detail on how source data will get split up by Druid at ingestion time, and the different options that you have.
You can't escape me, either! Here's my video on segment optimization.
- Video: Optimize your segments
Here are some key docs for you to look at:
Now fold in query wisdom
At the time of writing, by default, Druid has a fan-out / fan-in approach to parallelising SELECT
query execution. That makes certain queries very fast.
Learn what a good query pattern looks like in Druid. You'll begin to understand why data layout and data modelling is so important to query execution speeds - let alone providing enough infrastructure.
- Video: Aim for sub-second queries
Many people might also skip understanding how approximation works in Druid, and how you can use it to your advantage. But Apache Datasketches are a really important Computer Science technique to evaluate and to employ. As well as checking out "datasketches"-tagged videos in the Developer Resource Center, check out this video:
- Video Approximation
It's also now a good time to think about how many TABLE
s you may want to have in Druid, and how they get created.
Here's some useful docs links:
Set yourself up to debug and monitor
Over on the Druid Faculty you'll find two courses that will help you understand how to configure, store, and use Druid monitoring information – logs and metrics.
- Druid Faculty - Logs / Metrics courses
Join the community
The Apache Druid community welcomes you! The Slack channel is particularly active, and it's got not just other users, but actual code and docs committers present. Go over to the main Community page to get links to all the things...
Whether you've got a question about an error or just want to know whether your idea for using Druid is a fit, people there will be super happy to help you.
The main website for Druid also lists upcoming meetup events - and you can also find them by searching meetup.com for events tagged with Apache Druid.
Get a little bit of data in
It can be quite tempting to just take the plunge and ingest a tonne of data into Druid. But I'd suggest taking just a sample of your own data, or generating some sample data, to get you through a modelling exercise. The best place to understand the process is the Druid Faculty course on Data Modelling and Ingestion:
- Druid Faculty - Data Modelling and Ingestion
Now you're set for getting into the Slack channel and starting to build out your tables and your cluster!
And on starting the next part of the journey, take a detour onto my blog article on Druid POCs.