ClickHouse Basics: MergeTree and Tiered Storage with S3
ClickHouse Basics: MergeTree and Tiered Storage with S3
ClickHouse is a columnar OLAP database that scans hundreds of millions of rows and aggregates them absurdly fast. This post covers the bare minimum to get going, a few interesting things about MergeTree, and how to push old data to AWS S3 cold storage to save money.
A quick start with Docker
The fastest route is compose. Using a ready-made template (clickhouse_4.8, running clickhouse/clickhouse-server:24.8):
git clone https://github.com/yshengliao/docker-compose-templates
cd docker-compose-templates/clickhouse_4.8
docker compose up -d
Once it’s up there are two interfaces: 8123 for HTTP and 9000 for the native TCP protocol. Connect and run SQL:
docker exec -it clickhouse clickhouse-client
# or over the HTTP interface
curl 'http://localhost:8123/?query=SELECT%20version()'
The template mounts clickhouse-server/ as the config directory — that’s where the S3 config goes shortly.
MergeTree: parts and background merges
MergeTree is ClickHouse’s workhorse engine. Its core behaviour: an INSERT doesn’t append rows into one big file — it creates a separate part, sorted by the primary key; a background process then merges those parts into larger ones over time. That’s where the name comes from, and conceptually it’s close to an LSM-tree.
CREATE TABLE events
(
ts DateTime,
user_id UInt64,
event LowCardinality(String),
payload String
)
ENGINE = MergeTree
PARTITION BY toYYYYMM(ts)
ORDER BY (user_id, ts);
ORDER BY is the sorting key, and by default it doubles as the primary key. The MergeTree family also has some interesting variants: ReplacingMergeTree (deduplicates by key on merge), and SummingMergeTree / AggregatingMergeTree (sum or aggregate same-key rows at merge time). Giving the merge step semantics is one of ClickHouse’s neat ideas.
The sparse primary index
ClickHouse’s primary index is sparse: it doesn’t index every row, but every index_granularity (8192 by default) rows as one granule, recording only each granule’s start. So even with billions of rows the index is small enough to stay in memory.
When a query’s condition lands on a prefix of the ORDER BY, ClickHouse skips the irrelevant granules (data skipping) and reads only the blocks it needs. So how you choose ORDER BY largely decides how fast queries run — put the columns you filter on most up front.
PARTITION BY and TTL
PARTITION BY toYYYYMM(ts) slices data by month. It isn’t an index; it lets you operate on whole chunks: partition pruning when you query a specific month, an instant DROP PARTITION to delete old data, and movement across storage tiers a partition at a time.
TTL has two uses. One is expiring rows so they’re deleted:
ALTER TABLE events MODIFY TTL ts + INTERVAL 90 DAY;
The other is more interesting — when the time comes, move data to another storage tier. That’s the hook for hot/cold tiering (next section).
Tiered storage with S3
Hot data on a local SSD queries fast; old data on S3 is cheap — ClickHouse makes this declarative via storage_configuration. Drop a storage.xml into clickhouse-server/config.d/ (the template already mounts that directory):
<clickhouse>
<storage_configuration>
<disks>
<s3_cold>
<type>s3</type>
<endpoint>https://YOUR_BUCKET.s3.ap-northeast-1.amazonaws.com/clickhouse/</endpoint>
<access_key_id>YOUR_KEY</access_key_id>
<secret_access_key>YOUR_SECRET</secret_access_key>
</s3_cold>
</disks>
<policies>
<hot_cold>
<volumes>
<hot><disk>default</disk></hot> <!-- local SSD -->
<cold><disk>s3_cold</disk></cold> <!-- AWS S3 -->
</volumes>
<move_factor>0.2</move_factor>
</hot_cold>
</policies>
</storage_configuration>
</clickhouse>
move_factor means “when free space on a volume drops below this fraction (0.2 = 20%), data starts moving to the next volume automatically”. Attach the policy to the table, then use TTL to say when data moves to the cold tier:
CREATE TABLE events ( /* …as above… */ )
ENGINE = MergeTree
PARTITION BY toYYYYMM(ts)
ORDER BY (user_id, ts)
TTL ts + INTERVAL 30 DAY TO VOLUME 'cold'
SETTINGS storage_policy = 'hot_cold';
Data under 30 days stays on the local SSD (hot); after that the background moves it to S3 (cold), with no change to your query syntax — cold data is just a little slower. For three tiers — hot / warm / cold — add a local-HDD volume in the middle and write the TTL as … + 7 DAY TO VOLUME 'warm', … + 30 DAY TO VOLUME 'cold'.
Takeaways
- ClickHouse is a columnar OLAP store; MergeTree is the workhorse: an INSERT creates a sorted part, and the background merges parts into larger ones (close to an LSM).
- The primary index is sparse (one granule per 8192 rows), so it’s small and memory-resident; your
ORDER BYchoice decides how much data a query can skip. PARTITION BYlets you prune, drop, and move data a partition at a time; the MergeTree family (Replacing / Summing / Aggregating) gives the merge step semantics.TTL ... TO VOLUMEplusstorage_configurationgives you tiering: hot data on SSD, old data auto-moved to S3, with unchanged query syntax.- The clickhouse_4.8 template with
docker compose up -dis the fastest start; the S3 config takes effect once dropped intoclickhouse-server/config.d/.
Drafted with Claude · environment ClickHouse 24.8 (LTS) + Docker · part of the 2026-06-13 blog renovation, paint still drying.