StarRocks Deep‑Dive Cheat‑Sheet
A comprehensive technical reference for StarRocks
Table of Contents
1. Architecture Deep‑Dive
Frontend (FE) Nodes
- • Brain of the cluster – SQL parsing, optimizer, global metadata.
- • Raft consensus on metadata edits.
- • Roles:
- – Leader ✦ writes metadata, schedules jobs
- – Follower ✦ hot‑stand‑by voting node
- – Observer ✦ read‑only (scale metadata reads)
Backend (BE) Nodes [Shared‑Nothing]
- • Store on‑disk columnar files + execute pipelines.
- • Each BE hosts:
- – Query Engine Fragment executors
- – Storage Engine (Rowset manager, Compaction, WAL)
- – Tablet replicas
Compute Nodes (CN) [Shared‑Data]
- • Stateless query executors.
- • Pull tablet data from remote object store via cache layer.
Internal Services per Node
- • BRPC Service – high‑perf RPC between FE/BE.
- • Heart‑beat – every 5 s BE→FE health & tablet‐version report.
- • Txn Manager – 2‑phase commit across tablets for atomic loads.
- • Routine‑Load – Kafka ingest daemon on BE.
2. Data Organization Hierarchy
StarRocks organizes data in a hierarchical structure from logical tables down to physical storage files. Understanding this hierarchy is essential for optimizing performance and data management.
Data Organization Hierarchy
Logical Organization
- • Table: Defined by schema with columns and types
- • Partition: Horizontal data division (time/region/custom)
- • Bucket: Distribution unit within partition
Physical Organization
- • Tablet: Atomic unit of storage and replication
- • Rowset: Collection of segments with delete bitmap
- • Segment: Immutable columnar file (~512 MB)
Key Relationships
- • Each partition contains multiple buckets
- • Each bucket maps to exactly one tablet
- • Each tablet has multiple replicas (typically 3) for fault tolerance
- • Tablets contain multiple rowsets representing different data versions
- • Compaction merges rowsets to optimize read performance
3. Metadata, Consensus & Roles
- • Metadata stored in BDB‑JE (local) + delta logs.
- • Raft group of 3‑5 FEs keeps a single‑leader timeline.
- • Majority (⌈N/2⌉) FEs must be alive to accept DDL/DML.
- •
SHOW PROC '/journal_id'
reveals Raft term & last index. - • Observer can be placed in remote DC for disaster‑recovery reads.
4. Storage Engine Internals
StarRocks Storage Hierarchy
Data Organization Flow
Compaction Types
- • Cumulative (minor) – merges small rowsets (size tiering).
- • Base (major) – rewrites entire history into one big rowset.
- • Trigger: size ratio, write amplification, delete bitmap threshold.
Encoding & Compression
• LZ4 / ZSTD (v3.2 default) · Column encodings: RLE, DICT, DELTA, FOR, BITMAP.
WAL
- • MemTable flushed to WAL on write; made visible after FE COMMIT.
- • WAL auto‑deleted after rowset created.
5. Ingestion Pathways
Stream Load
HTTP PUT directly to FE; synchronous, good for ≤ 200 MB.
Broker Load
Hadoop ecosystem files; external broker process.
Routine Load
Continuous Kafka pull; exactly‑once via 2‑phase commit.
Spark Connector
spark‑starrocks; push‑down & vectorized sink.
INSERT … VALUES
trivial SQL, route to FE txn manager.
INSERT … SELECT
ELT; pipelines run on BEs.
External Table
query Parquet/ORC on S3/HDFS without ingest.
Transaction Semantics
- • Begin – automatic for Stream/Broker; manual via
BEGIN
for multi‑stmt. - • 2PC – FE PREPARE across tablets → BE COMMIT ← FE visible.
- • Idempotency – label‐based; re‑send with same LABEL=xxx to skip duplicate.
6. Advanced Partitioning
Expression Examples
PARTITION BY expr(time_slice(ts, INTERVAL 1 DAY, '1970-01-01'))
PARTITION BY expr(region || '_' || to_char(order_date, 'YYYYMM'))
PARTITION BY expr(bucket_id, order_status) -- Multi-column key (3.1+)
Dynamic Partition (legacy) properties
"dynamic_partition.time_unit"="DAY"
"dynamic_partition.end"="-1"
"dynamic_partition.prefix"="p"
...
Time‑based retention
PROPERTIES("partition_live_number"="90") -- keep 90 latest partitions
Split Partition
ALTER TABLE t SPLIT PARTITION p202504 INTO
(PARTITION p1 VALUES [ ('2025-04-01'), ('2025-04-16') ),
PARTITION p2 VALUES [ ('2025-04-16'), ('2025-05-01') ));
7. Bucketing & Colocation
Alter bucket count (3.2+ Random only)
ALTER TABLE t MODIFY RANDOM_BUCKETS = 128;
Create Colocation Group
CREATE TABLE dim_region ( … )
DISTRIBUTED BY HASH(region_id) BUCKETS 32
PROPERTIES("colocate_with"="group_geo");
Check Distribution Skew
SHOW DATA SKEW FROM TABLE fact_sales;
Tablet‑to‑Node Mapping
SHOW PROC '/cluster_balance/colocate_group/group_geo';
8. Tablet Lifecycle & Compaction
State Transitions
Schema Change
- • Light: add column default, rename column – metadata only.
- • Heavy: change type, drop column – triggers background tablet rebuild.
Version Garbage GC
• FE tracks oldest active txn; BE purges stale rowsets <= oldest.
9. Replication, Quorum & Recovery
Quorum Math
- replication_num = 3 ⇒ write_quorum = 2 (majority)
- replication_num = 4 ⇒ write_quorum = 3 (>= ceil(N/2))
Replica Selection (read path)
- • LEADER replica chosen for up‑to‑date reads; followers used if versions equal &
enable_replica_preference=true
.
Failure Handling
- • BE down → replicas state UNKNOWN; FE routes around; starts CLONE when node recovers.
- • Disk full → tablet marked BAD; auto‐migration if spare capacity.
- • FE leader crash → Raft election < 5 s.
10. Shared‑Data (Cloud‑Native) Specifics
- • Object Store Layout: s3://bucket/tablet_id/rowset_uuid/segment_0.dat
- • Cache:
- – Persistent SSD Cache (NVMe) on CN (LRU by segment)
- – Memory page cache for hot stripes (adaptive).
- • Upload: BE writes to local temp, then multipart PUT to S3, then commits metadata.
- • Read: CN first checks SSD; miss → RANGE GET from S3.
Tunables
"cloud_native_storage_cache_max_size"= "200GB"
"cloud_native_prefetch" = "true"
11. Query Execution & Performance Knobs
Pipeline Engine Flags
- •
enable_pipeline_engine
=true (default) - •
pipeline_dop
– degree‑of‑parallelism per operator (auto) - •
query_chunk_size
– batch rows (default 4096)
Vectorization
- • Fully columnar; encodes filter→projection chain as vector ops.
- • Disable per‑session:
set enable_vectorized_engine = false;
Runtime Filter Push‑down
• Bloom/In-minmax filters auto‑generated for join build side.
Important fe.conf / be.conf
- • fe.max.concurrent.txs – ingest concurrency
- • be.exec_mem_limit – per‑query memory
- • be_load_channel_timeout_second – stream‑load idle timeout
12. Operations & Monitoring
Key SHOW commands
- •
SHOW BACKENDS;
- •
SHOW FRONTENDS;
- •
SHOW TABLET WHERE BackendId=10101 AND State!='NORMAL';
- •
SHOW PROC '/sys_stat/be'
– CPU, IO per BE - •
SHOW PROCESSLIST;
– running queries
Metrics Endpoints
- • http://be_host:8040/metrics
- • exporters: Prometheus starrocks‑exporter, Grafana dashboards
Alert Rules
- • Tablet repair backlog > 1000
- • FE raft follower lag > 5000 ops
- • Query latency P95 > SLA
13. Maintenance Workflows
Node Management
ALTER SYSTEM ADD BACKEND "host:9050";
ALTER SYSTEM REBALANCE;
ALTER SYSTEM DECOMMISSION BACKEND "host:9050";
Re‑bucket Partition
ALTER TABLE t MODIFY PARTITION p202504 DISTRIBUTED BY HASH(user_id) BUCKETS 64;
Backup & Restore
• BACKUP SNAPSHOT
to S3/HDFS; RESTORE SNAPSHOT
into new DB.
14. Design Checklist
- Partition key matches WHERE predicates & retention
- Auto‑partition via expression; retention via partition_live_number
- Bucket count ~= data_size / 5 GB; even modulo BE count
- Hash keys align with frequent joins; colocate = on
- replication_num=3, majority quorum; cross‑AZ placement rules
- Monitor compaction score; adjust cumulative compaction policy
- In Shared‑Data, size SSD cache ≥ 10% hot data
- Enable pipeline & vectorized engine; adjust query_chunk_size if rows wide
- Keep tablet count < 300k; split DBs if needed
- Schedule weekly REBALANCE; check
SHOW DATA SKEW
- Snapshot backup daily + incremental; test RESTORE quarterly