StarRocks Deep‑Dive Cheat‑Sheet

A comprehensive technical reference for StarRocks

1Architecture Deep‑Dive 2Data Organization Hierarchy 3Metadata, Consensus & Roles 4Storage Engine Internals 5Ingestion Pathways 6Advanced Partitioning 7Bucketing & Colocation 8Tablet Lifecycle & Compaction 9Replication, Quorum & Recovery 10Shared‑Data (Cloud‑Native) Specifics 11Query Execution & Performance Knobs 12Operations & Monitoring 13Maintenance Workflows 14Design Checklist

1. Architecture Deep‑Dive

Frontend (FE) Nodes

• Brain of the cluster – SQL parsing, optimizer, global metadata.
• Raft consensus on metadata edits.
• Roles:

– Leader ✦ writes metadata, schedules jobs
– Follower ✦ hot‑stand‑by voting node
– Observer ✦ read‑only (scale metadata reads)

Backend (BE) Nodes [Shared‑Nothing]

• Store on‑disk columnar files + execute pipelines.
• Each BE hosts:

– Query Engine Fragment executors
– Storage Engine (Rowset manager, Compaction, WAL)
– Tablet replicas

Compute Nodes (CN) [Shared‑Data]

• Stateless query executors.
• Pull tablet data from remote object store via cache layer.

Internal Services per Node

• BRPC Service – high‑perf RPC between FE/BE.
• Heart‑beat – every 5 s BE→FE health & tablet‐version report.
• Txn Manager – 2‑phase commit across tablets for atomic loads.
• Routine‑Load – Kafka ingest daemon on BE.

2. Data Organization Hierarchy

StarRocks organizes data in a hierarchical structure from logical tables down to physical storage files. Understanding this hierarchy is essential for optimizing performance and data management.

Data Organization Hierarchy

Table

Logical container for data, defined by schema

Partition

Horizontal data division based on partition key (e.g., time, region)

Bucket

Data distribution unit within partition (hash/random)

Tablet

Physical storage and replication unit

Rowset

Collection of segments + delete bitmap

Logical Organization

• Table: Defined by schema with columns and types
• Partition: Horizontal data division (time/region/custom)
• Bucket: Distribution unit within partition

Physical Organization

• Tablet: Atomic unit of storage and replication
• Rowset: Collection of segments with delete bitmap
• Segment: Immutable columnar file (~512 MB)

Key Relationships

• Each partition contains multiple buckets
• Each bucket maps to exactly one tablet
• Each tablet has multiple replicas (typically 3) for fault tolerance
• Tablets contain multiple rowsets representing different data versions
• Compaction merges rowsets to optimize read performance

3. Metadata, Consensus & Roles

• Metadata stored in BDB‑JE (local) + delta logs.
• Raft group of 3‑5 FEs keeps a single‑leader timeline.
• Majority (⌈N/2⌉) FEs must be alive to accept DDL/DML.
• SHOW PROC '/journal_id' reveals Raft term & last index.
• Observer can be placed in remote DC for disaster‑recovery reads.

4. Storage Engine Internals

StarRocks Storage Hierarchy

Storage

Data Organization Flow

Tablet

Rowset

Segment

Column Data

Compaction Types

• Cumulative (minor) – merges small rowsets (size tiering).
• Base (major) – rewrites entire history into one big rowset.
• Trigger: size ratio, write amplification, delete bitmap threshold.

Encoding & Compression

• LZ4 / ZSTD (v3.2 default) · Column encodings: RLE, DICT, DELTA, FOR, BITMAP.

WAL

• MemTable flushed to WAL on write; made visible after FE COMMIT.
• WAL auto‑deleted after rowset created.

5. Ingestion Pathways

Stream Load

HTTP PUT directly to FE; synchronous, good for ≤ 200 MB.

Broker Load

Hadoop ecosystem files; external broker process.

Routine Load

Continuous Kafka pull; exactly‑once via 2‑phase commit.

Spark Connector

spark‑starrocks; push‑down & vectorized sink.

INSERT … VALUES

trivial SQL, route to FE txn manager.

INSERT … SELECT

ELT; pipelines run on BEs.

External Table

query Parquet/ORC on S3/HDFS without ingest.

Transaction Semantics

• Begin – automatic for Stream/Broker; manual via BEGIN for multi‑stmt.
• 2PC – FE PREPARE across tablets → BE COMMIT ← FE visible.
• Idempotency – label‐based; re‑send with same LABEL=xxx to skip duplicate.

6. Advanced Partitioning

Expression Examples

PARTITION BY expr(time_slice(ts, INTERVAL 1 DAY, '1970-01-01'))
PARTITION BY expr(region || '_' || to_char(order_date, 'YYYYMM'))
PARTITION BY expr(bucket_id, order_status)  -- Multi-column key (3.1+)

Dynamic Partition (legacy) properties

"dynamic_partition.time_unit"="DAY"
"dynamic_partition.end"="-1"
"dynamic_partition.prefix"="p"
...

Time‑based retention

PROPERTIES("partition_live_number"="90")   -- keep 90 latest partitions

Split Partition

ALTER TABLE t SPLIT PARTITION p202504 INTO
   (PARTITION p1 VALUES [ ('2025-04-01'), ('2025-04-16') ),
    PARTITION p2 VALUES [ ('2025-04-16'), ('2025-05-01') ));

7. Bucketing & Colocation

Alter bucket count (3.2+ Random only)

ALTER TABLE t MODIFY RANDOM_BUCKETS = 128;

Create Colocation Group

CREATE TABLE dim_region ( … )
  DISTRIBUTED BY HASH(region_id) BUCKETS 32
  PROPERTIES("colocate_with"="group_geo");

Check Distribution Skew

SHOW DATA SKEW FROM TABLE fact_sales;

Tablet‑to‑Node Mapping

SHOW PROC '/cluster_balance/colocate_group/group_geo';

8. Tablet Lifecycle & Compaction

State Transitions

NORMAL

ALTER

schema change

CLONE

repair

NORMAL

Schema Change

• Light: add column default, rename column – metadata only.
• Heavy: change type, drop column – triggers background tablet rebuild.

Version Garbage GC

• FE tracks oldest active txn; BE purges stale rowsets <= oldest.

9. Replication, Quorum & Recovery

Quorum Math

replication_num = 3 ⇒ write_quorum = 2 (majority)
replication_num = 4 ⇒ write_quorum = 3 (>= ceil(N/2))

Replica Selection (read path)

• LEADER replica chosen for up‑to‑date reads; followers used if versions equal & enable_replica_preference=true.

Failure Handling

• BE down → replicas state UNKNOWN; FE routes around; starts CLONE when node recovers.
• Disk full → tablet marked BAD; auto‐migration if spare capacity.
• FE leader crash → Raft election < 5 s.

10. Shared‑Data (Cloud‑Native) Specifics

• Object Store Layout: s3://bucket/tablet_id/rowset_uuid/segment_0.dat
• Cache:

– Persistent SSD Cache (NVMe) on CN (LRU by segment)
– Memory page cache for hot stripes (adaptive).

• Upload: BE writes to local temp, then multipart PUT to S3, then commits metadata.
• Read: CN first checks SSD; miss → RANGE GET from S3.

Tunables

"cloud_native_storage_cache_max_size"= "200GB"
"cloud_native_prefetch" = "true"

11. Query Execution & Performance Knobs

Pipeline Engine Flags

• enable_pipeline_engine=true (default)
• pipeline_dop – degree‑of‑parallelism per operator (auto)
• query_chunk_size – batch rows (default 4096)

Vectorization

• Fully columnar; encodes filter→projection chain as vector ops.
• Disable per‑session: set enable_vectorized_engine = false;

Runtime Filter Push‑down

• Bloom/In-minmax filters auto‑generated for join build side.

Important fe.conf / be.conf

• fe.max.concurrent.txs – ingest concurrency
• be.exec_mem_limit – per‑query memory
• be_load_channel_timeout_second – stream‑load idle timeout

12. Operations & Monitoring

Key SHOW commands

• SHOW BACKENDS;
• SHOW FRONTENDS;
• SHOW TABLET WHERE BackendId=10101 AND State!='NORMAL';
• SHOW PROC '/sys_stat/be' – CPU, IO per BE
• SHOW PROCESSLIST; – running queries

Metrics Endpoints

• http://be_host:8040/metrics
• exporters: Prometheus starrocks‑exporter, Grafana dashboards

Alert Rules

• Tablet repair backlog > 1000
• FE raft follower lag > 5000 ops
• Query latency P95 > SLA

13. Maintenance Workflows

Node Management

ALTER SYSTEM ADD BACKEND "host:9050";
ALTER SYSTEM REBALANCE;
ALTER SYSTEM DECOMMISSION BACKEND "host:9050";

Re‑bucket Partition

ALTER TABLE t MODIFY PARTITION p202504 DISTRIBUTED BY HASH(user_id) BUCKETS 64;

Backup & Restore

• BACKUP SNAPSHOT to S3/HDFS; RESTORE SNAPSHOT into new DB.