StarRocks Deep‑Dive Cheat‑Sheet

A comprehensive technical reference for StarRocks

1. Architecture Deep‑Dive

Frontend (FE) Nodes

  • • Brain of the cluster – SQL parsing, optimizer, global metadata.
  • • Raft consensus on metadata edits.
  • • Roles:
    • – Leader ✦ writes metadata, schedules jobs
    • – Follower ✦ hot‑stand‑by voting node
    • – Observer ✦ read‑only (scale metadata reads)

Backend (BE) Nodes [Shared‑Nothing]

  • • Store on‑disk columnar files + execute pipelines.
  • • Each BE hosts:
    • – Query Engine Fragment executors
    • – Storage Engine (Rowset manager, Compaction, WAL)
    • – Tablet replicas

Compute Nodes (CN) [Shared‑Data]

  • • Stateless query executors.
  • • Pull tablet data from remote object store via cache layer.

Internal Services per Node

  • • BRPC Service – high‑perf RPC between FE/BE.
  • • Heart‑beat – every 5 s BE→FE health & tablet‐version report.
  • • Txn Manager – 2‑phase commit across tablets for atomic loads.
  • • Routine‑Load – Kafka ingest daemon on BE.

2. Data Organization Hierarchy

StarRocks organizes data in a hierarchical structure from logical tables down to physical storage files. Understanding this hierarchy is essential for optimizing performance and data management.

Data Organization Hierarchy

Table
Logical container for data, defined by schema
Partition
Horizontal data division based on partition key (e.g., time, region)
Bucket
Data distribution unit within partition (hash/random)
Tablet
Physical storage and replication unit
Rowset
Collection of segments + delete bitmap

Logical Organization

  • Table: Defined by schema with columns and types
  • Partition: Horizontal data division (time/region/custom)
  • Bucket: Distribution unit within partition

Physical Organization

  • Tablet: Atomic unit of storage and replication
  • Rowset: Collection of segments with delete bitmap
  • Segment: Immutable columnar file (~512 MB)

Key Relationships

  • • Each partition contains multiple buckets
  • • Each bucket maps to exactly one tablet
  • • Each tablet has multiple replicas (typically 3) for fault tolerance
  • • Tablets contain multiple rowsets representing different data versions
  • • Compaction merges rowsets to optimize read performance

3. Metadata, Consensus & Roles

  • • Metadata stored in BDB‑JE (local) + delta logs.
  • • Raft group of 3‑5 FEs keeps a single‑leader timeline.
  • • Majority (⌈N/2⌉) FEs must be alive to accept DDL/DML.
  • SHOW PROC '/journal_id' reveals Raft term & last index.
  • • Observer can be placed in remote DC for disaster‑recovery reads.

4. Storage Engine Internals

StarRocks Storage Hierarchy

Storage

Data Organization Flow

Tablet
Rowset
Segment
Column Data

Compaction Types

  • Cumulative (minor) – merges small rowsets (size tiering).
  • Base (major) – rewrites entire history into one big rowset.
  • • Trigger: size ratio, write amplification, delete bitmap threshold.

Encoding & Compression

• LZ4 / ZSTD (v3.2 default) · Column encodings: RLE, DICT, DELTA, FOR, BITMAP.

WAL

  • • MemTable flushed to WAL on write; made visible after FE COMMIT.
  • • WAL auto‑deleted after rowset created.

5. Ingestion Pathways

Stream Load

HTTP PUT directly to FE; synchronous, good for ≤ 200 MB.

Broker Load

Hadoop ecosystem files; external broker process.

Routine Load

Continuous Kafka pull; exactly‑once via 2‑phase commit.

Spark Connector

spark‑starrocks; push‑down & vectorized sink.

INSERT … VALUES

trivial SQL, route to FE txn manager.

INSERT … SELECT

ELT; pipelines run on BEs.

External Table

query Parquet/ORC on S3/HDFS without ingest.

Transaction Semantics

  • • Begin – automatic for Stream/Broker; manual via BEGIN for multi‑stmt.
  • • 2PC – FE PREPARE across tablets → BE COMMIT ← FE visible.
  • • Idempotency – label‐based; re‑send with same LABEL=xxx to skip duplicate.

6. Advanced Partitioning

Expression Examples

PARTITION BY expr(time_slice(ts, INTERVAL 1 DAY, '1970-01-01'))
PARTITION BY expr(region || '_' || to_char(order_date, 'YYYYMM'))
PARTITION BY expr(bucket_id, order_status)  -- Multi-column key (3.1+)

Dynamic Partition (legacy) properties

"dynamic_partition.time_unit"="DAY"
"dynamic_partition.end"="-1"
"dynamic_partition.prefix"="p"
...

Time‑based retention

PROPERTIES("partition_live_number"="90")   -- keep 90 latest partitions

Split Partition

ALTER TABLE t SPLIT PARTITION p202504 INTO
   (PARTITION p1 VALUES [ ('2025-04-01'), ('2025-04-16') ),
    PARTITION p2 VALUES [ ('2025-04-16'), ('2025-05-01') ));

7. Bucketing & Colocation

Alter bucket count (3.2+ Random only)

ALTER TABLE t MODIFY RANDOM_BUCKETS = 128;

Create Colocation Group

CREATE TABLE dim_region ( … )
  DISTRIBUTED BY HASH(region_id) BUCKETS 32
  PROPERTIES("colocate_with"="group_geo");

Check Distribution Skew

SHOW DATA SKEW FROM TABLE fact_sales;

Tablet‑to‑Node Mapping

SHOW PROC '/cluster_balance/colocate_group/group_geo';

8. Tablet Lifecycle & Compaction

State Transitions

NORMAL
ALTER
schema change
CLONE
repair
NORMAL

Schema Change

  • • Light: add column default, rename column – metadata only.
  • • Heavy: change type, drop column – triggers background tablet rebuild.

Version Garbage GC

• FE tracks oldest active txn; BE purges stale rowsets <= oldest.

9. Replication, Quorum & Recovery

Quorum Math

  • replication_num = 3 ⇒ write_quorum = 2 (majority)
  • replication_num = 4 ⇒ write_quorum = 3 (>= ceil(N/2))

Replica Selection (read path)

  • • LEADER replica chosen for up‑to‑date reads; followers used if versions equal & enable_replica_preference=true.

Failure Handling

  • • BE down → replicas state UNKNOWN; FE routes around; starts CLONE when node recovers.
  • • Disk full → tablet marked BAD; auto‐migration if spare capacity.
  • • FE leader crash → Raft election < 5 s.

10. Shared‑Data (Cloud‑Native) Specifics

  • • Object Store Layout: s3://bucket/tablet_id/rowset_uuid/segment_0.dat
  • • Cache:
    • – Persistent SSD Cache (NVMe) on CN (LRU by segment)
    • – Memory page cache for hot stripes (adaptive).
  • • Upload: BE writes to local temp, then multipart PUT to S3, then commits metadata.
  • • Read: CN first checks SSD; miss → RANGE GET from S3.

Tunables

"cloud_native_storage_cache_max_size"= "200GB"
"cloud_native_prefetch" = "true"

11. Query Execution & Performance Knobs

Pipeline Engine Flags

  • enable_pipeline_engine=true (default)
  • pipeline_dop – degree‑of‑parallelism per operator (auto)
  • query_chunk_size – batch rows (default 4096)

Vectorization

  • • Fully columnar; encodes filter→projection chain as vector ops.
  • • Disable per‑session: set enable_vectorized_engine = false;

Runtime Filter Push‑down

• Bloom/In-minmax filters auto‑generated for join build side.

Important fe.conf / be.conf

  • • fe.max.concurrent.txs – ingest concurrency
  • • be.exec_mem_limit – per‑query memory
  • • be_load_channel_timeout_second – stream‑load idle timeout

12. Operations & Monitoring

Key SHOW commands

  • SHOW BACKENDS;
  • SHOW FRONTENDS;
  • SHOW TABLET WHERE BackendId=10101 AND State!='NORMAL';
  • SHOW PROC '/sys_stat/be' – CPU, IO per BE
  • SHOW PROCESSLIST; – running queries

Metrics Endpoints

  • • http://be_host:8040/metrics
  • • exporters: Prometheus starrocks‑exporter, Grafana dashboards

Alert Rules

  • • Tablet repair backlog > 1000
  • • FE raft follower lag > 5000 ops
  • • Query latency P95 > SLA

13. Maintenance Workflows

Node Management

ALTER SYSTEM ADD BACKEND "host:9050";
ALTER SYSTEM REBALANCE;
ALTER SYSTEM DECOMMISSION BACKEND "host:9050";

Re‑bucket Partition

ALTER TABLE t MODIFY PARTITION p202504 DISTRIBUTED BY HASH(user_id) BUCKETS 64;

Backup & Restore

BACKUP SNAPSHOT to S3/HDFS; RESTORE SNAPSHOT into new DB.

14. Design Checklist

  • Partition key matches WHERE predicates & retention
  • Auto‑partition via expression; retention via partition_live_number
  • Bucket count ~= data_size / 5 GB; even modulo BE count
  • Hash keys align with frequent joins; colocate = on
  • replication_num=3, majority quorum; cross‑AZ placement rules
  • Monitor compaction score; adjust cumulative compaction policy
  • In Shared‑Data, size SSD cache ≥ 10% hot data
  • Enable pipeline & vectorized engine; adjust query_chunk_size if rows wide
  • Keep tablet count < 300k; split DBs if needed
  • Schedule weekly REBALANCE; check SHOW DATA SKEW
  • Snapshot backup daily + incremental; test RESTORE quarterly