Apache Iggy
Introduction

Architecture

Iggy is the persistent message streaming, which means, that the messages are being stored in a form of an append-only log. You can create multiple streams, consisting of topics, which might have one or more partitions assigned, e.g. to achieve the horizontal scalability between many independent consumers or higher system resiliency. You can think of Iggy as an alternative to Kafka or RabbitMQ streams.

Let's discuss in-depth what these concepts are all about and how they relate to each other.

Architecture

Message streaming

There's a high chance that you've already used messaging tools such as RabbitMQ or Kafka, just to name a few. While they might look similar at the first glance, and actually, you can achieve the similar results with all of them (e.g. publishing and consuming the events by the different applications built on top of microservices architecture), they are quite different in their core.

The main difference is that RabbitMQ (except the recently released Streams plugin) is the message broker, which means that it's responsible for delivering the messages to the consumers. It works in the FIFO (First In, First Out) manner and the messages are being kept in the queues. For example, if you have multiple, distinct consumers, then each one would create its own queue, the message would be replicated between each queue and each consumer would be responsible for reading the messages from its own queue. Once the message is processed, it's gone from the queue, so there's no built-in way to replay past the messages. The more consumers you have, the more queues you have to create, which might result in more resources being used. The typical message broker follows the so-called smart pipes and dumb endpoints pattern.

On the other hand, Kafka is a message streaming platform, meaning that it's not responsible for delivering the messages to the consumers, but rather it's storing them in a form of an append-only log. The consumers are responsible for reading the messages from the log and processing them. You might have multiple distinct consumers, and it doesn't affect the resource usage as there's only one log. The consumers can read the messages from the beginning, or from the specific offset, thus you can replay the messages. The typical message streaming platform follows the so-called dumb pipes and smart endpoints pattern.

There are advantages and disadvantages of both approaches, but the main difference is that the message broker is responsible for delivering the messages to the consumers, while the message streaming platform is not. The message broker is a more mature concept, but the message streaming platform is gaining more and more popularity, especially in the cloud-native world. And you can achieve much higher performance and throughput with the message streaming platform, since it acts as a simple database, being optimized for the append-only operations and can be queried in a very efficient way.

As you might've guessed by now, Iggy is the latter - the message streaming platform.

How a message flows through Iggy

Before diving into the architecture details, here's the complete journey of a message from client to disk:

Message Flow

1. Client

Client sends messages via TCP/QUIC/WS/HTTP

2. Listener

Transport listener receives the request on a shard thread

3. Router

Request routed to owning shard via IggyNamespace hash

4. Stream

Stream lookup by ID (metadata read from left-right)

5. Topic

Topic lookup within stream, compression applied

6. Partition

Messages buffered in MemoryMessageJournal

7. Segment

Flushed to .log file via vectored I/O (io_uring)

Thread per core (shared nothing) + io_uring

Iggy uses a thread-per-core shared nothing architecture combined with io_uring for maximum performance. This design has been proven by systems like ScyllaDB and Redpanda, and is inspired by the Seastar framework.

Thread-per-Core Shard Architecture

Inter-shard communication

Shards communicate via crossfire bounded mpsc channels. Metadata mutations route to Shard 0. Partition ops route to the owning shard via DashMap<IggyNamespace, PartitionLocation>.

How it works

Each CPU core runs its own shard (an instance of IggyShard), pinned to a specific core via sched_setaffinity on Linux. Each shard has its own single-threaded compio async runtime, which means there is no cross-thread synchronization needed within a shard. Memory is bound to the NUMA node of the core via hwlocality for optimal memory access latency.

Asymmetric shard roles

Shards have asymmetric roles:

  • Shard 0 (Primary) holds the exclusive MetadataWriter, runs the HTTP and QUIC servers, and acts as the VSR consensus primary
  • Shards 1..N (Workers) hold MetadataReader clones, manage local partitions, and run TCP/WebSocket servers for load distribution

Request routing

Requests are routed between shards using message passing (via crossfire bounded mpsc channels), which avoids locking entirely. The routing logic splits operations into two planes:

  • Metadata operations (create/delete stream/topic/user etc.) always go to shard 0
  • Partition operations (send_messages, store_consumer_offset) are routed to the shard owning that partition via a DashMap lookup

Partitions are distributed between shards using Murmur3 hashing with correction for small partition counts:

pub fn calculate_shard_assignment(ns: &IggyNamespace, upperbound: u32) -> u16 {
    let mut hasher = Murmur3Hasher::default();
    hasher.write_u64(ns.inner());
    let hash = hasher.finish32();
    // Murmur3 has problems with weak lower bits for small integer inputs, so we use bits from the middle.
    return ((hash >> 16) % upperbound) as u16;
}

IggyNamespace

The IggyNamespace packs stream_id (12 bits), topic_id (12 bits), and partition_id (20 bits) into a single u64 for efficient hashing and routing. This gives maximums of 4096 streams, 4096 topics per stream, and 1,000,000 partitions per topic.

IggyNamespace Bit Packing (u64)

Stream, topic, and partition IDs are packed into a single u64 for efficient hashing and shard routing.

unused (20 bits)
stream12 bits
topic12 bits
partition20 bits
63..44
43..32
31..20
19..0
4,096Max Streams
4,096Max Topics
1,000,000Max Partitions

CPU allocation modes

The sharding system supports multiple allocation modes via the cpu_allocation config:

  • "all" - one shard per available CPU core
  • A numeric value (e.g. 4) - exactly N shards pinned to cores 0..N
  • A range (e.g. "5..8") - shards on specified core range
  • "numa:auto" - automatically detect NUMA topology and bind accordingly
  • "numa:nodes=0,1;cores=4;no_ht=true" - fine-grained NUMA control per node with hyperthread avoidance

io_uring and compio

Traditional async runtimes like tokio use epoll which is readiness-based - you ask the kernel "is this file descriptor ready?" and then perform the I/O yourself. The Linux kernel considers regular files "always ready" for epoll, which means tokio has to outsource file I/O to a blocking thread pool (up to 512 threads). This does not scale well.

io_uring is completion-based - you submit I/O requests to a submission queue (SQ), and the kernel completes them and places results in a completion queue (CQ). Both queues are lock-free ring buffers shared between user space and kernel. This is fundamentally better for disk I/O.

epoll (readiness-based)

1.
App asks: "Is FD ready?"
2.
Kernel: "Yes, it's ready"
3.
App performs the I/O itself
Files are "always ready" for epoll. Tokio uses a blocking thread pool (up to 512 threads) for file I/O.

io_uring (completion-based)

1.
App submits I/O to Submission Queue (SQ)
2.
Kernel completes the I/O asynchronously
3.
Result placed in Completion Queue (CQ)
Both SQ and CQ are lock-free ring buffers shared between user space and kernel. No syscall per I/O in the hot path.

Iggy uses compio as its async runtime, which provides a driver-disaggregated architecture on top of io_uring (Linux) and IOCP (Windows). Each shard gets its own compio executor configured with:

  • Capacity: 4096 concurrent I/O operations
  • Event interval: 128 events per loop iteration
  • Cooperative task running enabled

Performance: Tokio vs Thread-per-Core

The migration from Tokio to thread-per-core with compio delivered significant latency improvements across the board:

Latency Improvements: Tokio vs Thread-per-Core

Relative latency comparison (lower is better). Thread-per-core with io_uring vs Tokio work-stealing.

8 partitions P9999-81%
tokio
thread-per-core
16 partitions P95-28%
tokio
thread-per-core
16 partitions P99-32%
tokio
thread-per-core
16 partitions P9999-92%
tokio
thread-per-core
32 partitions P95-57%
tokio
thread-per-core
32 partitions P99-60%
tokio
thread-per-core

TCP socket migration

Iggy supports TCP socket migration across shards. When a client sends a request that targets a partition on a different shard, instead of forwarding the request (cross-shard hop), the TCP socket itself can be migrated to the owning shard. This eliminates the overhead of cross-shard message passing for data plane operations.

Left-right metadata

Shared metadata (streams, topics, consumer groups) uses a left-right concurrent data structure. Shard 0 is the sole writer, while all other shards hold read-only clones that are updated lock-free. This provides:

  • Lock-free reads on all shards (no contention on the hot path)
  • Strongly consistent writes serialized through shard 0

Append-only log

The append-only log is the core concept of Iggy. It's a simple data structure, which is optimized for the append-only operations. It's a sequence of records, that are being appended to the end of the log. The records are immutable, so that they can't be changed once they are written to the log. The records are being written in the order they are received, which results in the log being ordered.

To navigate the log, you can use the offset, which is the position of the record in the log. The offset is a simple integer, that starts from 0 and is incremented by 1 for each record. When the client is reading the records from the log, it can specify the offset from which it wants to start reading the records. The client can also specify the maximum number of records it wants to read. The client can read the records from the beginning, or from the specific offset, which means that you can replay the messages.

Append-Only Log

Messages are appended sequentially. Consumers track their position independently via offsets.

write
consumer

Click any message block to move the consumer pointer. New messages appear from the right.

Stream Hierarchy

Stream Topic Partition Segment

Stream

While we could put an equal sign between the log and the stream, they are not the same, at least in a case of Iggy streaming server. The stream is a logical concept, and you might think of it as a namespace. For example, you could have a single stream for the whole system, or multiple streams e.g. representing the different environments, such as dev, staging and production. The stream is identified by its unique ID. The stream can have one or more topics assigned, which results in the records being published to the specific topics that belong to the particular stream.

Topic

The topic is also the logical concept, which is a part of the stream. The topic is identified by its unique ID. You could think of topic as an entity being responsible for storing the specific type of the records. For example, you could have a topic for the user events, and another topic for the order events, etc.

The important thing to note is that, the messages are not being stored in the topic directly, but rather in the partitions, which are assigned to the topic. The topic can have one or more partitions assigned, that could help achieve higher parallelism and throughput. The topic can also have the retention policy assigned, which means that the records are being deleted automatically once they are older than the specified retention period. Topics also support configurable compression (none, gzip, lz4, zstd) and maximum size limits.

Partition

The partition has its own unique ID and belongs to the topic. The partition is responsible for storing the records. The records are being distributed between the partitions, therefore the partition acts as a simple database, which is optimized for the append-only operations. The partition is identified by its unique ID, which is an integer. The partition ID starts from 1 and is incremented by 1 for each partition. The partition ID is unique per topic, thus the same partition ID can be used in multiple topics.

Thanks to having multiple partitions, we can achieve the horizontal scalability between many independent consumers, since each consumer can read the messages from the different partitions. This can be achieved by using more advanced concepts such as consumer groups.

Each partition in the thread-per-core architecture is owned by exactly one shard and includes:

  • A SegmentedLog with sealed segments and one active segment
  • Consumer offsets and consumer group offsets
  • Optional MessageDeduplicator for server-side dedup
  • Partition statistics and an atomic offset counter

Segment

The segment, being a part of the partition, is the actual physical layer which stores the records in the binary format in a form of the files. Each segment has the limited size (default 1 GiB) and once it's full, the new segment is being created automatically. The segment name is based on the start offset of the first record in the segment and is unique per partition.

Each segment consists of:

  • .log file - the actual message data
  • .index file - positional and time indexes for fast lookups

Index caching can be configured per strategy: "all" (all indexes in memory), "open_segment" (only the active segment cached, the default), or "none" (on-demand disk reads).

Consumer groups

Consumer groups provide horizontal scaling for message consumption. When multiple consumers join the same consumer group, the server automatically distributes partitions among group members so that each partition is consumed by exactly one member. When members join or leave, the server triggers a cooperative partition rebalancing with a pending revocation phase (configurable timeout, default 30s) to ensure smooth transitions without message loss.

Consumer Group

Each partition is assigned to exactly one consumer. When a consumer joins or leaves, partitions are rebalanced.

Consumers
A
Consumer AP1, P4
B
Consumer BP2, P5
C
Consumer CP3, P6
Partitions
P1A
P2B
P3C
P4A
P5B
P6C

Message format

Message Header (64 bytes, little-endian)

Every message starts with this fixed-size header for efficient aligned reads.

checksum
id
offset
timestamp
origin_ts
hdrs_len
pay_len
reserved
0-8
8-24
24-32
32-40
40-48
48-52
52-56
56-64

Each message has a 64-byte header stored in little-endian format:

FieldBytesTypeDescription
checksum0-8u64xxHash3 integrity checksum
id8-24u128Unique message ID (UUIDv4)
offset24-32u64Sequential offset in partition
timestamp32-40u64Server-assigned timestamp
origin_timestamp40-48u64Client-provided timestamp
user_headers_length48-52u32Length of optional headers
payload_length52-56u32Length of payload
reserved56-64u64Reserved (must be 0)

After the header comes the optional user headers bytes, followed by the payload bytes.

Structure

Having in mind that stream consists of topics, which might have one or more partitions assigned, and each partition consists of segments, we can visualize the structure of the Iggy data directory as follows:

local_data/
├── info.json
├── state.messages
└── streams/
    └── 1/
        └── topics/
            └── 1/
                └── partitions/
                    └── 1/
                        ├── 00000000000000000000.index
                        └── 00000000000000000000.log

The .index file is created automatically and is used to speed up the search operations by keeping track of the offsets and timestamps of the records. The state.messages file is a write-ahead log that persists all metadata operations (stream/topic/user creation, etc.) with SHA-256 checksums for integrity.

Memory pool

Iggy uses a custom memory pool with 32 buckets holding buffer sizes from 256 B to 512 MiB. The default pool size is 4 GiB with 8192 buffers per bucket. This eliminates allocation overhead on the hot path and enables zero-copy message passing between components. The pool is page-aligned (4096-byte multiples) and requires a minimum of 512 MiB.

Write pipeline

Messages flow through a multi-stage write pipeline:

  1. Messages arrive and are buffered in a MemoryMessageJournal
  2. A flush is triggered when either the message count threshold (default 1024) or size threshold (default 1 MiB) is reached
  3. The MessagesWriter uses vectored I/O with MAX_IOV_COUNT=1024 to minimize syscalls
  4. Optional fsync per partition for durability guarantees
  5. When a segment reaches its size limit (default 1 GiB), it is sealed and a new segment is created

On this page