26 февр. 2026

How OceanBase Works Under the Hood

Two years ago, in my previous article, I introduced OceanBase as a relational distributed database that supports strong consistency, horizontal scaling, and multi-datacenter deployment.

Now, it's supports Vector Search and a goes through a lot of improvement's. So, I decided to goes through it's entire architecture and go one level deeper. This post explains what really happens inside OceanBase when you run a simple SQL query.

We’ll look at:

  • What happens when you write data?

  • What happens when you read data?

  • How replication works

  • How consistency is guaranteed

It's always very interesting to learn how a RDBMS works like a distributed database under the hood. I am trying to avoid any marketing language. Just how it actually works.

Actually, I am tired of listening about OpenClaw, Claude worker, Claude skill and much more, so decided to learning something core technology. If you are like me, you are welcome ;-). 

My setup is very simple as shown in the image below.

Basic concepts

OceanBase DB documentations uses a lot of terms like OBServers, OBProxy, Zone and much more. These three components defines how an OceanBase cluster is physically organized. Let's start from the OBServer.

  1. OBserver - The database node.

In OceanBase, each physical machine runs a single process called: observer. OBServer is not just storage. It is the entire database engine running on one node. An OBServer includes:

  • SQL engine (parser, optimizer, executor)

  • Transaction layer

  • Replication layer (Paxos)

  • Storage engine (MemTable + SSTables)

  • Log service (redo/WAL)

  • Resource management (multi-tenant isolation)

Note that, OceanBase is a shared-nothing architecture. That means:

- Every OBServer is fully independent.

- No shared storage.

- No special hardware.

  1. Zone — A Logical Group of OBServers

A Zone is not a machine. It is a logical grouping of OBServers. A zone represents a set of nodes with similar availability characteristics. In practice, this usually means:

  • One datacenter

  • One availability zone

If you deploy OceanBase across 3 datacenters:

DC1 → Zone A
DC2 → Zone B
DC3 → Zone C

Each zone may contains multiple OBServers as follows:

Zone A
- OBServer 1
- OBServer 2

Zone B
- OBServer 3
- OBServer 4

Zone C
- OBServer 5
- OBServer 6

  1. OBProxy — The Smart Router

OBProxy is separate from OBServer. It is the access layer of OceanBase. Each SQL queries goes through the OBProxy.

OBProxy:

  • Listens on MySQL protocol

  • Accepts client connections

  • Forwards SQL to correct OBServer

  • Automatically discovers cluster topology

When your application sends SQL:

  1. OBProxy parses the statement.

  2. Determines which Log Stream owns the data.

  3. Finds the leader of that Log Stream.

  4. Forwards the request directly to that OBServer.

After the basic concepts, let's dive into the storage and replication level of the Oceanbase database. It will help us to understand how the read/write path and replications works.

  1. Replication level: Log Stream.

In OceanBase, a Log Stream is the basic unit of replication, consensus, and failover. In simple terms: A Log Stream is a group of data that shares the same replication log and leader.

Think of it as a “data container” that moves and replicates together.

Technically, A Log Stream is:

  • A collection of table partitions (called tablets)

  • With one shared write-ahead log

  • Replicated using Paxos

  • Having one leader and multiple followers

All data inside the same LS:

  • Is replicated together

  • Fails over together

  • Moves together during load balancing

You can think of every Log Stream as a small replicated database inside the cluster.

Each Log Stream:

  • Has one leader

  • Has multiple followers

  • Owns a group of partitions (called tablets)

  • Replicates data using Paxos consensus

  1. Storage Level: WAL (Redo Log) and SStables (LSM Storage)

WAL (Redo Log) provides Durability + consensus of the data. On the other hand, SSTables gives the ability to Long-term data storage.

If you familiar with Cassandra Database, you probably knows how SSTables works.

SSTables are part of its LSM-tree storage engine. It is an immutable, sorted data file stored on disk. LSM storage engines idea is very simple:

  • Writes go to memory first (Memtable).

  • Disk files are written sequentially and never updated in place.

Now we are ready to go through the process how write/read works in Oceanbase db.

What Happens When You Write Data?

Let’s say you run:

UPDATE accounts SET balance = balance - 100 WHERE id = 1;

Here’s what happens step by step.

Step 1 — Request Routing

Your application connects through OBProxy.

OBProxy figures out which Log Stream owns this row and sends the request to the leader of that Log Stream. Only the leader can accept writes.

Step 2 — Redo Log Is Created

Before changing any data, OceanBase writes a redo log entry.

This redo entry contains:

  • Transaction ID

  • Changed data

  • Log sequence number

This is written to the Write-Ahead Log (WAL). If the server (OBServer) crashes after this point, the change can be recovered.

Step 3 — Replication to Other Datacenters

If OceanBase is deployed across 3 datacenters as shown earlier:

  • The leader sends the redo log entry to follower replicas.

  • Followers write it to their own WAL.

  • Followers send acknowledgment back.

The leader waits for a majority votes (2 out of 3).

Only after majority confirms:

  • The transaction is considered committed.

  • The client receives success.

Step 4 — Data Goes to Memory (MemTable)

After the redo log is safely replicated:

  • The change is applied to an in-memory structure called MemTable.

  • MemTable keeps rows sorted by primary key.

At this point:

  • The data is durable (because WAL is replicated).

  • The data is visible to future reads.

Step 5 — Flush to SSTable

When MemTable becomes large:

  • It is frozen.

  • Written to disk as an SSTable.

  • SSTables are immutable and sorted.

SSTables are built locally on each replica. The key point is that, they are not copied between servers.

So, the write path will something like this as shown below:

Client Write
   ↓
Log Stream Leader
   ↓
WAL (Redo Log)  ← replicated via Paxos
   ↓
MemTable
   ↓
SSTables
   ↓
Compaction

What Happens When You Read Data?

Now let’s look at a simple read:

SELECT balance FROM accounts WHERE id = 1;

Step 1 — Routing

For strong consistency, the request goes to the Log Stream leader. OceanBase supports snapshot reads using a global timestamp service (GTS). Each read gets a snapshot timestamp.

Step 2 — MVCC Snapshot

OceanBase uses multi-version concurrency control (MVCC). Each row version has a commit timestamp.

When reading:

  • The engine checks MemTable.

  • Then L0 SSTables.

  • Then L1 SSTables.

  • Then major SSTables.

It selects the newest row version that is less than or equal to the snapshot timestamp.

This provides consistent reads even during concurrent writes.

Here is the high-level view of the read path:

Application
   ↓
OBProxy
   ↓
OBServer (Log Stream Leader)
   ↓
SQL Engine
   ↓
Transaction Layer (Snapshot / GTS)
   ↓
Storage Engine
   ├── MemTable
   ├── L0 SSTables
   ├── L1 SSTables
   └── Major SSTable
   ↓
Result returned

Understanding OceanBase under the hood is not about memorizing components. It’s about understanding the read/write and replication data flow. Once you see that flow clearly, the whole system makes sense. From here, you can start discovering how the distributed query works. In the next part of these series, I am planning to deploy a 3 nodes cluster of the Oceanbase database and some test to explorer more.