Cache Coherence Protocols: From MSI to MOESI — A Hardware Designer's Perspective
Deep dive into cache coherence protocols for shared-memory multiprocessors. We trace the evolution from simple MSI to complex directory-based MOESI, with RTL implementation considerations.
Why Cache Coherence Matters
In any multi-core system, each core has its own private L1 cache. When two cores cache the same memory address, a write from one core will leave the other core's cache line stale. This is the cache coherence problem — and getting it wrong means silent data corruption.
The solution is a coherence protocol: a set of rules that govern how cache lines transition between states in response to reads and writes from local and remote cores.
The Classic MSI Protocol
The simplest coherence protocol is MSI — Modified, Shared, Invalid.
State Definitions
| State | Meaning | Dirty? | Other copies? |
|-------|---------|--------|---------------|
| M (Modified) | This cache has the only valid copy, and it's dirty | Yes | No |
| S (Shared) | This cache has a clean copy; other caches may also have copies | No | Possibly |
| I (Invalid) | This cache line is not valid | — | — |
State Transitions
Consider a bus-based system where all caches snoop a shared bus.
When a core performs a PrRd (processor read):
- I → S: BusRd issued. Data fetched from memory or another cache. Other caches set shared line.
- S → S: No bus transaction. Read from local cache.
- M → M: No bus transaction. Read from local cache.
When a core performs a PrWr (processor write):
- I → M: BusRdX issued (read-exclusive). All other caches invalidate their copies. Data fetched, then modified.
- S → M: BusUpgr issued (upgrade). All other caches invalidate. Line promoted to Modified.
- M → M: No bus transaction. Write locally.
Bus Transactions
BusRd — Read without intent to modify. Other caches transition M→S.
BusRdX — Read with intent to modify. Other caches invalidate.
BusUpgr — Upgrade from S to M. Other caches invalidate. No data transfer needed.
Flush — Write back dirty data to memory.
The MESI Protocol: Adding the Exclusive State
The key insight: when a core does a BusRd and no other core has the line, why go to Shared? You can go to Exclusive — meaning you have the only clean copy.
E (Exclusive): This cache has the only copy, and it's clean. It matches memory.The advantage: transitioning E → M requires no bus transaction. You already have the only copy, so there's nothing to invalidate. This eliminates upgrade traffic on the bus.
PrWr in E-state: Transition to M-state silently — no bus traffic.
This is a real performance win for single-writer patterns, which are common in parallel code.
The MOESI Protocol: Adding the Owned State
Some systems (notably AMD's coherent hypertransport) add a fifth state:
O (Owned): This cache has a dirty copy, but other caches may also have copies. This cache is responsible for eventually writing back to memory.The O-state allows dirty sharing:
Core A: M → O on snoop read hit (provides data, keeps dirty copy)
Core B: I → S on read miss (receives data from Core A)
Memory is now stale, but Core A knows it must eventually write back. If Core A evicts, it writes back. If Core B evicts, it simply drops (no writeback needed).
Key MOESI Benefit
Without the O-state, a dirty eviction from one cache while another cache holds a shared copy requires complex handling. The O-state explicitly tracks this responsibility.
Directory-Based Coherence
Bus-based protocols (snooping) don't scale beyond ~8 cores due to bus bandwidth. For larger systems, we use directories.
Directory Structure
A directory entry tracks, for each memory block:
- State: Uncached, Shared, Exclusive
- Sharers: Bit vector of which caches hold the line
Directory entry for address X:
state: SHARED
sharers: [core0, core3, core7] // 3-bit vector for 8-core system
Directory Protocol Operations
Read miss from core i:This is significantly more complex than snooping protocols, with more message types and potential race conditions between requests.
RTL Implementation Notes
When implementing coherence in RTL, several practical considerations arise:
1. Transient States
Real implementations need transient states between stable states:
M^D → Modified, waiting for data (on BusRdX)
S^D → Shared, waiting for data
I^D → Invalid, waiting for invalidation acks
Without transient states, your state machine can accept a new request in an inconsistent state during a pending transaction.
2. Writeback Buffer
Dirty evictions can't stall the pipeline. A writeback buffer decouples eviction from writeback completion:
typedef struct packed {
logic [31:0] addr;
logic [511:0] data;
logic valid;
} wb_entry_t;
wb_entry_t wb_buffer [3:0]; // 4-entry writeback buffer
3. Ordering Requirements
Coherence protocols must respect memory ordering. A write to address A followed by a write to address B must become visible to all cores in that order. This requires invalidation acknowledgements and careful handling of the memory barrier (fence) instruction.
4. Verification Strategy
Coherence is notoriously hard to verify. Key techniques:
- Litmus tests: Small concurrent programs that test specific ordering scenarios
- Randomized testing: Random sequences of reads/writes from multiple cores with a checker that monitors invariants
- Formal verification: Model checking of the protocol state machine for deadlock and safety properties
// Example invariant assertion
assert property (
@(posedge clk) disable iff (!rst_n)
(cache_state == M) |-> (sharer_count == 0)
);
Performance Considerations
The choice of protocol has real performance impact:
| Protocol | Bus Traffic | Latency | Complexity | Scale |
|----------|-------------|---------|------------|-------|
| MSI | High | Low | Low | 2-4 cores |
| MESI | Medium | Low | Medium | 4-8 cores |
| MOESI | Low | Low | High | 4-8 cores |
| Directory | Low | Medium | High | 64+ cores |
For modern SoCs with 4-8 cores, MESI or MOESI with a shared bus is common. For server-class chips with 64+ cores, directory-based protocols are essential.
Conclusion
Cache coherence is one of the hardest problems in computer architecture, but understanding the evolution from MSI → MESI → MOESI → Directory provides a clear mental model. The key insight is that each new state eliminates a specific inefficiency:
- MESI: Eliminates upgrade bus traffic for exclusive lines
- MOESI: Allows dirty sharing, eliminating writebacks on shared-to-modified transitions
- Directory: Eliminates the bus bottleneck for large-scale systems
Getting these protocols right in RTL requires careful handling of transient states, proper ordering, and thorough verification — but the performance impact of correct, efficient coherence is enormous.
References
- Sorin, D. J., Hill, M. D., & Wood, D. A. (2011). A Primer on Memory Consistency and Cache Coherence
- Hennessy, J. L., & Patterson, D. A. (2017). Computer Architecture: A Quantitative Approach, 6th Edition
- ARM AMBA 5 CHI Architecture Specification
Comments are not configured yet.
Set NEXT_PUBLIC_GISCUS_* environment variables to enable Giscus.