缓存一致性协议：从 MSI 到 MOESI — 硬件设计师视角

Why Cache Coherence Matters

In any multi-core system, each core has its own private L1 cache. When two cores cache the same memory address, a write from one core will leave the other core's cache line stale. This is the cache coherence problem — and getting it wrong means silent data corruption.

The solution is a coherence protocol: a set of rules that govern how cache lines transition between states in response to reads and writes from local and remote cores.

The Classic MSI Protocol

The simplest coherence protocol is MSI — Modified, Shared, Invalid.

State Definitions

|-------|---------|--------|---------------|

| M (Modified) | This cache has the only valid copy, and it's dirty | Yes | No |

| I (Invalid) | This cache line is not valid | — | — |

State Transitions

Consider a bus-based system where all caches snoop a shared bus.

When a core performs a PrRd (processor read):

I → S: BusRd issued. Data fetched from memory or another cache. Other caches set shared line.
S → S: No bus transaction. Read from local cache.
M → M: No bus transaction. Read from local cache.

When a core performs a PrWr (processor write):

I → M: BusRdX issued (read-exclusive). All other caches invalidate their copies. Data fetched, then modified.
S → M: BusUpgr issued (upgrade). All other caches invalidate. Line promoted to Modified.
M → M: No bus transaction. Write locally.

Bus Transactions

BusRd — Read without intent to modify. Other caches transition M→S. BusRdX — Read with intent to modify. Other caches invalidate. BusUpgr — Upgrade from S to M. Other caches invalidate. No data transfer needed.

Flush — Write back dirty data to memory.

The MESI Protocol: Adding the Exclusive State

The key insight: when a core does a BusRd and no other core has the line, why go to Shared? You can go to Exclusive — meaning you have the only clean copy.

E (Exclusive): This cache has the only copy, and it's clean. It matches memory.

The advantage: transitioning E → M requires no bus transaction. You already have the only copy, so there's nothing to invalidate. This eliminates upgrade traffic on the bus.

PrWr in E-state: Transition to M-state silently — no bus traffic.

This is a real performance win for single-writer patterns, which are common in parallel code.

The MOESI Protocol: Adding the Owned State

Some systems (notably AMD's coherent hypertransport) add a fifth state:

O (Owned): This cache has a dirty copy, but other caches may also have copies. This cache is responsible for eventually writing back to memory.

The O-state allows dirty sharing:

Core A: M → O on snoop read hit (provides data, keeps dirty copy)
Core B: I → S on read miss (receives data from Core A)

Memory is now stale, but Core A knows it must eventually write back. If Core A evicts, it writes back. If Core B evicts, it simply drops (no writeback needed).

Key MOESI Benefit

Without the O-state, a dirty eviction from one cache while another cache holds a shared copy requires complex handling. The O-state explicitly tracks this responsibility.

Directory-Based Coherence

Bus-based protocols (snooping) don't scale beyond ~8 cores due to bus bandwidth. For larger systems, we use directories.

Directory Structure

A directory entry tracks, for each memory block:

State: Uncached, Shared, Exclusive
Sharers: Bit vector of which caches hold the line

Directory entry for address X: state: SHARED

sharers: [core0, core3, core7] // 3-bit vector for 8-core system

Directory Protocol Operations

Read miss from core i:

Directory: if Uncached → fetch from memory, set state to Exclusive, sharer = {i}

Directory: if Shared → fetch from memory, add i to sharers

Directory: if Exclusive at core j → forward request to core j, core j supplies data, both cores set to Shared, directory updated

Write miss from core i:

Directory: invalidate all sharers, wait for acks

Directory: grant exclusive ownership to core i

This is significantly more complex than snooping protocols, with more message types and potential race conditions between requests.

RTL Implementation Notes

When implementing coherence in RTL, several practical considerations arise:

1. Transient States

Real implementations need transient states between stable states:

M^D → Modified, waiting for data (on BusRdX)
S^D → Shared, waiting for data  
I^D → Invalid, waiting for invalidation acks

Without transient states, your state machine can accept a new request in an inconsistent state during a pending transaction.

2. Writeback Buffer

Dirty evictions can't stall the pipeline. A writeback buffer decouples eviction from writeback completion:

typedef struct packed {
  logic [31:0] addr;
  logic [511:0] data;
  logic valid;
} wb_entry_t;

wb_entry_t wb_buffer [3:0];  // 4-entry writeback buffer

3. Ordering Requirements

Coherence protocols must respect memory ordering. A write to address A followed by a write to address B must become visible to all cores in that order. This requires invalidation acknowledgements and careful handling of the memory barrier (fence) instruction.

4. Verification Strategy

Coherence is notoriously hard to verify. Key techniques:

Litmus tests: Small concurrent programs that test specific ordering scenarios
Randomized testing: Random sequences of reads/writes from multiple cores with a checker that monitors invariants
Formal verification: Model checking of the protocol state machine for deadlock and safety properties

// Example invariant assertion
assert property (
  @(posedge clk) disable iff (!rst_n)
  (cache_state == M) |-> (sharer_count == 0)
);

Performance Considerations

The choice of protocol has real performance impact:

|----------|-------------|---------|------------|-------|

| MSI | High | Low | Low | 2-4 cores |

For modern SoCs with 4-8 cores, MESI or MOESI with a shared bus is common. For server-class chips with 64+ cores, directory-based protocols are essential.

Conclusion

Cache coherence is one of the hardest problems in computer architecture, but understanding the evolution from MSI → MESI → MOESI → Directory provides a clear mental model. The key insight is that each new state eliminates a specific inefficiency:

MESI: Eliminates upgrade bus traffic for exclusive lines
MOESI: Allows dirty sharing, eliminating writebacks on shared-to-modified transitions
Directory: Eliminates the bus bottleneck for large-scale systems

Getting these protocols right in RTL requires careful handling of transient states, proper ordering, and thorough verification — but the performance impact of correct, efficient coherence is enormous.

References

Sorin, D. J., Hill, M. D., & Wood, D. A. (2011). A Primer on Memory Consistency and Cache Coherence
Hennessy, J. L., & Patterson, D. A. (2017). Computer Architecture: A Quantitative Approach, 6th Edition
ARM AMBA 5 CHI Architecture Specification