Ashley Stevens and Bruce Mathewson, ARM
Abstract :
AMBA 4 ACE adds system-level coherency support to the AMBA 4 specifications. By enabling cache coherency between the high- performance ARM Cortex-A15 MPCore processor and software-compatible high- efficiency Cortex-A7 MPCore processor it enables energy savings through heterogeneous multiprocessing, termed ‘big.LITTLE’ by ARM.
Heterogeneous Multi-Processing
The continual requirement for more processing performance whilst maintaining or improving energy consumption to increase battery life and/or reduce energy use, drives the requirement for power-efficient processing. It is well known that multi- processing is a power-efficient way to add additional performance, rather than the converse of pushing a single processor to ever higher performance by increased use of low- Vt transistors, high-performance but leaky process technology and higher, over-drive power-supply voltages. Provided software can make use of parallel hardware it’s more power-efficient both in terms of power and area, to add additional parallel processor units. SMP (Symmetric Multi-Processing) is well known. Today most ARM Cortex™-A9 implementations are multicore, either dual or quad core. In future, expect to see ARM-based SoCs with more than four cores. The move to ARM Cortex-A15 increases the performance potential significantly, but at the cost of more transistors. Combined with the move to smaller geometry processes such as 32 or 28nm this results in increased leakage current during periods of low activity.
To improve battery life of mobile products whilst enabling the extremely high performance of Cortex-A15 when required, ARM introduced the Cortex-A7 MPCore processor. Cortex-A7 is an extremely small and efficient core able to provide adequate performance for many tasks at very low power and with very high energy-efficiency. The Cortex-A7-Cortex-A15 energy-efficient compute model has been termed ‘big.LITTLE’, where one or more ‘big’ processors with very high performance (in this case Cortex-A15) is paired with one or more ‘Little’ processors which are architecturally compatible with the ‘big’ core with lower performance but significantly greater energy efficiency. The ‘LITTLE’ processor can be used for many low- intensity tasks and the ‘big’ processor powered-up only when high-intensity tasks are performed that demand more performance than can be provided by the ‘Little’ processor.
The CoreLink CCI-400 Cache Coherent Interconnect (the CCI-400) creates a cache- coherent interconnect between two processor clusters, either 2x Cortex-A15 clusters or one Cortex-A15 cluster and one Cortex-A7 cluster. Each cluster can contain up to four CPUs. It supports all big.LITTLE-operating models, ie:
1. big.LITTLE switching model
2. big.LITTLE Multiprocessing (MP) model
3. Asymmetric big.LITTLE (Fixed task allocation)
The Coherency Challenge
Cache coherency is an issue in any system that contains one or more caches and more than one entity sharing data in a single cached area. There are two potential problems with system that contains caches. Firstly, memory may be updated (by another master) after a cached master has taken a copy. At this point, the data within the cache is out-of-date and no longer contains the most up-to-date data. This issue occurs with any system that contains any kind of cache. Secondly, systems that contain write-back caches must deal with the case where the master writes to the local cached copy at which point the memory no longer contains the most up-to-date data. A second master reading memory will see out- of-date (stale) data.
Software-Based Coherency Approach
Cache coherency may be addressed with software-based techniques. In the case where the cache contains stale data, the cached copy may be invalidated and re-read from memory when needed again. When memory contains stale data due to a write-back cache containing dirty data, the cache may be cleaned forcing write-back to memory. Any other cached copies that may exist in other caches must be invalidated.
Hardware-Based Coherency Approaches Snooping cache coherency protocols
Hardware-based coherency techniques can be divided into two main approaches. Cache coherency systems based on Cache Snooping rely on all masters ‘listening in’ to all shared-data transactions originating from other masters. Each master has an address input as well as an address output to monitor all other transactions that occur. When the master detects a read transaction for which it has the most up-to-date data it provides the data to the master requesting it, or in the case of a write it will invalidate its local copy.
Directory-based cache coherency protocols
An alternative approach is the Directory-based coherency protocol. In a directory-based system there is a single ‘directory’ which contains a list of where every cached line within the system is held. A master initiating a transaction first consults the directory to find where the data is cached and then directs cache coherency traffic to only those masters containing cached copies. AMBA 4 ACE is designed to be a flexible protocol that enables the creation of systems based on either snoop or directory-based approaches and may be used for ‘hybrid’ approaches such as those with snoop filters.
AMBA 4 ACE Cache Coherency States
The AMBA 4 ACE protocol is based on a 5- state cache model. Each cache line is either Valid or Invalid, meaning it either contains cached data or not. If it’s Valid then it can be in one of 4 states defined by two properties. Either the line is Unique or Shared, meaning it’s either non-shared data, or potentially shared data (the address space is marked as shared). And either the line is Clean or Dirty, generally meaning either memory contains the latest, most-up-to-date data and the cache line is merely a copy of memory, or if it’s Dirty then the cache line is the latest most up-to-date data and it must be written back to memory at some stage. When data is shared, all caches must contain the latest data value at all times, but only one copy may be in the SharedDirty state, the others being held in the SharedClean state. The SharedDirty state is thus used to indicate which cache has responsibility for writing the data back to memory, and SharedClean is more accurately described as meaning data is shared but there is no need to write it back to memory.
However ACE is designed to support components that use a variety of internal cache state models, including MESI, MOESI, MEI and others. ACE does not prescribe the cache states a component can use. Some components may not support all ACE transactions
ACE Protocol Design Principles
Understanding the basic ACE protocol principles is essential to understanding ACE. Lines held in more than one cache must be held in the Shared state. Only one copy can be in the SharedDirty state, and that is the one that is responsible for updating memory. As stated previously, devices are not required to support all 5 states in the protocol internally. But beyond these fundamental rules there are many options as ACE is very flexible.
The system interconnect is responsible for coordinating the progress of all shared (coherent) transactions and can handle these in various manners.. The interconnect may choose to perform speculative reads to lower latency and improve performance, or it may choose to wait until snoop responses have been received and it is known that a memory read is required to reduce system power consumption by minimizing external memory reads. The interconnect may include a directory or snoop filter, or it may broadcast snoops to all masters.
Unlike early cache coherency protocols and systems (such as board-based MEI snoop coherency where all snoop hits resulted in a write-back), which were concerned primarily with functional correctness, ACE has been designed to enable performance and power optimizations by avoiding wherever possible unnecessary external memory accesses. ACE facilitates direct master-to-master data transfer wherever possible. Since off-chip data accesses are an order of magnitude (10x) or so higher energy than on-chip memory accesses, this enables the system designer to minimize energy and maximize performance.
ACE Additional Signals and Channels
AMBA 4 ACE is backwards -compatible with AMBA 4 AXI adding additional signals and channels to the AMBA 4 AXI interface. The AXI interface consists of 5 channels. In AXI, the read and write channels each have their own dedicated address and control channel. The BRESP channel is used to indicate the completion of write transactions.
ARADDR: Read address and command channel
RDATA: Read data channel
AWADDR: Write address and command channel
WDATA: Write data channel
BRESP: Write response channel
AXI 5 channel interface
For further information on the AXI interface download the specification from the ARM website.
ACE adds 3 more channels for coherency transactions and also adds some additional signals to existing channels. The ACADDR channel is a snoop address input to the master. The CRRESP channel is used by the master to signal the response to snoops to the interconnect. The CDDATA channel is output from the master to transfer snoop data to the originating master and/or external memory.
ACADDR: Coherent address channel. Input to master
CRRESP: Coherent response channel. Output from master
CDDATA: Coherent data channel. Output from master
ACE additional channels
AMBA 4 ACE Transactions
ACE introduces a large number of new transactions to AMBA 4. To understand them, it’s best to think of them in groups.
Non-shared
The first group is Non-Shared and this consists of ReadNoSnoop and WriteNoSnoop transactions. These are the same as existing AXI Read and Write used for non-coherent, non-snooped transactions.
Non-cached
ReadOnce is used by masters reading data that is shared but where the master will not retain a copy, meaning that other masters with a cached copy do not need to change cache state. An example use could be a display controller reading a cached frame buffer. ReadOnce is used by ACE-Lite masters for shared reads.
WriteUnique is used by uncached masters writing shared data, meaning that all other cached copies must be cleaned. Any dirty copy must be written to memory and clean copies invalidated.
WriteLineUnique is like WriteUnique except that it always operates on a whole cache line.
Note: Read, Write, ReadOnce and WriteUnique are the only ACE (external) transactions that can operate on data that is not a whole cache line.
Shareable Read
ReadShared is used for shareable reads where the master can accept cache line data in any state ReadClean is used for shareable reads where the master wants a clean copy of the line. It cannot accept data in either of the dirty states (for example it has a write-through cache).
ReadNotSharedDirty is used for shareable reads where the master can accept data in any cache state except the SharedDirty state. It cannot accept data in the SharedDirty state (for example it uses the MESI model rather than the MOESI model). The Cortex-A15 processor core uses the MESI model and therefore uses ReadNotSharedDirty for shared read transactions.
Shareable Write
Note: Transactions in the ‘shareable write’ group are all actually performed on the read channel as they are used for obtaining the right to write a line in an internal write-back cache.
MakeUnique is used to clean all other copies of a cache line before performing a shareable write. Once all other copies are cleaned and invalidated the master can allocate a line in its cache and perform the write. ReadUnique is like MakeUnique except it also reads the line from memory.
ReadUnique is used by a master prior to performing a partial line write, where it is updating only some of the bytes in the line. To do this it needs a copy of the line from memory to perform the partial line write to; unlike MakeUnique where it will write the whole line and mark it dirty, meaning it doesn’t need the prior contents of memory.
CleanUnique is used for partial line writes where the master already has a copy of the line in its cache. It’s therefore similar to ReadUnique except that it doesn’t need to read the memory contents, only ensure any dirty copy is written back and all other copies are invalidated. Note that if the data is dirty in another cache (ie in the SharedDirty state), the master initiating the CleanUnique transaction will be in the SharedClean state but will have a current up-to-date copy of data, as the ACE protocol ensures that all dirty copies of a line in all caches are the same at all times. It’s possible to use ReadUnique instead of CleanUnique at the expense of an uneccessary external memory read. The Cortex-A15 processor doesn’t use CleanUnique but uses ReadUnique.
Write-back transactions
WriteBack is a transaction to write back an entire dirty line to memory and is normally the result of an eviction of a dirty line as the result of allocating a new line.
WriteClean is like WriteBack but indicates the master will retain a copy of the clean line afterwards. This could be the result of a master performing eager writeback, ie speculatively writing back lines when not strictly necessary in the hope that the line will not be updated again prior to eviction. WriteClean is provided in addition to WriteBack to enable an external snoop filter to track cache contents. Evict does not write back anything.
Evict indicates that a clean line has been replaced (‘evicted’), for example as the result of an allocation. Evict is provided purely as a mechanism for external snoop filters to track what’s in the cache.
Cache Maintenance
CleanShared is a broadcast cache clean causing any cache with a dirty copy to write the line to memory. Caches can retain the data in a clean state.
CleanInvalid is similar to CleanShared except that caches must invalidate all copies after any dirty data is written back.
MakeInvalid is a broadcast invalidate. Caches are required to invalidate all copies and do not need to write back any data to memory even if it’s dirty data.
Note that ACE-Lite can perform only the transactions within the Non-Shared, Non- cached and Cache Maintenance transaction groups.
ACE-Lite I/O Coherency
ACE-Lite masters can perform transactions only from the Non-shared, Non-cached and Cache Maintenance transaction groups. ACE- Lite enables uncached masters to snoop ACE coherent masters. This can enable interfaces such as Gigabit Ethernet to directly read and write cached data shared within the CPU. Going forwards, ACE-Lite is the preferred technique for I/O coherency and should be used where possible rather than the Accelerator Coherency Port for best power and performance. Cortex-A15 supports an optional ACP primarily for designs including legacy IP that is not ACE-Lite compliant or designs that are upgrades from other MPCore technology-enabled processors.
Cortex-A15 / Cortex-A7 Inter-Cluster Coherency
The Cortex-A15 MPCore processor was the first ARM processor core to support AMBA 4 ACE. The CCI-400 Cache Coherent Interconnect enables expansion of the SoC beyond the 4-core cluster of the ARM MPCore technology to multiple clusters. Alternatively, CCI-400 supports Cortex-A15 with Cortex-A7 in a big.LITTLE configuration. Full coherency may be maintained between the Cortex-A15 and Cortex-A7. All shared transactions are controlled by the ACE coherent interconnect. ARM has developed the CCI-400 Cache Coherent Interconnect product to support up to two clusters of CPUs (Cortex-A15 + Cortex- A15 or Cortex-A15 + Cortex-A7) and three additional ACE-Lite masters.
Barriers
In systems with shared-memory communications and out-of-order execution, barriers are necessary to ensure correct operations. The ARM architecture defines two types of barrier, DMB and DSB. DMB, or Data Memory Barrier ensures that all memory transactions prior to the barrier are visible by other masters before any after it. Hence the DMB Data Memory Barrier prevents any re- ordering about the DMB. With the extension of coherency outside the MPCore cluster, it’s necessary to broadcast barriers on the ACE interface. DMB barriers may define a subset of masters that must be able to observe the barrier. This is indicated on the AxDOMAIN signals. These can indicate Inner, Outer, System, or Non-shareable.
The DSB or data synchronization barrier is used to wait until all previous transactions are complete. This contrasts with DMB where the transaction can flow on the pipelined interconnect. DSB would typically be used when interacting with hardware that has an interface in addition to via memory
Distributed Virtual Memory (DVM)
The ability to build multi-cluster coherent CPU systems sharing a single set of MMU page tables in memory brings the requirements to ensure TLB coherency. A TLB or Translation Look-aside Buffer is a cache of MMU page tables in memory. When one master updates page tables it needs to invalidate TLBs that may contain a stale copy of the MMU page table entry. Distributed Virtual Memory support in ACE consists of broadcast invalidation messages. DVM messages can support TLB Invalidation, branch predictor, virtual or physical instruction cache invalidation (for when a processor has written code to memory) and Synchronization. DVM messages are sent on the Read channel using ARSNOOP signaling. A system MMU (SMMU) may make use of the TLB invalidation messages to ensure its entries are up-to-date. Instruction cache and branch-predictor invalidation messages are used for inter- processor communications ensuring instruction-side invalidation occurs in all processors in the system, even across different clusters.
Summary
AMBA 4 ACE adds system-level coherency support to the AMBA 4 specifications. It supports cache-coherent transactions, cache maintenance transactions, barrier signaling and distributed virtual memory (DVM) messages to enable TLB management and invalidation.
ACE has been designed to support a wide range of coherent masters with differing capabilities, not just processor cores such as the Cortex-A15 MPCore processor. Furthermore, the ARM CCI-400 Cache Coherent Interconnect supports not just dual Cortex-A15 clusters but also fully coherent big.LITTLE processing with Cortex-A15 and Cortex-A7. ACE supports I/O coherency for uncached masters, supports masters with differing cache line sizes, differing internal cache state models, and masters with write- back or write-through caches. ACE has been in development for more than 4 years and benefits from the input and experience of more than 30 reviewing companies.
The ACE specification is now available for download on the ARM website via a click-thru licensing agreement.