Open side-bar Menu
 Core Values
Neil Parris
Neil Parris
Neil Parris is CCI Product Manager for the ARM processor division responsible for interconnect products.

Extended System Coherency 2 – Implementation, big.LITTLE, GPU Compute, Enterprise

April 7th, 2015 by Neil Parris

This is the second part of a series of blogs about hardware coherency. In the first blog I introduced the fundamentals of cache coherency: Extended System Coherency – Part 1 – Cache Coherency Fundamentals

This part talks about the implementation of hardware cache coherency and use cases.

Implementing Hardware Coherency

ARM’s first implementations of AMBA® 4 ACE include the ARM® CoreLink™ CCI-400 Cache Coherent Interconnect, ARM Cortex®-A15 and Cortex-A7 processors. These products were first released to our silicon partners in 2011, and we’ve seen the first big.LITTLE™ products come to market in 2013.

CoreLink CCI-400 has been licensed by over 24 partners to date for mobile and enterprise applications such as networking or micro-servers. CoreLink CCI-400 supports up to two AMBA 4 ACE processor clusters allowing up to eight processor cores to see the same view of memory and run an SMP OS.

Mobile Applications: big.LITTLE processing

CoreLink CCI-400 supports all big.LITTLE combinations including Cortex-A15 + Cortex-A7, Cortex-A17 + Cortex-A7, and Cortex-A57 + Cortex-53 with full support for ARMv8-A including 64-bit. big.LITTLE processing is a power optimization technology from ARM where high performance ‘big’ cores and efficiency tuned ‘LITTLE’ cores are combined with software to dynamically transition applications to the right processor at the right time.


Hardware coherency is fundamental to big.LITTLE processing as it allows the big and LITTLE processor clusters to see the same view of memory and run the same operating system. big.LITTLE software such as Global Task Scheduling (GTS) places tasks on the appropriate core at a given time. For moderate workloads all processing may be performed on the LITTLE cores while the big cores are powered down. If a workload requires higher performance a big core is powered up and the task migrated while other moderate workloads continue to run on LITTLE cores. big.LITTLE GTS allows all the cores on an SoC to run simultaneously, for example a device with four big and four LITTLE will appear to the operating system as a octo core processor.

Mobile Applications: GPU Compute


GPU Compute with APIs such as OpenCL™1.1 Full Profile and Google RenderScript compute, unlock the combined processing power of CPU and GPU.

The ARM Mali™-T600 series and Mali-T760 GPUs support AMBA 4 ACE-Lite for IO coherency with the CPU. This means that the GPU can read any shared data directly from the CPU caches, and writes to shared memory will automatically invalidate relevant lines in CPU caches. Hardware coherency reduces the cost of sharing data between CPU and GPU, and allows tighter coupling.

GPU Compute applications include: computational photography, computer vision, modern multimedia codecs targeting Ultra HD resolutions such as HEVC and VP9, complex image processing and gesture recognition.

ARM is one of the founding members of the Heterogeneous System Architecture (HSA) foundation. This foundation aims to provide a royalty free specification that makes it easier to take advantage of the heterogeneous CPU, GPU and DSP hardware in an SoC. This includes shared virtual memory and a roadmap to fully coherent GPU. These techniques will further reduce the cost of sharing data between processing engines.

See the HSA website for more information:

Enterprise Applications: Networking and Server


Enterprise applications such as networking and server have high performance serial interfaces such as PCI Express, Serial ATA and Ethernet. In most applications all of this data will be marked as shared as there will be many cases where the CPU needs to access data from these serial interfaces. The picture to the right shows an simplified example system (click picture for a larger image).

Example: network interface

  • Incoming packet on Ethernet interface stored to DRAM
    • Shared writes will automatically invalidate any stale data in CPU caches
  • CPU processes packet headers
  • Ethernet interface forwards packet
    • Shared reads will look up in CPU cache and DRAM to find the latest data

There is a trend in networking applications to move functionality to software to allow an SoC to support multiple applications. This means that the SoC needs more processing nodes.

The CCI-400 Cache Coherent Interconnect is being designed into a range of smaller enterprise applications including residential gateways, security appliances, WLAN enterprise access points, industrial communications and micro servers. These applications use a range of ARM processors depending on the performance requirements from Cortex-A7 to Cortex-A57 with up to a total of 8 cores maximum and no L3 cache.

ARM has a range of interconnect products to extend performance across a range of core counts:

  • CoreLink CCI-400 Cache Coherent Interconnect
    • Up to 2 clusters, 8 cores
  • CoreLink CCN-504 Cache Coherent Network
    • Up to 4 clusters, 16 cores
    • Integrated L3 cache, 2 channel 72 bit DDR
  • CoreLink CCN-508 Cache Coherent Network
    • Up to 8 clusters, 32 cores
    • Integrated L3 cache, 4 channel 72 bit DDR

Ian Forsyth talks more about the CoreLink CCN products in this blog post:Coherent Interconnect Technology Supports Exponential Data Flow Growth

CoreLink CCI-400 Cache Coherent Interconnect

The following table details key features of the CoreLink CCI-400:

Feature Description
Slave Interfaces 2x ACE fully coherent interfaces, up to 8 processor cores (Cortex-A7, Cortex-A15, Cortex-A17, Cortex-A53 or Cortex-A57)3x ACE-Lite IO coherent interfaces for GPU, accelerators and interfaces
Master Interfaces 2x ACE-Lite for memory, with configurable interleaving memory striping option1x ACE-Lite for system
Quality of Service Integrated bandwidth and latency regulators, QoS Virtual Networks
Address space 44 bit Virtual, 40 bit Physical (1TB), supports ARMv7-A & ARMv8-A
Performance Approximately 25GB/s sustained bandwidth at 533MHz for dual channel memory
Area Area can be optimized for application, based on performance and frequency targets

Two of the most commonly asked questions are: how big is it, and how fast does it run? CoreLink CCI-400 has many configuration options including register stages and transaction tracker sizes which allow the interconnect area and performance to be optimized for a given application. At the low end the gate account gets down towards 100k gates. In terms of clock speed, our baseline implementation trials started at 533MHz on a CMOS 32LP process, but we see a number of partners implementing at higher speeds on smaller silicon geometries and with faster implementation techniques.

The following diagram demonstrates an example mobile applications processor with Cortex-A50 series processors, CoreLink MMU-500 System MMU and a range of CoreLink 400 system IP (click picture for a larger image).


In this system the Cortex-A57 and Cortex-A53 provide the big.LITTLE processor combination and are connected to CCI-400 with AMBA 4 ACE to provide full hardware coherency. The Mali-T628 and IO Coherent masters connect to CCI-400 via AMBA 4 ACE-Lite interfaces. As described in the first blog, this IO coherency allows the IO coherent agents to read from processor caches.

The other components in the system include:

  • MMU-500 System MMU – provides stage 1 and/or stage 2 address translation to support visualization of memory for system components.
  • TZC-400 TrustZone Address Space Controller –  performs security checks on transactions to memory or peripherals and allows regions of memory to be marked as secure or protected.
  • DMC-400 Dynamic Memory Controller – provides dynamic memory scheduling and interfacing to external DDR2/3 or LPDDR2 memory.
  • NIC-400 Network Interconnect – provides a fully configurable, hierarchical, low latency connectivity for AMBA 4 AXI4, AMBA 3 AXI3, AHB-Lite and APB components.

Performance Analysis with ARM DS-5 Streamline Performance Analyzer


So how do you optimize for the best performance and power efficiency around CCI-400? One solution is to use the Streamline Performance Analyzer which is part of the ARM DS-5 Development Studio. Thisbrings together system performance metrics, software tracing, statistical profiling, and power measurement to present into a system dashboard to help you optimize the system.

The CCI-400 includes a Performance Monitoring Unit (PMU) which allows the counting of events to measure items like bandwidth, transactions stalls, cache hit rates. These counters can be visualized with theStreamline Performance Analyzer as shown in the screen shot above. This data could be shown alongside SoC power and processor activity to understand what is happening at a system level.


In the first blog I described how the AMBA 4 ACE bus interface extends hardware cache coherency outside of the processor cluster and into the system. In this blog we looked at implementations of hardware coherency and applications from mobile, like big.LITTLE processing, and enterprise. At the heart of all these applications is a cache coherent interconnect like the CoreLink CCI-400. ARM as an IP provider is in a unique position to offer the complete solution of Cortex processor, Mali graphics and CoreLink cache coherent interconnect as well as tools and physical IP. I personally look forward to seeing more products come to market in 2014 taking full advantage of hardware cache coherency and AMBA 4 ACE, and I’d be interested in your plans or views on how this technology is helping you!

Further information:

Related posts:

Leave a Reply

Your email address will not be published. Required fields are marked *


You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

S2C: FPGA Base prototyping- Download white paper

Internet Business Systems © 2016 Internet Business Systems, Inc.
595 Millich Dr., Suite 216, Campbell, CA 95008
+1 (408)-337-6870 — Contact Us, or visit our other sites:
TechJobsCafe - Technical Jobs and Resumes EDACafe - Electronic Design Automation GISCafe - Geographical Information Services  MCADCafe - Mechanical Design and Engineering ShareCG - Share Computer Graphic (CG) Animation, 3D Art and 3D Models
  Privacy Policy