Neil Parris is CCI Product Manager for the ARM processor division responsible for interconnect products.
Extended System Coherency 2 – Implementation, big.LITTLE, GPU Compute, Enterprise
April 7th, 2015 by Neil Parris
This is the second part of a series of blogs about hardware coherency. In the first blog I introduced the fundamentals of cache coherency: Extended System Coherency – Part 1 – Cache Coherency Fundamentals
This part talks about the implementation of hardware cache coherency and use cases.
Implementing Hardware Coherency
ARM’s first implementations of AMBA® 4 ACE include the ARM® CoreLink™ CCI-400 Cache Coherent Interconnect, ARM Cortex®-A15 and Cortex-A7 processors. These products were first released to our silicon partners in 2011, and we’ve seen the first big.LITTLE™ products come to market in 2013.
CoreLink CCI-400 has been licensed by over 24 partners to date for mobile and enterprise applications such as networking or micro-servers. CoreLink CCI-400 supports up to two AMBA 4 ACE processor clusters allowing up to eight processor cores to see the same view of memory and run an SMP OS.
Mobile Applications: big.LITTLE processing
CoreLink CCI-400 supports all big.LITTLE combinations including Cortex-A15 + Cortex-A7, Cortex-A17 + Cortex-A7, and Cortex-A57 + Cortex-53 with full support for ARMv8-A including 64-bit. big.LITTLE processing is a power optimization technology from ARM where high performance ‘big’ cores and efficiency tuned ‘LITTLE’ cores are combined with software to dynamically transition applications to the right processor at the right time.
Hardware coherency is fundamental to big.LITTLE processing as it allows the big and LITTLE processor clusters to see the same view of memory and run the same operating system. big.LITTLE software such as Global Task Scheduling (GTS) places tasks on the appropriate core at a given time. For moderate workloads all processing may be performed on the LITTLE cores while the big cores are powered down. If a workload requires higher performance a big core is powered up and the task migrated while other moderate workloads continue to run on LITTLE cores. big.LITTLE GTS allows all the cores on an SoC to run simultaneously, for example a device with four big and four LITTLE will appear to the operating system as a octo core processor.
Mobile Applications: GPU Compute
GPU Compute with APIs such as OpenCL™1.1 Full Profile and Google RenderScript compute, unlock the combined processing power of CPU and GPU.
The ARM Mali™-T600 series and Mali-T760 GPUs support AMBA 4 ACE-Lite for IO coherency with the CPU. This means that the GPU can read any shared data directly from the CPU caches, and writes to shared memory will automatically invalidate relevant lines in CPU caches. Hardware coherency reduces the cost of sharing data between CPU and GPU, and allows tighter coupling.
GPU Compute applications include: computational photography, computer vision, modern multimedia codecs targeting Ultra HD resolutions such as HEVC and VP9, complex image processing and gesture recognition.
ARM is one of the founding members of the Heterogeneous System Architecture (HSA) foundation. This foundation aims to provide a royalty free specification that makes it easier to take advantage of the heterogeneous CPU, GPU and DSP hardware in an SoC. This includes shared virtual memory and a roadmap to fully coherent GPU. These techniques will further reduce the cost of sharing data between processing engines.
See the HSA website for more information:
Enterprise Applications: Networking and Server
Enterprise applications such as networking and server have high performance serial interfaces such as PCI Express, Serial ATA and Ethernet. In most applications all of this data will be marked as shared as there will be many cases where the CPU needs to access data from these serial interfaces. The picture to the right shows an simplified example system (click picture for a larger image).
Example: network interface
There is a trend in networking applications to move functionality to software to allow an SoC to support multiple applications. This means that the SoC needs more processing nodes.
The CCI-400 Cache Coherent Interconnect is being designed into a range of smaller enterprise applications including residential gateways, security appliances, WLAN enterprise access points, industrial communications and micro servers. These applications use a range of ARM processors depending on the performance requirements from Cortex-A7 to Cortex-A57 with up to a total of 8 cores maximum and no L3 cache.
ARM has a range of interconnect products to extend performance across a range of core counts:
Ian Forsyth talks more about the CoreLink CCN products in this blog post:Coherent Interconnect Technology Supports Exponential Data Flow Growth
CoreLink CCI-400 Cache Coherent Interconnect
The following table details key features of the CoreLink CCI-400:
Two of the most commonly asked questions are: how big is it, and how fast does it run? CoreLink CCI-400 has many configuration options including register stages and transaction tracker sizes which allow the interconnect area and performance to be optimized for a given application. At the low end the gate account gets down towards 100k gates. In terms of clock speed, our baseline implementation trials started at 533MHz on a CMOS 32LP process, but we see a number of partners implementing at higher speeds on smaller silicon geometries and with faster implementation techniques.
The following diagram demonstrates an example mobile applications processor with Cortex-A50 series processors, CoreLink MMU-500 System MMU and a range of CoreLink 400 system IP (click picture for a larger image).
In this system the Cortex-A57 and Cortex-A53 provide the big.LITTLE processor combination and are connected to CCI-400 with AMBA 4 ACE to provide full hardware coherency. The Mali-T628 and IO Coherent masters connect to CCI-400 via AMBA 4 ACE-Lite interfaces. As described in the first blog, this IO coherency allows the IO coherent agents to read from processor caches.
The other components in the system include:
Performance Analysis with ARM DS-5 Streamline Performance Analyzer
So how do you optimize for the best performance and power efficiency around CCI-400? One solution is to use the Streamline Performance Analyzer which is part of the ARM DS-5 Development Studio. Thisbrings together system performance metrics, software tracing, statistical profiling, and power measurement to present into a system dashboard to help you optimize the system.
The CCI-400 includes a Performance Monitoring Unit (PMU) which allows the counting of events to measure items like bandwidth, transactions stalls, cache hit rates. These counters can be visualized with theStreamline Performance Analyzer as shown in the screen shot above. This data could be shown alongside SoC power and processor activity to understand what is happening at a system level.
In the first blog I described how the AMBA 4 ACE bus interface extends hardware cache coherency outside of the processor cluster and into the system. In this blog we looked at implementations of hardware coherency and applications from mobile, like big.LITTLE processing, and enterprise. At the heart of all these applications is a cache coherent interconnect like the CoreLink CCI-400. ARM as an IP provider is in a unique position to offer the complete solution of Cortex processor, Mali graphics and CoreLink cache coherent interconnect as well as tools and physical IP. I personally look forward to seeing more products come to market in 2014 taking full advantage of hardware cache coherency and AMBA 4 ACE, and I’d be interested in your plans or views on how this technology is helping you!