EDACafe Editorial Roberto Frazzoli
Roberto Frazzoli is a contributing editor to EDACafe. His interests as a technology journalist focus on the semiconductor ecosystem in all its aspects. Roberto started covering electronics in 1987. His weekly contribution to EDACafe started in early 2019. New AI architectures in the spotlight at Linley Spring Processor Conference 2021April 29th, 2021 by Roberto Frazzoli
Cerebras’ new 2.6 trillion transistors wafer scale chip is one the announcements made during the 2021 edition of the Linley Spring Processor Conference, a virtual event organized by technology analysis firm The Linley Group from April 19 to 23. In our quick overview of the conference we will focus mainly on new product announcements, which include innovative AI intellectual property from startups Expedera and EdgeCortix, a new approach to clock distribution from Movellus, and more. But first, let’s briefly summarize the opening keynote given by Linley Gwennap – Principal Analyst of The Linley Group – who provided an updated overview of AI technology and market trends. A variety of AI acceleration architectures Gwennap described the different AI processing architectures that the industry has developed over the past few years. While many CPUs, GPUs, and DSPs include wide vector (SIMD) compute units, many AI accelerators use systolic arrays to break the register-file bottleneck. Also, convolution architecture optimized for CNNs have been proposed: examples include processors developed by Alibaba and Kneron. Within AI-specialized architectures, many choices are possible: a processor can use many little cores, or a few big cores. Extreme examples are Cerebras with its wafer-scale chip integrating over 400,000 cores (850,000 in the latest version), and Groq with one mega-core only. Little cores are easier to design, while big cores simplify compiler/software design and are better for real-time workloads. Another architectural choice is between multicore versus dataflow: in a multicore design, each core executes the neural network from start to finish, while in a dataflow design the neural network is divided across many cores. An additional architectural style – that goes ‘beyond cores’ – is Coarse-Grain Reconfigurable Architecture (CGRA), which uses dataflow principles, but instead of cores, the connected blocks contain pipelined compute and memory units. This approach has been adopted by SambaNova, SimpleMachines, Tsing Micro and others. So the industry now offers a wide range of AI-capable architectures, ranging from very generic to very specialized. In general terms, a higher degree of specialization translates into higher efficiency but lower flexibility.
Evolution of AI workloads AI workloads are evolving, too, especially in the case of image classification and natural language processing tasks. As Gwennap noted, image models seem to have reached a peak in terms of complexity with AmoebaNetB (557 million parameters), with diminishing returns for bigger models; while NLP models continue to grow in size at a pace of 40x per year: Google’s Switch Transformer has 1.6 trillion parameters. The reason is that bigger NLP models go well beyond word classification: they can also summarize or translate texts. Inference for the biggest NLP models requires very large memories to store parameters; some solutions involve software pipelining to divide the model across multiple chips – a solution made simpler by coherent interconnects. The largest models must be trained using thousands of chips, scaling across servers or racks with high-bandwidth connections. Gwennap then recapped the solutions that are being investigated to reduce the AI processing workload and increase efficiency, such as smaller data types, binary neural networks, spiking neural networks, analog computation, photonic computation, and solutions taking advantage of “sparsity”. Nvidia still leading the AI datacenter market Gwennap pointed out that Nvidia has further extended its lead in the datacenter market. The Ampere A100 chip offers big generational improvements over Turing and Volta, and the company is also gaining ground for AI inference. According to Gwennap, players that are challenging Nvidia in the training market are still falling short of Ampere. On the inference side, instead, several vendors are offering better energy efficiency than Nvidia A100. The AI market landscape includes Internet giants that have designed their own chips, such as Alibaba, Google, Amazon and Baidu. Still, the most difficult part of an AI solution is the software, and that’s where Nvidia has a big advantage with its CUDA platform. Competition is much more open on the client (edge) side, where barriers to entry are lower than in the datacenter. Many startups are jumping into this market, such as Ambient, Aspinity, BrainChip, Cambricon, Cornami, Deep Vision, Flex Logix, Grai Matter, GreenWaves, Gyrfalcon, Hailo, Horizon Robotics, Innatera, Kneron, Mythic, Perceive, SiMa.ai, Syntiant. Established vendors also see an opportunity here: among them NXP, Maxim, Ambarella, Intel, Lattice, Nvidia, Synaptics and others. Recent developments at the edge include microcontrollers adding AI to small sensors (also through solutions such as Google’s TensorFlow Lite for MCUs), new Tiny AI engines offering high efficiency for IoT applications, and new AI IP (intellectual property) specifically targeted at IoT products. Expedera introduces new edge IP reaching 18 TOPS/W at 7nm Let’s now move to the new products announced at the conference. One of the most notable came from Expedera, a Santa Clara-based startup that just emerged from stealth to introduce its Origin neural engine, which it claims to be “the industry’s fastest and most energy-efficient AI inference IP for edge systems.” Silicon-proven through a 7-nanometer test chip, it provides up to 18 TOPS/W while minimizing memory requirements. Expedera’s scheduler operates on metadata which simplifies the software stack and requires only about 128 bytes of memory for control sequences per layer. “Expedera has created the unique concept of native execution, which greatly simplifies the AI hardware and software stack,” commented Linley Gwennap, quoted in the company’s press release. “As a result, the architecture is much more efficient than the competition when measured in TOPS/W or, more important, IPS/W on real neural networks. On either metric, Expedera’s design outperforms other DLA blocks from leading vendors such as Arm, MediaTek, Nvidia, and Qualcomm by at least 4–5x,” Gwennap noted. The Origin product family includes a range of configurations spanning from 2.25K to 54K INT8 MACs. More details can be found in this whitepaper. EdgeCortix: run-time reconfigurable edge IP for streaming data applications Another announcement from an edge IP startup: EdgeCortix, a company with offices in San Francisco and Tokyo, unveiled an extension of its offering of IPs and compilers which target embedded and telecom edge devices requiring ultra-low latency, high throughput and high energy efficiency. According to the company, streaming data use-cases at the edge need specific solutions as their batch size is 1 so they cannot rely on batching to improve hardware utilization. In addition, AI hardware architectures that fully cache network parameters using large on-chip SRAM cannot be scaled down easily to sizes applicable for edge workloads. EdgeCortix’s software-first approach is based on a compiler called MERA (Multi-module Efficient Reconfigurable Accelerator) and on an architecture IP called DNA (Dynamic Neural Accelerator). DNA is a dataflow array-based architecture designed for low-latency and low-power batch-1 inference, characterized by a run-time reconfigurable interconnect and memory structure. Reconfiguration allows to adapt the architecture to different degrees of parallelism across neural networks and network layers. Early layers have small channel sizes and large row/column sizes, while late layers have large channel sizes and small row/column sizes. According to EdgeCortix, reconfiguration significantly reduces latency while increasing area efficiency (FPS/mm2, in vision applications) and power efficiency. MLperf benchmarks have been published for the existing DNA-F200 IP implemented on FPGA. The new IP announced at the conference, DNA-A, is a modular solution for AI inference on ASICs offering different performance and power efficiency points, ranging from 1.8 TOPS for ~0.6 Watts, to 54 TOPS for ~8 Watts at 800 MHz. More details can be found in this whitepaper. More AI acceleration announcements: Cerebras, Tenstorrent Some details about the already mentioned new Cerebras chip: called Wafer Scale Engine 2 (WSE-2), it boasts 2.6 trillion transistors and 850,000 AI optimized cores. This increase – compared to the 400,000 cores of first generation WSE – has been achieved by moving to a 7-nanometer TSMC process. The WSE-2 will power the Cerebras CS-2 computer system, which will more than double the performance of first-generation CS-1. Production is scheduled for the third quarter. Tenstorrent introduced its new Wormhole chip, an evolution from preexisting JawBridge and GraySkull devices. Targeting AI training with a system level focus, Wormhole offers PPA improvements, GDDR6 DRAM, scale-out through 100Gb Ethernet. The Tenstorrent solution offers a uniform mesh substrate for compiler across core/chip/server/rack hierarchies, where the compiler automatically places-and-routes the model. Neuromorphic news: Innatera, Brainchip Neuromorphic updates from the conference mostly concerned software tools aimed at simplifying the development of spiking neural networks, starting from industry standard languages and libraries. Dutch company Innatera – a spin-off of the Delft University of Technology – preannounced one of such tools, a Spiking Neural Processor SDK which will use a model description in Python. Innatera also provided updates on its high efficiency neuromorphic processor targeting pattern recognition in sensor-based devices. BrainChip introduced MetaTF, a ML framework that allows people working in the convolutional neural network space to develop edge AI systems based on the company’s Akida event domain neuromorphic processor – without having to learn anything new. The MetaTF development environment leverages Tensorflow and Keras for neural network development and training; it also leverages the Python scripting language and its associated tools and libraries. Movellus’ new approach to clock distribution in SoCs Even though AI accelerators made up the largest part of the conference content, the event also offered presentations concerning other product categories. Movellus (San Jose, CA) announced its Maestro Intelligent Clock Network platform for clock distribution in SoC designs, including AI applications in the cloud and at the edge. According to Movellus, Maestro clocking solutions dramatically improve the entire chip’s power, performance, and area, eliminating the need to overdesign the system to compensate for on-chip variation or skew. The Maestro platform combines a clock architecture, software automation, and application-optimized IP to solve common clock distribution challenges related to on-chip variation (OCV), jitter, clock skews, peak switching current, switching noise. “Given the big impact clock distribution has on power, performance, and area in SoC designs, it hasn’t received the attention it deserves,” commented Linley Gwennap, quoted in a company’s press release. “Movellus’ innovative solution reduces the complexity and risks inherent in today’s clock networks, including block and SoC timing closure and full chip gate-level simulation. It enables designers to recover up to 30% of the SoC’s power and performance through its pioneering smart clock modules that reduce the effect of on-chip variation and skew.” According to Movellus, Maestro creates a virtual mesh which can be thought of as a “network-on-chip (NoC) for the clock”. New Tensilica DSP cores from Cadence As for “traditional” processing architectures, Cadence announced at the conference the expansion of its Tensilica Vision DSP product family with two new DSP IP cores for embedded vision and AI. The Tensilica Vision Q8 DSP, a 1024-bit SIMD architecture reaching 3.8 TOPS, delivers 2X performance and memory bandwidth compared to the Vision Q7 DSP. It targets high-end vision and imaging applications in the automotive and mobile markets. The Tensilica Vision P1 DSP is optimized for always-on and smart sensor applications in the consumer market, at one-third the power and area compared to the Vision P6 DSP. Both new cores feature an N-way programming model that preserves software compatibility from prior-generation Tensilica Vision DSPs with different SIMD widths. Both DSPs also support Xtensa Neural Network Compiler (XNNC) and the Android Neural Networks API (NNAPI). The 2021 Linley Spring Processor Conference also included presentations from Intel, Flex Logix, Arm, GlobalFoundries, Synopsys, Deep AI, CEVA, Arteris IP, Rambus, Graphcore, Marvell, SiFive, Mythic, SimpleMachines, Hailo, and Qualcomm. |