EDACafe Editorial Roberto Frazzoli
Roberto Frazzoli is a contributing editor to EDACafe. His interests as a technology journalist focus on the semiconductor ecosystem in all its aspects. Roberto started covering electronics in 1987. His weekly contribution to EDACafe started in early 2019. AI acceleration takes center stage at the 2019 Linley Fall Processor ConferenceNovember 1st, 2019 by Roberto Frazzoli
Innovative computing concepts challenging traditional architectures, new papers being published at a rate of 16 per day, startups attracting investors’ money, new chips hitting the market: the energies unleashed by neural network-based AI (artificial intelligence) spell exciting times for the IT and semiconductor industries. A good example of this climate was offered by the 2019 Linley Fall Processor Conference (Santa Clara, CA, October 23th and 24th), organized by the technology analysis firm Linley Group: the event attracted hundreds of attendees and required two parallel tracks – on day one – to accommodate all sponsors. Most speakers addressed AI-related themes, particularly the quest for new processing architectures boosting energy efficiency and speed as required by upcoming AI applications. Some speakers, however, touched other topics such as 5G and traditional architectures. Here is a quick overview of some of the presentations. Systolic arrays to cope with future 32K automotive cameras Paul Master from Cornami (Campbell, CA) introduced the concept of a dynamically reconfigurable systolic array. To explain the need for a dramatic performance improvement, Master cited the ever-increasing resolution of image sensors, quoting Michael Cioni from Light Iron: resolution quadruples every eight years, and new image sensors take seven years to become low-price commodities. So, for example, in 2031 cars could be equipped with 32K sensors (531 megapixels) to improve obstacle detection. The bandwidth required by these future video cameras will be in the range of 191 GB per second. Moving such an incredible amount of data will be practically impossible, therefore processing will have to be performed next to the sensor – where 2.5D interposer board packaging technology already allows this bandwidth, connecting the dies to the substrate. Considering that more processing will be needed in the central unit of the vehicle, Cornami proposes a single scalable compute fabric from the edges to the core, allowing the “right” number of systolic array processing units to be placed where they are needed. Optimizing AI architectures for inference/W and inference/$ Cheng C. Wang from Flex Logix (Mountain View, CA) announced the company’s InferX X1 chip that will reach mass production in the second half of 2020. According to Wang, the key metrics for edge AI processors are the number of inferences per dollar or per Watt, parameters only loosely related to TOPS (Tera Operations Per Second). InferX X1 reaches a high efficiency level through various optimizations such as high MAC utilization, programmable interconnect, reduced memory accesses via deep layer fusion, DRAM access time ‘hidden’ in background. Also, the eFPGA logic included in the chip natively handles custom operators such as sigmoid and tanh. According to Flex Logix, InferX X1 has only 7% of the TOPS and 5% of the DRAM bandwidth of Tesla T4, yet it has 75% of the T4 inference performance running YOLOv3 (an object detection network) at two megapixels. Combining a graph-based dataflow with an analog MAC The approach chosen by Mythic (Redwood City, CA) for edge AI acceleration in constrained environments (e.g. surveillance cameras) leverages analog compute and a graph-based dataflow architecture. Analog compute performs matrix multiplication (multiply-accumulate) using Ohm law for multiplication (I=GV, where G is the variable conductance of a Flash transistor representing the synaptic weight) and Kirchhoff law for addition (multiple currents adding up in a single node). According to Dave Fick from Mythic, eliminating weight movement provides more than 10x efficiency improvement over digital systems. The graph-based dataflow architecture, on its part, eliminates software overhead by automatically starting operations when prerequisites are met. Spiking neural networks from research to market Spiking neural networks (SNNs) are the ‘trademark’ of the neuromorphic approach, that at the Linley Conference was represented by Intel with its Loihi chip and by Brainchip (Aliso Viejo, CA) with the Akida processor. As explained by Mike Davies, director of Intel’s Neuromorphic Computing Lab, Loihi is currently a research effort. Akida, instead, is now a real product available as an SNN processing chip or as a licensable technology that can be integrated with other hardware. Anil Mankar from Brainchip explained that the SNN enables Akida to take advantage of the inherent sparsity of events, leading to a reduced number of operations. Furthermore, Akida’s weights and activations are quantized to 1, 2 or 4 bits to reduce memory requirements. All intermediate results are stored on chip memory, eliminating off chip memory access; NPUs communicate over a mesh network, so there is no need for external host CPU. Akida runs the entire neural network, as all layers are executed in parallel. These and other features allow inference and incremental learning on edge devices within a power, size and computation budget. Event-based processing without spikes Event-based processing can be performed without using spiking neural networks, thus avoiding the spike-to-data coding problem. This approach is used by GrAI Matter Labs (Paris/Eindhoven/San Jose) in its Neuronflow architecture. The company’s target application is “live AI at the edge”: extracting information from livestreams, generating instant feedback with minimal latency, and acting autonomously. Jonathan Tapson from GrAI Matter Labs pointed out that the “GrAI One” chip exploits sparsity in space, time and connectivity, to reduce power consumption. The architecture consists in a network of neuron cores connected by a proprietary packet-switched network-on-chip. Native tensor processing Probably the most ‘radical’ approach to AI acceleration at the edge is the one adopted by Novumind (Santa Clara, CA, and Beijing, China), a company that claims the creation of “the industry’s only AI architecture for native tensor processing”. According to Chien-Ping Lu, non-native AI processing (the approach used by most accelerators) has various disadvantages including the need to reformat tensor data; on the contrary, native AI processing allows to process tensor data natively – including input and output – and offers other benefits. Chien-Ping Lu also said that other approaches to AI acceleration are based on two “myths”: that GEMM (general matrix multiplication) is at the heart of AI, and that mesh-like interconnects are essential to scalability. In his view, it is unnecessary to flatten out tensor data – turning them into matrices – to leverage wide broadcasts and deep reductions that are already in CNNs. Coupling compute with flat communication topology reduces MAC density and makes AI as much a communication problem as a compute one. Furthermore, flattening tensors to matrices to fit in a flat topology loses image-space parallelism, which is critical for processing high resolution data. In the Novumind architecture, tensor data become ‘block tensors’ whose entries are themselves tensors (tensor packets). The company has already built a test chip for its customers; production silicon is underway. Ethernet networking in AI training Challengers started emerging in inference applications, but the competition is quickly extending to AI training at the datacenter side. Eitan Medina from Habana (San Jose, CA) described the Gaudi AI training processor stressing that it’s the only AI processor integrating RoCE (RDMA over Converged Ethernet, where RDMA stands for Remote Direct Memory Access). Using Ethernet avoids proprietary interfaces such as Nvidia’s NVLink. Also, Gaudi is available in PCIs cards that fit into existing servers. According to Medina, Gaudi overcomes the scalability limitation of Nvidia solutions. AI at the datacenter: the Facebook perspective Challenges posed by AI at the datacenter were the subject of a keynote speech from Misha Smelyanskiy, Director of AI Systems Co-Design at Facebook. Machine learning data at the social media giant has tripled from 2018, and daily inference volume has reached 200T predictions and 6.5B translations. Smelyanskiy said that coping with this workload requires hardware/software co-design, tailoring resources for the specific types of computations needed: in the case of Facebook, most of them are ranking and recommendation, while GEMMs only represent 40% of workload. He also stressed the importance of programmability – in terms of the amount of programmer’s time required to reach peak performance. According to Smelyanskiy, deep learning programmability is still an unsolved problem. More new product announcements: CPUs, GPUs, DSPs Besides AI chips, the 2019 Linley Fall Processor Conference offered other new product announcements in the CPUs, GPUs and DSPs areas; here we will briefly mention some of them. Synopsys introduced the new DesignWare ARC VPX DSP Processor IP. Intel announced Tremont, its new low power x86 microarchitecture offering advancements on ISA, security and power management. SiFive unveiled its new U8-Series, an out-of-order application processor core IP claiming improvements over previous series in terms of performance per Watt, area efficiency and scalability. Arm introduced the Helium technology, an M-Profile Vector Extension (MVE) for future Cortex-M processors, aimed at enhancing DSP and ML performance for microcontrollers used in small devices. Imagination presented the new PowerVR GPU-A, addressing the future needs of mobile gaming on 5G networks. |