AI acceleration trends and updates from the 2022 Linley Fall

EDACafe Editorial

Roberto Frazzoli
Roberto Frazzoli is a contributing editor to EDACafe. His interests as a technology journalist focus on the semiconductor ecosystem in all its aspects. Roberto started covering electronics in 1987. His weekly contribution to EDACafe started in early 2019.

AI acceleration trends and updates from the 2022 Linley Fall Processor Conference

November 26th, 2022 by Roberto Frazzoli

This year’s fall edition of the Linley Processor Conference – held on November 1 and 2 in Santa Clara, California – was, as usual, a good observation point to keep abreast of trends and products in neural network acceleration. In this article we will provide a very quick overview of part of the conference, focusing on the keynote given by Linley Gwennap and on the presentations from the companies that addressed AI acceleration topics. The event, of course, offered many more presentations concerning ‘conventional’ (non-AI) processors and other processing-related themes, which we will not cover here.

Linley Gwennap’s keynote: trends in AI acceleration

In his keynote, Linley Gwennap – principal analyst at TechInsights – noted that the growth of AI model size has slowed, as training has become increasingly resource-intensive: for example, training the GPT-3 language processing model takes 1,024 Nvidia A100 GPUs over one month. Rapid growth of AI model size has been enabled by moving training to large processing clusters, but cluster size is topping out for cost reasons: 1,024 GPUs cost approximately $25 million. As a result, essentially there has been no growth in largest trained models over the past year, and recent progress focuses on models with less compute per parameter. Future growth of AI model size will be paced by hardware progress, e.g. the availability of new Nvidia H100 clusters.

Gwennap then observed that AI chip performance has grown incredibly rapidly: it has more than doubled every year. For example, Nvidia Hopper H100 offers 100x more Tflop/s than Pascal P100, which was introduced in 2016. According to the analyst, this performance growth rate is not sustainable as it has been driven by factors that cannot continue scaling at this pace: the addition of matrix units (aka tensor cores), the move from FP32 to FP8/INT8 data formats (leaving little room for further progress), the growth in power (TDP) from 300W to 700W (again, little room to grow). Therefore, future performance gains of AI chips will be more like 30–40% per year, and power efficiency could improve even less quickly. Bigger gains will require a new approach, such as the move to analog or optical technologies. However, the applicability of these new approaches to mass production will have to be demonstrated in practice.

In terms of AI application requirements, Gwennap observed that large data-center customers need flexible and easy-to-use solutions, as they train and deploy a broad range of AI models in rapidly changing markets. AI vendors must therefore deliver a flexible architecture and a broad software stack, avoiding the addition of hardware support for specific algorithms. Platforms must work well for different AI models, even ones yet to be invented. According to Gwennap, large chip vendors with extensive software expertise are best suited to serve datacenter needs: for example Nvidia, Intel, Qualcomm, or AMD (“if it gets serious about AI”, he added). In the datacenter market, AI startups are currently serving niche markets or enterprises, so it’s difficult for them to justify multi-billion-dollar valuations without a breakthrough in market share, Gwennap maintained. Largest cloud-service providers can build their own AI chips, but they still struggle to build a broad software stack and flexible architecture. Among them, only Google has deployed in-house AI chips in high volume.

As for AI applications at the edge – that is, all use cases different from datacenter – Gwennap maintained that all personal computers will have AI acceleration, as this feature extends battery life in laptop PCs that have frequent AI usage for workloads such as voice assistants, speech-to-text software, AAA games etc. Apple was the first to add a deep learning accelerator to a PC processor, its M1 chip in fourth quarter 2020; AMD’s next laptop processor (Phoenix Point, to be introduced in second quarter 2023) will also have a DLA, as well as Intel’s Meteor Lake processor for desktop/laptop, scheduled for fourth quarter 2023.

Gwennap then reviewed the different segments of the edge AI market: high-end edge devices, such as autonomous vehicles and multicamera surveillance system, targeted by Nvidia and other big players like Intel and Qualcomm; single-camera chips, a segment addressed by Hailo, Ambarella, Quadric, RealTek and many others; ultralow-power AI chips (down to sub-milliwatt) for IoT sensors used for presence detection, wakeword detection, gunshot detection etc., applications targeted by Syntiant, Innatera, Ambient, BrainChip etc.; scalable IP for edge SoCs, a segment addressed by startups such as EdgeCortix, Edged.AI, Expedera and Vsora; acceleration IP offered by established IP vendors such as Cadence, Arm, Ceva, Imagination, Synopsys, ThinkSilicon; microcontrollers equipped with on-chip AI acceleration, with examples from Arm, Greenwaves and NXP.

According to Gwennap, automotive will be a huge market for AI chips, as almost all vehicles will be equipped with Level 2 autonomous driving capabilities by 2025, and Level 4 will be commonplace by 2030. TechInsights forecasts approximately $10 billion in automotive AI chip revenue in 2030, 60% of which for Level 4 applications, and this will be a difficult market for new vendors. Most other edge applications don’t need an AI accelerator, Gwennap maintained: Internet-connected devices (speakers, cameras) can offload most AI processing to the cloud; line-powered devices can run AI workload on their CPUs with vector (SIMD) extensions, and even microcontrollers can handle basic AI tasks. Therefore, AI hardware is most valuable in battery-powered devices – such as laptops, smartphones, smartwatches. These products, however, use processors equipped with an integrated (typically in-house) AI engine. As a result, there will be little volume left for licensed AI accelerators.

Licensable accelerator IP for SoCs

Brainchip described the recent advancements it has achieved with its neuromorphic (spike-based) Akida processor IP and its framework for edge applications. According to the company, a paradigm shift is now required, as AI models are becoming prohibitively compute and power intensive, so edge devices need to be redesigned from the ground up. The neuromorphic approach lends itself to such a paradigm shift, as it allows to reduce data movement during inference, and ensures scalability by leveraging ‘multi-pass processing’ (i.e. separating a model that may not fit on the hardware into multiple sequential passes). The company has also developed a replacement for MobileNet v1, called AkidaNet, which results in a 15%-30% decrease in power usage and slight accuracy increase.

Expedera addressed the ‘in-house design versus IP licensing’ decision that companies need to make when it comes to adding a neural processing unit to an SoC. According to the company, the difficulties and risks involved in self-designing an NPU are often overlooked, as real PPA is difficult to evaluate and the set of requirements includes scalability for future uses, ease of integration within the SoC etc. Expedera claims that its Origin NPU IP is a better option, in terms of PPA, software stack, and track record of real applications. The Origin architecture is ‘packet-based’, where a packet is a contiguous fragment of a neural network layer with entire context of execution.

Flex-Logix focused on the challenges posed by transformer models to vision inference applications at the edge. Transformer models are needed to enlarge the vision receptive field and provide results in context. DETR, Meta’s object detection model, combines a CNN backbone with a transformer-based detector, but this solution involves transformer encoder/decoder computation that is very different from CNNs – a heavy workload not suitable for traditional accelerators. According to Flex-Logix, the company’s X1 processor IP lends itself to vision transformers because – among other things – it has a dedicated path to load activation data into the weight memory and supports mixed-precision mode.

Quadric described its recently introduced Chimera GPNPU (General Purpose Neural Processing Unit), a single IP core unifying a neural network accelerator and a conventional CPU/DSP. The assumption here is that edge vision applications still require a combination of all these hardware resources, therefore unification results in simpler SoC and simpler C++ programming. Chimera is based on a proprietary ISA and employs a hybrid Von Neumann + Dataflow 2D-matrix architecture. According to Quadric, Chimera excels at convolution layers, ensuring high MAC utilization.

Novel approaches

Ceremorphic focused on its recently announced QS1 chip aimed at improving the reliability of AI applications, based on a fine-grained multi-thread architecture called TheadArch already being used in Wi-Fi SoCs and smartwatch designs. In QS1, multiple threads on single or dual core execute the same program; if a fault is detected, correction is made with voting; the program is then re-started where the fault was detected. According to the company, this approach enables low area overhead and fast recovery after a fault. Product release of the QS1 Hierarchical Learning Processor, built in a TSMC 5-nanometer process, is scheduled for the fourth quarter of 2024.

Tetramem introduced its approach for using software to overcome analog loss for in-memory-computing. According to the company, analog in-memory computing is essential for camera-based edge applications, as it satisfies both cost and performance requirements. However, the analog multiply-add matrix (where multiplication is provided by Ohm’s law and addition by Kirkhoff’s law) is inherently plagued by low precision and variability due to parasitic (such as wire resistance), circuit nonlinearity, noise, defects, variation, temperature, etc. Existing solutions to this problem have drawbacks: retraining the device with consideration of analog loss is too expensive and not always applicable, overdesigning the circuit to compensate the analog loss may worsen the PPA. The Tetramem solution is based on a fast and accurate Spice-level simulator and on a chip calibration process. This involves running simple tests on the chip to detect the variation, and sending feedback to the cloud to optimize the mapping.

Untether AI provided updates on its device roadmap, describing the features of its speedAI240 chip that will be sampling in first half of 2023. According to the company, “at-memory compute” is the sweet spot for AI acceleration – in terms of energy efficiency and bandwidth – as opposed to both near memory/Von Neumann architectures and to in-memory computation. The new speedAI240 will reach 2 PetaFLOPs at 30 TFLOPs/W. The device employs 729 Dual Risc-V memory banks and a multi-threaded custom Risc-V processor based on a standard RV32EMC instruction set with the addition of some twenty custom instructions. Another key point of the Untether AI approach is the use of the FP8 datatype, which – according to the company – provides the best balance of precision, throughput, and energy efficiency.

Chips for AI acceleration

Kinara – formerly known as Deep Vision – delved into its approach to edge AI acceleration, based on its Ara-1 AI processor. The chip employs a ‘Polymorphic Dataflow Architecture’. Kinara targets smart cities, smart retail, automotive, and Industry 4.0 applications.

Innatera shared additional details about its Spiking Neural Processor, unveiled in July 2021. Spun out of the Delft University of Technology (Netherlands), the company focuses on always-on edge sensing applications requiring a millisecond-scale processing latency envelope and milliwatt or sub-milliwatt power envelope. At the conference, Innatera also introduced its Talamo software development kit, which requires no knowledge of spiking neural networks. The company then described an application where its Spiking Neural Processor is used to perform always-on radar classification. Compared to optimized CNN inference on a GAP-8 computing engine with ultra-low power convolutional accelerator, Innatera claims 42x lower inference power, 166x shorter inference latency, and 30x smaller model size.

Habana, an Intel company, offered details and benchmark results about its Gaudi2 deep learning accelerator chip, launched in May 2022, targeting datacenters. The company underlined that its software suite makes it easy to get started with TensorFlow and PyTorch models. Speaking about the trends in AI datacenter workloads, Havana also pointed out that large scale models are no longer limited to language; they now include multiple input modalities, such as vision, language, audio.

The presentation from Qualcomm focused on the ways to partition large AI models across multiple acceleration devices (the company’s Cloud AI 100 boards) for inference in datacenters. Model partitioning schemes include pipeline, Tensor slicing, and hybrid (a blend of the other two).

Other conference topics

The conference also offered presentations from NXP, Cadence, Ceva, Marvell, ProteanTecs, Siemens, GlobalFoundries, Arteris, Movellus, Achronix, Blue Cheetah, AMD, SiFive, Andes, Imagination, Synopsys, on a range of different topics. A second keynote was given by Cisco’s Sudhakar Ranganathan.

This entry was posted on Saturday, November 26th, 2022 at 1:30 am. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

AI acceleration trends and updates from the 2022 Linley Fall Processor Conference

Back to 'EDACafe Blogs'

EDACafe Editorial

Subscribe to Blog via Email

Recent Posts

Categories

Meta