Open side-bar Menu
 EDACafe Editorial
Roberto Frazzoli
Roberto Frazzoli
Roberto Frazzoli is a contributing editor to EDACafe. His interests as a technology journalist focus on the semiconductor ecosystem in all its aspects. Roberto started covering electronics in 1987. His weekly contribution to EDACafe started in early 2019.

Deep learning acceleration – trends and news from the Linley Spring Processor Conference

 
May 19th, 2022 by Roberto Frazzoli

The Linley Spring Processor Conference 2022 – which took place last April 20th and 21st – saw the participation of numerous sponsor companies, many of them offering deep learning acceleration solutions. This week EDACafe takes a quick look at the conference content, mostly focusing on some technology trends and some new announcements. Full proceedings of the event can be accessed from www.linleygroup.com, the website of the technology analysis firm now owned by Canadian reverse engineering company TechInsights.

Ever-growing NLP models

In his keynote speech, TechInsights’ principal analyst Linley Gwennap pointed out that language-processing models keep growing at an impressive pace: Alibaba’s M6 has 10 trillion parameters. Model size is limited by training time (compute cycles): for example, training the GPT-3 model using one thousand A100 GPUs takes more than one month. Rapid growth has been achieved by moving to large and very expensive clusters. Recent progress focuses on adding parameters using fewer GPU cycles: for example, Alibaba reports training M6 required only 15% the time of smaller GPT-3. Training can be accelerated through ‘model sharding’, which divides a model across many chips. This requires complex software, possibly with manual assistance. Scaling massive models across servers and racks, sharding requires high-bandwidth connections.

FP8 and ‘structured sparsity’ in Nvidia chips

Among other things, Gwennap provided an update on the increasing use of smaller data formats. FP8, in particular, seems to be a promising format, with Nvidia Hopper being the first commercial design implementing it. FP8 comes in two versions – E5M2 and E4M4 – as there is no IEEE standard at the moment. Updates also concern the techniques used to take advantage of “sparsity” – that is, disabling a MAC unit when an input is zero. This method saves power versus performing a useless operation, but no speedup is achieved. A new method implemented by Nvidia Ampere, called “structured sparsity”, can rearrange the computations to skip values. The model is preprocessed to remove 50% of weights (eliminate values close to zero), then mask bits remove corresponding activations, doubling MAC throughput. This method greatly improves model performance but may result in some reduction in output accuracy. In general terms, smaller data types and sparsity create opportunity for “free” improvements in performance, but they require changes to software and hardware designs.

DLA competitive scenario

Reviewing the current competitive scenario, Gwennap noted that the recently announced Nvidia Hopper chip raises the bar: H100 triples A100’s TOPS and flop/s, with an approximately 2.5x gain on most AI models. Along with the gains obtained by using the FP8 data format and the new system-level NVLink, the total performance gain ranges from 6x to 9x for models with more than 100 billion parameters. Chip power rises to 700W TDP, but efficiency gain on most models is approximately 40%. Gwennap also noted that Qualcomm Cloud AI 100 can outperform Nvidia A100 in power efficiency – but it handles inference, not training. In terms of advanced hardware solutions, Gwennap mentioned Graphcore Colossus using TSMC’s wafer-on-wafer (WoW) stacking technology; the wafer-scale Cerebras second generation design; and Tesla’s D1, cramming 25 die in single wafer-level package. Neither Cerebras nor Tesla have disclosed benchmark results, though. Gwennap them moved to providing updates on edge AI solutions. In general terms, as more AI moves to the edge, this market is fragmenting into high-end chips for camera-based systems and low-power chips for simple sensors. Another edge trend is microcontrollers adopting AI, an approach that avoids the need for a separate AI-accelerator chip or switching to new SoC software – for applications that need simple neural networks and don’t require extremely low power.

New fault mitigation techniques from Ceremorphic

Ceremorphic – a San Jose startup that emerged from stealth last January – explained its focus on “reliable computing”. Founder and CEO Venkat Mattela reminded that companies like Amazon, Facebook, Twitter have recently experienced “surprising outages” linked to chip faults. He then pointed out that fault mitigations techniques based on duplication of logic/processor cores are prohibitively expensive for silicon area and cost of recovery for both time and application integrity. Ceremorphic’s QS 1 chip, instead, employs a number of innovative fault mitigation techniques aimed at reducing PPA overhead – such as choosing the right combination of logic cells for resiliency towards Early Life Failures, cross layer techniques etc. The QS 1 chip also uses fast recovery techniques based on its multi-thread processor. Two multi-thread processors are used to act like duplication (multiple threads run the same program); if a fault is detected, voting is performed using multiple results and a correction is made; then program execution starts from where the fault was detected.

Neuromorphic advancements from Intel

Mike Davies, Director of Intel’s Neuromorphic Computing Lab, provided updates on Intel’s Loihi 2 neuromorphic chip and on Lava, a new software framework for neuromorphic computing. Davies introduced the concept of energy-delay-product, a combined latency and energy efficiency metric that is best suited to compare Loihi 2 with other AI chips. For the right workloads, orders of magnitude gains in latency and energy efficiency are achievable with Loihi 2 – but standard feed-forward deep neural networks give the least compelling gains (if gains at all). The best gains can be achieved for recurrent networks with novel bio-inspired properties, especially on optimization problems. Davies then pointed out the improvements of Loihi 2 versus its previous generation, resulting in better performance in metrics such as neuron update time, synaptic op time, minimum timestep, neuron update energy, and synaptic op energy. He then described the features of Lava, maintaining it’s better than most other software for neuromorphic computing in that it supports a wider range of functions.

Memristor-based analog in-memory computing from TetraMem

Glenn Ge, co-founder & CEO of startup TetraMem (Fremont, CA), described the company’s solution based on In-Memory Computing with multilevel resistive switching devices, targeting edge applications. What mainly differentiates TetraMem from other companies developing solutions based on an analog MAC matrix is the choice of memristors as a memory technology. According to the company, all the other current memory devices have limitations for computing applications, while the memristor is ready for computing – but not for memory yet. The two areas have different requirements, as most constraints on memory application do not apply to computing memristors. Different from memory applications, computing applications are intrinsically more defect tolerant as the computing result is determined by the combined effect of a group of cells rather than just the reading result of a single cell. The latest TetraMem chip is an 8 bit-design, with five 256×256 crossbar arrays and a Risc-V core. Built in a 65-nanometer technology node and running at 400MHz, the chip reaches a MAC engine energy efficiency (at INT8) of up to 25 TOPS/W. A competing IMC alternative based on 40nm Flash reaches 5.2 TOPS/W.

EdgeCortix shifting from IP to chips

Previously selling AI intellectual property, Japan-based EdgeCortix is now selling its own edge-AI inference chips for line-powered systems. The new die, called Sakura, implements the company’s dynamic neural accelerator (DNA) engine, on-chip SRAM, and other hardware resources – but no host CPU. Sakura has a maximum performance of 40 TOPS; on ResNet-50, it achieves 0.4ms latency at 4.7W, yielding 533 inferences per second per watt (IPS/W). The company plans to ship samples on a development board in July, with production anticipated in 1Q23. EdgeCortix may also license Sakura hard IP for chiplets.

Other processors

Let’s now briefly list the other presentations given at the Linley Spring Processor Conference 2022, mostly concerning products that had already been introduced. Synopsys presented its Neural Processing Units IP; Cadence described its NNE110 IP; Ceva focused on its NeuPro-M AI core; Flex Logix elaborated on applications of its InferX X1, including edge vision; Aspinity delved into the theme of inferencing in analog; Intel revealed the architecture of its Gaussian and Neural Accelerator (GNA); Western Digital offered details about its SweRV Core family of Linux-compatible 64-bit In-Order Risc-V CPUs; Marvell’s presentation addressed the theme of accelerating virtualized workloads with DPUs; Arteris discussed AI traceability in automotive; Movellus analyzed the advantages of large-scale synchronous clocking domains in SoCs and chiplets; GlobalFoundries focused on IoT process requirements.

Logged in as . Log out »




© 2024 Internet Business Systems, Inc.
670 Aberdeen Way, Milpitas, CA 95035
+1 (408) 882-6554 — Contact Us, or visit our other sites:
TechJobsCafe - Technical Jobs and Resumes EDACafe - Electronic Design Automation GISCafe - Geographical Information Services  MCADCafe - Mechanical Design and Engineering ShareCG - Share Computer Graphic (CG) Animation, 3D Art and 3D Models
  Privacy PolicyAdvertise