Open side-bar Menu
 EDACafe Editorial
Roberto Frazzoli
Roberto Frazzoli
Roberto Frazzoli is a contributing editor to EDACafe. His interests as a technology journalist focus on the semiconductor ecosystem in all its aspects. Roberto started covering electronics in 1987. His weekly contribution to EDACafe started in early 2019.

Linley Fall Processor Conference 2020 – Part Two

 
November 3rd, 2020 by Roberto Frazzoli

Last week, EDACafe provided a quick overview of some of the presentations which were given during the first part of this year’s Linley Fall Processor Conference, from October 20 to 22. This week we complete our coverage of the virtual event – organized by The Linley Group – by quickly summarizing some of the presentations which were given during the second part of the conference, from October 27 to 29. No shortage of innovations in this period, which is also characterized by big deals (Nvidia-Arm and AMD-Xilinx) in the processor industry.

TinyML chip requirements: the Google point of view

The second part of the conference was kicked off by a keynote from Peter Warden, Technical Lead of the TensorFlow Micro open source framework at Google. Warden summarized the requirements that chip vendors will need to satisfy to make the vision of TinyML come true. He foresees a future of hundreds of billions of “peel-and-stick” sensors placed on everyday objects – used for industrial monitoring, environmental monitoring, building automation, agricultural and wildlife use cases etc. – all of them capable of full-vocabulary speech recognition and/or person and gesture recognition.

To get there, product teams need new hardware with more arithmetic computing power, more inference focus (as opposed to training focus), the ability to work well at low precision in terms of number of bits, and compatibility with preexisting training environments. As for arithmetic, Warden asked chip designers to shift their focus from “memory-bound” problems to “compute-bound” problems, that are typical of ML workloads. As for low precision, he thinks that lower bit depths (4-bit, binary) are very promising, but lack good hardware solutions beyond FPGAs, and – because of this – software support is also lacking.

The compatibility issue, Warden explained, comes from the fact that researchers keep inventing new ops – for example, TensorFlow has more than 1,200 operations – but product teams often don’t have the ability to retrain models. So custom accelerators must be able to run general purpose code without paying a massive performance penalty. He also called for specific TinyML performance benchmarks, and warned that using raw numbers of arithmetic ops can be very misleading: “Specialized accelerators might be great at running large matrix multiplications or convolutions, but have slow or non-existent support for other layers. (…) The type of compute matters, not just the raw operation count.”

Low-power AI at the microedge: Ambient Scientific

A solution that could potentially satisfy the low-power requirements of TinyML applications has been proposed by Ambient Scientific. The company introduced its GPX-10 AI processor targeted at “microedge” applications such as wearables, with severe power and cost constraints. According to Ambient, architectures based solely on off-the-shelf components cannot meet these requirements; more innovations are required, all the way down to circuits and devices. As noted by analyst Linley Gwennap in this white paper, the Ambient solution is based on a custom architecture optimized for low-power AI operations called DigAn, which uses standard digital CMOS to implement an analog math engine. Other optimizations include a custom 3D SRAM design, low-voltage signaling, custom ADCs, selective wake-up of circuit blocks, etc. The result: 512 billion operations per second at just 120mW, a performance that supports both inference and retraining for speech recognition.

Credit: Ambient Scientific

Deep-AI: integrated training and inference at the edge

Integrating training and inference at the edge is also the vision of Deep-AI, but in this case the target applications are those that can be served by a Xilinx Alveo card. Emerged from stealth mode on early October, Israel-based Deep-AI has developed technologies that enable network training at 8-bit fixed-point with high sparsity ratios, as opposed to 32-bit floating-point and no sparsity which is the norm today with GPUs. One of the challenges was that back propagation – the most common training algorithm – requires a large dynamic range to avoid vanishing/exploding gradients phenomena; this is the reason why 32-bit floating-point is commonly used – as opposed to the 256 levels dynamic range achieved with 8-bit fixed-point. Getting rid of the cloud-centric model and of GPUs, Deep-AI claims a dramatic reduction in cost and power, with the additional benefit that its training output is inference-ready.

Credit: Deep-AI

In-memory computation: GSI Technology, Untether AI

A promising solution to improve energy efficiency for AI workloads is ‘in-memory computation’, an approach represented at the conference by two companies.

GSI Technology explained the principles of its Gemini “Associative Processing Unit”. Standard CPUs can perform complex calculations on small datasets very quickly, but they are less efficient when handling large datasets due to the von Neumann memory bottleneck. For tasks requiring parallel calculations across large datasets, higher performance and higher energy efficiency can be achieved by an “Associative Processing Unit”, an interconnected array of millions of simple “boolean bit processors” utilizing in- and near- memory processing techniques. More details can be found in this white paper from Linley Gwennap.

Credit: GSI Technology

Untether AI unveiled its “tsunAImi” PCIe accelerator card powered by four of its “runAI” devices, processors that use at-memory computation. According to the company, in current architectures 90 percent of the energy for AI workloads is consumed by data movement: transferring the weights and activations between external memory, on-chip caches, and finally to the computing element itself. The tsunAImi accelerator card claims 2 PetaOps of compute, which translates into over 80,000 frames per second of ResNet-50 v 1.5 throughput at batch=1, or – for natural language processing – more than 12,000 queries per second of BERT-base.

Credit: Untether AI

Other IP: Arteris, Cadence, Achronix

Besides AI-specific accelerators, some of the speeches concerned other key IP blocks that are required or can be useful in a SoC.

The presentation from Arteris IP focused on the new requirements that large SoCs with AI accelerators are posing to the network-on-chip backbone. The main challenge here is ‘heterogeneous coherency’: processing elements with different characteristics, and perhaps using different transaction protocols, working together as equal peers in a cache coherent system. This reflects on the coexistence of AMBA CHI, ACE and AXI interfaces. According to Arteris IP, the appropriate NoC solution enables better IP reuse and better PPA results. Arteris R&D is currently working on automated NoC topology synthesis, on the concept of “islands of cache coherency” within the SoC, etc.

Cadence summarized its Tensilica IP offering for edge AI applications, with processors combining DSP and AI acceleration. The assumption here is that voice or vision edge AI applications always involve preprocessing and postprocessing steps that are not based on neural networks, but on “traditional” DSP algorithms.

Achronix showcased its FPGA and eFPGA offering in the context of what it described as “the fourth FPGA wave”: ubiquitous compute at the edge.

New developers’ tools from Centaur, SiMa.ai, SiFive

Some companies used the conference to provide updates on their respective developers’ tools.

Centaur – that offers a server processor combining eight x86 CPUs with a deep-learning accelerator – has adopted a new compiler stack based on MLIR (Multi-Level Intermediate Representation), a novel approach to building reusable and extensible compiler infrastructure, part of the LLVM project. The MLIR-based compiler stack simplifies software development for heterogeneous systems (x86 and AI accelerator, in this case) and quickens support for new models and frameworks.

SiMa.ai – that targets edge applications such as cameras with a SoC that includes a ML accelerator and an Arm CPU – has introduced its SDK, enabling developers to quickly port existing neural networks from popular frameworks.

SiFive introduced a new platform for professional Risc-V developers, called “HiFive Unmatched”. The board, which is based on the new SiFive FU740 SoC, comes in the mini-ITX form factor and is equipped with standard industry connectors, to make it easy to build a Risc-V PC.

HiFive Unmatched. Credit: SiFive

DPUs: Nvidia, Fungible, Marvell

Data Processing Units, a new category of processors, were also well represented at the conference.

Nvidia described its BlueField-2X, a PCIe board aimed at datacenter servers. Introduced on early October at the Nvidia GTC event, it features the capabilities of the Mellanox ConnectX-6 Dx SmartNIC combined with Arm cores and an Nvidia Ampere GPU’s AI capabilities that can be applied to data center security, networking and storage tasks.

BlueField-2X. Credit: Nvidia

Fungible unveiled what it claims to be “the world’s fastest NVMe disaggregated storage platform”, the company’s first line of data-centric platforms. Powered by the Fungible Data Processing Unit, the Fungible storage cluster delivers 300M IOPS in a data center rack. It represents a significant milestone to realize the company’s vision of data centers where compute and storage resources are ‘hyperdisaggregated’ and then composed on-demand to dynamically serve application requirements. Announced last August at Hot Chips 2020, the Fungible DPU chip addresses two of the biggest challenges in scale-out data centers: inefficient data interchange between nodes and inefficient execution of data-centric computations.

Credit: Fungible

Marvell described several architecture details of its Octeon CN98xx DPU.

Logged in as . Log out »




© 2024 Internet Business Systems, Inc.
670 Aberdeen Way, Milpitas, CA 95035
+1 (408) 882-6554 — Contact Us, or visit our other sites:
TechJobsCafe - Technical Jobs and Resumes EDACafe - Electronic Design Automation GISCafe - Geographical Information Services  MCADCafe - Mechanical Design and Engineering ShareCG - Share Computer Graphic (CG) Animation, 3D Art and 3D Models
  Privacy PolicyAdvertise