Cerebras’ new 2.6 trillion transistors wafer scale chip is one the announcements made during the 2021 edition of the Linley Spring Processor Conference, a virtual event organized by technology analysis firm The Linley Group from April 19 to 23. In our quick overview of the conference we will focus mainly on new product announcements, which include innovative AI intellectual property from startups Expedera and EdgeCortix, a new approach to clock distribution from Movellus, and more. But first, let’s briefly summarize the opening keynote given by Linley Gwennap – Principal Analyst of The Linley Group – who provided an updated overview of AI technology and market trends.
A variety of AI acceleration architectures
Gwennap described the different AI processing architectures that the industry has developed over the past few years. While many CPUs, GPUs, and DSPs include wide vector (SIMD) compute units, many AI accelerators use systolic arrays to break the register-file bottleneck. Also, convolution architecture optimized for CNNs have been proposed: examples include processors developed by Alibaba and Kneron. Within AI-specialized architectures, many choices are possible: a processor can use many little cores, or a few big cores. Extreme examples are Cerebras with its wafer-scale chip integrating over 400,000 cores (850,000 in the latest version), and Groq with one mega-core only. Little cores are easier to design, while big cores simplify compiler/software design and are better for real-time workloads. Another architectural choice is between multicore versus dataflow: in a multicore design, each core executes the neural network from start to finish, while in a dataflow design the neural network is divided across many cores. An additional architectural style – that goes ‘beyond cores’ – is Coarse-Grain Reconfigurable Architecture (CGRA), which uses dataflow principles, but instead of cores, the connected blocks contain pipelined compute and memory units. This approach has been adopted by SambaNova, SimpleMachines, Tsing Micro and others. So the industry now offers a wide range of AI-capable architectures, ranging from very generic to very specialized. In general terms, a higher degree of specialization translates into higher efficiency but lower flexibility.
Selecting the right construction for a PCB stack and meeting the tight loss budget of PCB transmission lines are major challenges for designers and manufacturers of high-frequency printed circuit boards. According to Avishtech – a young San Jose-based provider of innovative EDA solutions – traditional EDA tools fall short of needs in those two areas, often leading to a trial-and-error development process that translates into long design cycles and increased costs.
Avishtech started addressing these problems in 2019. “That’s when me and my partners saw an opportunity to really make an impact and actually do things in a very different way,”, said founder and CEO Keshav Amla in the video interview he recently gave to Sanjay Gangal from EDACafe. “We had the right backgrounds and we felt that we were the right people to do that.” So after completing his master’s degree, in 2019 Amla left his PhD program to work on Avishtech full-time. One year later, in July 2020, the company launched its Gauss product line: Gauss Stack, a PCB stack-up design and simulation solution, and Gauss 2D, a field solver that improves transmission line loss modeling. Let’s now take a closer look at Avishtech and at the recently announced latest versions of its tools.
Nvidia entering the datacenter CPU market – and becoming a direct competitor of Intel in this area – is definitely this week’s top news. Unrelated to this announcement, an academic research adds to the debate on heterogeneous compute. More updates this week include an important EDA acquisition and EDA figures; but first, let’s meet Grace.
Grace, the new Arm-based Nvidia datacenter CPU
Intel’s recently appointed CEO Pat Gelsinger is facing an additional challenge: defending the company’s datacenter CPU market share against Grace, the new Nvidia CPU – that promises 10x the performance of today’s fastest servers on the most complex AI and high performance computing workloads. Announced at the current GTC event and available in the beginning of 2023, the new Arm-based processor is named for Grace Hopper, the U.S. computer-programming pioneer.
In his GTC keynote, Nvidia CEO’s Jensen Huang explained that Grace is meant to address the bottleneck that still makes it difficult to process large amounts of data, particularly for AI models. His example was based on half of a DGX system: “Each Ampere GPU is connected to 80GB of super-fast memory running at 2 TB/sec,” he said. “Together, the four Amperes process 320 GB at 8 Terabytes per second. Contrast that with CPU memory, which is 1TB large, but only 0.2 Terabytes per second. The CPU memory is three times larger but forty times slower than the GPU. We would love to utilize the full 1,320 GB of memory in this node to train AI models. So, why not something like this? Make faster CPU memories, connect four channels to the CPU, a dedicated channel to feed each GPU. Even if a package can be made, PCIe is now the bottleneck. We can surely use NVLink. NVLink is fast enough. But no x86 CPU has NVLink, not to mention four NVLinks.” Huang pointed out that Grace is Arm-based and purpose-built for accelerated computing applications of large amounts of data – such as AI. “The Arm core in Grace is a next generation off-the-shelf IP for servers,” he said. “Each CPU will deliver over 300 SPECint with a total of over 2,400 SPECint_rate CPU performance for an 8-GPU DGX. For comparison, todays DGX, the highest performance computer in the world, is 450 SPECint_rate.” He continued, “This powerful, Arm-based CPU gives us the third foundational technology for computing, and the ability to rearchitect every aspect of the data center for AI. (…) Our data center roadmap is now a rhythm consisting of three chips: CPU, GPU, and DPU. Each chip architecture has a two-year rhythm with likely a kicker in between. One year will focus on x86 platforms, one year will focus on Arm platforms. Every year will see new exciting products from us. The Nvidia architecture and platforms will support x86 and Arm – whatever customers and markets prefer,” Huang said.
The NVLink interconnect technology provides a 900 GB/s connection between Grace and Nvidia GPUs. Grace will also utilize an LPDDR5x memory subsystem. The new architecture provides unified cache coherence with a single memory address space, combining system and HBM GPU memory.
The Swiss National Supercomputing Centre (CSCS) and the U.S. Department of Energy’s Los Alamos National Laboratory are the first to announce plans to build Grace-powered supercomputers. According to Huang, the CSCS supercomputer, called Alps, “will be 20 exaflops for AI, 10 times faster than the world’s fastest supercomputer today.”. The system will be built by HPE and come on-line in 2023.
Google’s AI scientist Samy Bengio has reportedly resigned over a controversy with the company. Brother of Yoshua Bengio, another world-famous AI scientist, Samy joined Google in 2007 and was part of the TensorFlow team. Prior to that, Samy Bengio co-developed Torch, the ancestor of PyTorch. It will be interesting to see where he will be landing next. Let’s now move to some updates, catching up on some of the news from the last couple of weeks.
Cadence Palladium Z2 and Protium X2 systems
Cadence has introduced the Palladium Z2 Enterprise Emulation and Protium X2 Enterprise Prototyping systems, representing the new generation of the current Palladium Z1 and Protium X1. Based on new emulation processors and Xilinx UltraScale+ VU19P FPGAs, these systems provide – according to Cadence – 2X capacity and 1.5X performance improvements over their predecessors. Both platforms offer a modular compile technology capable of compiling 10 billion gates in under ten hours on the Palladium Z2 system and in under twenty-four hours on the Protium X2 system.
Cadence Palladium Z2 and Protium X2. Credit: Business Wire
Siemens’ new Veloce system
Siemens has unveiled its new Veloce hardware-assisted verification system, that combines virtual platform, hardware emulation, and FPGA prototyping technologies. The solution includes four new products: Veloce HYCON (HYbrid CONfigurable) for virtual platform/software-enabled verification; Veloce Strato+, a capacity upgrade to the Veloce Strato hardware emulator that scales up to 15 billion gates; Veloce Primo for enterprise-level FPGA prototyping; and Veloce proFPGA for desktop FPGA prototyping. Customer-built virtual SoC models can begin running real-world firmware and software on Veloce Strato+ for deep-visibility to the lowest level of hardware, then the same design can be moved to Veloce Primo to validate the software/hardware interfaces and execute application-level software while running closer to actual system speeds. Both Veloce Strato+ and Veloce Primo use the same RTL, the same virtual verification environment, the same transactors and models. A key technology in the upgraded Veloce platform is a new, proprietary 2.5D chip which – according to Siemens – enables a 1.5x system capacity increase over the previous Strato system.
Innovative architectures, high performance targets, competitive market: does this AI cocktail call for specially optimized EDA solutions? We asked Prith Banerjee (Ansys), Paul Cunningham (Cadence), Mike Demler (The Linley Group), Jitu Khare (SimpleMachines), Poly Palamuttam (SimpleMachines), Anoop Saha (Siemens EDA)
Never before were silicon startups as numerous as they are today, in this era of ‘silicon Renaissance’ driven by an insatiable hunger for neural network acceleration. Startups engaged in the development of AI accelerator chips are raising considerable venture capital funding – and attracting a lot of attention from the media, as technology champions at the forefront of innovation. Not surprisingly, most EDA vendors have updated their marketing messaging to emphasize product offerings specifically tailored to the design needs of these devices, and AI startups seem to enjoy a privileged status among EDA customers in terms of coverage from vendors’ blogs and press releases. It is therefore interesting trying to figure out if AI accelerator chips really pose special design challenges calling for specially optimized EDA solutions.
AI chips: different or normal?
Apart from some notable exceptions – such as the devices based on analog processing, or the wafer-scale chip from Cerebras – it seems fair to assume that the vast majority of the AI accelerators being developed are digital and have a ‘normal’ die size. Is there anything special in these chips that makes them different from other complex processors from an EDA standpoint? “The short answer is no,” says Paul Cunningham, Corporate Vice President and General Manager at Cadence. “I don’t think there is anything really fundamental that makes an AI chip different from other kinds of chips. But an AI chip is usually a very big chip and it’s highly replicated. So you have a basic building block, some kind of floating point MAC, and it’s replicated thousands, tens of thousands, hundreds of thousands of times. The nature of the design will stress the scalability of EDA tools to handle high replication. So in this sense, yes, it is important to make sure that our EDA tools have good performance on this style of design, but if there was another type of design which was also highly replicated, it would stress the tools in the same way.”