Nvidia entering the datacenter CPU market – and becoming a direct competitor of Intel in this area – is definitely this week’s top news. Unrelated to this announcement, an academic research adds to the debate on heterogeneous compute. More updates this week include an important EDA acquisition and EDA figures; but first, let’s meet Grace.
Grace, the new Arm-based Nvidia datacenter CPU
Intel’s recently appointed CEO Pat Gelsinger is facing an additional challenge: defending the company’s datacenter CPU market share against Grace, the new Nvidia CPU – that promises 10x the performance of today’s fastest servers on the most complex AI and high performance computing workloads. Announced at the current GTC event and available in the beginning of 2023, the new Arm-based processor is named for Grace Hopper, the U.S. computer-programming pioneer.
In his GTC keynote, Nvidia CEO’s Jensen Huang explained that Grace is meant to address the bottleneck that still makes it difficult to process large amounts of data, particularly for AI models. His example was based on half of a DGX system: “Each Ampere GPU is connected to 80GB of super-fast memory running at 2 TB/sec,” he said. “Together, the four Amperes process 320 GB at 8 Terabytes per second. Contrast that with CPU memory, which is 1TB large, but only 0.2 Terabytes per second. The CPU memory is three times larger but forty times slower than the GPU. We would love to utilize the full 1,320 GB of memory in this node to train AI models. So, why not something like this? Make faster CPU memories, connect four channels to the CPU, a dedicated channel to feed each GPU. Even if a package can be made, PCIe is now the bottleneck. We can surely use NVLink. NVLink is fast enough. But no x86 CPU has NVLink, not to mention four NVLinks.” Huang pointed out that Grace is Arm-based and purpose-built for accelerated computing applications of large amounts of data – such as AI. “The Arm core in Grace is a next generation off-the-shelf IP for servers,” he said. “Each CPU will deliver over 300 SPECint with a total of over 2,400 SPECint_rate CPU performance for an 8-GPU DGX. For comparison, todays DGX, the highest performance computer in the world, is 450 SPECint_rate.” He continued, “This powerful, Arm-based CPU gives us the third foundational technology for computing, and the ability to rearchitect every aspect of the data center for AI. (…) Our data center roadmap is now a rhythm consisting of three chips: CPU, GPU, and DPU. Each chip architecture has a two-year rhythm with likely a kicker in between. One year will focus on x86 platforms, one year will focus on Arm platforms. Every year will see new exciting products from us. The Nvidia architecture and platforms will support x86 and Arm – whatever customers and markets prefer,” Huang said.
The NVLink interconnect technology provides a 900 GB/s connection between Grace and Nvidia GPUs. Grace will also utilize an LPDDR5x memory subsystem. The new architecture provides unified cache coherence with a single memory address space, combining system and HBM GPU memory.
The Swiss National Supercomputing Centre (CSCS) and the U.S. Department of Energy’s Los Alamos National Laboratory are the first to announce plans to build Grace-powered supercomputers. According to Huang, the CSCS supercomputer, called Alps, “will be 20 exaflops for AI, 10 times faster than the world’s fastest supercomputer today.”. The system will be built by HPE and come on-line in 2023.
Google’s AI scientist Samy Bengio has reportedly resigned over a controversy with the company. Brother of Yoshua Bengio, another world-famous AI scientist, Samy joined Google in 2007 and was part of the TensorFlow team. Prior to that, Samy Bengio co-developed Torch, the ancestor of PyTorch. It will be interesting to see where he will be landing next. Let’s now move to some updates, catching up on some of the news from the last couple of weeks.
Cadence Palladium Z2 and Protium X2 systems
Cadence has introduced the Palladium Z2 Enterprise Emulation and Protium X2 Enterprise Prototyping systems, representing the new generation of the current Palladium Z1 and Protium X1. Based on new emulation processors and Xilinx UltraScale+ VU19P FPGAs, these systems provide – according to Cadence – 2X capacity and 1.5X performance improvements over their predecessors. Both platforms offer a modular compile technology capable of compiling 10 billion gates in under ten hours on the Palladium Z2 system and in under twenty-four hours on the Protium X2 system.
Cadence Palladium Z2 and Protium X2. Credit: Business Wire
Siemens’ new Veloce system
Siemens has unveiled its new Veloce hardware-assisted verification system, that combines virtual platform, hardware emulation, and FPGA prototyping technologies. The solution includes four new products: Veloce HYCON (HYbrid CONfigurable) for virtual platform/software-enabled verification; Veloce Strato+, a capacity upgrade to the Veloce Strato hardware emulator that scales up to 15 billion gates; Veloce Primo for enterprise-level FPGA prototyping; and Veloce proFPGA for desktop FPGA prototyping. Customer-built virtual SoC models can begin running real-world firmware and software on Veloce Strato+ for deep-visibility to the lowest level of hardware, then the same design can be moved to Veloce Primo to validate the software/hardware interfaces and execute application-level software while running closer to actual system speeds. Both Veloce Strato+ and Veloce Primo use the same RTL, the same virtual verification environment, the same transactors and models. A key technology in the upgraded Veloce platform is a new, proprietary 2.5D chip which – according to Siemens – enables a 1.5x system capacity increase over the previous Strato system.
Innovative architectures, high performance targets, competitive market: does this AI cocktail call for specially optimized EDA solutions? We asked Prith Banerjee (Ansys), Paul Cunningham (Cadence), Mike Demler (The Linley Group), Jitu Khare (SimpleMachines), Poly Palamuttam (SimpleMachines), Anoop Saha (Siemens EDA)
Never before were silicon startups as numerous as they are today, in this era of ‘silicon Renaissance’ driven by an insatiable hunger for neural network acceleration. Startups engaged in the development of AI accelerator chips are raising considerable venture capital funding – and attracting a lot of attention from the media, as technology champions at the forefront of innovation. Not surprisingly, most EDA vendors have updated their marketing messaging to emphasize product offerings specifically tailored to the design needs of these devices, and AI startups seem to enjoy a privileged status among EDA customers in terms of coverage from vendors’ blogs and press releases. It is therefore interesting trying to figure out if AI accelerator chips really pose special design challenges calling for specially optimized EDA solutions.
AI chips: different or normal?
Apart from some notable exceptions – such as the devices based on analog processing, or the wafer-scale chip from Cerebras – it seems fair to assume that the vast majority of the AI accelerators being developed are digital and have a ‘normal’ die size. Is there anything special in these chips that makes them different from other complex processors from an EDA standpoint? “The short answer is no,” says Paul Cunningham, Corporate Vice President and General Manager at Cadence. “I don’t think there is anything really fundamental that makes an AI chip different from other kinds of chips. But an AI chip is usually a very big chip and it’s highly replicated. So you have a basic building block, some kind of floating point MAC, and it’s replicated thousands, tens of thousands, hundreds of thousands of times. The nature of the design will stress the scalability of EDA tools to handle high replication. So in this sense, yes, it is important to make sure that our EDA tools have good performance on this style of design, but if there was another type of design which was also highly replicated, it would stress the tools in the same way.”
Just a couple of months after taking office, Intel’s new CEO Pat Gelsinger delivered on the expectations of a quick strategy change – and his plan caught many observers by surprise. Investors who had suggested Intel to embrace the fabless model will probably be disappointed, as Gelsinger – speaking at a webcast event on March 23 – announced just the opposite: not only will Intel increase its manufacturing capability, but it will also create a foundry business on its own. Gelsinger’s bold move resonates very well with the current climate characterized by geopolitical tensions, incentives from the Biden administration, and a severe chip shortage. However, turning his plan into reality might prove to be difficult, according to some observers.
Intel CEO Pat Gelsinger (Credit: Walden Kirsch/Intel Corporation)
A renewed technological self-confidence
One of the key elements of Intel’s new course is a renewed confidence in its internal technological capabilities. The company expects to continue manufacturing the majority of its products internally, and – as stated in a press release – the 7nm development is progressing well, driven by increased use of extreme ultraviolet lithography. Intel expects to tape out the compute tile for its first 7nm client CPU (code-named “Meteor Lake”) in the second quarter of this year. During the webcast, Intel officers reportedly did not directly address the issue of catching up with the leading foundries – as far as the 5nm and 3nm nodes are concerned. However, Gelsinger reportedly offered an explanation for Intel’s delay in moving to the most advanced process nodes: he said the company was too cautious about EUV lithography equipment and, to compensate, the designs got excessively complicated, leading to production problems. Gelsinger also announced a new research collaboration with IBM focused on creating next-generation logic and packaging technologies, which should clearly help Intel to catch up quickly. The internal capabilities on which the company is relying to fuel its new course include its expertise in advanced packaging technologies for chiplet-based devices.
News from the rapidly evolving Chinese semiconductor industry open our roundup this week. More updates span across EDA, U.S. defense research, chip manufacturing, and electric vehicles.
China semiconductors updates: Baidu, ByteDance, CSIA
Chinese Internet giant Baidu has reportedly said that its artificial intelligence chip unit Kunlun has recently completed a fundraising round, which values the business at about $2 billion. According to the same source, Baidu is considering commercializing its AI chip design capabilities, with the aim of making the Kunlun unit a standalone company.
Another important Chinese Internet player, ByteDance, has reportedly begun hiring employees for semiconductor-related job openings. The company – best known for its TikTok app – confirmed it is exploring initiatives in this area, including the development of Arm-based server chips. According to another press report, ByteDance has also established a team to explore the development of artificial intelligence chips.
Semiconductor-related initiatives from companies like Baidu and ByteDance fit into a context where the Chinese government is playing an important role. During the recent National People’s Congress, the government reportedly committed to boost spending and research in advanced chips and artificial intelligence, to reduce reliance on foreign technologies.
In fact – as noted by market research firm IC Insights – despite being the largest consuming country for ICs since 2005, China still holds a small share as a producer. Of the $143.4 billion worth of ICs sold in China in 2020, only 15.9% was produced in China. Of that amount, China-headquartered companies produced only 5.9%.
Despite this scenario of increased international competition, the U.S. and Chinese industry groups have launched a collaboration initiative. The China Semiconductor Industry Association (CSIA) reportedly said in a recent statement on its website that it will form a working group with the U.S. Semiconductor Industry Association (SIA). According to the statement, ten chip companies from each nation will meet twice a year to discuss topics ranging from export policies to supply-chain safety and encryption technology.
Catching up on some of the news from the last couple of weeks or so, let’s start with a recent update concerning Apple: the company is reportedly planning to build a new semiconductor design center in Munich, Germany, as part of a 1 $1.2 billion investment push to develop custom chips for 5G mobile and other wireless technologies. According to the report, Apple plans to move into the facility in late 2022 and plans to hire hundreds of people.
Xilinx Vitis HLS front-end is now open source
Xilinx has made the decision of opening access to the front-end of Vitis HLS (high level synthesis) on GitHub. The Vitis HLS tool allows C, C++, and OpenCL functions to be deployed onto the device logic fabric and RAM/DSP blocks. Making the Vitis HLS front-end available on GitHub enables developers to tap into the technology and modify it for the specific needs of their applications.
EDA’s love affair with neural networks is cemented by elective affinities and made unique by the depth and breadth of the challenges
From logic synthesis down to the post-tapeout flow, machine learning has already made inroads in a wide range of EDA tools, enabling shorter turnaround times for chip designs, improving PPA results, and reducing the need for hardware resources across the design cycle. While disruptive advancements are all but unexpected when neural networks come into play, there seems to be something unique in the relationship between Electronic Design Automation and machine learning. On the one hand, it looks like the EDA industry is particularly well equipped to take advantage of the potential of neural networks; on the other hand, the difficulty and diversity of EDA challenges impose the use of several different machine learning solutions, contributing to a uniquely complex ML-enabled flow.
Solving hard problems is business as usual
One aspect that immediately stands out when addressing this subject is the way EDA experts have approached the innovations brought about by neural networks. While they undoubtedly consider machine learning as a disruptive technology enabling exciting results, on the other hand they see neural networks as a natural continuation of what EDA companies have always been doing: writing advanced software to solve hard problems.
“Machine learning is changing the world of software, not just EDA, it’s the next evolution in algorithms,” says Paul Cunningham, Corporate Vice President and General Manager at Cadence. “Our business is complex software, so the math and the computer science behind neural networks and Bayesian methods, all of these deep complex techniques inside machine learning, this is all just very normal for us. We are doing this all the time anyway, this is the software that we write,” he continues. “We have all the experts, there is no problem for us to write the [machine learning] algorithms just from scratch.”
Paul Cunningham. Credit: Cadence
Another reason why machine learning – despite being such a disruptive technology – can be considered a sort of natural development for EDA is that in many tools the new ML-based algorithms are replacing pre-existing traditional heuristic in a way that is invisible to the user. In these cases, “It’s just as using machine learning as a better heuristic,” says Cunningham. “We may have some expert system inside, some other way to take a decision, some rule-based method, and we are now using a neural network-based method for the heuristic, so the customer has no visibility.”
As Alphabet-owned Waymo expands testing of its autonomous vehicles in San Francisco, automotive vision startup Recogni raises $48.9 million in a Series B financing round. The autonomous vehicle landscape seems to be evolving quickly – and it will be interesting to see if this will have an impact on Mobileye, under the new Intel CEO. Let’s now move to some weekly news updates.
Qualcomm, Google and Microsoft reportedly against Nvidia-Arm deal
According to a report from CNBC, Qualcomm has told regulators around the world that it is against Nvidia’s acquisition of Arm. Qualcomm reportedly complained to the U.S. Federal Trade Commission, the European Commission, the U.K.’s Competition and Markets Authority and China’s State Administration for Market Regulation. The FTC’s investigation has reportedly moved to a “second phase” and the U.S. regulator has asked SoftBank (current Arm owner), Nvidia and Arm to provide it with more information. According to a report from Bloomberg, the list of companies complaining to U.S. antitrust regulators about Nvidia’s acquisition of Arm also includes Google and Microsoft. Both CNBC and Bloomberg reports are based on sources who asked not to be identified. Commenting these reports on EETimes, analyst Mike Feibus noted that “not a single Big Tech exec has spoken out publicly against [the Nvidia-Arm deal]”. Feibus also pointed out that the execs he has spoken with over the past few months “say the outcome they fear most is that they might set themselves up for retribution by Nvidia if they object to the acquisition, and the deal is consummated anyway.”
New Arm records
Meanwhile, the value and importance of Arm is reflected in its recently announced results: in the third quarter of 2020 alone, Arm silicon partners shipped 6.7 billion Arm-based chips. Most of them – 4.4 billion – are based on Cortex-M; another key contributor was the Mali graphics processor, which – according to Arm – remains the number one shipping GPU. In calendar year 2020, Arm signed 175 new licenses, bringing the total to 1,910 licenses and 530 licensees.
As widely reported by the media, the Semiconductor Industry Association has sent a letter to President Biden urging him to include in his recovery and infrastructure plan “substantial funding for incentives for semiconductor manufacturing, in the form of grants and/or tax credits, and for basic and applied semiconductor research.” According to SIA, “bold action is needed to address the challenges we face. The costs of inaction are high.” Speaking of challenges,TSMC Board of Directors has recently approved the issuance of bonds for a total amount of nearly $9 billion to finance TSMC’s capacity expansion and/or pollution prevention related expenditures. The Board has also approved the establishment of a wholly owned subsidiary in Japan to expand TSMC’s 3DIC material research, investing $186 million.
Sunnyvale-based Micro Magicclaims it has developed an ultra-low power 64-bit Risc-V core consuming only 10mW at 1Ghz, thus achieving a record breaking 250,000 Coremarks/Watt performance. According to the company, the processor – built in a 16nm FinFET process – reaches such an outstanding efficiency when running at a 350mV supply voltage.
Some architecture and performance details about the processors developed by Chinese Internet giants Alibaba and Baidu have recently been made available – along with tons of other interesting content – on the HotChips conference website (HC32 archive). Presentations from the 2020 edition of the event include slides from Alibaba Group describing the Xuantie 910 processor and the Hanguang 800 NPU, as well as slides from Baidu providing some details about the Kunlun processor.
NeuReality, an Israel-based AI hardware startup, has reportedly emerged from stealth mode announcing an $8 million seed round and the addition of Naveen Rao, the GM of Intel’s AI Products Group and former CEO of Nervana System, to the company’s board of directors. NeuReality is reportedly developing a high performance inference platform for datacenters.
DATE Conference 2021 was held as a virtual event from February 1st to 5th. As usual, it offered a wide selection of papers – mostly from academic researchers – on a range of topics including design automation, neural networks, automotive applications, quantum computing, security, cyber-physical systems etc. In this article we will briefly summarize just a few papers presented at the DATE 2021 – in no particular order – only to give a taste of the event. For each paper, only the speaker’s affiliation is indicated here, even though the works usually involved multiple universities or research institutions.
University of Stuttgart (Germany) discussed Negative Capacitance Field-Effect Transistors (NCFETs) as a new beyond-CMOS technology with advantages for offering low power and/or higher accuracy for neural network inference; and Ferroelectric FET (FeFET) as a novel non-volatile, area-efficient and ultra-low power memory device.
TU Dresden (Germany) addressed emerging ‘reconfigurable nanotechnologies’ that allow the implementation of self-dual functions with a fewer number of transistors as compared to traditional CMOS technologies. The team developed methods to achieve better area results for Reconfigurable Field-Effect Transistors (RFET)-based circuits. (more…)