EDACafe Editorial Roberto Frazzoli
Roberto Frazzoli is a contributing editor to EDACafe. His interests as a technology journalist focus on the semiconductor ecosystem in all its aspects. Roberto started covering electronics in 1987. His weekly contribution to EDACafe started in early 2019. Cutting cloud costs with ExotaniumMay 13th, 2022 by Roberto Frazzoli
Saving up to 90% by leveraging the cloud ‘spot market’ and avoiding overprovisioning: that’s the promise of Exotanium, a startup enabling users to benefit from Live Virtual Machine Migration – even with stateful workloads – in a transparent way and without interruption Chip design teams are increasingly resorting to cloud computing, mostly as a way to reduce time-to-market. Running the EDA tools in the cloud, however, can prove extremely expensive, and skyrocketing cloud bills may prevent users from extending the benefits of cloud computing to a larger number of designs. A startup called Exotanium is now offering new solutions to optimize cloud costs, promising savings up to 90%. Cost reduction is obtained by taking advantage, as much as possible, of the cheapest cloud resources (the ones offered through the so-called “spot market”) and by avoiding overprovisioning (that is, paying for cloud resources that are larger in capacity than needed). These achievements were made possible by technologies originally developed at Cornell University (Ithaca, New York). Hakim Weatherspoon, CEO of Exotanium, described the company’s solutions in the video interview he recently gave to EDACafe’s Sanjay Gangal; in this article we will add a few details, as well as the answers provided by Rohan Prakash – Exotanium’s Senior Business Development Manager – to some additional questions.
Cloud waste and its causes Cloud computing is not cheap, but it becomes unnecessarily expensive due to a significant waste of resources – a problem plaguing the majority of cloud users. In general terms, the two main causes of waste are idle resources and overprovisioned resources. Cloud providers are aware of the users’ need to reduce waste: see, for example, this Microsoft Azure web page. Nevertheless, according to an industry expert, in 2020 “between idle and overprovisioned resources alone, that’s $17.6 billion in cloud spend that will be completely wasted.” And a survey conducted by Flexera showed that cloud cost savings is the top cloud initiative for 2022 across all organizations. As for overprovisioning, Exotanium explains that many legacy applications running in the cloud cannot automatically scale up and down with the workload. Examples of such applications include legacy databases and stateful streaming processors. To serve dynamic workload with minimum impact on service latency, users often overprovision resources according to a peak workload that happens rarely, leading to a huge amount of wasted resources in the cloud when the applications are idle. But besides idle and overprovisioned resources, a different form of waste is the users’ inability to take full advantage of the “spot market”. As for this latter problem, Exotanium notes that major cloud providers – such as Amazon AWS, Microsoft Azure, and Google Cloud – allow users to bid for unused virtual machine resources at a fraction (10-20%) of the price of regular cloud offering. Amazon calls them “Spot Instances”, Microsoft calls them “Low-Priority VMs”, and Google calls them “Preemptible VM Instances”. However, providers reserve the right to reclaim these instances at any time with very short notice, making the spot market useful mostly for short-running and stateless tasks that can be quickly restarted on a different virtual machine. A quick look at Exotanium’s technology Prior to describing the Exotanium offering, let’s take a quick look at how its technology works, summarizing the explanations provided by the company’s website. Extending the cloud spot market to arbitrary containerized applications, including stateful and legacy applications, is challenging. The key technology used by Exotanium is Live Virtual Machine Migration, a solution used by cloud providers themselves to balance the load of VMs across their compute clusters. Making Live Virtual Machine Migration available to cloud users, they can migrate containers into the spot market when there are plenty of unused resources and the spot market is reliable, and move them out of the spot market when resource reclamation is likely. Exotanium makes virtual machine migration available to cloud users thanks to the so-called “nested virtualization”: running a virtual machine monitor inside a virtual machine. The nested virtual machine monitor is under the control of the users, and users can then leverage existing live VM migration tools to move a virtual machine to another virtual machine monitor under their control. Additionally, the company managed to make migration transparent to container runtime systems such as Docker and Kubernetes, even if the container is stateful. Lastly, the decision on when it is safe to run a container in the spot market and when not is based on machine learning, predicting when premature termination of the spot is likely. More technical details about Exotanium technologies can be found in the following papers: “Cloud Mobility for Geographically Shifting Workloads”; “X-Containers: Breaking Down Barriers to Improve Performance and Isolation of Cloud-Native Containers”; “Smart Spot Instances for the Supercloud”. Let’s now move to briefly describe the three different solutions that Exotanium has developed using the concepts described above: xSpot, x-Stack and x-Scale. x-Spot: relocate containers between spot instances and regular instances According to Exotanium, x-Spot dramatically improves the usability of spot instances for critical applications, by dynamically relocating containers between spot instances and on-demand instances. As already mentioned, x-Spot uses Cornell patented user-level live migration technology. User virtual machines are run as nested or second layer VMs (sVMs), while the spot instances form the first layer VMs (fVMs). Nested virtualization allows sVMs to be migrated according to a scheduling policy. Users set a high maximum bid, which can be as high as the on-demand prices. The x-Spot runtime monitors the spot price and live migrates sVMs to other, possibly cheaper fVMs – either just another spot instance type, or even spot instances in another availability zone, another region, or another cloud. In the worst case, the sVMs can be live migrated to regular on-demand instances that are always available, so the user only needs to pay as much as the on-demand price. When the price approaches the maximum bid in the middle of an instance hour, the x-Spot runtime migrates the sVMs to avoid being terminated. The x-Spot Instance Scheduler optimizes placement of groups of sVMs on different types of fVMs, either spot instances or regular instances. x-Stack: packing idle containers to reduce VMs As explained by the company, Exotanium’s x-Stack addresses the inability of many legacy applications (e.g., legacy databases and stateful streaming processors) running in the cloud to automatically scale with the workload. Using Cornell patented live-migration technology, x-Stack packs idle containers onto a small number of VMs during idle periods, minimizing the number of active VMs and thus reducing the cost of keeping services online. When the workload increases, x-Stack relocates containers onto different VMs, without any service interruption. Therefore, according to Exotanium, x-Stack solves the problem of overprovisioning in the cloud and the waste of idle resources. x-Scale: resizing containers The function performed by x-Scale, as Exotanium explains, is to dynamically change the size of a container in terms of the number of physical CPU cores, memory size, and I/O throughput. Using existing cloud platforms, containers can only be resized by shutting them down and starting them back up with new resource requests. Using Cornell patented live-migration technology, instead, x-Scale can resize containers in real-time without taking them down. More importantly, the solution can automatically scale containers up and down according to workload, saving significant costs. Targeting both EDA vendors’ clouds and BYOC options The Exotanium’s solutions are targeted at all cloud providers and all computing use cases, including EDA. When it comes to Electronic Design Automation, it should be noted that some of the major EDA vendors are offering a range of different cloud options. In very general terms, they fall into two main categories: 1) SaaS (Software as a Service) or specifically optimized solutions (e.g., “Synopsys Cloud”, “Ansys Cloud”, “Cadence Cloud” etc.); and 2) BYOC (Bring Your Own Cloud) solutions. Applicability of the cost optimization technologies to the different EDA cloud options is therefore an interesting point, from a user perspective. “The Exotanium solution can be used in both scenarios,” Rohan Prakash maintained. “Our software needs to run in the cloud environment where the Synopsys, Ansys, or Cadence tools are being run.” Transparent solutions Assessing the net benefit provided by the Exotanium solutions also requires taking into account any process changes or complexity linked to the addition of one more piece of software in an already complex environment. According to the company, however, these cost optimization solutions leave the chip design process unchanged: “Exotanium is another software piece but there is no difference to the end user,” Prakash replied. “We are our own runtime that is put on top of the cloud hypervisor. When it comes to starting and stopping EDA tools, the end user will start and stop a job in the same way that they do it today; manually or by using a scheduler like PBS, LFS, SOCA etc. The only difference in using the Exotanium enabled platform is that the users will submit their jobs to the Exotanium enabled queue using the same commands that they use today.” The whole mechanism of virtual machine migration, packing and resizing containers etc. is transparent to the chip design team, according to Prakash: “All the actions are done by the Exotanium software that is running in the users’ cloud environment. The software will help determine what resources are needed and optimal through our proprietary algorithms.” Amount of cost savings for chipmakers Lastly, in order to fully understand the significance of the savings promised by Exotanium – up to 90% – let’s take a look at the absolute cost of cloud computing for chip design. Obviously, cloud costs vary greatly from design to design; the example provided by Exotanium, however, is in the 100-million-dollar order of magnitude: “The absolute savings are based on the percentage of cloud compute costs that Exotanium saves the customer. If the customer is spending $100M on compute, we have seen instances where we save customers 80% or $80M,” Prakash maintained. Alternatively, cloud cost optimization can be leveraged to enable a further time-to-market reduction without increasing the cloud budget: “The other piece that cannot be calculated” – Prakash added – “is how the time to market can be reduced. We have seen other chipmakers have been able to reduce the amount of time needed to test by months, by using our platform to increase the number of computations done for the same dollar. These several months are hard to quantify but there is a large, realized value that is in line with the clouds’ appeal to EDA users.” A university startup Exotanium is a software startup founded by faculty and researchers in the Cornell Ann S. Bowers College of Computing and Information Science (Ithaca, NY) and based on technology licensed through Cornell’s Center for Technology Licensing. Hakim Weatherspoon (CEO) founded the company in 2018 with co-founders and fellow Cornell researchers, Zhiming Shen (CTO) and Robbert van Renesse (chief scientist). Exotanium’s technology is based on Shen’s doctoral research. Last year the company announced completing a $5 million seed funding round led by Walden International and Nepenthe Capital. Chairman of Walden International is Lip-Bu Tan, former Cadence CEO. Exotanium offices are in Ithaca, NY, and Santa Clara, CA. Categories: EDACafe Editorial, Video Interview |