Cloud native EDA tools & pre-optimized hardware platforms
Many computing errors have been historically blamed on bad code/programming, algorithms and/or users¡¯ errors. And that makes sense, as many performance issues are easily traced to software and it has seemingly been one of the major root causes of many computer errors.
Or has it?
Over the last decade or so, a sleeping giant has been uncovered, lurking in the components that undergird all computing: hardware. More specifically, a hardware problem that¡¯s known as Silent Data Corruption (SDC) is to blame for many performance issues. As computing scales massively at a rapid pace with the demands of AI and machine learning algorithms, the issue of Silent Data Corruption has sharpened and become more intense.
But what is Silent Data Corruption? How do we stop it? And why is it such a pervasive, difficult problem to address?
We sat down with Rama Govindaraju, principal engineer at Google, and Robert S. Chappell, partner hardware architecture at Microsoft, to get to the bottom of these questions and more.
Silent Data Corruption happens when an impacted device inadvertently causes silent (unnoticed) errors in the data it processes.
For example, an impacted CPU might miscalculate data (such as 1+1=3), and there may be no indication of these errors unless regular scans are conducted, hence the ¡°silent¡± moniker. In short, these miscalculations are hard to detect and rectify.
If something goes wrong with the software, there are fail-stop mechanisms, user notifications, and various other alerts or indications that something needs to be fixed. With SDC incidents in hardware, there is no notification that something has been miscalculated, leading to corrupted datasets that can go completely undetected.
Due in part to SDC¡¯s stealthy nature, it¡¯s difficult to detect exactly how long it has been a phenomenon in computing. However, it has at the very least been a known problem in the industry over the last seven to eight years or so.
The challenges we currently face to solve this problem are multi-faceted:
Regarding the cost question, we are of the mind that the errors caused by SDC will cost organizations many, many times more to fix than to prevent ahead of time. Trying to debug problems caused by SDC can take many months, which is simply not scalable for most businesses. Benjamin Franklin once said, ¡°An ounce of prevention is worth a pound of cure.¡± That sentiment is apt here.
Let¡¯s say only one in 1,000 chips are defective. That doesn¡¯t sound like much, and maybe in the past it wouldn¡¯t have caused too many issues. But in today¡¯s world, machine learning algorithms are running on tens of thousands of chips. This means that over a long run time, these corruptions can derail entire datasets and demand massive expenditure to fix. As those working in the AI field know all too well, workloads are increasingly occupying a huge footprint and disruptions are increasing by the day.
We need to ask ourselves how we can become better at screening. We need to research up and down the entire stack, from hardware to software and everything in between. More critically, we need holistic solutions starting in design or even process technology.
Rate of defect screening with DCDIAG test on third-generation Intel Xeon Scalable SoCs (Source: Intel)
Unfortunately, SDC is a problem that is getting worse as time goes on. The scale of computing needs is not slowing or even leveling out; rather, it¡¯s accelerating at an unprecedented rate. Integration is increasing and more resources are packed into single parts, leading to more complex processes, both in the production cycle and implementation. Systems are stressed more than ever, and there¡¯s little relief in sight.
Our challenge now is explaining the scope of the problem as we understand it presently. We are still trying to get our arms around just how widespread SDC is. It¡¯s difficult to attack a problem that is not fully illuminated, much less explain to others why said problem should be taken seriously.
Industry leaders may be justified in not expending resources for a problem they don¡¯t fully understand. But errors at the scale SDC can insert can be much more costly than preventing them. We need academics and researchers, business leaders, engineers, and everyone in the production and business operations ecosystems to come together and take this problem seriously before it gets out of control.
To learn more, check out Speaking Up About Silent Data Corruption and the second part of this blog series that highlights solutions for addressing SDC. You can also watch the full-length video of our panel session with Google and Microsoft below.