Cloud native EDA tools & pre-optimized hardware platforms
In part one of this series, we talked about what silent data corruption (SDC) is and how it impacts the current state of computing data. We again sat down with Rama Govindaraju, principal engineer at Google, and Robert S. Chappell, partner hardware architecture at Microsoft, to discuss what can be done to address this difficult-to-solve issue.
To refresh your memory, silent data corruption is defined as an impacted CPU inadvertently causing errors in the data it processes. These errors can persist for long periods of time without detection, corrupting entire datasets without raising flags.
As computer processing ramps up with the adoption of memory-hungry artificial intelligence (AI) and other new technologies, SDC threatens to corrupt mountains of data, causing problems that are difficult to predict and quite large in scale.
It¡¯s clear that SDC needs to be addressed, but how?
Presently, the cause of SDC is unknown and solutions are immature.
One of the biggest hurdles to overcome is that decision-makers aren¡¯t investing the proper resources into solving the issue, only mitigating symptoms as they occur. ¡°This will cost how much?¡± is the party line and usually the reason SDC isn¡¯t addressed. It¡¯s simply too costly to run periodic scans and alter/improve the chip-making process, among many other actions.
Ultimately, the cost argument is preventing meaningful solutions from being developed. If it costs too much, why research solutions in the first place? But this is the paradox: If a solution was developed, methods for lowering the cost and scaling could be examined.
The onus of fixing SDC should not fall only on chip designers, but also extends to manufacturers and beyond. Even if every chip in existence was perfect, SDC is a problem that would still occur. However, there are potential solutions at every step of the chip lifecycle that should be able to make an impact.
Today, there is no incentive for manufacturers to make changes to address SDC. If a customer receives a faulty or defective chip, they simply send it back to receive a replacement. That¡¯s all well and good but doesn¡¯t solve the actual problem. If the incentive model changed, the behavior would almost certainly change in response. For instance, if chip designers could demonstrate to manufacturers that a chip is faulty and the manufacturer had to pay back 50X the chip¡¯s cost, there would be more reason for manufacturers to put preventative measures in place.
Additionally, screening and testing can diagnose SDC in its early stages, allowing time for remediation. For example, think about the sensors in your car. Many of those sensors are not integral to the vehicle¡¯s ability to function but rather alert the user that there may be a problem.
With chips, there may be a fault that can go one or two years without being detected, and by that point, it¡¯s too late to do much to fix it. Sensors in chips that allow for early alerts or warnings are a stopgap to a more robust solution but one that would be able to help in the interim.
One of the most difficult aspects of solving SDC is how all encompassing solutions need to be. From chip designers and vendors to cloud and data managers and everyone in between, we need a solution that spans all these parties to be truly effective.
Another aspect of SDC that makes it difficult to solve is that we simply don¡¯t know a lot about how it happens or why. If you do not know your enemy, fighting it is nearly impossible. That means we need more data that can be broadly shared, analyzed, and studied. Not only that, but the industry also needs to allow and incentivize researchers and engineers to focus on SDC.
The ability to identify outliers in that data, take corrective actions, diagnose symptoms, look for things such as time delays or data leakages as warning signs, and many more diagnostic options will help us demystify SDC. Then, tactics can be adapted and perhaps a solution can be found. But we are behind in that development stage and need to unite as many stakeholders as possible to work together on a solution.
Something that we could borrow from similar industries such as cybersecurity are governing standards. These dictate what does or does not meet specific criteria to be deemed a safe or secure product. We see standards like this in cybersecurity, food, and other consumer goods; a similar framework in computer components may help the cause.
Another tool in the toolbox currently being underutilized is AI and machine learning (ML) algorithms. In terms of diagnostics, periodic screening is not perfect. A screening can be run 10X and give five positives and five negatives. Failures are easy to miss, and symptoms may be identified, but the why or how is still mysterious in most cases.
AI or ML may be able to help. In theory, algorithms could flag when certain conditions are met that show early signs of SDC. The challenge with this route is again the large amount of data required to train these models; curation would be needed and there would need to be a high level of intentionality with the datasets AI is trained on. It¡¯s an exciting option but one still in its early stages.
What is clear is that this problem is large in scale and an underlying, existential threat that must be dealt with. We need stakeholders in all areas ¡ª chip designers and manufacturers, software and hardware engineers, vendors, and anyone involved in computer data ¡ª to collaborate and take SDC seriously. Part of that is education, and our hope is that resources like this blog series begin to explain why silent data corruption demands action and persuade decision-makers to act.
The first step to solving a problem is admitting there is a problem. We are at that point with silent data corruption, but now it¡¯s time for action.
To learn more, check out Speaking Up About Silent Data Corruption and watch the full-length video of our second panel session with Google and Microsoft below.