Cloud native EDA tools & pre-optimized hardware platforms
Manuel Mota, Product Marketing Manager Senior Staff, Synopsys
Next generation server, AI accelerator and networking system-on-chip (SoC) designs require progressively stronger capabilities to keep up with the demand for faster data processing and advanced artificial intelligence/machine learning workloads. The large SoC size and benefits of modularity are driving a paradigm shift in the industry towards multi-die SoCs, which offer several well-known advantages:
However, multi-die SoCs bring new challenges that designers need to overcome, including:
IP and design tools have evolved to help designers define and implement their SoC architecture. In previous articles we¡¯ve discussed important die-to-die topics such as main applications and care-abouts, characteristics of SerDes/parallel PHY architectures and production test of multi-die SoCs. We¡¯ve also discussed design flows and advanced 3DIC designs.
This article goes beyond the die-to-die PHY interface features and benefits and describes die-to-die link attachment to SoC fabric, implementation requirements, link error management and the Die-to-Die protocol stack structure, all of which are key understandings to achieve better modularity, flexibility and yield for multi-die SoC in a single package.
It is valuable for multi-die SoC designers to focus on the following criteria on how the die-to-die interface contributes to system performance:
Here are a few examples of die-to-die use cases:
One common use case for multi-die SoCs targeting High Performance Computing (HPC) is the assembly of multiple homogeneous dies in the same package, each die containing a cluster of similar processing units, either generic CPUs or specialized processors for AI workloads, as well as local memory and cache. The reason for adopting a multi-die approach can be flexibility of configuration and modularity (i.e: scaling compute performance) or because the monolithic die is simply too big for efficient manufacturing (i.e: split SoC).
Figure 1: Illustration of a multi-die SoC with assembly of homogeneous dies
In a multi-die SoC with assembly of homogeneous dies, as shown in Figure 1, an interconnect mesh connects all the CPU clusters and shared memory banks in each die. The die-to-die link connects the mesh interconnect in both dies as if they are part of the same interconnect.
In high performance homogeneous compute use cases, either servers or AI processing, the CPU or TPU clusters with tightly-coupled cache hierarchy are spread across several dies (as shown in Figure 2).
Figure 2: Homogeneous compute use case with CPU and TPU clusters spread across several dies
These implementations are enhanced with a unified memory architecture, which means that any CPU can access memory in another CPU cluster with similar access time, so that the software code can be agnostic to how the workloads are distributed between the different processing clusters. For these cases, it is critical that CPUs in one die can access memory in the other die with minimum latency while supporting cache coherency.
Often the link between the two dies will require cache coherency, leveraging the advantage of CXL or CCIX traffic to reduce link latency.
Maintaining a single unified memory architecture domain is typically possible if the link latency is in the range of 15 to 20 nanoseconds in each direction.
High performance heterogeneous compute architecture may also require coherency when both sides of the link share cache memory.
Applications such as IO access, where digital processing exists in a separate die from the IO functionality for flexibility and efficiency (IO examples can be electrical SerDes, optical, radio, sensors or others), typically don¡¯t have coherency requirements and are more tolerant to link latency. For these cases, IO traffic is generally routed through standard protocols such as the AXI interface.
Similarly, parallel architectures such as GPUs and some categories of heterogeneous compute where an accelerator is connected to the CPU cluster, may only require IO coherency (if the accelerator die has no cache) or does not require coherency at all, as shown in Figure 3.
Figure 3: Multi-die SoCs with parallel heterogeneous architecture
Any data transmission is subject to errors. die-to-die links, by nature of the short reach and relatively clean channel characteristics, generate less errors than longer reach channels that must cross different materials and connectors.
To avoid data corruption due to link errors, which has catastrophic impact on system operation, the die-to-die link must implement functionality that allows error detection and correction. Depending on the system requirements and raw PHY BER, two major options to detect and correct transmission errors are available, which can be used separately or in conjunction:
FEC can correct a certain number of errors without requiring re-transmission, which would incur additional latency. Typically, FEC is used to recover BER to a certain low probability level (¡°reliable link¡±) and any uncorrected error triggers a retransmission.
Depending on the system requirements and die-to-die link configuration, the raw BER limit from which a FEC is no longer required, can vary. A minimum limit that is sometimes considered enough is 1e-15 BER, which corresponds to a re-transmission request every seven hours for 1 lane link operating at 40 Gbps. For complex systems with 10s of lanes, the re-transmission rate increases proportionally to the number of lanes so a light FEC implementation that can reduce BER to lower levels may still be required to keep the interval between retransmission requests reasonably high.
In common with other chip-to-chip links, the protocol stack of a die-to-die link can be partitioned into different protocol layers that align with the Open System Interconnection (OSI) model stack definitions, as shown in Figure 4. The PHY layer is made up of the physical medium attachment (PMA) and physical medium dependent (PMD). The PHY layer handles the electrical interface with the channel. The Logical PHY layer, sitting on top of the PHY layer, isolates the signaling characteristics of the PHY layer from the link layer, assisting with data stream building and recovery.
The link Layer manages the link and handles the error detection and correction mechanisms to guarantee an end-to-end error-free link. The link layer also handles flow control, regulating the amount of data communication between the sender and receiver. The transaction layer receives read and write requests from the application and creates and receives request packets from the link layer.
Figure 4: Die-to-die protocol stack
Each layer is optimized when they are defined and validated together, even if each layer has pre-defined interfaces. For example, the desired characteristics of the FEC are dependent on the expected PHY's bit error rate
die-to-die links have characteristics that set them apart from traditional chip-to-chip links. As an example, both ends of the link are known and fixed when assembling the multi-die SoC. Therefore, the die-to-die link characteristics can be determined in advance and set at boot time either via register setting or software, avoiding the complexity of link discovery and negotiation steps.
Additionally, the die-to-die link is expected to be a simple ¡°tunnel¡± connecting the interconnect fabric in two dies without a specific defined protocol. To reduce latency and guarantee interoperability, it is ideal to have the link closely optimized for operation with the on-die interconnect fabric. For example, the Arm Neoverse platform defines specialized interfaces that support cache coherency which can be used for low-latency die-to-die solutions. Alternatively, more general-purpose application interfaces, like AXI, can be used to attach to any on-die interconnect fabric.
Designers are exploring other options to optimize yield and latency as SoCs are becoming larger in size and more complex in functionality. The trend is towards multi-die SoC in a single package for better modularity, flexibility and yield. Because of this reason, implementing die-to-die interfaces becomes essential in various use cases such as scaling of the SoC compute power, splitting the SoC, aggregating multiple disparate functions and disaggregating IOs. Designers must understand how to implement such use cases, how each layer in the die-to-die protocol stack can be defined and verified together to enable a more optimized and reliable die-to-die link, and how to determine the target die-to-die characteristics to lower latency and guarantee interoperability.
Synopsys has developed and designed a portfolio of DesignWare Die-to-Die IP solutions to specifically meet the needs of each use case. The DesignWare? Die-to-Die IP portfolio includes SerDes and Parallel-based PHYs available in FinFET process and controller for 112G USR/XSR link.