Cloud native EDA tools & pre-optimized hardware platforms
Yudhan Rajoo, Technical Marketing Manager, Synopsys
With the rise of applications using machine learning, great attention is now paid to modifying computing architectures to improve neural network processing throughput. The FinFET era has nudged product architects and system-on-chip (SoC) engineers to take a closer look at the power efficiency of computations that are performed in every clock cycle. As the race to deliver superior neural network architectures heats up, so do the temperatures on silicon while performing these complex functions on a chip. SoC designers are faced with the conundrum of newer computing blocks eating away at their already low power budgets. In addition, evolving architectures make schedules tighter due to the changing nature of RTL code. When faced with power and time-to-market challenges, developing a full chip layout that fits within the same die area and performs at expected throughput levels in mission mode is no small feat. Designers need to tackle the problem of meeting power, performance, and area (PPA) targets of high-performance artificial intelligence (AI) SoCs at an elemental level with the building blocks that make up computation circuits. These elemental blocks of Boolean logic and memory storage elements are called Foundation IP.
The most popular deep learning technique today is convolutional neural networks (CNN). CNNs take in uncompressed frames of data such as an image of height ¡®h,¡¯ width ¡®w,¡¯ and depth ¡®d,¡¯ and convolve the image with a filter to two dimensions, creating a matrix of values that are then passed through fully connected classification nets for inference. The convolution operation is essentially a multiply-accumulate (MAC) computation that can be represented with the equation shown in Figure 1.
Figure 1: Equation and use of convolution operations for CNNs
MAC computations have, in the past, been partially addressed through use of conventional DSPs and GPUs that were used in earlier days for machine learning, but the industry soon realized that dedicated improvements are needed in these architectures to efficiently address the delicate balance between performance, power, and area. For CNN functions, some vector DSPs are area efficient for convolutions but lack the necessary speed and power efficiency. GPUs are efficient in throughput but consume large amounts of die area and power as the number of shader cores increase. FPGAs have recently seen some success in adoption but is limited to data center applications that have less stringent area requirements. Dedicated CNN engines such as those in DesignWare? Embedded Vision Processors resolve this quandary through customized logic that can perform up to 3520 MACs per cycle within small area and low power budgets.
CNN engines (neural network processors) are a new beast only a battle tested champion can slay. While implementing these blocks, going down the wrong path could be catastrophic to project schedule. Therefore, designing with foundation IP blocks that offer flexibility for course correction during the design cycle is necessary for a successful product rollout.
Floorplan Iterations
Physical design implementation of machine learning blocks usually needs floorplan iterations to determine the best placement of macros and logical hierarchies within the given die area. The iterations may need modifications to the aspect ratio of both the core area and the macros themselves so that slower macros can be placed near logical hierarchies. A wide variety of memory cuts from compilers that trade-off timing and aspect ratios is required to deal with this iterative churn. The relative placement of logic hierarchies, in turn, is affected by the routing track resources available to fit those hierarchies in a given space. While the top routing layer restrictions are already defined for a block by the top-level designer, adjusting the power ground (PG) grid to the specific logic library can optimize core density. Lack of a recipe to start PG grid design can cause implementation delays.
Congestion in MACs
With the macros fixed, the challenge then becomes to manage placement and routing congestion in the logic area while tuning the design for power-per-MHz targets. MAC blocks are notorious for routing congestion caused by higher-than-normal pin density and high net connectivity. When represented as schematics, MAC blocks have a naturally triangular shape as data passes through them, which is why optimal results are often achieved by hand-placing the datapath elements. EDA tools have made good progress to bridge the gap between a full custom layout of the MAC block and an algorithmically derived placement of individual cells that also takes care of process design rules with shrinking geometries. However, some EDA tools require compatible standard cell structures to complete the solution. Whether hand-placing the elemental blocks or relying on the tool to do so, the need for larger multi-input elemental adders, multiplexers, compressors, and sequential cells in different circuit topologies and sizes is evident.
Achieving Differentiated PPA in Time
Design timelines are long, and tapeout schedules are short. The squeeze in design cycles means there is no time to ramp up a design team on advanced process nodes. Integrating validated IP that has seen silicon success, along with a tool recipe that can start the implementation process on day one, is a ¡°must have¡± to win in competitive markets. With ever-shrinking nodes, there is limited choice on foundries to use for an AI-powered SoC. Even more limited is the choice in the logic library and memory design kits provided by the foundry. This begs the question: How will the PPA of a design using a foundry default design kit differentiate itself in the market? A truly unique IP solution is crucial to pushing the PPA envelope beyond what is achieved by commonly known optimization tactics and a basic toolkit for implementation.
Choice of the Building Blocks is Half the Battle
On the road to meeting PPA and time-to-market goals for an AI-powered SoC, design engineers hit several roadblocks that mandate making time-sensitive decisions. A targeted solution of logic libraries and memory compilers can help designers start to address the challenges of implementing CNN engines and hence AI-powered SoCs. For example, having multiple options of logic cells that leverage EDA tool optimization capabilities and provide flexibility of manual intervention when standard tool algorithms are not enough, is essential to overcoming implementation challenges. Synopsys¡¯ Foundation IP portfolio includes an HPC Design Kit which is a collection of logic library cells and memories that have been co-optimized with EDA tools on advanced nodes to push the PPA envelope of any design, and has been optimized for AI-enabled designs. In addition to a rich silicon- validated portfolio that achieves superior PPA, Synopsys¡¯ support for customizations to meet individual design needs makes the offering more flexible than any other (Figure 2).
Figure 2: Requirements and benefits of a differentiated logic library IP
Design Flow Quick Start
The foremost advantage of using a foundation IP solution that comes from an EDA vendor is interoperability. This means that the designer can use the scripts provided with the IP to have a working pipe cleaning flow on the most cutting-edge process nodes and time is not wasted in ramping up. A rigorously tested PG grid is also available when designers use foundation IP from EDA vendors. The PG grid enables designers to start design exploration earlier, leverage the special provisions in standard cell architecture, and take advantage of the additional routing track resources for signals without compromising the power integrity of the design. When complete, floorplan trials can begin in full swing.
In a DesignWare ARC? EV6x Embedded Vision Processor core, after a few iterations, the right aspect ratio can provide optimal routing track utilization for 500MHz. For a successive design when frequency is increased to 800MHz, growing the floorplan with same aspect ratio is not sufficient for design closure. In such a scenario, not only does one need a different aspect ratio for memory cuts but also a faster memory architecture with a larger bit size and different periphery setting. Synopsys¡¯ HPC Design Kit provides the variety of compilers needed to easily scale the performance from 500MHz to 800MHz with an optimized floorplan for each performance level. A broad offering of foundation IP at designers¡¯ fingertips provides more room for optimization, reduces floorplan iterations, and ultimately accelerates design closure.
Figure 3: Floorplan iteration examples on the DesignWare ARC EV6x with CNN Engine show the flexibility to achieve either 500MHz when area is at a premium, or 800MHz for performance-optimized designs
Rich Cell Set Co-Optimized with Tools
The richness of the cell set in a library co-optimized with EDA tools proves to be a crucial component to achieving PPA and managing congestion. When a library is co-optimized with tools, logic library architects have the capability to look under the hood of advanced synthesis and routing algorithms. This drives the library designer¡¯s decision on which cells to include and how to lay out these cells. Working together, the algorithm owner of the tool can introduce features in the tools that ensure superior results when new logic cell types are utilized. The IP/EDA vendor¡¯s behind-the-scenes work reduces the effort required of the SoC designer, who can now focus on other aspects of implementation. A tool-library handshake has gained more prominence with multi-patterning FinFET nodes. SoC designers using smaller process technologies benefit from having library cells whose pin layouts (especially those of high input complex cells) have been optimized to take advantage of latest router innovations that eventually lead to faster route closure at the design level.
The ratio of area to speed for cells drives the decision making within synthesis algorithms. These algorithms can derive optimal circuit topology, as long as they have enough options within a logic library. When using Synopsys logic libraries, even large, complex cells are seamlessly integrated into EDA flows at synthesis, reducing manual intervention. Special wide multiplexers combining several bits together with compressor cells can contribute to lowering overall net length of the design. Wide multiplexers reduce the amount of routing track needed and in turn help with area and congestion. Multiple circuit topologies in the library for compressors, adders, and multiplexers ensure that synthesis always finds an optimal solution for every architecture with varying multiplier sizes. Low-power versions of multibit flops and multiplexers ensure that power is not compromised while larger device sizes are used for congestion mitigation. Wallace trees can also benefit with the use of Booth multiplexers with a higher number of bits and datapath arithmetic cells that combine multiple Boolean logics into one supremely optimized cell. Designers that like to use hand-placed structured datapaths in their design or prefer to use scripts for swapping cells for power recovery, can benefit from the HPC Design Kit¡¯s library that provides a wide range of options for both sequential and combinational functions in varying device sizes.
Synopsys¡¯ HPC Design Kit has been enhanced for artificial intelligence/machine learning applications (Figure 4). The new kit has been found to save up to 39% power on Synopsys Embedded Vision Processors over generic foundation IP solutions. Tradeoff tuning with the HPC Design Kit can also enable 7% higher speeds with 28% power savings on CNN blocks.
Figure 4: Enhanced HPC Design Kit for artificial intelligence and machine learning applications
A proven track record of enabling customers¡¯ silicon success in advanced process nodes along with support for customization services that create differentiation for your design makes the Synopsys DesignWare HPC Design Kit a compelling choice.