Cloud native EDA tools & pre-optimized hardware platforms
By Manny Wright, Senior Field Applications Engineer, 91³Ô¹ÏÍø
Processor architects continually strive to improve CPU hardware performance. System architects and SoC designers can leverage advancements in CPU designs to optimize their products for performance. However, hardware ¡°optimizations¡± can, from time to time, alleviate performance challenges only to create equally perplexing software related development and performance challenges. These challenges are exacerbated by the variety of software typical SoCs are expected to process. As shown in Figure 1 below, from an application perspective, a narrowly dedicated SoC can leverage hardware optimization / customization without a major impact on the software solution. As the variety of software and the number of teams writing software for the SoC multiplies, the software challenges tend to multiply for hardware optimized SoCs.
Figure 1: Software challenges related to hardware optimization
Instruction-level parallelism (ILP) refers to design techniques that enable more than one RISC instruction to be executed simultaneously in the same instruction. ILP takes advantage of sequences of instructions in user code that use different functional units (e.g. load unit, ALU, FP multiplier, etc.) of the processor core. There are a variety of ways in which processor architectures can leverage these sequences, but the idea is always the same: to have these independent instructions executing simultaneously to keep the functional units busy as often as possible in order to maximize processor throughput. These techniques boost the performance of the processor by increasing the amount of work done in a given time interval, which increases the throughput.
When evaluating CPU architectures intended to take advantage of ILP, it is important to keep in mind the application environment, as results of improvements can and do vary greatly depending on the type of code being executed, and the potential impact on software development. For example, if the application is dominated by highly parallel code, any of a number of different architectures could improve application performance. However, if the dominant applications have little ILP, the performance improvements will be much smaller.
Finding the balance between hardware optimization and software development can be attacked at the micro-architectural and architectural levels.
Common micro-architectural techniques used to exploit ILP include: 1
Even with these sophisticated and advancing techniques, the growing disparity between processor operating frequencies (which tends to scale nicely with process technology) and memory access times (which is struggling to keep up in a cost-effective manner) serves to reduce or limit such benefits.
For example, these techniques may be inadequate to keep the CPU from stalling when relying on off-chip memories. As an alternative, the industry is heading towards exploiting higher levels of parallelism that can be exploited through techniques such as multiprocessing and multithreading.
ILP can be explicit such that each additional instruction is explicitly part of the instruction mnemonic, or it can be implicit, where the number and type of operations are encoded into the instruction. While explicit parallelism is suitable in certain contexts, implicit parallelism offers inherent advantages that enhance performance and simplify coding, making it a better option in many cases.
At first glance, explicit ILP such as VLIW looks attractive, because it is obvious that during the instruction cycle, more than one operation is performed. The architecture of VLIW instructions is explicitly parallel, consisting of several RISC operations that control different resources.
VLIW machines first appeared in the early 1980s. They were used primarily as scientific super computers. At the time, with semiconductors still in their infancy and process geometries measured in microns, it made sense to bolt two (or more) execution units together and tolerate the software and real-time inefficiencies that this created. There were few better options at the time if you wanted to increase performance. Pushing the problem off to the compiler made sense because there were fundamental limitations on the number of transistors that could be put on a chip and, by extension, the level of functionality that could be implemented.
We have progressed from the 1980s and so has microprocessor technology. Process technology is now measured in nanometers with several orders of magnitude reduction in geometry sizes. This enables us to put billions of transistors on the chips that we are making and to take fundamentally different approaches to increase processing performance.
SoCs are increasingly being built with heterogeneous multi-processor configurations where the microprocessors are independently programmed but work together on a problem, with each doing its part. Multi-processor designs have the benefit of being easier to program, they can be scaled to hundreds of processing elements, they allow functions to be addressed in real-time, they make efficient use of memory (no operation (NOP) stuffing), and they facilitate block-level reuse. This of course, is fundamentally different from VLIW implementations where if one execution unit branches all of the execution units have to branch, implementing more than a few processing elements is cumbersome, and keeping all of the processors running at 100% of capacity to reduce NOP stuffing of memory is impossible. It is no wonder that with the inefficiencies of VLIW processor implementation, they are fading into the past, and that multi-processor implementations are the preferred techniques for state-of-the-art SoC designs.
DSP applications are memory access intensive in order to perform repeated mathematical operations on arrays of numbers. A simple example of explicit ILP is an addition calculation with load and store operations in a 3-slot instruction:
//{addition; store destination operand from previous instruction; load source operand for next instruction}
//Equivalent of 3 operations per instruction
{add a3, p5, p2; store a2, ; load p6, }
{add a4, p6, p2; store a3, ; load p7, }
{add a5, p7, p2; store a4, ; load p8, }
{add a6, p8, p2; store a5, ; load p9, }
In the above example, only one of the source operands of the addition operation needs to be loaded from memory. One can easily deduce that if the second source operand of the addition operation needs to be loaded from memory, a fourth slot is also required. Furthermore, as shown above, addresses for the load and store operations in DSP applications are not readily available, but must be derived from one instruction to the next. More parallel operations or additional cycles of scalar operations may be required to derive the addresses for the next instruction. Figure 2 depicts a pipeline with three parallel execution units that are required to process 3-slot instructions.
Figure 2: Explicit parallelism duplicates execution units
Launching multiple instructions at the same time for parallel execution can result in faster code execution speed. However, this explicit parallelism makes software difficult to optimize. It requires careful instruction ordering to avoid the need to simultaneously access the same data. It is also necessary to avoid simultaneous execution of instructions where one instruction depends on the results of the other for its operands. That brings us to yet another drawback of explicit instruction parallelism: VLIW cannot always be filled with useful operations. Some of the slots will be filled with NOP or ¡°do nothing¡± instructions. An efficient scheduling of a VLIW application requires a thorough understanding of the parallel execution units as well as the capacity of the memory interfaces. Thus, not only does adding more dedicated hardware parallel execution units sacrifice programmability, but it is not always energy efficient. In general, not all DSP application software has a structure suitable for multiple-launch execution; but when it does, explicit parallelism offers a good high-performance architecture.
Implicit ILP is similar to its explicit counterpart in that hardware is added. But that is where the similarity ends. The hardware added does not form extra execution units, but rather becomes part of the only execution unit, architected to alleviate DSP applications¡¯ need for fast memory accesses while performing repeated mathematical operations on arrays of numbers. This hardware addition, known as an XY memory system, consists of memory banks, address pointers, address update registers and address generation units. There are typically two identical memory arrays (one called X, the other called Y); or four memory arrays (two banks of XY, with only one bank being active). Each memory array has four dedicated address pointers: X0, X1, X2, X3 for the X memory array and Y0, Y1, Y2, Y3 for the Y memory array. To each of the address pointers, there are two address update registers: MX00, MX01 for X0; MX10, MX11 for X1, etc. The address generation units provide additional addressing modes such as modulo, bit reverse and variable +/- linear offset. They make complex address calculations independently, removing a significant overhead from the CPU and ensuring efficient memory access without cycle penalties. These address generation units are built into the instruction pipeline. One direct result is that a single instruction may embody three data moves, a mathematical operation and three address pointer updates:
//load and multiply X1 and Y2 address pointers¡¯ contents and store the result into the location of X0
//also update the X0 address pointer according to modifier MX01
//update the X1 address pointer according to modifier MX10
//update the Y2 address pointer according to modifier MY21
//Equivalent of 7 instructions per cycle: 2 LD, 1 ST, MUL, 3 ADDR Update
MUL X0_M1, X1_M0, Y2_M1
MUL X0_M1, X1_M0, Y2_M1
This kind of single but complex RISC instruction offers code density when performing repeated mathematical calculations on arrays of data because it does not require explicit data move operations. Its high degree of parallelism is all under the hood, enabling a sustained seven operations per cycle.
The XY DSP memory option available on ARC 600 and ARC 700 family of processors takes advantage of implicit ILP without pushing additional burden onto the software engineer. The solution, shown below in Figure 3, consists of separate memory banks for X and Y operands and a DMA, which moves data in and out of XY memory. This system delivers data at register speed, eliminating main memory fetch cycles. Other hardware and software related advantages include: high-performance address generators, fast pointer accesses, 10% of the size of a DSP coprocessor, single processor solution (can replace separate DSP), support for multiple memory banks, and a consolidated development environment for both CPU and DSP.
Figure 3: DesignWare ARC XY advanced DSP block diagram
The ARC processor cores¡¯ base ISA can readily utilize XY memories as though they are native operands. Thus, the addressing units bring DSP-like addressing capabilities to the RISC core, such that the pipeline sees the entire XY memory space as though it is a register file ¡ª very, very tightly coupled with the pipeline¡¯s execution unit, as shown in Figure 4.
Figure 4: Implicit parallelism execution unit accesses the entire XY memory like a register file
The reverse is also true. ARC¡¯s DSP instruction extensions can utilize the RISC core¡¯s resources as well. The DSP instruction decode unit and processing elements all connect with the rest of the core, allowing the DSP logic to use the core¡¯s resources, as well as its own. The extensions have full access to registers and operate in the same instruction stream as the RISC core. The programming model with the XY memory system is the same as that without it. There is no scheduling to ponder and all instructions have access to core and XY resources alike.
As mentioned above, micro-architectural and architectural techniques can both be brought to bear on the performance optimization challenge. So far, the focus here has been on taking advantage of instruction level parallelism. To find the balance between hardware optimization and software, one must also consider the architectural level. Suppliers of commonly available CPUs tend to put focus on specific pieces of the performance / ease of use challenge. Some focus on ¡°generic¡± hardware and ¡°generic¡± tools. Others deliver more optimized hardware solutions (such as tools to develop custom processors), pushing the burden of software development onto customers and partners. And still others deliver a combination of performance-efficient hardware and a suite of development tools for SoCs and software designers.
Synopsys¡¯ ARC processors are highly configurable, user extendable and fully supported by a broad set of simulation and development tools. ARC processors are a family of 32-bit CPUs that SoC designers can optimize for a wide range of uses, from deeply embedded to high-performance host applications. Designers can differentiate their products by using patented configuration technology to tailor each ARC core instance to meet specific performance, power and area requirements, or select a pre-defined ¡°out-of-the box¡± processor configuration. ARC processors are also extendable, allowing designers to add their own custom instructions that dramatically increase performance.
To ease customer product and software development, ARC processors are supported by a complete suite of development tools, a variety of 3rd-party tools, operating systems and middleware from leading industry vendors, including members of the ARC Access Program.
System and SoC architects, working with software engineers should evaluate the trade-offs of performance and ease of use specific to their application and customer requirements.
Both explicit and implicit instruction level parallelism add hardware with the goal of speeding DSP performance. Explicit instruction parallelism requires more effort to program and an in-depth understanding of the hardware units and I/O interfaces. Most DSP applications do not render themselves readily into these rigid parallel operations. Implicit instruction parallelism using XY memory retains the RISC programming model and brings the entire XY memories into the pipeline as though they are register files: a resource-efficient and, in general, much higher performing implementation.
1
2 Reflections on the Memory Wall