A system-level methodology for low power design
A system-level methodology for low power design
By Wolfgang Nebel and Laila Kabous, EEdesign
May 2, 2003 (6:59 p.m. EST)
URL: http://www.eetimes.com/story/OEG20030502S0057
Designing for lower power has become a critical pre-requisite for a chip's technical and commercial success. The challenge of designing chips with optimum energy density and power consumption threatens to increase design time and, therefore, time to market. Moreover, the consequence of failing to meet the power challenge can be a significant increase in the cost of ownership of both chips and systems. This article discusses a system-level methodology that reduces or even eliminates the design delays associated with the traditional approach to designing for lower power. Power estimation and analysis are performed at the algorithmic and architectural level, enabling significant remedial design modification without significant loss of time. Moreover, optimization at algorithmic and architectural levels can deliver greater savings in power consumption than does that at lower levels of abstraction. The power challenge Use rs of laptop computers have long tolerated the functional and performance constraints imposed by the imperative to conserve battery power. Moreover, the need for adequate heat dissipation requires "design for cooling," which adds to both design and material/production costs. However, the fact that Intel has only recently announced its first lower-power microprocessor for mobile devices nearly a decade after the introduction of the laptop would indicate that designing for lower power while maintaining performance is not a trivial challenge. Functions such as broadband networking, personal area networking, and multimedia have become consumer features in a remarkably short space of time. Consequently, semiconductor manufacturers must integrate these functions with a multitude of others in order to remain competitive especially cost competitive. However, such levels of integration intensify the already acute power consumption problem. Time to market pressures allow minimal time to design for lower power consumption. However, failure to constrain power consumption often leads to more expensive chip packaging and/or increased cooling costs, increasing the overall cost of ownership of both chips and systems. According to the U.S. Financial Times, power delivery and cooling account for 25% of a server farm's operating costs. Thus, failure to constrain chip and system power consumption or, at least, the growth thereof can have a significant effect on company financials. Compromising performance and functionality in order to remain within the power budget and to maintain chip reliability is not a competitive option. Consequently, designing for lower power has become a critical factor in the design process. Unfortunately, designers are discovering that the traditional approach to designing for lower power is not only time consuming, but also often incapable of delivering the desired results. A new methodology is required. The traditional approach The tra ditional front-end approach to designing for lower power is to estimate and analyze power consumption at the register transfer level (RTL) or the gate level, and to modify the design accordingly. In the best case, only the RTL within given functional blocks is modified, and the blocks re-synthesized. The process is re-iterated until the desired results are achieved (see figure 1). Often, though, the desired power consumption reductions may be achieved only by modifying the architecture, and even the algorithms, of the design. Modifications at this level affect not only power consumption, but also other performance metrics, and may actually significantly affect the economics of the chip. Thus, such modifications require re-evaluation and re-verification of the entire design, and re-synthesis of the design afresh.
Figure 1 -- Traditional design flow
Relatively simple modifications, such as datapath optim ization, may achieve power reductions of the order of 30%, with a delay of several weeks. However, a more far-reaching architectural and algorithmic re-design necessary to reduce power by up to 75% may take months.
Further power analysis is performed after physical layout to detect transistor-level 'hot spots' and to accurately verify chip-level power consumption. This analysis enables the modification of the physical layout to re-distribute the energy density. It is not, and is not intended to be, a chip-level power reduction methodology.
From experience with the traditional approach, it is clear that algorithmic and architectural design decisions have the greatest influence on power consumption (see figure 2). Therefore, any new methodology must start at this system level.
Figure 2 -- Less time to less power
The system-level solution
System-level design consists of the mapping of a high-level functional system model onto an architecture. Power estimation and optimization at this level enable designers to evaluate different processing algorithms and hardware architectures to achieve reductions in power consumption on the order of 75% before netlist synthesis (see figure 3).
System-level power optimization is thus an integral part of algorithmic and architectural design, and avoids the multiple iterations of the traditional approach. The system-level solution not only yields greater power savings than the traditional approach, but also significantly reduces design time.
Figure 3 -- System-level low power design flow
The system specification defines system requirements, and is expressed at a very high level of abstraction. This specification is usually written in a standard language, such as C/C++ or SystemC.
Using this system specification, algorithms that realize system fu nctionality are developed and optimized, generally in those same standard languages. The algorithmic description consists of an executable specification, or functional description. This executable specification captures the system function, and enables the verification thereof. It can be written as a behavioral description, which can be refined into a bit-accurate, pure functional design description.
Thereafter, the architecture memory, controller and datapath structure is developed to implement the given algorithms, while simultaneously comprehending the application-dependent design constraints, such as power, as well as performance and area.
Design then proceeds in the usual manner, through netlist synthesis, verification and physical design. Transistor-level power analysis is still necessary to identify hot spots, and to enable modification of the layout to eliminate them.
In system-level design, power-optimal algorithms and architectures are developed according to the methodolo gy flow in figure 4. It should be noted that system-level optimization targets dynamic power consumption, Pload, which is calculated from four factors, namely, clock frequency, the square of the supply voltage, load capacitance and average switching activity. Short-circuit power and leakage power are optimized at lower levels of abstraction.
Figure 4 -- Design of power-optimal algorithms and architectures
Candidate algorithms are analyzed for their power characteristics, and to identify potential function-level hot spots. The most promising algorithms are then selected and optimized. This is then followed by creation of a power optimal system architecture. The optimal power-consuming functions are then transformed into hardware. The process is an iterative one of power estimation and optimization, each iteration consuming minutes or hours, rather than days or weeks.
Algorithm analysis and opt imization
The best algorithm to implement the specification is selected from the most appropriate candidate algorithms. Power-hungry parts of each algorithm are identified and the algorithm optimized by algorithm transformation, as appropriate.
Analysis of the power consumption for a given algorithmic specification proceeds according to the flow shown in figure 5. The C/C++ or SystemC specification must first undergo compilation and instrumentation. Instrumentation is the process of inserting the protocol statements necessary to derive the switching activity at each defined operation in the source code.
The algorithms are then executed, and the resulting activity profile data is used to annotate a suitable design representation, such as a control data flow (CDF) graph. Any number of power estimations may then be performed to determine the power characteristics of any given configuration.
Figur e 5 -- Algorithm power estimation using control data flow graph
A power-optimized architecture can be derived from this graph without executing a complete synthesis, using power models created for each RT-level component. These models depend on the input data, component characteristics such as bit width and architecture, and the underlying technology or cell library. The power models may be generated automatically for a given technology. Using the switching activity and the power models, the power consumption of a component can be estimated.
Algorithm transformation techniques include HW/SW partitioning, operator substitutions, and code transformations. Code transformations include transformations on conditional statements, loop splitting, and loop unrolling. Control statement reduction has a significant impact on the power consumption, and such transformations are often best effected via loop transformations.
An example of the power reduction effects of loop unrolling and common case optimi zation techniques is shown in figure 6. The results show the power consumption for the original algorithm (matrix_simple), the algorithm optimized with loop unrolling (matrix_unrolled), and the matrix optimized by common case techniques (matrix_ccase_opt).
Figure 6 -- Power analysis before and after loop unrolling
An example of how the best algorithm can be selected using the initial specification and the input data stream is shown in figure 7. The benchmark consists of two JPEG algorithms, one accurate and one fast, each of which performs a decompression of the same encoded image.
The benchmark compares the energy consumption of the two algorithms resulting from the processing of two different input streams: a high quality stream with a low compression ratio consisting of 99% of the original data, and a fast stream consisting of 30% of the original data. It can be seen that the fast alg orithm processes the quality stream with less power consumption, while the accurate algorithm consumes less power when processing the fast stream. Analysis of such algorithms by the traditional method would have necessitated extensive design work, and introduced a delay of weeks.
Figure 7 -- Energy consumption of two JPEG algorithms
Architecture analysis and optimization
Algorithmic estimation and optimization are followed by the creation of a power optimal architecture. This includes memory architecture, scheduling, number and types of resources, how those resources are shared and bound to the algorithm operators, type of data encoding, controller design, floor plan and clock tree design.
The optimal power-consuming algorithm is transformed into hardware, normally by manual means, although time to market pressures will increase the demand for high-level synthesis. This transformation co nsists of a number of complex decisions involving scheduling, allocation and binding. Scheduling determines the clock cycle during which an operation is executed; allocation determines the type and the number of resources to be used; binding is the mapping of operations onto resources determined during allocation.
Furthermore, resources can be distinguished not only by their function, but also by their internal architecture. For instance, an adder can be realized as a carry-ripple or a carry-select adder.
This is clearly a large design space with multiple degrees of freedom, each with its own trade-offs and power consumption characteristics. Bounding this space is effected by decomposing the specification into a series of function calls, each of which may be seen as pure function views of subtasks in a complex computation. Each call is analyzed for its power consumption characteristics. The power consumption of the library elements in a given architecture can be characterized in advance.
Memorie s are used for both intermediate information storage and inter-block communication. Thus, they have a significant effect on chip power consumption, and often account for the majority of it up to 80% in some SOC designs. Consequently, optimizing memory hierarchy and structure as early as possible is a major step in meeting power consumption constraints.
Common techniques for optimizing memory access and memory system performance include basic loop transformations such as loop interchange, loop tiling, and loop unrolling; array contraction; scalar replacement; and code co-location. Most of these techniques can be effected simply by rewriting code.
Selection of the best optimizations is facilitated by the visual display of power analysis results, as shown in figure 8. The graphic shows an analysis of the power consumption of algorithms used in a digital signal processing application a Wavelet (signal compression) transform. The bar graph shows the power consumed, while the memory access traces show memory usage. It can be seen that intra-array optimization reduces power consumption from 19.2uWs to 12.1uWs, or 37%. Inter-array optimization memory size reduction by mapping arrays onto the same addresses of another array reduces the consumption by another 1.2uWs, yielding a total 43% reduction.
Figure 8 -- Power analysis of a wavelet transform mapped to different memories
Tool requirements
The functionality requirements of system-level tools for low power design can be largely derived from the foregoing methodology. Clearly, the tools must perform model creation and power estimation. Moreover, the complex calculations required in model creation, power estimation, design space exploration and optimization cannot be analyzed adequately in a spreadsheet-type tool. Post-processing analysis of such a large quantity of complex textual data is extremely difficult and e rror-prone. Tools that give instant visual feedback through versatile graphical user interfaces are indispensable.
Thus, the general requirements of such tools are:
- Model creation tools must be capable of automatically characterizing RT-level macros and memories to create the appropriate power models for use by power estimation tools.
- Estimation tools must perform comprehensive design space exploration automatically, calculating the maximum and minimum power consumption associated with each and every combination of algorithmic state and architecture.
- The tools must accept data written in standard high-level languages, such as C/C++ and SystemC.
- The tools must fit into a standard design flow, and communicate effectively with other tools in that flow.
- Estimate power using actual application data.
- Perform instrumentation automatically on the initial specification.
- Efficiently execute the algorithmic description with ap plication data.
- Automatically annotate the CDF with activity profile data.
- Automatically create constraint scripts for high-level synthesis tools.
- Accept standard data sheet information for I/O pads, registers, and off-chip memories.
- Have an estimation scheme for clock power and interconnect.
- Output the analyses in graphic form.
Conclusion
A system-level design methodology supported by the appropriate automation tools is the fastest and most effective method of designing complex chips for lower power. Moreover, it significantly reduces the risk of not meeting often stringent power constraints by the early identification of function-level hot spots, and enabling the analysis and selection of alternative solutions. Such methodologies have already been adopted by designers of complex chips, and shown to satisfy the claims made for them.
Dr. Wolfgang Nebel is Chief Technology Advisor and Co-Founder, ChipVision Design Systems. Dr. Laila Kabous is Technical Marke ting Manager, ChipVision Design Systems.