|
|||
The why, where and what of low-power SoC design
Pete Bennett, Cadence Engineering Services
(12/02/2004 8:49 PM EST) Reducing on-chip power consumption has become a critical challenge for the nanotechnology era. The traditional trade-offs between performance and area are now being compounded by the addition of power into the equation. Problems relating to power consumption are not only applicable to the battery powered, handheld and mobile applications, but everything targeting 90nm and beyond, where the power influences designs not only in terms of time to market, but also for cost and reliability. Why low power Before going into the details of analyzing and reducing power consumption, we should first look at why it is so critical in today's designs. The continuing trend in applications for ever increasing functionality, performance and integration within SoCs is leading to designs with power dissipations in the hundreds of Watts. This can be seen from the latest processor variants from Intel, with the Itanium2, for example, approaching 130 Watts [1]. This class of device requires expensive packaging, heat sinks and a cooling environment. This leads to a number of additional issues that need to be addressed to maintain the feasibility of future applications. The increased integration of mobile applications puts greater demands on the battery lifetime of the product over previous generations. While the advances in CMOS technology have seen a doubling in transistor density roughly every 18 months, the equivalent advancement in battery technology is greater than every five years. Having high current on-chip decreases the life time and reliability of the product. With increasing frequencies, the average on-chip current required to charge (and discharge) total load capacitance also increases, while the time during which the current surges results in power fluctuations across the power distribution network of the device. These dynamic voltage drops are a concern in creating delay uncertainty, leading to possible functional problems and eventually to a shortened product life through complete device failure [2]. Finally, the issue of how to address the power dissipated from the device as heat becomes ever more expensive to handle as part of the overall system design. Where power is consumed Power dissipation within a device can be broken down into two basic types — dynamic power consumption based on the switching activity and the static power consumption based on leakage. The dynamic power consumption can be broken down into the switch power due to the charging and discharging of capacitive loads driven by the circuit (including net capacitance and input loads), and short circuit power occurring momentarily during switching when pairs of PMOS and NMOS transistors are conducting simultaneously. The leakage power can also be broken down into a number of key contributors. One is the current flowing through the reverse biased diode formed between the diffusion regions and the substrate (Idiode). Another is the current flowing through transistors that are not conducting, tunneling through the gate oxide (Isubthreshold). Note the leakage of the device is dramatically impacted by the operating temperature. Therefore, as the chip heats up, the static power dissipation increases exponentially.
Figure 1 — CMOS power dissipation
Leakage current within a 130nm process with a 0.7V threshold gives approximately 10-20 pA per transistor; reduce the threshold to 0.3V in the same process and the leakage current rockets to 10-20 nA per transistor, increasing exponentially in smaller geometries. It can be seen, therefore, that leakage is affected by how close Vth is with respect to Vdd, transistor size and temperature. The effects of varying and optimizing Vdd and Vth are discussed in depth in the papers by David J. Frank [3] and Tadahiro Kuroda [4]. The following equations define the power within the device:
Pswitch = A * C * V2 * F Pshort = A (B/12) (V-2Vth)3 * F * T Pleakage = (Idiode + Isubthreshold) * V
A = Switching Activity
What we can do to minimize power In targeting an SoC architecture for low power application, we must first fully understand the requirements that will define the power budget. These may be derived from some form of standards-based requirements limiting current draw under certain conditions, or alternately to prolong the life of the battery in the case of a mobile application. The solution for the target applications will differ in how the device is controlled and architected. Once the requirements are clearly defined, we can start to explore various architectures and determine potential trade offs. By starting at the highest level of abstraction, where the potential for maximum savings are, and further refining this through the levels of design abstraction, we can continually drive the power savings downwards toward the target budget.
Figure 2 — Diminishing returns through levels of abstraction
In finalizing the SoC architecture, a number of considerations and decisions will need to be made at various stages of the design abstraction to reach the optimal solution. These will include such requirements as system performance, processor and other IP selection, new modules to be designed, target technology, the number of power domains to be considered, target clock frequencies, clock distribution and structure, I/O requirements, memory requirements, analog features and voltage regulation. All of these are contributors to the power budget and therefore can be targeted for power minimization to achieve the low power goal.
In bringing all the pieces of the architecture together we need to next look at the global control and clock features that can be used to reduce the overall power of the system. A design is likely to have many modes of operation for various application demands, such as startup, active, standby, idle, and power down. In some cases multiple levels of these modes will be used to achieve the best overall power management strategy. These modes tend to be generally controlled by a combination of software and hardware features, and need to be planned into the system development from a very early stage of the design process. From the previously described equations, it can be seen that the best way to save as much power as possible is to scale the voltage to the optimal levels for the required performance. The impact of reducing voltage levels, however, is to increase the gate delay, and beyond a certain level that becomes unfeasible. The ideal solution is to have varying modes of operation, with the target to power down as much of the design as possible for the given application, reducing both dynamic and leakage power. In standby mode, for example, the minimum amount of logic required should be maintained on a low voltage domain to bring the device out of this state on demand from some external event, then moving through the modes of operation to the required performance level. While this solution provides the maximum saving, it also carries the largest overhead in terms of complexity. These range through the considerations for on or off chip switching regulation, power domain isolation, performance impact of delays associated with the switching and resumption of stable power, and potential loss of state for flip-flops and memory requiring save and restore routines, along with all the additional associated test and verification requirements. In developing this type of implementation, consideration needs to be given to all of the above items and the feasibility of the management of the periods of time where this can be realistically achieved.
Figure 3 — Example of multiple-domain structure
Simpler implementations containing multiple domains, but no switching or scaling, will carry some of the associated benefits from the quadratic voltage effect. In these cases consideration needs to be given for the partitioning of the design into high performance/higher voltages and low performance/lower voltages. The next level of consideration, after defining the voltage partitioning and scaling, should be the system level clock architecture and methods of controlling frequency and associated switching levels. While it doesn't address power consumption through leakage, this method goes a long way towards reducing the dynamic power consumption of the device. It is not uncommon for a design to have the clock distribution and clocked elements consume over 50% of the total power consumption of the device [5]. Note the scaling of frequency may be directly proportional to any voltage scaling if implemented to meet the required system level performance. In a given idle or sleep mode, all the non dependent modules can be gated off completely from the route of the tree, eliminating the switching in both the clock distribution and logic within these parts of the design. The use of multiple clock domains, frequency scaling and frequency phasing to reduce peak power can all be managed from the central level of distribution.
Figure 4 — Clock phasing
The control for the clock architecture is generally controlled through the software interface available via the processor. However, dynamic hardware controlled switching for on-demand activation can also be implemented — for example, in the case of some decoder function that is required to support bursts of data traffic. These types of features reduce the total software sequence support and system latency.
In all the above aspects of implementation for the defined system clock architecture, detailed consideration is required to avoid all forms of clock glitching, the additional overheads associated with multiple domains in terms of functional test, skew control, design for test considerations and timing closure implications. Once the design architecture has been captured, the RTL code can now be targeted towards a low power synthesis flow, automatically trading power alongside the generally accepted performance and area constraints. The main features targeted by the tools include multiple threshold leakage optimization, multiple supply voltage domains, local latch based clock gating, de-clone and re-clone restructuring, operand isolation, and gate level power optimization. For multiple threshold leakage optimizations, generally up to three versions of the targeted library are used: Low Vth (fast, high leakage), Standard Vth, and High Vth (slower, low leakage). The tool will target to use as many of the high threshold cells as possible, while maintaining the timing constraints, only utilizing the low threshold cells for critical paths. Obviously, selecting and targeting the appropriate library and characterization for the application performance requirements are a key consideration that should be addressed early on in the design process.
Figure 5 — Multiple voltage threshold leakage optimization
To support multiple voltage domains, additional characterized libraries for the targeted voltages are required. These may also include multiple threshold variants within them. The savings in costs in terms of power will obviously relate to the quadratic voltage scaling effect. Along with a consistent and supporting tool flow, managing the domain partitioning requires careful design consideration in the early stages of development, and close integration between front end design and layout processing to support all of the above methodology.
If enabled, local latch based clock gating will generally insert library specific clock gating latches wherever possible before groups or banks of associated flops. The effect of this is to reduce unnecessary clock toggles to the associated flops. The user can generally define the range of flops to be driven from a single clock gate to avoid any unnecessary imbalance in the clock distribution network. Each clock gating cell provides a functional and a test-activated enable for the clock path, with the optional addition of observability automatically generated if required to reach required target ATPG coverage.
Figure 6 — Clock gating implementation
In relation to clock gating, an additional step can be added using physical data to restructure the clock gating, further reducing power and area. This is achieved from relative placement of the registers and gating cells, reducing fragmentation and replication. Where possible, the original logical partitioning of the clock gating cells to flops will be restructured to provide a more physical layout friendly structure.
Figure 7 — Clock gate restructuring
The complete process has a number of steps. In the pre-layout design the local clock gating is de-cloned to a higher common level, reducing area and creating a cleaner starting point for clock tree synthesis (CTS). Then during the detailed placement/CTS phase, local clock gating cells can be re-cloned to provide the optimal required clock tree. The operand isolation step automatically identifies and shuts down data path elements and hierarchical combinatorial modules with a common control signal. The tool only partially commits to the restructuring, to allow optimal timing and power tradeoffs.
Figure 8 — Operand isolation
Classical gate level optimization resizes cells, performs pin swapping, removes unnecessary buffering, merges gates, adds buffers to reduce slew and restructures logic to provide the best possible power optimization. However, the majority of these steps are also rehashed in the physical domain with real placement and wire length constraints.
Figure 9 — Gate-level optimization
Comparative numbers between a base line flow and that of a low power synthesis flow employing the above techniques show that an embedded processor device in a 90 nm technology of approximately 650K gates can achieve savings of greater than 40% for both dynamic switching power and leakage power [6]. Summary The trends outlined in this paper clearly show that the on-chip power related challenges associated with designing at the latest technology nodes are here to stay. The available data indicates that the factors impacting low power design considerations will become an overriding concern for architects and designers of the next generation of SoCs. This will require a proportional incremental step in the design tools and methodologies needed to exploit the full potential of the future technology. In line with proposed silicon technology advances, engineering productivity needs to increase to further resolve and reduce the on-chip power challenge. This can only be achieved through new methodologies, tools and design concepts. These may involve a move to more power conservative clock structures, such as low swing flip-flops, double edge triggered flip-flops or conditional flip-flops, giving clock on demand where the internal clock is only activated when the input data will have the effect of changing the output data [7]. Alternately, moves to more radical approaches such as complete asynchronous design techniques, or more intelligent clocking structures, will be required to support the more traditional concepts. Tools to support these new methodologies will be required, fully integrating the process from high to low level abstracts of design and analysis. References 1. Intel web sourced data, http://www.intel/research
Peter Bennett is a Chief Consulting Engineer for Cadence Engineering Services, responsible for all technical aspects associated with customer design engagements, specifically targeting designs involving complex SoCs intended for low power applications. He was previously with the Mainframe Design Group of International Computers Ltd (ICL) for over 18 years. |
Home | Feedback | Register | Site Map |
All material on this site Copyright © 2017 Design And Reuse S.A. All rights reserved. |