|
||||||||||
Reducing Power Hot Spots through RTL optimization techniquesBy Supriya Unnikrishnan and Sujith Mathew Iype (Ignitarium Technology Solutions) With the tremendous increase in the intelligence added to various consumer and professional devices, the applications are turning more data-centric and computation intensive. From the IC design perspective, this enhances the already existing challenge of power vs area trade-off. Power Analysis is not a new terminology in the semiconductor design cycle. But there is a difference in opinion regarding the most efficient phase of the design cycle for power analysis. An architecture phase will be too early, and physical design phase will be too late for the power estimation. Of late, there is a tendency to analyze power hot spots during the RTL phase of the project. RTL based power analysis is faster and easier to perform and has a shorter iteration time than the analysis at later stages. This article addresses some of the power optimization techniques applied at the RTL level. A radical increase in battery-driven consumer appliances made power optimization an essential feature in most systems on chips (SoC). In the early phases of the VLSI industry, power analysis was considered a back-end activity where the front-end design interactions and contributions were minimal. But as the chip complexity increases, with different power and voltage domains, it became mandatory to move power analysis to the front-end phase to ensure proper estimation and optimization. The power figures are so competitive that optimization only at the back-end phase falls short of the requirement. Also, the dynamic power calculation depends very much on the input vector driven to the SoC; hence, analysis with functional verification vector input appears mandatory. Thus, the industry started doing power analysis at the RTL phase. There is scope for power optimization at different stages of ASIC design. System partitioning into power and voltage domains is done at the architecture phase. Defining power modes like active and hibernation allows optimization even with a single voltage domain. Power gating the circuitry except for wake-up logic while the processor is in hibernation mode reduces power wastage. These are some of the conventional approaches applied for power reduction. In multi-core processor designs, multiple voltage domains allow the supply voltage to each core to be controlled based on the workload. A core operating at higher voltage operates at a higher frequency, whereas a core applied with lower voltage can use lower frequency. The memory organization and module level clock gating is another area where architecture level optimization is applied. Power reduction during the synthesis phase owes to gate transistor sizing and cell merging to reduce net switching activity. Another approach is to assign a high switching net to a pin with lower capacitance and a low switching net to a logically symmetric pin with higher capacitance. Synthesis tools enable clock gate insertion by converting data enables to clock-enable. Design optimization over leakage power by mapping non-critical paths and critical paths in the design by cells with different threshold voltages is another technique applied at back-end. The power optimization at RTL level mainly focus on reducing signal activity at the register level. This article primarily focuses on RTL optimization, which captures power control at a more granular level. II. POWER AWARE RTL In general, exploring RTL power optimization includes close monitoring of the following aspects of the design.
This section addresses some optimization techniques, citing user scenarios and coding examples. A. Clock Gating Clock gate insertion at the block level is a prevalent method to reduce power. But clock gate insertion at the register level is employed only when power constraints are so stringent. This is due to the area cost inferred from the clock gate insertion at a granular level. To enable clock gate insertion at the register level, RTL is written in such a manner that the synthesis tool automatically inserts clock gate. Another option is to manually configure synthesis tools to insert clock gate for selected registers. In a complex design, the second option is not feasible. In such a scenario, RTL is written in a way so that tool infers the clock gate from code. Consider the generation of fifo wr err signal asserted when FIFO is full and write is launched. The example code listing 1 doesn’t infer clock gate, whereas listing 2 infers clock gate. Listing 1. Code without CG Inference always @(posedge clk or negedge reset) The code listing 2 will infer a clock gate with enable as fifo wr en. This will get toggled only when fifo wr en is there . Listing 2. Code with CG Insertion always @(posedge clk or negedge reset) 1) Clock Gating Efficiency: Insertion of clock gate without investigation on clock gating efficiency may increase power figures. The following factors estimate clock gating efficiency
The listing 3 shows an example of efficient clock gating, which is enabled only when input at the register changes. Listing 3. Improved Clock Gating Efficiency always @(posedge clk or negedge reset) 2) Clock Gating TradeOff: When we strive for 100 percent clock gating efficiency, the system’s area figures will increase drastically due to using clock gates specific to each register. To avoid this, we can follow trade-off where we use weak enable for clock gating. Here, we combine the enables for multiple registers even though there will be redundant toggling at flip flop inputs. The recommendation for the default value of this trade-off is 3-4, which means a common enable is shared across 3-4 registers at the expense of reduced clock gate efficiency. The listing 4 and 5 shows how this can be implemented for the FIFO example. The write error and read error generation enable can be combined to reduce area cost. Listing 4. Combined Clock Gating Example 1 always @(posedge clk or negedge reset) Listing 5. Combined Clock Gating Example 2 always @(posedge clk or negedge reset) B. FSM based Control Fig. 1. FSM Based Enable Generation Based on the design, FSM-based control signals is used to gate downstream or upstream paths. Example 1 : FSM is used to control the transmit path and receive path of a device that is enabled either in transmit or in receive mode. The signals generated based on the FSM state can be used to gate all logic for transmission and reception. List 6 shows how the enables are generated based on the FSM control. Listing 6. Enable Generation based on FSM assign transmit_cg_en = ˜state_tx[IDLE]; Example 2 : In the above scenario, consider the device is plugged into a broadcast serial bus where all devices must listen to the bus for a particular duration. We can gate the external bus transition propagation to internal logic after confirming that the transaction is not addressed to the device once the listening period is over, as shown in figure 1. The start and stop detection of the bus protocol is constantly active. In contrast, the Serial In Parallel Out (SIPO) logic is active only during a listening period if the transaction is not addressed to the device. C. Data Path Operator The data path operator blocks like multipliers can have unnecessary toggling at the input even if these are not enabled. The following techniques reduce power dissipation due to this.
Listings 7 and 8 depict the code to realize this. Listing 7. Gating Data Operator Input Toggling Method1 always @(posedge clk or negedge reset) Listing 8. Gating Data Operator Input Method 2 assign mul_in1 = data_in1 & Listing 9. Gating Data Operator Input Methos 3 always@(mul_en) begin D. Reducing activity at Combinational Logic The power dissipation at the combinational logic can be controlled by avoiding unnecessary toggling at the inputs. A multiplexer can be considered here as example which is the major building block of the combinational logic. In the example circuit given in figure 2, we have a shared data memory accessed by two packetizers. Here we do have the apparent power wastage since the input to packetizer 2 will toggle when packetizer 1 is accessing data and vice-versa. A more power optimized design will gate input to MUX in packetizer 1 when packetizer 2 is accessing data memory as shown in figure 3 Fig. 2. Toggling at Mux Input In the above example, if we can split the memory into small words, we can gate memory block and result in more effective power gating. But the disadvantage will be in routing congestion and duplicating the DFT logic for memories. Fig. 3. Gated Toggling at Mux Input III. RTL POWER ANALYSIS TOOLS The ASIC design flow is getting updated with RTL analysis tools that account for power analysis and reduction at an early stage so that power budgeting is complete as early as possible. Joules RTL power solution from Cadence and PowerArtist from Ansys are such tools that does Power Analysis and Power Reduction early in the design cycle. The RTL power analysis tool flow is shown in figure 4. RTL design files are elaborated and analyzed for clock and data activity using simulation activity files given in VCD, SAIF or FSDB format. The input to the power estimation stage includes a library with power information, including Vt-based classification, capacitance estimation based on wire load and spef model, and clock definition. The clock definition information includes SDC file of the design, the definition of clock gating cell and minimum bit width for clock gating, and clock buffers to be used for root, branch, and leaf node. The power analysis results are reported as two categories [2] [3] ‘Power Reduction’ look at the implementation and will try to analyze power saving, power penalty, and area impact for a different implementation technique. ‘Power Linter ’ recommend a means of recovering the wasted power, as the method of recovering depends heavily on the design and test pattern. A. Power Reduction By default, the synthesis tool uses locally available enables to clock gate register banks. This is useful but not enough to address complex designs. The clock gating based on logic observability analysis generates new enables from downstream or upstream logic. There are situations where, in a design, the output of register (or output of downstream logic) is not used (not observable) downstream. These situations are used to clock gate the register. B. Power Linters Power Linters identify power spots by doing analysis based on a RTL checklist. The tool can be configured for examining RTL for many guidelines including
Fig. 4. RTL Power Analysis IV. EXPERIMENTAL ANALYSIS A representative design as shown in fig 6 is used for the experimental analysis. The design captures representation of the various circuit components discussed in section II. RTL is developed for the design by incrementally enabling different power optimization techniques discussed in the section. RTL power analysis tool Joules from Cadence is used for the analysis.
Fig. 5. Block Diagram Experimental studies are conducted in two phases.
Phase 1 study presents the clock gate(CG) inference by tool based on the vector activity analysis. The user scenarios are simulated with different toggling rate for generating vector activity database. If CG inference is enabled, additional clock gating opportunities are explored by tool even in the absence of mux feedback into the register. The power value reported by the tool is captured in Figure 6. The CG min and CG max values indicates minimum and maximum number of flops that can be driven by a CG cell. Even if RTL is written without multiplexer feedback into the register used to infer CG, tool will insert clock gating cells at global points. In the design under consideration, tool inserts clock gating at the following points when CG inference is enabled.
The trade-off between power saving and area overhead is presented with the help of FoM (Figure of Merit) defined as multiplied value of Power and Area ( PowerxArea). The optimal min-max value ideal for system is identified through this analysis, for a constant system frequency and switching activity. In some cases, top down approach can be applied where a top level min/max is defined and this value is overwritten in the sub block. As observable from the figure, there is an optimal CGmin and CGmax value beyond which there is no improvement in the FoM.
Fig. 6. Analysis of CG insertion by tool In the second phase of the experimental studies, power optimized RTL is provided as input to the tool as described below. 1) The incremental changes for optimization are made in RTL as per the guidelines captured in section II
2) Combinational activity is reduced following guidelines in section II C and D. 3) Optimization is done based on the use case of the system and the metrics under consideration. Inefficient enable can explode area without improvement in power values. Examples of some optimization techniques applied are cited below.
assign enable = reg_bus[7:0] ˆ nxt_reg_bus[7:0]; assign enable = reg_bus[31:0] ˆ nxt_reg_bus[31:0]; ANALYSIS OF CG INFERENCE BY RTL In addition to these, we can use tool reports to get more inputs on the different power hot spots. Observability Don’t Care (ODC) is one such technique to derive new clock gate enables based on analysis of downstream logic for observability of register output. There are situations where, in a design, the output of register (or output of downstream logic) is not used (not observable) in the downstream [2]. The report odc report the wasted ODC power and suggestion for enable generation. Along with the analysis based on the system use case, these techniques can be effectively clubbed to generate a power optimal design. V. SUMMARY The target power metrics for low-power chips demand power optimization to be enabled in all stages of the chip design. A simple change in RTL can save a lot of dynamic power and further reduce the time and effort spent in later stages of design to achieve optimal values. For a low-power design, RTL developed by adhering to the power optimization guidelines significantly contributes to the early closure of power budgeting. REFERENCES [1] Early Power Closure Using a Design for Power Methodology: Ansys White paper [2] Power Artist Reference Manual Software Release 2019 R2.5 [3] Joules User Guide for Legacy UI If you wish to download a copy of this white paper, click here
|
Home | Feedback | Register | Site Map |
All material on this site Copyright © 2017 Design And Reuse S.A. All rights reserved. |