Low Power
  You are here: CHARTER Low Power Design Guide
 Charter
 Main
 Introduction
 LP Design Guide
 People
 Projects
 Papers
 
 Orinoco®
 Introduction
 
 Resources
 Books
 Dates
 Links
 LP Software
 
 Misc
 Jobs
 Contact
 
 
 Print this page ...


Low Power Design Guide


3 Low Power Design Methodologies

In this chapter the designer receives practical advise for low power design. This document must not be understood as a complete implementation guide. It is an overview of known techniques gathered from [ 1 ] - [ 8 ]. This gives an idea of what methodology is applicable for a designflow. The reader should fall back on the known literature for more details or request more information from OFFIS, if needed.
Chapter 3.1 starts at a low level of the design space and in 3.2 switching power reduction is addressed. Shown are promising techniques for low power design. Chapter 3.3 depicts proceedings which effect the architectural level and focus on power down modes. Last, 3.4 handles the system level. The advantage of SoC to power dissipation is introduced. Note that the higher the level of abstraction at which a methodology is applied, the more promising are the effects on saving power (compare with figure 2).


figure 2: Power Reduction Opportunities


3.1. Adapting Process Technology

This chapter was added to give a complete picture on low power techniques. Defacto, it is of no relevance for designer who have no influence on the technology, since the described effects base on a level of design abstraction which is not in their scope. In 3.1.1 reducing capacitance is the topic and chapter 3.1.2 illuminates leakage power. 3.1.3 handles the promising methodology of lessening power supply voltage. Power savings through higher density of integration is described in chapter 3.1.4.

3.1.1. Reducing Capacitance

In equation 5 Cout is described as the sum of three capacitances:

    equation 5:    Cout = Cfo + CW + Cp

Cfo is the input capacitance of fan-out gates, Cw the wiring and Cp the parasitic capacitance. For deep sub-micron technologies Cw is the most dominant component and unfortunately the hardest one to estimate, too. Complex effects like "cross-talk" have to be considered. Only the provider of a technology, the fab, has greater effects on this parameter.


3.1.2. Reduce Leakage Power

In the ordinary Pdynamic outweighs Pleakage but the ratio might tip over, if the design is idle most of the time and switching activity is low. A payphone is an application for which this might be true. Since only the provider of a technology can influence Pleakage any other comment is left out.


3.1.3. Reducing Supply Power

Having a look at equation 2 it is apparent, that reducing supply voltage is the most promising way of saving power since its influence is quadratic. Slowing down switching speed is the penalty for this technique as can be deduced from equation 6 taken from [ 8 ].

   equation 6:   

Td is the delay of a CMOS inverter, h is a technology-dependent constant, W and L are respectively the transistor width and length and Vt is its threshold voltage.
Usually a circuit is designed to meet certain timing constraints which will be violated when the supply voltage is reduced. The solution is called "architecture-driven voltage scaling". The level of concurrency is raised by adding more hardware to the design. Typical methodologies are pipelining and parallelization. This eases the timing restrictions. In spite of having more hardware that is consuming power, the over all power dissipation is reduced because of the quadratic influence of Vdd.


3.1.4. Higher Density of Integration

By minimizing the scale of a circuit, its capacitances and therefore its dynamic power dissipation is reduced. Normally the technology is fixed and the designer has no choice in which process he has to use, so this paper is not going into details.


3.2. Reducing Switching Activity

In chapter 2.1 the importance of Pdynamic for Pavg is explained. To reduce power dissipation effectively low power methodologies should affect this source. Since a common designer has no influence on Vdd and only a minor one on Cout, switching activity is the remaining rudiment.
Methodologies for reducing the switching activity are given in the following chapters. Glitches are unnecessary transitions. They are covered by chapter 3.2.1. Chapter 3.2.2 handles minimization of operations in algorithms. In most cases, several different calculations compute the same function. Those, which content the least or the cheapest - regarding power consumption - operations, should be chosen for a low power design. Chapter 3.2.3 treats busses and shows several styles for data representation, which result in a lower signal activity. That optimization can be achieved by an intelligent scheduling and binding, which is shown in chapter 3.2.4.


3.2.1. Minimization of Glitches

Gates' delays are often assumed zero to simplify estimation. In this way an important aspect of reality, glitch power, is left out. In a static logic gate, the output or internal nodes can switch before the correct logical value is being stable. Imagine an AND-gate with two inputs of different delay time and the transition 01 -> 10. At a zero delay gate the output would be a stable logic zero but in our example the first port has a lower delay, which forces the output to a temporary logic one, after which it stabilizes on zero again. The power lost during this unnecessary switching activity is called glitching power loss.
The influence of glitch power is illustrated in figure 3. Leaving out glitch power would make the left architecture the less power consuming choice. Amazingly, the alternative on the right side with two adders is the more economical design.


figure 3: Alternative architectures that implement the same function: Effect of glitching ( [ 4 ], p146 )

Glitches of a signal node are dependent on the logic depth of it. Generally, nodes that are logically deeper are more prone to signal glitches. One of the reasons is that the transition arrival times are spread longer due to delay imbalances. Also, a logically deep node is typically affected by more input switching and therefore more susceptible to glitches.
One way to reduce glitches is to shorten the depth of combinational logic by adding pipeline registers. This is very effective especially for data-path components such as multipliers and parity trees.


3.2.2. Minimization of the Number of Operations

Minimizing the number of operations to perform a given function is critical to reducing the overall switching activity. To illustrate the power trade-offs that can be made at the algorithmic level, consider the problem of compressing a video data stream using the vector quantization (VQ) algorithm. Detailed information on VQ can be found in [ 10 ]. The basic operation is shown in equation 7:

   equation 7:   

where Xj are the elements of the input vector and Cij are the elements of the codebook vector. Two more algorithms are illuminated in [ 8 ]: tree search and differential codebook (equation 7 represents the full-search-methodology). Have a look at table 1 which shows the number of operations needed for each algorithm.


Algorithm  # of Memory Access  # of Multiplications  # of Adds  # of Subs
Full Search  4096  4096  3840  4096
Tree Search  256  256  240  264
Differential Tree Search  136  128  128  0

table 1: Computational complexity of VQ encoding algorithms.

It is obvious, that 4096 subtractions will use a lot more power then zero. The numbers of the other operations change by a factor of 30.
The reason for not going into any details is, that the optimization techniques used in the example are not commonly adaptable to other application. Each algorithm will have to be explored in its own domain. This results in some intellectual handwork. The example shows that this manpower is well spent.


3.2.3. Low Power Bus

Busses are known for their heavy loads, long interconnects and therefore their large capacitances. This is due to their connections to large cores spread across the die of a SoC. Reducing the capacitance is normally not possible and reducing the switching activity is the only chance of reducing power loss. Therefore, coding the transmitted data for minimum switching activity is the methodology of the choice. One-hot coding, gray-coding, bus-inversion-coding and two's complement versus sign magnitude are introduced in the following.
One-hot coding is a simple, redundant coding style. The original bus with a bit width of n is capable of transmitting 2n different words. Each word is mapped to a single wire in the one-hot coding. A bus using this methodology would need m = 2n wires. m-n is the degree of redundancy of the encoding. It is ensured, that only two bits will change between the transmission of two data words. Practically, the one-hot is of no relevance because the area (number of wires) is growing exponentially with n.
Another encoding strategy is gray-coding. A gray code sequence is a set of numbers in which adjacent values have one bit difference only. This is of use when the data being transmitted over a bus is sequential and highly correlated. In this case, the number of transitions between two words broadcasted on the bus approaches two. Address bus accesses for instruction fetches are highly correlated and a good example where gray-coding will have good results. In table 2 an example is given. Notice that gray-coding does not have a high redundancy like one-hot.

DECIMAL VALUE  BINARY VALUE  GRAY CODE
0  000  000
1  001  001
2  010  011
3  011  010
4  100  110
5  101  111
6  110  101
7  111  100

table 2: Binary and Gray-code representation

A third example is bus-inversion-coding [ 11 ]. Before transmitting a data word Si+1 the previous word Si is compared with it. Either or Si+1 is transmitted, depending on which representation results in fewer transitions and where is the bit-wise inverse of Si+1. An additional wire indicates how to interpret the data. For bus width smaller then 8 a reduction of activity down to 25% can be achieved. It is possible for wider buses to divide them, e.g. a 32 bit bus into four parts and encode them separately. This makes more then one additional line necessary.
Another choice of data representation is two's complement vs. sign magnitude. In most designs two's complement is chosen because implementation of additions and subtractions are easier to implement. The problems occur, if the dynamic range is much smaller than the maximum possible value and digits around zero are common. In this case the most significant Bits (MSB) will produce transitions very often, when the sign changes. In applications like this sign magnitude is a good alternative.


3.2.4. Scheduling and Binding Optimization

At the behavioral level a design is typically described as a control-data-flow graph (CDFG). A scheduling algorithm assigns each operation of the CDFG to a designated time slot, taking data dependencies into account. A proceeding, which is power-oriented, is presented in [ 13 ]. The intuition behind this method is that operations must be scheduled, so that resources which are not performing useful computations in a given control step can be shut down. In this way the methodologies of chapter 3.3 become more applicable.
It has been said, that each operation has to be executed on a resource. The correlation of operations to resources is called binding. Operations that are not scheduled in the same control step can be bound to the same resource and executed sequentially. This is called resource sharing. Due to resource sharing it is possible to implement the same functionality on a reduced set of resources, e.g. one adder instead of two. One must not expect to reduce power by minimizing resources. It is true, that the required chip area will be halved, but since the activity of the one adder is doubled (two additions have to be executed on it instead of only one) its dynamic power dissipation will double as well. How can a binding be optimized for low power then? Scheduling and binding should be chosen to utilize data correlations. Binding operations of a correlated data streams to the same resource will reduce switching activity.
OFFIS developed the high level synthesis tool ORINOCO (compare with [ 9 ] ) to automate the determination of a low power scheduling and binding.


3.3. Power Down Modes

Systems have to be designed to meet certain constraints in which they have to operate. Since these limits normally show worst case situations, the system typically is not working at maximum possible performance. Parts of the chip are idle, do not add any functionality to the design at the time, but still consume power. The reasons are unnecessary changes on the inputs of the unused devices and the load they add to the clock-signal, which continuously toggles whether the devices are processing or not. Reasonably these parts should be turned off.
This chapter offers several proceedings for this case. They are sorted by granularity. If great parts of the system are idle for a long time, power supply shutdown is the methodology of the choice (chapter 3.3.1). For smaller components, or such which must not loose register values, chapter 3.3.2, which handles clock gating, might be more applicable. Chapter 3.3.3 describes the least effective but also least invasive methodology, insertion of enabled flip-flops. Last but not the least shut, down of memory is treated in its own chapter 3.3.4.


3.3.1. Power Supply Shutdown

Shutting down power supply reduces power dissipation to zero. This is the most effective way to save power in idle modules. Several conditions have to be fulfilled to employ this methodology.
1.  
The power switch will have to be well designed. A resistance and delay value is tied to a real switch. A straight forward implementation would be a transistor with a low ON resistance. Therefore, its width has to be increased, which results in a large capacitance. Buffering circuitry is needed to operate the switch at a satisfactory performance.
2.  
It takes a delay time of DT before supply voltage stabilizes in a switched back on module. This makes the methodology applicable for components with an idle time greater then DT only.
3.  
The design must not contain any storage units like registers or memory because their values would be lost during power down. It is possible to add extra logic to save and later restore the data, but the logic and power overhead for this proceeding has to be well examined.
4.  
Powering down and up will result in transient noise and voltage drops in a carefully designed power supply grid. These effects must be adequately shielded to avoid functional failures.


The numerated points suggest, that power supply shutdown is applicable for a very coarse level of granularity only and one has to realize that it is very invasive and disturbing to a design.


3.3.2. Clock Gating

Instead of switching off power supply, the clock signal may be halted in idle devices. This reduces switching activity and therefore dynamic power consumption to zero. Inserting clock gates is not as great of an interference to the design as power supply shutdown and can be used on components with lower granularity than mentioned in 3.3.1. This makes clock gating applicable for applications where power shut down is no alternative. Clock gating won't lessen power dissipation to zero since leakage power is unaffected.
The designer has to take into account that the gate increases clock skew and makes testing more complicate. Lastly, glitches on the switch's control signal must be prevented. E.g. a glitch could cause a temporarily false clock turn off/on, which might add an extra rising edge to the clock signal behind the gate. Preservation of the circuits behavior is not guaranteed!
Synopsys advertises their tool power compiler to handle insertion of clock gating automatically (compare with [ 9 ] ).


3.3.3. Enabled Flip-Flops

As clock gating can be seen as a softer alternative to power supply shut down, enabled flip-flops are the next less aggressive (and less effective) strategy. Registers are replaced by a representative with an enable signal. By enabling these representatives, they behave like general registers. Disabled, the flip-flops' outputs are not changing, which reduces switching activity in the circuit. The most active signal, the clock, is still active though, ensuing a great deal of power dissipation.
Recapitulating it can be said, power management based on enabled flip-flops can be beneficial, but an implementation based on gated-clocks is fundamentally superior.


3.3.4. Memory Partitioning

Farrahi and co-authors [ 6 ] propose a memory partitioning (also called segmentation) scheme that reduces power by exposing idleness in memory access. The functionality of memory is to store data when it is written and return it when read. Farrahi suggests to view memory not as a monolithic resource but as a collection of independent memory segments. Each segment has its own clock and refresh signals. Whenever a memory segment is idle, it can be put in a sleep mode where the clock is halted or no refreshes are transmitted. Memory is idle, when no useful information is stored in it. Be aware that memory is not idle, when it is not accessed. It might store vital information which would be lost when the memory is turned off. It might store unimportant information though. A lifetime can be assigned to each variable in a memory element. It defines a time interval which starts when a variable is written, and ends when the variable is last read. A segment is called idle, when it contains no live variables. The partitioning technique attempts to store variables which have overlapping lifetimes in the same segment. Due to this approach, idle time of memory segments is increased and power dissipation is reduced.
Memory segmentation is a binding problem that assumes the knowledge of scheduling information.


3.4. System Design

Regarding to figure 2 on page 5, system level low power design techniques should be most promising for reducing energy consumption. Two methodologies are denoted in this section. Chapter 3.4.1 focuses on low power hardware-software partitioning and shows possible power savings of approximately 77%. Chapter 3.4.2 handles chip's I/O communication, which is responsible for up to 33% of overall system power consumption in typical designs.


3.4.1. HW/SW Partitioning

Services can be implemented either in software running on cores, or in dedicated hardware. In a typical design flow this will be decided during the step of hardware-software partitioning. This process has great influence on system power. This is illustrated in the following example:
Specific hardware is generally more efficient. 330mW are consumed to perform an addition using a SPARClite processor core in an exemplary technology (0.32mm / 1.8V / 16.8 MHz). A custom adder in the same technology consumes only 2mW plus additional communication overhead. The author of [ 14 ] presents the HDTV chromakey algorithm with 22000 lines of code. Only 15 lines, the critical loops, are implemented in hardware, which results in an energy saving of 77%. These are promising results and confirm the predication of figure 2.


3.4.2. Integration of Chip Components

Implementing systems using present day technology results in third or more of total power being consumed at the chip's input/output (I/O) ports. The larger capacitances of chip's boundaries compared to internal gates and higher voltages are the reason for this observation. Typical values for internal capacitances reside around 10's of femtofarads, where I/O pins reach dimensions of 10's of picofarads. Nowadays supply voltages for chip-cores tend to be lower then 2.0V. In industrial systems not all components of a design might be state of the art and require higher voltages or technical constraints require them. Still, these different components have to communicate over their I/O. This makes dual voltage systems (lower voltage for the cores - higher voltage for I/O) quite common.
From equation 1 the relevance of high capacitances and voltage to dynamic power is known. This makes reduction of switching activity on I/O ports an important task. The methodologies of chapter 3.2 can be applied. Implementing a cache for the special case of external memory helps to reduce I/O traffic and should raise performance as a side effect. But the better way is, decreasing I/O as much as possible by integration of all systems on one chip (SoC). Submicron technology makes this highly integrated circuits possible.





<< BACK NEXT >>


Design: Ralf Beckers | EMail Webmaster | Copyright © 1999 - 2004 Low Power - Last modified: 21-11-2007