An Efficient ASIP Design Methodology

Selim ZOGHLAMI*, Raphael DAVID*, Stéphane GUYETANT* and Daniel ETIEMBLE**
* CEA LIST, Embedded Computing Laboratory
** LRI - Computer Science Lab

ABSTRACT :

The processors that are used in embedded systems must fulfil a set of constraints: program execution time, power consumption, chip size, code size and so on. In this paper, we focus on the design of Application Specific Instruction-set processors, and more precisely on an efficient methodology for the Design Space Exploration of an ASIP for the audio and speech domain. Using this methodology, we designed a high performance ASIP achieving over 13GOPS/mm2 with a 350MHz clock frequency in a low-power 65-nm TSMC technology. The development time was less than two man-months.

1. CONTEXT

The Design Space Exploration of an ASIP (Application Specific Instruction-set Processor) can be very complex due to the large number of design parameters. In our design case study, we focus only on some key architectural features like the pipeline depth, the number of registers, the implementation of special operations, the number of instructions that can be executed simultaneously and so on. Finding the best trade-off for the values of all these parameters is not obvious and we need a specific design methodology to meet all requirements.

In figure 1, we present different approaches that can be used to find the best trade-off. To make the figure readable, we only use consider two design parameters P1 and P2 that could be for example pipeline depth and the number of instructions that are executed simultaneously.

Figure 1: Different approaches to find the optimal values of two design parameters

(a) The exhaustive search considers all the possible values of each parameter. Due the large number of parameters, it is infeasible to evaluate each point of the design space and compare it to all the other ones. Heuristic search techniques should be used leading to suboptimal solution.

(b) In order to avoid that an heuristic search stops the search at a local optimum, a second technique called random sampling is presented here. It consists in choosing randomly the couples of parameters but again there is no guaranty to converge towards an acceptable result.

(c) With the guided-search approach, the designer starts with a preliminary choice of two parameters, and iterates around step by step until finding an acceptable trade-off. This approach avoids inconsistent or conflicting values for the different parameters and represents the best design solution when the entry point is well chosen.

(d) Many other approaches could also be considered, as using genetic algorithms, machine learning based searches, and so on.

For our design, we use the guided search of parameters. First, we determine the most important features of our architecture. Then, we use a design tool to quantify these different features and the rest of the architecture. So what we propose here is a design methodology based on a guided-search of parameters. The paper will continue with the presentation of that design methodology, then the architecture is detailed. The results and the validation results of the designed processor follow. And finally, further works are introduced.

2. DESIGN METHODOLOGY

Our aim is to find a good trade-off between the time-todesign and the performances of an ASIP for a specific application domain.

2.1 Our benchmarks

For our case study, we choose the Audio and Speech Standards as a specific and largely used domain of embedded systems. Several audio and speech standards with different encoding techniques are available, from lossless to lossy coding. Table 1 summarizes the set of benchmarks that we used for the Audio ASIP Design. Most of these benchmarks come from MediaBench. They cover both different coding techniques and some key features like bit-rates and computing complexities. More details on audio coding techniques are given in [1], [2], [3] and [4].

Table 1: Audio Applications Benchmark

2.2 Benchmark Profiling and Analysing

The selected benchmarks have been profiled using GPROF [5], the public GNU profiler. The outputs of the profiler provide the call graphs and the hotspots, i.e. the most time consuming functions. For our audio-speech benchmarks, we identified 14 hotspot functions such as the codebook best parameters filter search from the CELP (Code Excited Linear Prediction) standard or the MP3 (Mpeg-1 audio- Part 3- layer 3) Modified Discrete Cosine Transform. Those hotspots take over 66% of overall execution time. With these analysis of the hotspots, we cover all audio needs. Their limited number makes the manual analysis feasible. The hotspots can also be analysed to determine the architectural features that could accelerate the execution. For example, we can identify the register and storage needs, the data-path widths, and so on. For instance, table 2 presents the number of registers that would be needed for an efficient execution of each audio-speech hotspot. These needs were identified from the evaluation of the life duration of variables in the execution graph.

Table 2: Estimated registers needs of audio-speech hotspots

We have also identified some specific code features that could be accelerated by specific hardware features such as a pre-arithmetic shift. Our benchmarks also intensively use loops for which optimizing both loop conditional branches and computation conditional branches is fundamental.

2.3 Architecture Sizing

2.3.1 Basic assumptions for the initial version of the architecture

The initial version of the architecture that we used is now presented. It uses a typical RISC (Reduced Instruction- Set Computer) instruction set architecture with 1-instruction delayed branches, conditional code flags (CC flags) for conditional branches (like the SPARC ISA). The ISA (Instruction- Set Architecture) is implemented either with a typical 5-stage pipeline for the scalar version and the n-way superscalar or VLIW (Very Long Instruction Word) versions. Some features reduce the number of executed instructions both for the scalar or n-way versions. The number of CC flags is such a feature that is presented in the next section. One more fundamental feature is the number of instructions that the hardware can execute simultaneously, i.e. the value of n for the n-way approach. It will be discussed in a subsequent section.

2.3.2 Conditional Codes Flags Sizing

As previously mentioned, loops are common in our benchmarks and they combine a loop branch and one (or several) computation branch within the loop. Generally, the result of an entire loop computation is scaled at the end of the loop. So we need a flag for the loop branch and another one for the conditional result scaling. Having one or several CC flags impacts on the overall performance of the loop.

Table 3: Evaluation of Conditional Codes Flags Implementation

In the example shown in the table 3, implementing two different CC flags saves one cycle by loop iteration. With only one CC flag, there is no way to fill up the delay slot after the loop branch, as the CMOV instruction must follow the first SUBCC while the JMP CC must follow the second instance of the SUBCC. With two different CC flags, the CMOV instruction can be moved into the branch delay slot removing the NOP that was needed in the previous case.

The gain can rapidly grows with n-way architectures. In this situation, the loop branch condition SUBCC1 can be evaluated in the same cycle as another instruction. In table 4, with a 2-way architecture, we save one more cycle per iteration.

Table 4: Evaluation of Conditional Codes Flags Implementation for a 2-ILP architecture

The impact of the number of CC flags can be evaluated by a metric called ”instruction utilization rate (IUR)”, that is defined as the number of useful instructions over the overall number of instructions (that includes useful and NOP instructions). This instruction utilization rate can also be defined as 1−NOPpercentage. In table 4, if the first M instructions perfectly fitted on the architecture leading to N/2 cycles and zero NOP, an evaluation of that metric for both implementations leads to:

Using several conditional codes flags increases the performance and it more efficiently uses the capabilities of the architecture. The chip area cost is relatively small and there is no issue for the instruction-set coding. Obviously, the results that are shown in table 4 are based on a simple 5-stage pipeline like the MIPS-R2000 one [6]. Deeper pipelines could lead to other results. For instance, the pipeline of Alpha 21164 [7] had 2 execution stages (EX1 and EX2): the evaluation of the condition was executed during EX1 stage, while the conditional branch was executed during EX2. In that case, both the instruction setting the condition and the conditional branch can be scheduled in the same clock cycle removing a lot of NOP instructions in table 4. Using deeper pipelines will be considered in further works.

2.3.3 N-way architectures

The aim of the article is to present a design methodology based on a driving parameter well chosen. We focus on the number of executions to be executed simultaneously as the driving parameter. The main objective to find the most efficient architecture that is capable to exploit the ILP (Instruction Level Parallelism) that exists in the benchmarks with the minimal set of resources, i.e. the best silicon efficiency.

We need to find the best triplet n-way (Nbways), instruction utilization rate (Tuse) and chip area. The processor frequency and the resulting Nbop/sec are derived from the design for each different n-way architecture.

As no compiler is available for each evaluated architecture, the only way to find the best triplet n-way -instruction utilization rate and chip area is to manually schedule operations in execution kernels according to each architecture. The assembly code of the identified hotspots has been written and the corresponding execution time (in clock cycles) based on the data dependencies and the instruction utilization rates have been calculated for different parallel architectures (2, 3, 4, 6 and 8-way architectures). We considered two types of data-paths : homogeneous data-paths have the same processing resources while heterogeneous architectures have specific processing resources for each way of the data-path.

For our audio-speech benchmarks, on homogeneous data-paths, the instruction utilization rate is 87% for a 2-way VLIW, 74% for 3-way, 54% for 4-way and less than 36% for wider architectures. Obviously, the hotspot loops of the audio applications have not enough ILP to efficiently exploit 6 or 8-way architectures. The instruction utilization rate on heterogeneous architectures is 87% for a 2-way, 72% for 3-way and 52% for 4-way architectures, as shown in figure 2.

Heterogeneous data-paths allow an important architecture area save. At the same time, the utilization rates of both homogeneous and heterogeneous data-paths are quite similar. So, dealing with silicon efficiency as the main metric, the use of parallel architectures with heterogeneous processing resources is very interesting. We will only consider heterogeneous 2, 3 and 4-way architectures in the rest of the paper.

Figure 2: Instruction utilization rates for n-way architectures for audio benchmarks

The second step is to select the amount of parallelism in the architecture. This step needs a prediction of the evolution of the hardware complexity when duplicating resources. From a RISC processor size distribution, we estimate the chip area of each parallel architectures according to the following hypothesis :

The decoder hardware complexity is about 5% of the overall chip area.
The fetch cost is also about 5%.
The chip area of a register file of 32 32-bit registers is about 35%.
The execution units are supposed to use 40% of the overall area.
The remaining 15% are assumed for the remaining parts of the pipeline with its communication mechanisms and pipeline registers.

The evolution of the hardware complexity of different architectural features is also estimated. For example, we consider that the program memory access cost is proportional to the number of fetched instructions per clock cycle. When n increases with n-way architectures, the decoder complexity increases, but many operations have mutual decoding. Thus, we assume that the decoder area increases proportionally to the square root of the value n. Bypassing and communication mechanisms are also assumed to increase according to the same law.

As the register file and the execution units represent around 3/4 of the overall chip area, we made some specific investigations to estimate more precisely their evolution when n increases. For the register file, a set of gate level synthesis based on 2R/1W RF description has been done. This study shows an increase of 50% when doubling the number of RF ports, an increase of 100% with a 6R/3W RF and over 2.5 increase factor for an 8R/4W RF versus the original 2R/1W one. In table 5, we present the hardware complexity evolution of n-way processors relatively to the RISC area complexity.

Table 5: Hardware complexity evolution for n-way architectures with heterogeneous data-paths relatively to RISC processor

Having evaluated each of the parameter presented in the equation 3, we can evaluate the different n-way architectures versus the scalar implementation (i.e. bypassing the Nbop/sec that is not already known). Four our study case, 2 and 3-way architectures represent a good trade-off for audio-speech applications.

2.4 Development Tool

The Synopsys Processor Designer [8] is an automated design tool from the ADL (Architecture Description Language) LISA 2.0. It allows an efficient design feedback to debug and optimize the architecture. From a behavioral description of the operations, several architectures (RISC, DSP, VLIW) can be implemented. Also, an architecture debugger gives a total visibility of the parameters at the execution time : registers, contents of the different memories, instruction opcodes, pipeline stages, stalls and flushes, loop iterations, current pipeline signals, and so on. It allows a micro-step execution of the LISA instructions, that is neither cycle-accurate nor instruction-accurate but ”LISA-line-accurate”.

This tool is used to size a design criteria and to rapidly evaluate its influence on the main system. The development flow and the tool features used are presented in figure 3. From the starting point defined previously, this tool is used during the guided search process described in figure 1,c) of the section 1.

Figure 3: Audio Processor Design with Synopsys

3. ARCHITECTURE OVERVIEW

A block diagram of the designed processor is presented in figure 4.

This figure shows a five pipeline stage architecture: Instruction Fetch (FE), Decoding (DE), Execution (EX), Memory or second execution stage (MEM) and RF Writeback (WB). A n-way structure with three separate data-paths. The distribution of the operators by data-path was obtained from the applications analysis and their computational patterns. The instruction utilization rate estimated gives an overview of the rightness of the choice. This distribution is given below :

Data-path 1 : Arithmetic and Logic Unit, Jumps and Branches, and a 16x16!32-bit Multiplier.
Data-path 2 : Arithmetic Unit with CC Flags edition and a Shifter.
Data-path 3 : ALU, Load/Store Unit, Data Manipulation (including Conditional Writes).

All these data-paths are 32-bit signed except the multiplication. The multiplier takes 16-bit operands and explicits signs in order to support wider software (un)signed multiplications. The 16x16!32-bit multiplication is done in the MEM (or EX2) stage in order to not extend its critical path (i.e. processor critical path) with the data hazard resolution. The multiplier result can be used within one cycle latency. The instruction-set coding is 96-bit wide with mainly two source operands and one register result (Opcodedestreg, src1reg, src2reg−or−imm). The second operand can be a register or an immediate value mostly 14- bit wide. The Register File includes 32 32-bit registers. It is fully accessible by the three data-paths: it includes 6 Read and 3 Write ports. The branch and jump unit is not represented in this figure. The corresponding instructions are implemented by the decoder and the result is given back to the fetch stage. Branches and jumps are delayed by one clock cycle, which means that the delay slot must be filled by a useful instruction or a NOP. Like already presented in the example of Conditional Codes Flag Implementation, a conditional move is implemented, that either writes first or second operand to a register value according to the state of CC flags. This technique replaces conditional branches by conditional transfers. Its utilization increases performances because of availability of data-paths and freeing condition evaluation waiting. The Load/Store Unit allows data memory access. It has 4 access modes : ”.W” to manipulate word-type data, ”.H” for signed half-word-wide data, ”.UH” for same wide unsigned one and ”.B” for 8-bit one. All these access are done in the MEM stage which implies one cycle latency to use the loaded results.

Figure 4: Architecture Overview

4. RESULTS AND VALIDATION

4.1 Architecture Design

The designed 3-way VLIW ASIP VHDL RTL has been generated using the Synopsys Processor Designer tool. RTL has next be gate-level synthesized using Design Compiler from Synopsys targeting 65-nm Low Power TSMC technology. Under a minimum time constraint of 2.8ns, the overall chip area is about 0.07mm2 with more than 45% dedicated to the Register File and 13% to the decoder. The validation process consists in executing the profiled applications and evaluating the processor performances in terms of Silicon Efficiency. Figure 5 summarizes the overall design flow from the application benchmarks to the ASIP performance evaluation.

Figure 5: Methodology Design Flow

First, we select a set of benchmarks from the application domain that we profile and analyze. Then, we look for the best triplet Number of instructions executed in parallel - Instruction utilization rate and chip area. For this, we examine how the assembly code of the different benchmark kernels execute on each n-way architecture and we evaluate the execution time and the instruction utilisation rate. Third, we use a design tool to size an efficient processor. We iterate the process until we meet our requirements. Finally, we validate the designed processor with a gate-level synthesis and we execute the studied hotspots kernels.

As no compiler was available, the assembly code of three hotspots was manually-written and optimized to validate the ASIP architecture. The hotspots of the profiled applications were executed on the processor leading to an instruction utilization rate of 86%. We notice that only three of the 14 hotspot functions were manually-written to evaluate our processor. They only represents about 20% of the overall execution time. The Silicon Efficiency of a processor is given by:

The silicon efficiency of the designed ASIP is then:

The designed 3-way processor delivers about 13GOPS/mm2. The development lasted a couple of months. Its clock speed is about 357MHz and it executes efficiently GSM (Global System for Mobile communications), CELP, ADPCM (Adaptive Differential Pulse Code Modulation) and MP3 applications.

4.2 Performance Analysis

The Synopsys Processor Designer allows a fast generation of other Audio ASIP versions based on the designed one. The aim to the presented design methodology is to show that the design parameters were correctly sized. A small modification of one of them leads to totally different results. In an example before, we showed the impact of implementing two different conditional codes flags. Now we consider the impact of smaller instructions.

Few modifications are done to the ADL description to design 2-way VLIW and RISC implementations. Evaluating their performance with the audio benchmarks leads to different results in silicon efficiency as presented in figure 6.

Figure 6: Normalized silicon efficiencies achieved by different n-way processors

For the three evaluated hotspot functions, we clearly observe that n-way architectures are better than scalar ones. At the beginning of the study, taking only these 3 hotspots, 3-way architecture was 0.78 times less efficient than the 2-way one in terms on Silicon Efficiency. But for all the hostpots, the two versions were quite similar. The results given in the figure 6 after implementation refer only to the execution of the three hotspots. So if we assume that the evolution from the preliminary results to the results after implementation will be the same for all the hotspot functions, then we expect that the 3-way processor will be 1.23 times better than the 2-way one and even more versus the scalar implementation.

SPARC v8 is an instruction-set for RISC processors including load/store, arithmetic, logic and shift instructions and all the necessary stuff for executing a large amount of applications. We choose the Leon3 implementation of the SPARC v8 ISA to be our referent for the ASIP performance achieved. The Leon3 has a seven stage pipeline with a Harvard architecture (with separated Program and Data memories). It includes a hardware multiplier/divider and a 3-port Register File. The special register file contains 32 registers organized in windows. The three validation functions are executed on it and its RTL implementation is gate level synthesized with the same Low-power TSMC library. The overall design size is about 0.035mm2 with a clock speed of 357MHz. In table 6 we compare both the results of the designed 3-way ASIP and the results of the Leon3 processor executing the audio applications. With the described design methodology, the audio-speech 3-way ASIP is about 70% more efficient than the Leon3 processor.

Table 6: Audio ASIP vs Leon3 Silicon Efficiencies

5. CONCLUSION AND FUTURE WORK

The applied methodology allowed a fast Design Space Exploration and an efficient sizing of the key parameters. Our methodology has several limitations:

Manually assembly coding an entire benchmark can’t be done to choose the correct architecture sizing. In the audio example, we modeled over 66% of the set of our benchmarks. This leads us to the choice of a 3-way architecture, but we have no guaranty that the remaining 34% would not modify this choice.
Predicting the evolution of complexity can hardly be done if we are faced to complex system designs with hierarchical memories and complex network connections.

The designed VLIW ASIP was very efficient in terms of performance. But its Silicon Efficiency was badly reduced by its chip area. We noticed that the Register File took over 45% of the overall area. In future works, we will focus on reducing the overall system silicon cost.

REFERENCES

[1] Karlheinz Brandenburg, Oliver Kunz, and Akihiko Sugiyama. Mpeg-4 natural audio coding. Signal Processing: Image Communication, 15:423–444, 2000.

[2] M. Budagavi and J.D. Gibson. Speech coding in mobile radio communications. Proceedings of the IEEE, 86(7):1402–1412, July 1998.

[3] Andres Vega Garcia. M´ecanismes de controle pour la transmission de l’audio sur l’internet. PhD thesis, Nice- Sophia Antipolis University, October 1996.

[4] A.S. Spanias. Speech coding: a tutorial review. Proceedings of the IEEE, 82(10):1541–1582, October 1994.

[5] http://www.ibm.com/developerworks/linux/library/lgnuprof.html?ca=dgr-lnxw02gnuprofiler.

[6] N. Pinckney, T. Barr, M. Dayringer, M. McKnett, Nan Jiang, C. Nygaard, D. Money Harris, J. Stanley, and B. Phillips. A mips r2000 implementation. pages 102– 107, June 2008.

[7] P. Bannon and J. Keller. Internal architecture of alpha 21164 microprocessor. In Compcon ’95.’Technologies for the Information Superhighway’, Digest of Papers., pages 79–87, Mar 1995.

[8] Karl V. Rompaey, Diederik Verkest, Ivo Bolsens, and Hugo D. Man. Coware - a design environment for heterogeneous hardware/software systems. EURO-DAC, pages 252–257, 1996.