Optimize the RISC/DSP Combo for Voice over IP

Optimize the RISC/DSP Combo for Voice over IP
By Dariush Baghbadrani , Communication Systems Design
January 3, 2003 (11:20 a.m. EST)
URL: http://www.eetimes.com/story/OEG20030103S0054

To view a PDF version of this article, click here.

The shift to packet-based networks such as voice-over-Internet Protocol and voice-over-digital subscriber line imposes a multitude of requirements on communications system design that traditional processors are hard-pressed to meet. The problem is that no single architecture can adequately perform all the networking tasks a RISC processor can, along with the signal detection and noise cancellation functions that are the bailiwick of a digital signal processor.

To date, the solution has been to provide both processor types, adding cost and complexity that is perhaps even overkill in some applications. Other options include adding DSP functionality to a RISC processor-or vice versa-as well as the expensive and often time-consuming design of an ASIC from scratch.

While each of these alternatives h as merits and drawbacks in terms of cost, flexibility and complexity, many of these trade-offs can be eliminated through the use of another, more-recent route to customized processing: the extensible processor. These devices allow designers to customize the instruction set for the application and several such processors are now available, including the Xtensa from Tensilica and the ARCtangent from ARC International.

The main advantages of the extensible approach derive from unified memory and a single software development environment, which can lead to a lower-cost alternative to the traditional dual-processor design. The extensible route also provides intellectual-property (IP) protection, since the customization is done in-house. However, before jumping on the extensible-processor bandwagon, the designer needs to fully understand what should and should not be added to the instruction set in order to take full advantage of this architecture.

Current VoIP Approaches
A typical voice-over-IP (VoIP) system uses a RISC processor to implement the networking protocols, perform bit- and byte-wide manipulation for decoding header and routing information, shape data into packets and handle sequencing tags to order packets for playback as voice. At the same time, the system needs a DSP to implement echo canceling, noise suppression and tone detection on the voice side. The DSP also performs silence detection and voice compression before forming the voice information into packets, as well as the reverse operations of decoding and decompression.

Traditionally, the answer to meeting these diverse needs has been to provide both processor types (Figure 1). The RISC processor connects to the digital network and acts as the protocol-processing engine and network manager. It can also handle main system control and the user interface in designs that have them. The DSP connects to the pulse code modulation packet interface and does the processing needed to achieve a robust link with voice compression/d ecompression capability.

While this is workable, the dual-processor approach has some drawbacks. The most obvious is the cost of having two processors, each with its own memory and peripheral chips. Also, each processor needs its own design tool chain, with nearly independent software development efforts. Data exchange between the two processors is also an issue as it requires some common memory—either dual-ported RAM or FIFO, as a buffer between the two processors, or else a complex software handshaking scheme is needed to control the data exchange.

Another drawback is that the scheme is overkill for smaller applications. Network data rates determine the lowest clock speed the RISC processor can run and still keep up, but the actual data handling for a single communications channel leaves the RISC processor idle much of the time. In large systems, the RISC can handle data streams from many chan nels, each with its own DSP section. In small systems, the excess RISC capability is wasted because it is not enough to absorb the DSP functions.

Single-processor options
Clearly, a single-processor design would provide a less-costly alternative. There are three traditional approaches to creating such a design. One is to use a RISC processor with DSP enhancements or, conversely, a DSP with RISC control enhancements. Another is to create a custom processor design, which is typically accomplished by developing an ASIC. Each solution has disadvantages.

Quite a few RISC processor architectures include DSP elements, but they don't go far enough. In most cases they add only a multiply-accumulate (MAC) unit, although a few include a barrel shifter as well. These additions will help speed some DSP tasks in control applications, but communications applications tend to need much more performance. The DSP needs of a VoIP system, for instance, require five times the processing power of the RISC. Becau se the architectures are fixed, the system designer faces the problem of balancing the DSP/RISC performance.

The RISC-enhanced DSPs tend to go too far the other way and not provide enough RISC performance. A number of enhanced DSPs are available from well-established companies but they typically target control applications requiring a minimal amount of bit manipulation. They simply can't keep up with the speeds of network protocols unless greatly oversized for their DSP needs.

In contrast to RISC, DSP processors tend to pose complex pipelined architectures that lend themselves to efficient computation of complex DSP operations. But simple control instructions don't benefit from such architectures and may be impeded, contributing to the DSP's inefficiency as a control processor. Further inefficiency stems from the tendency of DSPs to work only with full-word-width data. For communications protocol processing, where packets are typically composed of multiple 8-bit data words, this results in the DSP w orking with unpacked data, where only the last 8 bits are useful, even if the DSP can handle 16- or 32-bit words. While RISC processors can work with packed words, and manipulate each byte of a long word separately, DSPs attempting protocol processing need to unpack and repack data, adding software and memory overhead.

Standard chips do exist that have a balanced combination of RISC and DSP performance, but a look inside often reveals that such devices are simply a one-chip version of the traditional two-processor architecture; they still have independent RISC and DSP cores. Although these platforms have the benefit of unified real-time operating systems and tools, they still to some extent suffer the software development/optimization problems of the two-processor approach.

ASICs Too Risky
There is always the option of designing your own processor and having it built as an ASIC. But ASICs add considerably to the design cycle of the product for which you need the custom processor. Then ther e is the need for custom compilers, assemblers and debugging tools.

A further drawback with the ASIC approach is the difficulty in achieving the right balance of DSP and CPU processing power. The development tools available to an ASIC designer don't allow high-speed emulation of software execution. As a result, predicting the design's performance becomes difficult and iterating the design to tune performance becomes virtually impossible. Given that mask sets now cost several hundred thousand dollars to create, there is a huge penalty for getting the balance wrong.

But the ASIC approach is not the only way to get a customized processor. Extensible processors like the Xtensa and ARCtangent are core designs supported by development tools that make it possible to add custom instructions—and the hardware that executes them—to the base processor design. The design tools then automatically generate extensions to the compiler and the other software development tools, so that they will support the e nhanced core.

Other extensible architectures have also appeared, but many are user-extensible to simply support custom peripherals. Such designs have customizable memory configurations to match the target system-they do not support the addition of custom instructions.

This automatic tool generation carries a hidden, very attractive benefit: software IP protection. When you customize an extensible processor by adding instructions and the circuitry to execute them, you are adding op codes that are unique to your design. Because the software tools are automatically integrated, no one outside of the design team knows what any of the new op codes do. As a result, attempts to reverse-engineer a design by decompiling the software will fail, because the op codes are undefined in the standard processor.

The question then becomes, what do you need to add? There are basically three types of circuitry in a DSP that aren't available in a standard RISC processor: the MAC unit, the barrel shifter and saturatio n arithmetic logic. In addition, a DSP will offer unique memory-addressing schemes. Each type of circuit addresses the needs of a whole category of signal-processing tasks.

Start with a MAC
The MAC is the circuit most designers think of as the heart of a DSP, because many signal-processing algorithms have functions of the form x - Σa_i x b_i. Implemented with a standard RISC processor, such operations require a loop containing a sizable multiplication routine that burns lots of CPU time. The MAC implements the multiplication and subsequent addition in a single instruction, dramatically improving software execution speeds for such functions.

Another software accelerator is the barrel shifter. Shifting the bits of a binary word left or right serves many purposes. Logical shifts, which wrap bits around to the end of the word as they move, are useful for aligning subword fields so they can be readily evaluated. Arithmetic shifts preserve the most-significant bit, the sign bit of a binary number, while moving all the other bits left or right and inserting filler bits based on the direction of the shift. The effect is to divide or multiply a binary number by two each time it is shifted left or right, respectively. As a result, arithmetic shifts are useful for multiplication, division and normalization of binary numbers. As with the MAC, the advantage of the barrel shifter is that it saves a software loop because it performs any number of shifts in a single instruction; RISC processors can only move one bit at a time.

Saturation arithmetic logic comes into play when you want to avoid overflow or underflow while performing mathematical operations. In conventional RISC processors, addition and subtraction cause rollovers. This is fine for counters, where the rollover event signals the end of the count. For signal processing, however, it causes erroneous results if the rollover is misinterpreted as a sudden jump in numerical value. To avoid rollovers in standard RISC proce ssors, the programmer needs to add tests and corrective action to the arithmetic subroutines. With saturation logic, the corrective action is automatic and needs no additional tests or instructions.

The implementation of the saturation logic is particularly important when trying to design to international standards. The International Telecommunications Union and European Telecommunications Standards Institute offer reference vectors for determining the performance of algorithms such as vocoders. Customers expect designs to show a bit-by-bit match with the reference vector as a kind of certification that the algorithm has been implemented correctly.

To obtain this precise match, the saturation logic must correspond to the operations defined in the standards. In a 16 x 16 MAC operation, for instance, the standards expect the accumulation to limit the result to the extreme values available in a signed 32-bit number. Many MACs, however, have an accumulator wider than 32 bits to guard against over- or und erflow, then saturate the result when storing it in a destination register. This may produce a slightly different result than you get when saturating at 32 bits in the accumulator, a difference that customers may perceive as an error. An ITU-compliant MAC can perform either type of saturation.

Barrel Shifters Help
Similarly, the barrel shifter must conform to ITU/ETSI standards in order to produce bit-correct results. These shifters need to provide saturation, maintain the sign bit regardless of the shift operand's size and handle reverse shift. Software emulation of the standard implementation is a possibility if the shifter does not conform, but that ruins the advantage of a hardware implementation for the DSP instructions.

Not all DSP enhancements are instructions, though. For example, memory-addressing schemes, while not directly associated with a specific instruction, can influence how efficiently software handles some DSP operations. In DSPs, an X-Y memory allows an operation to add ress two source operands simultaneously, typically one from the X field and the other from the Y field, by using two addressing parameters. Simultaneous read/write operations are possible without bus arbitration and parallel data moves become simplified.

Having an X-Y memory system that can address both source operands from the same field (either X or Y) also offers the opportunity to save on memory access. Without this capability, software must copy data from one field to another, so that one operand can be fetched from the X field and the other from the Y field. The need to access data from within a single field often arises in the MAC operation a = Σr_i x r_k when the data sequence (r_n) is stored in a single field.

Having extra pointers-more than two pairs-further simplifies the programming task and reduces Mips consumption in large loops where a multitude of different data sequences is involved in "n" recursive computations. The results from these recursive computations may need to be written in "n" different memory segments for proper handling later in the algorithm. Storing in so many different segments becomes problematic when n is larger than two; having the additional pointers helps.

The X-Y memory also allows addressing modes that are particularly useful in DSP operations (Figure 2). Circular addressing is advantageous for creating buffers. Variable-offset updates help when implementing filters. The bit-reverse addressing simplifies a fast Fourier transform calculation by eliminating the need to move data when making the "butterfly" cross-calculation.

While memory addressing and special circuitry speed the signal-processing performance of a chip, they need to be added appropriately. There is, after all, a chip cost to pay for the additional circuitry. Developers need to understand just how much DSP capability they must have so they don't o verdesign and add unnecessary cost. Does the processor need one MAC or two, and how many bits wide? How long should the barrel shifter be if one is needed? How large an X-Y memory should be used? The answers to these questions can, at best, only be estimated at the beginning of the design. Wrong answers can result in inadequate performance on the one hand or excessive cost on the other.

This area of uncertainty is where the extensible processor can be shown to have advantages. The development tools for creating the core allow software simulation of the core's performance in running code regardless of the extensions in use. Therefore, developers have the option of running sample code on the simulation and assessing the performance impact of adding extensions.

Fine-tuning
The process is iterative. Using the simulator in conjunction with a software execution profiler, you identify the bottlenecks-the sections of code where your software spends most of its time. Then, determine if a custom ins truction can save CPU cycles. If so, add that instruction to the processor's configuration and run the simulation again.

A representative execution profile for several ITU/ETSI DSP instructions is seen in Figure 3. It shows the number of times per second each DSP instruction gets used in implementing each of two standard vocoders. The MAC functions, as might be expected, show the greatest use, several million times per second. The shift instructions L_ASL and L_ASR are the next most common. Both of these are candidates for implementation as DSP extensions to the RISC core to boost system performance. They operate on 32-bit input operand "B," and so can provide the same functionality as ASL and ASR, respectively, since the latter operate on 16-bit source operand "b." Implemented in RISC alone, each operation would require a number of steps, so the custom instructions can be the equivalent of adding hundreds of Mips to the processor's performance.

Adding a custom instruction, however, has an effect on chip cost as well as on execution speed. It is therefore important to evaluate what the cost impact of the extra instruction will be. Most extensible-processor design tools will make an estimate of die size as you add instructions. Because the cost of an IC is largely determined by its die size, this estimate can help you derive an approximate chip cost.

Once you have assessed the cost/performance trade-off, you repeat the process with the next most promising bottleneck. In the example shown in Fig. 3, high-use operations such as normalization and rounding look like promising candidates. Once the desired performance level is achieved, you can freeze the design and generate the software development tools needed for creating the final programs. You can later fine-tune the processor design using the code you have developed.

Related Articles

"FPGA/DSP Blend Tackles Telecom Apps "; www.commsdesign.com/story/OEG20020628S0097
"Handling VoIP Speech Coding Challenges"; www.commsdesign.com/story/OEG20021016S0005

About the Author
Dariush Baghbadrani (Dariush.Baghbadrani@arc.com) is principal DSP engineer at ARC International, in the U.K. He has a PhD from Manchester University, a master's from Loughborough University and a BSc EE from Newcastle Upon Tyne Polytechnic.