|
|||||||||
Solving AMR Speech Codec Porting Challenges
Solving AMR Speech Codec Porting Challenges The adaptive multi-rate (AMR) speech codec was introduced at the end of 1998 as a new codec for both the GSM wireless network and as the mandatory speech codec for third-generation (3G) wideband CDMA (W-CDMA) systems. As of the beginning of 2004, AMR was the de-facto codec for existing GSM systems and is a key component for the rollout of 3G wireless designs. But, while widely adopted, implementing the AMR codec is not an easy task. Since AMR was defined by the European Telecommunications Standards Institute, it is supplied with reference C code and test vectors. At first glance it would seem porting this code to a specific processor architecture simply involves compiling the provided code and optimizing key sections until the desired performance metrics are met, using the test vectors to ensure compatibility. However, as with many engineering projects, once a more detailed study of the problem is made, a number of complex factors must be analyzed and balanced to create a robust implementation. In this article, we'll examine the factors that must be considered when porting the AMR codec to a digital baseband processor featuring a digital signal processor core, an ARM7TDMI core, as well as a variety of memories, buses, and peripherals. While we will look into some of the steps followed for porting the AMR speech codec to this specific processor, they are general enough that they can be applied to most any modern DSP architecture. Converting ETSI basicops Instead of using native definitions for basic computational operations, ETSI precisely defines their behavior through explicit function calls. For example, the typical method of adding two 16-bit numbers together in C (held in variables val1 and val2, with the result placed in val3) would be: However, if this operation were called for in an algorithm defined by ETSI, it would look like this: The function add is defined in basicop2.c in the AMR reference code. Inside this function, the addition is performed along with a specific form of fixed-point saturation defined by ETSI. A wide range of basic mathematical operations such as addition, subtraction, multiplication and division, along with bit manipulation and shifting operations are defined and whenever that operation is required by the algorithm, the appropriate function call is made. There are two major reasons for ETSI's precision in defining these basic operations. First, by explicitly defining how all mathematical operations are handled, it is possible to guarantee complete compatibility and bit-exactness across all target architectures, whether they are a simple fixed-point DSP or a microprocessor sitting inside a PC. Additionally, by placing every computational operation in a function call, ETSI provides a simple method of counting the number of mathematical operations that are required to process a frame of data. This enables a method of estimating the number of cycles it takes to run the AMR speech codec on a particular architecture before an optimized port begins. ETSI refers to this figure as weighted millions of operations per second (WMOPS). All basicop functions are independently weighted, dependent on the complexity of the underlying operation. Additionally, WMOPS weightings can be hand-tuned allowing the final calculated WMOPS value to more closely reflect the possible performance on a particular processor. One consequence of ETSI's reliance on basicops is that it is advantageous for a DSP architecture to support ETSI's definitions of basic mathematical operations in hardware. If the hardware does not support the specific definitions of particular computational operations, a software simulation which follows how the basicop function works is necessary, greatly decreasing overall performance. One very important method of decreasing the processing load of the AMR speech codec is to replace the function calls to all mathematical operations into something more computationally efficient. Depending on the DSP architecture and tools set chosen, there may be support in software and hardware for C compiler intrinsics. An intrinsic is a compiler-specific feature which allows for direct mapping of C functions into native assembly language. Intrinsics look like normal C function calls, but they are translated directly into assembly language by the compiler. A good method to follow for this transformation is to use the C preprocessor to substitute function calls to ETSI basicops with the appropriate intrinsic functions, enabling the proper mapping of the function calls to their intrinsic nomenclature via a predefined compiler macro. If this is done, it is possible to use common C source code for both the DSP and the PC, while optimizing the code which runs on the DSP and keeping the important profiling features of basicops on the PC. Optimizing Memory Usage Placing all temporary variables on the stack limits the ability to utilize slow and fast memory, as all stack variables will be in the same memory segment. Additionally, stack usage forces the placing of variables in sequence and does not allow for full use of multiple memories for dual data accesses. Given the above reasons, it is important to minimize stack usage, especially for large and frequently accessed buffers. One common method of reducing dependence on the stack is through a "dynamic memory allocator" such as the ANSI C standard function malloc. Malloc allocates memory on a heap, which is similar to a stack. One big advantage of the heap over the stack is that some implementations of malloc allow for multiple heaps, providing a method of placing variables in different memory segments based upon how frequently they are accessed. One potential problem with relying on the heap is that the programmer must manually free allocated memory when it is no longer needed. If this is not done properly, a memory leak can occur, where a portion of memory is not freed at the end of algorithm execution. One method of minimizing the possibility of memory leaks is to create a scratch memory structure by analyzing how memory is allocated in each function and then placing those allocations in a C structure. In this case, the high level API interface calls malloc to allocate enough space for the entire algorithm. A simple example of how a program can be converted from a standard stack-based memory allocation to one that uses memory allocation and a scratch structure is given in Code Listing 1 and Listing 2. The data memory requirements between the two listings are the same. Because the two subfunctions in Listing 2 have their scratch arrays unioned together, they only take up as much memory as the larger of the two arrays. This is similar to how the stack would overlay the local memory allocation of two functions at the same call level. Listing 2 shows how calls that reside at the same level can safely have their scratch memory unioned together. If the algorithm had another level of calls within it (subfunc1 and/or subfunc2 made function calls) another level of structures (and potentially unions based on the exact details of the function calls) would be placed within the scr_sf1 and scr_sf2 structures. On many DSP architectures, multiple data accesses along with computational operations are possible within a single cycle; however, buffer locations must be configured such that they reside in different memory banks. One simple method of enabling a wide range of dual data accesses is to place the scratch structure and the state structure (variables whose values must be maintained from frame to frame as long as the algorithm is operational) in different data banks. While this covers many instances where multiple data accesses are desirable, there are cases where additional optimizations could occur if multiple elements of the scratch structure could be accessed in the same cycle. For instance, a number of correlation functions call for multiple accesses to different locations within the same scratch memory buffer. To enable dual data accesses in this situation, a secondary buffer is created and it is forced to reside in the opposite bank that the main scratch structure is placed within. At the start of the correlation functions, a portion of the buffer to be processed is copied from the main scratch structure to the alternate scratch structure. Once the copy completes the correlation function begins and data is accessed in a far more efficient manner, optimizing execution time for the function (Figures 1 and 2).
Native Porting The question of when to stop porting from C to assembly can be difficult to answer, especially if there are not any hard guidelines on required performance or memory usage. If hard guidelines are available, the procedure of running test vectors, searching for functions which consume the most processor cycles, and then porting these algorithms to assembly and benchmarking again is a good approach. However, before embarking on the optimization process, it is important to examine the WMOPS value to determine if it is even possible to reach the desired performance. If the WMOPS value is very close to the target millions of cycles per second (MCPS) value, chances are it will either be incredibly difficult or impossible to reach this value, as WMOPS do not take into account any inefficiencies due to limited register sets or instruction/data caches. In this case, it would be wise to reconsider the target DSP architecture or the original required performance. If a specific performance metric is not known, it is more complicated to determine how far to proceed in porting the C code to assembly. Many modern DSP tool chains include sophisticated profiling features for determining which functions are consuming the most cycles and where the porting effort is best focused. In the case of the AMR speech codec, by hand-optimizing only 15% of the code, total MCPS were reduced by 50%. This illustrates the necessity to focus particular effort on those functions which show the highest cycle consumption in the profiling process. Because of this, the gains in algorithmic performance decrease exponentially as more functions are ported from C to assembly language (Figure 3).
It is also important to consider the WMOPS value when a specific desired performance metric does not exist. In this case, the WMOPS value gives an estimate of the absolute best-case performance from a full custom port of the code to assembly language. For example, if the measured WMOPS value for a particular DSP architecture is 12.5 WMOPS and the currently measured performance is 14.7 MCPS, it will only be possible to gain some value less than 2.2 MCPS through a total port to assembly language. Estimating Worst-Case Cache Performance There is no longer a simple one to one correspondence between the code being executed and the time it will take for it to execute. Additionally, interrupts or pre-emptive execution of code can flush a previously full cache, increasing execution time in a live system. Determining the worst-case scenario in execution time is very important in algorithms which have hard real-time constraints (such as the AMR speech codec, which must complete its processing in 20ms or audible artifacts occur). In older wireless handsets, it was more common that the software modules had a single known priority level and memory mapping at runtime. This was true because handsets typically were single purpose products designed to handle only voice communications. However, with the advent of mobile multimedia and messaging services, the priority level for the AMR speech codec could potentially change based on real-time requirements. Therefore, it is important to benchmark this algorithm across a wide variety of memory layouts to give the system designer a full understanding of the MCPS requirement through many profiles. Additionally, it is important to mimic the effects of flushing the instruction and data caches while calculating isolated performance metrics. As the priority level of the speech codec is lowered, it is more likely that the speech codec will be pre-empted by another algorithm, potentially causing a portion of the cache to be invalidated. In testing the AMR speech codec, a baseline of a complete instruction and data cache flush was performed at the start of each frame. This helped provide padding in the MCPS calculations necessary to ensure safe performance in a live application. Wrap Up About the Author
|
Home | Feedback | Register | Site Map |
All material on this site Copyright © 2017 Design And Reuse S.A. All rights reserved. |