High-Performance DSPs -> Voice control enhances appliance apps

Voice control enhances appliance apps

Voice control enhances appliance apps
By Richard Mensik, System Application Engineer, Motorola Czech Systems Laboratories, Czech Republic, EE Times
November 15, 2001 (4:42 p.m. EST)
URL: http://www.eetimes.com/story/OEG20011115S0066

The digital processing capabilities of microcontrollers have enabled voice control to penetrate embedded systems. These "new" microcontrollers, sometimes called embedded digital signal processors or DSP controllers, have sufficient performance for real-time speech processing, and they integrate almost all needed control peripherals on one piece of silicon. Now practically any new system equipped with DSP controllers can be controlled by voice.

"Voice control" implies that the system will recognize only a limited command set, not fluent speech. That limitation markedly decreases memory and performance requirements compared with those needed for fluent-speech recognition. "Embedded," meanwhile, implies a single-chip solution.

The open question in embedded voice control is speaker dependence. Speaker independence needs more memory and more processor performance. Here we focus on speaker-dependent systems, which are more suitable for embedding; fur ther, speaker dependency is sometimes required to improve the security of the system.

Two basic choices are available in recognition algorithms: dynamic time warping and hidden Markov models. DTW has lower requirements on hardware. HMMs are more complex; they yield better recognition scores but also require more speech data in the training phase. For the purposes of this article, we assume DTW will be the preferred choice.

Even with a limited command set and speaker dependence, memory remains the most important limiting factor. Using common speech parameterization, approximately 1k x 16 per command (assuming about a 1-second length) is usually needed. The speech data is created by the user during the training of the recognition system.

Since the data must be stored permanently, the use of flash memory is recommended. The recognition software could be stored in ROM, but flash is more suitable because speech-recognition algorithms are continuously under development. As recognition scores increase , casual upgrades should be expected.

The input buffers will use RAM (with a size equal to the speech-frame size), as will the buffer for the recognition algorithm, or DTW matrix. Assuming the system will have 30 commands, one user and a sampling frequency of 8 kHz, a rough estimation of data memory consumption would be 32k x 16 of flash and 8k x 16 of RAM. A minimalist solution would require the minimum consumption of program memory (about 32k x 16), but the value ultimately depends strongly on the processor instruction set and compiler efficiency.

It's difficult to specify requirements on processor performance because of the known problem of measuring that performance. Traditional units like Mips or MACs (multiply-accumulates per second) do not cover all performance aspects. Generally, however, 20 Mips will be sufficient for most DSP controllers. The relationship between recognition scores and processor performance is unambiguous: Higher processor performance allows the system to reach better scor es.

The most suitable method for measuring processor performance is algorithm kernel benchmarking. Relevant algorithms for speech recognition include real fast Fourier transforms, computation of linear predictive coefficients and finite/infinite response (FIR) filters. Benchmarking results for those computations determine the chip's ability to perform real-time parameterization of the speech signal. That capability is pivotal for digital speech processing.

Some chip producers, such as Motorola and Texas Instruments, have announced new cores for their original DSP controllers that markedly increase performance. Peripherals should be preserved so that application software will not need any modifications (assuming it was not written in assembler), leaving developers free to focus on improving recognition algorithms or on reusing more-demanding recognition algorithms from other platforms without time-consuming optimizations.

The data width and type of processor arithmetic are also important factors . Naturally, higher precision (greater data width, floating-point arithmetic) improves the recognition score, but if a low-cost solution is required, then the optimal data width will be probably 16 bits.

Concerning on-chip peripherals, the presence of an analog-to-digital converter, general-purpose I/O pins and other communications interfaces — such as the serial communications interface I2C, along with IrDA — should be required. The minimum precision of the A/D converter should be 8 bits, with a minimum sampling frequency of 8 kHz. The appropriate number of required A/D channels depends on the application. DSP controllers for motor control usually have more A/D channels (eight to 16) and a high sampling frequency relative to speech processing.

Most contemporary speech-recognition software is written in C, so the presence of a C compiler for the chosen DSP controller is virtually mandatory; and considering the software complexity, an integrated debugging environment is similarly all but req uired. Several companies have already released software packs for embedded speech recognition. The software is optimized in size and performance for embedded platforms but will usually need some adaptation on a concrete architecture. Some multiply adapted systems exist, but they cover small parts of the DSP controller market, so the adaptation process is markedly accelerated when the DSP controller has its own software development kit (SDK) or signal-processing library optimized for the concrete processor.

An example of an embedded voice-control system would be one for lighting and heating control. The solution can be generalized to the control of any device with a limited command set; however, one noticeable limitation of voice control is in audio/TV systems, because it is difficult to separate the user-command signals from the other audio signals.

The proposed application is based on Motorola's DSP 56805 because that controller's low cost and configuration allow for the creation of system-on-chip solutions. The chip has eight time-multiplexed A/D channels with 12-bit precision and a 32k x 16 internal program flash memory, which is sufficient for complete voice-control software. Data memory must be external because at present no DSP controller has sufficient internal data memory for speech recognition.

The estimation of consumed performance bandwidth can be based on a measurement of the execution time of the FFT, which is the most time-consuming operation and must be performed in real-time. If an 8-kHz sampling frequency is used with 12-bit quantization (storing in 16 bits) and 16-millisecond segmentation with 50 percent overlapping, 125 frames per second will need to be processed. If common speech parameterization (such as cepstral coefficients) is used, one FFT and one inverse FFT will be processed on a 128-point frame (corresponding to 16 ms, 8 kHz). That computation consumes about 50,000 clock cycles on a DSP 56805 using Motorola's SDK signal-processing library.

If 50 percent is kept in r eserve for additional computations, we will need 50,000 x 1.5 x 125 frames = 9,375,000 clock cycles per second, which means 12 percent utilization for an 80-MHz clock frequency. The reserve can be applied later using more sophisticated speech-processing algorithms.

Another recommendation is to use third-party speech-recognition software because the function requires very complex know-how. The issue is not developing the algorithm itself but making it resistant to noise to achieve a high recognition score in a wide class of environments.

A secondary issue is the measurement of the recognition score. If the developer must reliably prove that the recognition score is, for example, higher than 99 percent, then it will be necessary to perform several hundred recognitions with different speakers in different environments. That evaluation is usually performed using a speech database stored on disk or CD, so additional hardware is needed.

Memory requirements impose restrictions on the command-set size. If we apply speech parameterization as in the example, with eight coefficients per speech frame, we will need 1,000 data words per second of speech. A minimal command set — about 30 seconds of speech — will need 30,000 data words. The Motorola DSP56805 can directly address 2 x 64k x 16 data words.

The command set must be designed with respect to the speaker dependence of the algorithm employed. Assuming that the proposed system will be used by four people, the command set could be the following: the words "light," "dark," "heat" and "cold," recorded individually by each user; the numbers zero through nine; and "time" and "temperature," recorded by one authorized person. Speaker dependence represents an advantage when controlling devices by phone.

All controlling functions of the proposed system should be accessible both by voice input and manual switch or keypad because of the possibility of noisy environments. Each microphone has a corresponding switch or button and lamp unit. The arbitr ation process, which chooses the controlled device, is performed by software. DSP samples continuously call A/D inputs; the affiliated software process adjudicates whether there is speech on the input and assigns an actual device.

The A/D converter on the DSP 56805 has two channels, each multiplexed to four pins, so that eight sources of analog signals can be connected in total. One A/D pin can be optionally connected to a phone line via a subscriber-line interface circuit (SLIC), and one pin is linked to a temperature sensor. The six remaining pins can then be connected to microphones. The A/D resolution is 12 bits with a maximal sampling frequency of 800 kHz, performance that makes it possible to do time-multiplexed sampling of all eight A/D channels. To minimize memory requirements, a sampling frequency of 8 kHz is recommended for all speech channels. Because of that low sampling frequency, low-frequency microphones or anti-aliasing filters (low-pass filters) should be used.

Heating can be also o ptionally controlled by phone. The recognition process will be the same, but implementing the answering function is recommended. If the speech reference is recognized, the DSP sends it back as an audio signal via phone lines to confirm the validity of the recognition. The SLIC connects the DSP to the phone line, the internal A/D converter is used at the input side and an external digital-to-analog converter is used at the output.

The D/A converter and DSP are connected via SPI. Modem functions are provided by software.