SoC boosts speech-recognition systems

By Zhizuo Yang & Jia Liu, University Beijing, China
Eric Chan, Lim Cheow Guan, Chen Kim Chin, Infineon Technologies AG

With the development and maturity of speech compression/ decompression and recognition, speech processing is becoming an important form of man-machine interface. Currently, these systems in laboratories are mostly large and complex. These implementations are usually based on computer platforms.

The primary focus of these speech recognition systems is large vocabulary continuous speech. Their speech encoding/ decoding algorithms are so complex that they have to depend on a PC’s computing power and ability. Hardware requirements for such subsystems are highly restrictive in meeting the requirements of portable, low-power and low-cost embedded system.

To date, the Uni-Speech SoC has shown good performance in accurate speech recognition and low-rate high-quality speech encoding/decoding. However, the economics and power consumption of the current solution limit its potential in portable speech-recognition applications. This opens up opportunities for Uni-Lite, an SoC that addresses cost and power consumption.

The Uni-Lite chip architecture comprises a 16bit DSP core, on-chip ROM and RAM, an embedded ÄÓ ADC/DAC with their respective I/O analog channels, and other common interfaces.

Uni-Lite is an SoC embedded with speech-processing firmware that can be developed into an application without the need for secondary on-board devices.

Hardware architecture

The Uni-Lite contains not only a DSP core and a codec, but also I/O analog channels that serve as microphone and speakers. On-chip RAM and ROM, communication to external devices are handled through on-chip UART, GPIO lines as well as SPI. With the exception of power supply, microphones and speakers, all the system hardware components are integrated into a single chip (Figure 1).

The OAK DSP Core is a 16bit (data and program buses) highperformance fixed-point DSP core. All the I/O units are connected to it directly through intermediary digital logics. The DSP core is designed for operation of up to 104MHz. However, due to the paradigm of design for portability, the operating speed of the DSP can be programmed to various rates from complete hibernation, 1-104MHz. In terms of computational capability, the DSP core is able to provide a throughput of 1Drystone MIPs per MHz of operation.

On-chip memory consists of RAM and ROM. Most program and data used by the algorithm are stored in the ROM to minimize the costs of both silicon area and SoC. The application layer will reside in RAM to provide application flexibility of the whole system.

The codec consists of 12bit on-chip DAC and 12bit ADC. The sampling rate of the ADC and DAC can be programmed to either 8kHz or 16kHz. This allows speech processing for different frequency bandwidths. The input analog channel of the ADC has a PGA-programmable gain amplifier with a dynamic gain of 0-42dB while the output analog channel of the DAC has a PGA of 0-(-18)dB and a programmable band pass filter of 2.5-10kHz.

A power management unit provides general-purpose powermanagement and clock-rate control capabilities. It has the ability to control clock inputs to all major blocks of the design as the power consumption of the system is closely related to the rate of the system clock. The power management unit is also able to put the chipset into a state of hibernation until the whole system is ‘woken up’ by the defined sources.

Figure 1: The Uni-Lite contains not only a DSP core and codec, but also I/O analog channels that serve as microphone and speakers. On-chip RAM and ROM are handled through on-chip UART, GPIO lines and SPI.

Figure 2: The entire software system is partitioned into three levels—application, service and driver.

Watchdog timer is also available. This special design enables the system to recover from unexpected endless loops and hangups. This will thus enhance the chip’s operational reliability.

An SPI interface enables the use of external memories that use the SPI. In the case of Uni-Lite, the SPI interface serves two purposes. In the first case, it is used as a source for downloading firmware during system boot-up. In the second case, the SPI interface can function as a general-purpose interface to external serial-interface memories. In the first mode, the system loads the first batch of contents of the SPI EEPROM into the internal Program RAM during boot up. This is done with the aid of either a Hardware Bootloader or Software BootLoader. This special double bootloader design provides a stable performance of the system.

The GPIO provides generalpurpose serial communications and control capabilities as well as “wakes up” sources when the system is put to sleep mode. The typical rate of operation is 10MHz.

Power consumption

As an SoC designed for portable systems, size and power consumption are two most important factors to consider in design. In Uni- Speech, the SoC was developed specifically for use in advanced stationary speech-recognition applications. Uni-Lite, which was jointly developed by Infineon Technologies and Tsinghua University, focuses on developing a seamless system that integrates an advanced speech-recognition algorithm and cost-efficient SoC solutions.

Power consumption is tightly correlated to the operating clock rate in any digital system. The DSP core was designed to operate at varying rates, according to the function module. In the case of the real-time G.723 coder, a high operating rate is needed due to the complexity of the computation in compression algorithm. On the other hand, the G.723 decoder requires a much lower operating rate. The algorithm for Mel Frequency Cepstral Coefficient (MFCC) feature extraction requires a lower rate. The higher operating rate is needed by the recognition mechanism to minimize the delay after the end of speech. The operating rate of the chip can be modified according to the delay requirement, the scale of vocabulary and the complexity of the template (Table 1).

Algorithm	Operation rate (MHz)
G.723 coder	60
G.723 decoder	20
Feature extract	10
Recognize	104

Table 1: The operating rate of the chip can be modified according to the delay requirement, scale of vocabulary and complexity of the template

The ability to vary the operating rate of the semi-real-time speech application will thus constrain peripherals usage and limit computational requirements. This will result in power reduction to some portion of the chip, which will contribute to the overall power savings. With correct settings in the applications software, the entire Uni-Lite SoC can be put into hibernation mode where it only consumes current in the range of microamperes (Table 2).

Hardware of Uni-Lite	Voltage and current
One channel of ADC and DAC	2.5V 10mA
OAK DSP (80MHz)	45mW @ 1.8V (typical application) <2.0 W @ 1.8V (stop mode)
64 PADS	2.5V/3.3V 8mA

Table 2: With correct settings in the applications software, the entire Uni-Lite SoC can be put into hibernation mode where it only consumes current in the range of microamperes.

Software architecture

On the SoC, a full-speech interface— functions of guidance prompt, speech talk-back and speech recognition—is embedded. This software set is composed of endpoint detection, MFCC feature extraction, small vocabulary speaker-independent recognition and encoding/decoding objects. Other algorithms such as speaker-dependent recognition and speaker identification on this chip are under development.

The entire software system is partitioned into three levels— application, service and driver.

The driver level mainly manipulates the hardware and peripherals of the chip, and serves as a soft device to the upper level. With this structure, only minor revisions need to be made for the whole system to function when the external devices are changed.

The service level contains basic speech functional objects. The performance of a template-based word-recognition system is very sensitive to the variations of endpoint. This is even more difficult in a speech-interface chip, since all endpoint-detection processing must be done in real-time on the hardware.

The two-stage endpoint detection is a good method developed to cope with such difficulty. In the first stage, endpoint detection is based on energy and zero-crossing rate. This process is simple enough to be time-synchronous. However, it gives only rough active- voice boundaries. Speech frames within these boundaries are then processed with their features extracted and recorded. This process saves storage space by bypassing the silence. This will also lower the DSP’s average computing burden resulting in a lower operation rate.

The second-stage endpoint detection uses more information generated from feature extraction, such as energy of different frequency bands. A new feature is added to allow the algorithm to search both forward and backward of the endpoints, and update the searching threshold based on the current whole word. In this case, the second-stage endpoint detection gives out a far more accurate endpoint location. This is considered a well-balanced mechanism between efficiency and accuracy, and results in highperformance feature extraction.

The continuous density HMM method based on both word model and subword model is also implemented in this level to achieve text- and speaker-independent high-accuracy recognition. This means that vocabulary recognition can easily be added as text from the computer and downloaded to the chip.

A multipass decoding algorithm is also embedded in the system. Using a simple template, the most likely words are selected in a short time. A much more precise template is used to determine which one is the final result. This saves both memory consumption and recognition time, thus enhancing the performance of SoC recognition.

Also implemented on the SoC is the ITU G723.1 speech algorithm, which provides a low-rate, good-quality speech coding method that has been successfully applied in very narrow- band videoconferences. This high compression rate also contributes to longer speech for a given data-storage space.

Most of the code and data in this level are stored in the ROM. This results in a significant reduction of silicon area that translates to lower power and hardware cost reduction.

The application level is the most variable portion of the system. With the implemented software architecture, this can be easily changed according to specific applications. With the support of relevant service-level speech-functional objects, any new application software can be built up rapidly. And being a flexible configurable software system, each service-level speech- functional object can be freely integrated into the system application firmware.

The embedded software can be built up using the provided high-performance functional objects and is designed to enable applications to be easily assembled within a very short time.

The SoC is an optimized solution for embedded applications such as toys, voice-based remote control and speech recorder. Future developments include real-time recognition with the ability to recognize a much longer speech in a short time on the SoC. Speaker-dependent recognition and speaker identification will also be developed to satisfy more complex applications.