Speech-recognition core targets high-volume apps

Speech-recognition core targets high-volume apps
By Stephan Ohr, EE Times
October 22, 1999 (11:36 a.m. EST)
URL: http://www.eetimes.com/story/OEG19991022S0013

LEUVEN, Belgium — Frontier Design is betting that a low-cost speech-recognition core with better than 97 percent accuracy will open new markets for control applications. The device, which makes it possible to recognize speech in a variety of languages, is for high-volume, low-cost command-and-control applications such as toys and other consumer electronics. It will be demonstrated for the first time in the United States at DSP World in Orlando, Fla., next month.

Priced at $1.40 each in 500,000-piece lots, the speech core is expected to find utility in a number of "giveaway" applications, according to Frontier Design. "Talk to your postcard, and it will respond," said Mark Bloemandael, Frontier's applications manager in the Netherlands. More likely, it will find a place in speech-controlled car radios, an application that demands noise cancellation, high-accuracy speech recognition and low cost all in one i mplementation, Bloemandael said.

With 30 kbytes of RAM on board, the speech core will recognize 30 to 35 commands in, among other languages, English, French, German, Dutch, Hebrew and Chinese. The core requires one-time training, but consistently delivers 97 to 100 percent accuracy (typically higher than 98 percent) in response to spoken commands. However, the device recognizes digits without training and would thus be useful for coin and currency changers, said Bloemandael.

Interest in currency changers will be heightened as European nations convert to the euro, and Frontier is working with banks in the hope of getting design wins for ATMs, he said.

On-chip synthesis

The company has already developed a voice-controlled currency translation device for Columns Ltd. of Singapore. The design includes a complete speech recognition and synthesis (SRS) system-on-a-chip (SoC), microphone and speaker. The chip's 30 kbytes of RAM are intended to store speech recognition templates, and an a dditional 20 kbytes of ROM is for the storage of high-quality synthesized phrases. The 15,000-gate SoC delivers 2 Mips of throughput at 2 MHz, with an average word-recognition latency of only 250 ms. It consumes 6 microamps in standby and 16 mA during speech synthesis — that is, while driving the speaker. Projected battery life is about two years with a single CR2-430 cell.

There are several other customers for the speech-recognition core, said Bloemandael. National Semiconductor Corp. (Santa Clara, Calif.) is combining the speech core, a DSP resource, with its own DR16B RISC core to enable voice-controlled dialing in digital enhanced cordless telecommunications (DECT) phones, he said.

Market researcher Frost & Sullivan (New York) has predicted that the speech control market will triple in the next three years. Some 75 percent will be in applications requiring continuous speech, such as dictating equipment, said Bloemandael. But another 15 percent will be in command and control, with the remai ning 10 percent in telephony.

Frontier has its eye on these last two segments. These applications will succeed with a limited vocabulary, as long as the horsepower is there to pick out commands in an otherwise noisy environment, said Bloemandael. "Once the customers are convinced the speech applications work, they'll go faster to implement them," he said.

The amount of vocabulary the device will recognize depends on a variety of implementation factors, including the type of host it is coupled to, the types of peripherals it uses and the amount of memory. The memory requirement for speech-recognition templates is less than 1 kbyte of RAM per word. Synthesized speech takes up less than 1 kbyte/second, and 16 kbytes of ROM produce 20 seconds of high-quality synthesized speech with an added pulse-width-modulator speaker stage.

The device effectively embeds speech-processing algorithms in C language. The core — 99 percent in C and 1 percent in VHDL — has a dense design, said Bloemandael. " C-language designs are much more easily optimized for power, performance or cost than are HDL implementations," said Herman Beke, Frontier's chief executive.

The SRS core is available in C-language object code for DSP or RISC processors and PC platforms. It can be compiled to run on any DSP or RISC processor, including Texas Instruments Inc.'s TMS320C62XX processor and National's CR16B DECT core. Alternatively, it can be distributed as a VHDL or Verilog core (with or without interfaces to other on-chip logic) or as a cell-based SoC including codec and amplifiers. Alternatively, Frontier can market a complete OEM module that includes speaker, microphone, RF and other functionality.

Since the speech recognition algorithm requires only 5 to 10 Mips, any existing pager, mobile telephone or other system with 5 Mips of spare processing power can include the SRS core with no extra overhead.

The SRS implements several advanced recognition algorithms: the Mel Frequency Cepstrum Coefficient (MFCC) algor ithm for acoustic feature extraction; continuous noise-level estimation to eliminate background noise; coarse- and fine-word boundary detection to define the word boundaries and Dynamic Time Warping algorithm to identify the words used.

That algorithm compares a series of energy vectors with unequal length and with duration variations within the series. It takes a weighted average difference between the feature vectors of the compared utterances and compares it with vectors in a template. The result is 97 to 100 percent accuracy for commands in the template.

Accuracy measured

Frontier executives said that an overall accuracy of 98.7 percent was measured with the TI20 standard vocabulary of 3,200 words. The TI20 standard is a subset of the TI46 audio database published by the National Institute of Standards and Technology. The database is used to test speech-recognition algorithms and products using a standard database of English words and digits spoken by 16 male and female speakers.

Additional functionality, such as echo cancellation, speech compression or caller ID detection, can be added with a minimal increase in gate count, said Bloemandael. For example, echo cancellation with the SRS core would require only 1,000 additional gates. Complete OEM systems are also available that include microphone, speaker, battery and packaging.