MIPS, ARC look to compress instruction code

MIPS, ARC look to compress instruction code
By Chris Edwards, EE Times UK
June 11, 2001 (6:36 p.m. EST)
URL: http://www.eetimes.com/story/OEG20010611S0019

LONDON — MIPS Technologies Inc. and ARC International plc, leading examples of companies that license their intellectual property rather than ship it in their own silicon, are set to make announcements regarding instruction code compression at this year's Embedded Processor Forum, which begins Monday (June 11) in San Jose, Calif.

MIPS (Mountain View, Calif.) has made major changes to the synthesizable HDL code for a new version of its 32-bit processor core to reduce power consumption and code density. At the same time, the company has added a floating-point engine to its synthesizable 64-bit core for multimedia applications. It is set to disclose details of this development on Tuesday (June 12) at the Forum.

To reduce power consumption, MIPS has added clock gating to the 4Ke family of cores. How much gating there is in the core is selected during synthesis. The overhead for adding the fine-grain gating is a round a third if all the possible gated blocks are included, the company indicated.

"You can reduce the clock gating if the design is area constrained," said Mark Pittman, director of product marketing at MIPS.

This gating can control more than 85 percent of the processing in the core, said Pittman, adding that it is also possible to shut down registers that are not being used directly to further reduce power consumption.

MIPS said the power consumption of the cores will be half as much as for the previous 4K family, at 0.35 to 0.45 mW/MHz, with the core running at 220 MHz worst case and 270 MHz to 275 MHz typical.

But this power figure does not include the requirements of cache memories, which can be configured to hold up to 64 kbytes each for instruction and data. The caches can be one- to four-way set associative. The organization of the caches will have a further, marked affect on overall power consumption, the company said.

At 275 MHz, the performance with caches is, in theory, 385 to 475 Dhrystone MIPS, or 1.4 million instructions per second (Mips) per watt, said Pittman. One cache set can be sacrificed to build a 16-kbyte scratchpad memory.

In a Tuesday morning session, MIPS engineering manager Morten Zilmer is due to explain how the company is adding a floating-point unit to its 64-bit synthesizable 5K family.

About one third of MIPS Technologies' 45 partners have taken licenses for the 5K and the hard-core 20K, but the majority of those have yet to come to market with products based on either.

With its floating-point unit, the 5Kf is aimed at multimedia processing, said Pittman. The unit takes the core up to 4.3 square millimeters.

The floating-point core has a seven-stage pipeline. A feedback loop in the unit can reduce the latency to four cycles, said Pittman. Once stocked, the unit can produce one result every cycle in single-precision mode. Double-precision mode adds one latency cycle, but the throughput is also one result per cycle.

MIPS has added some of the power-saving techniques of the 4Ke family to the 5Kf, bringing the power well under the 4 W required for the 20K family.

The power demands of the core are expected to be 1.3-to-1.5 mW/MHz, or 400 milliwatts. At the typical speed of 340 to 390 MHz, the core handles 510 to 580 Mflops, with a sustained performance of up to 390 Dhrystone Mips.

Both the 4Ke and 5Kf have 32-bit co-processor interfaces. The first of the two co-processor slots is dedicated to a floating-point unit. "The 4Ke is the first 32-bit core with that co-processor interface," said Pittman.

Instructions added

MIPS has taken the 16-bit MIPS16 core launched three years ago by LSI Logic Corp. and created its own version, called the MIPS16e. The MIPS16e is backwards compatible with the MIPS16, and has been licensed back to LSI Logic, the first licensee of the 4Kec core.

MIPS16e adds eight new instructions to those for the MIPS16, which provides a 10 percent increase in code density over the MIPS16.

A key d ifference with other approaches, including the original MIPS16, is that the location of the decompressor, which converts 16-bit instructions to their 32-bit equivalents, can be configured in RTL to trade off between speed and area.

The decompressor unit sits in the instruction decode chain. In one form, multiple decode units can sit before the instruction-selection unit, allowing high-speed decoding. Alternatively, a single unit can sit after the instruction selector to reduce the impact on area.

Off on a Tangent

ARC International (Elstree, England), which trades as ARC Cores, has also developed a more compact instruction set for its next-generation 32-bit configurable processor. The set is intended to help ARC compete with the code density offered by compressed formats such as ARM Ltd.s' Thumb and the MIPS16.

This development is the first detail to be released of ARC's next-generation Tangent-A5 configurable RISC processor, due to be introduc ed at the end of the year. The code compression features are due to be outlined by Jim Turley, ARC's senior vice president for technology strategy, on Tuesday.

ARC product manager Phil Barnard said that ARC's approach with the Tangent-A5 is not simply based on a compressed format, but on a native 16-bit encoding that can be interleaved freely between 32-bit instructions.

Instead of using a decompressor at the head of the instruction pipeline to convert 16-bit forms into their 32-bit equivalents — the technique used by MIPS16 and Thumb — the instruction decoder in the A5 will process the 16-bit instructions natively.

However, to make the two types of instruction run side by side, the company has been forced to sacrifice backwards compatibility with previous versions of the ARC processor. Barnard said that choice was preferable, particularly given that the move is designed to let the company aim at more deeply embedded applications, to the alternative of facing issues raised by compressio n.

"You can get density [with compression] but it can have hidden 'gotchas,' " Barnard said. "Code generation is more complex and you need to work out the overhead of the mode switch versus the code size reduction."

The company has taken the opportunity to re-encode the 32-bit instructions so that the first four bits now determine the instruction mode.

"The bonus with this is that we get more extension slots [for custom instructions]. Customers can create both 16- and 32-bit instructions," said Barnard.

The company will offer both new compilers and a translator to convert existing 32-bit machine code.

The instruction fetch in the processor remains 32-bit, which lets it fetch up to two 16-bit instructions in one cycle. To let the compiler freely mingle 16-bit and 32-bit instructions, the larger instructions can be aligned on word boundaries, which means that half may get fetched one cycle and half the next. As a result, the instruction buffer is 48-bits wide.

However, the targets of br anches have to be aligned on double-word boundaries to guarantee that the processor can process one arithmetic instruction every cycle.

As with all 16-bit instructions for 32-bit processors, the ARC compact format is limited in terms of how many registers it can address and there are certain implied registers used in some of them.

"If you have 16 bits, you can only put so much information in there," said Barnard.

The registers that the company has made accessible to the 16-bit instructions were chosen on the basis of how the registers are used by the existing compiler.

By making those registers accessible to 16-bit instructions, the designers have tried to make the job easier for the compiler to swap between 16-bit and 32-bit instructions. To get access to registers outside the normal 16-bit set, there is a 16-bit register-register move instruction that can access any of the regular 32 registers in the central file. Similarly there are versions of the compare and add instructions that can do the same "because those operations are so common," said Barnard.

The use of implied registers has led the company to implement instructions more familiar to CISC programmers. Because the address of the stack pointer is implied in the 16-bit set, there are now push and pop instructions.

"Previously, we had to do this as two separate instructions," said Barnard.

Chris Edwards is the editor of Electronics Times, EE Times' sister publication in the United Kingdom.

More Embedded Processor Forum coverage.