Breakthrough in Microprocessor Architecture and Energy Performance

Roger Sundman, Imsys AB
November 26, 2015

Radically new processor architecture, reducing overhead high frequency switching, is needed in order to fully utilize the potential of future CMOS technology. Optimizing for energy efficiency, performance, cost, code density, adaptability and scalability are big challenges for the microprocessor architect. Imsys has developed a dual core processor with features not found in other architectures and it is runtime reconfigurable. The processor can run at 350 MHz with an active power consumption of 19.7 µW/MHz using one core, a four time improvement over the currently most efficient MCU.

Imsys unique architecture is well positioned for low power applications. A proof-of-concept chip has been produced and verified. 97% of its transistors are used in memory blocks. It uses Imsys’ patented dual core solution, were the pair of cores occupies 40% less space and consumes 25% less power than two single cores. The chip is manufactured by UMC using the standard 65 nm LL process. The cores share memories and a 5-port grid network router, NoC, thereby prototyping a tile of a many-core IC, where each core has local memory capacity sufficient for its immediate need. Memory management is handled by microcode and memory is closely integrated with the processor without the need for an ordinary cache controller. The active consumption of each core, executing from RAM, including its consumption there, is 6.9 mW at 350 MHz. Leakage is lower than for architectures with comparable performance since both the core and the stored program are smaller. Imsys uses about 1/5 of the silicon area compared with other processors it could replace.

Energy efficiency has, for the first time and for the foreseeable future, become the most important characteristic of processors, big and small. Every transistor switch event produces energy, i.e., heat. Both size and delay time for transistors are reduced for every CMOS generation, and if these are utilized, i.e. density or clock frequency is increased, the heat density increases. This has become a problem with the latest generations. If both density and speed would be fully utilized over the entire area, then the heat would destroy the chip. For a processor core it is of course desirable to obtain the maximum performance the technology allows, but this can be done only on a few percent of the chip area, and this percentage is reduced for every generation. The remainder has to be “dark silicon”, meaning that is has to have much lower activity.

Imsys’ processor has a different, yet well proven, fundamental design that doesn’t have the abovementioned limitation and is therefore suitable for the new situation in semiconductor technology. It is suitable for sensor nodes powered by energy harvesting in the Internet of Things, as well as in many-core chips for microservers, robotics and Cyber Physical System.

Simply placing 128 copies of the verified tile next to each other results in 256 cores, 42 MByte ROM and 25 MByte RAM on 320 mm2 silicon, consuming 2.8 W with all cores running at 350 MHz. This can simply be scaled down – with 14 nm technology, an area of 238 mm2 could have 4096 cores, 672 MByte ROM and 400 MByte RAM and consume 31 W at 1.6 GHz.

Each core has almost constant power consumption when active, and the heat it generates spreads across the adjacent memory areas. This allows a higher total power dissipation and simplifies cooling system and power budgeting.

Microcode, as opposed to logic gates, is compact and energy efficient. Imsys uses extensive microprogramming to accomplish a rich set of instructions, thereby reducing the number of cycles needed without energy inefficient speculative activity and duplicated hardware logic. Each core has two instruction sets, one of which executes Java and Python directly from the dense JVM bytecode representation. C code is compiled to the other set with unparalleled density.

Internal microcode is used for computationally intensive standard routines, such as crypto algorithms, which would otherwise be assembly coded library routines or even special hardware blocks. Optimizing CPU intensive tasks by microcode can reduce execution time and energy consumption of by more than an order of magnitude compared to C code.

The rich instruction set optimized for the compiler reduces the memory needed for software and, just like the microcoded algorithms, it reduces the number of clock cycles needed for execution.

This platform has a certified JVM and uses an RTOS kernel certified to ISO 26262 safety standard for automotive applications. The development tools will be enhanced with the support enabled by the LLVM infrastructure. A new instruction set optimized for an LLVM backend has been developed and is being implemented in the coming hardware generation.