Custom processors rev Java execution

Custom processors rev Java execution
By Ashish Sethi, Product Manager, ARC International, Matt Kubiczek CTO and Cofounder, Digital Communications Technologies Cambridge, England, EE Times
April 1, 2002 (10:48 a.m. EST)
URL: http://www.eetimes.com/story/OEG20020329S0016

Embedded Java implementations run up against performance problems because of the stack structures the language requires. Java implements a stack-based programming model, in which two stacks are required: the variable stack, which holds parameters passed between Java functions, and the operand stack, which holds parameters that are required by the method currently being executed.

When programming in C, stacks are normally maintained in main memory and data is pushed or popped between the stack and the CPU registers.

In standard software implementations of a Java Virtual Machine, or JVM, both Java stacks are also maintained in main memory. But many more memory transactions are required because data must be pushed and popped between the variable stack and the operand stack, and again between the operand stack and the CPU registers. It is clear that this adds further memory traffic and could severely hamper system performance. Stack op erations are an example of tasks that dedicated Javaxc processors do very well, but where RISC processors are inefficient.

Also, translating Java byte codes into native machine language of a processor must be done efficiently to reduce the size of the resulting native code and to reduce the complexity of the JIT. This is seen as a significant challenge because RISC and Java processor architectures and Java processor architectures are vastly different. The fact that some Java byte codes are complex and relate to many RISC instructions only serves to exaggerate performance loss.

Two routes have come to light that address the inefficiencies and challenges that developers face in accelerating Java execution in embedded systems.

Preprocessing concerns
The first approach attacks the translation problem. Some vendors have decided that an additional stage preceding the processor's instruction fetch can translate byte codes into RISC instructions as they arrive. This approach resem bles a coprocessor. However, because it precedes the processor and does not hoard bandwidth on the bus, it does not hamper system performance as much as a coprocessor.

Indeed, it could be called a Java preprocessor. Since the hardware translation block replaces the software translation in the JVM, Java performance is improved. Developers can expect between 5 and 6 CaffeineMarks/MHz from a technique such as this one.

Although performance is enhanced and the cost of the solution at approximately 12k gates may sound appealing, there are hidden drawbacks in implementing a hardware translation mechanism. Complex Java byte codes are normally translated in schemes like these by trapping the incoming byte code and executing a routine, written in the native machine language of the RISC processor.

The dynamic translation itself requires on-chip memory, similar to microprogrammed ROMs in CISC processors. This ROM can be quite large, so the 12k-gates measurement may give a deceptive view of act ual silicon area usage. In addition, the processor must switch from translating byte codes to running a RISC function.

A more efficient, highly integrated implementation can be realized if a processor architecture that has been designed for extension is used. Processors like these allow designers to add custom logic to the processor core architecture, such as new CPU instructions, supplementary CPU registers and control registers.

This is the approach that we have taken, by extending the CPU register file of the ARC tangent-A4 user-customizable processor and developing a twofold algorithm leveraging the hardware enhancements to create a high-performance RISC/Java processor at little cost.

This approach does not focus on implementing bolt-on hardware that translates incoming byte codes as they are fetched. Instead, it strives to adapt the RISC processor architecture and programming model to better reflect the needs of Java. In that way, translation can be accomplished easily and efficiently in software.

Because no large preprocessor is required, a large operand stack can be implemented instead of a small one. This operand stack is implemented as CPU registers and accessed using additional CPU register addresses that are not used by the standard RISC processor. Because the stack is larger than in the previous example, the chances of overflowing to memory are greatly reduced and performance is improved. Moreover, the variable stack can be combined with the operand stack to create a Unified Register Mapped (URM) stack. This greatly reduces the amount of memory accesses required as data is present in registers ready to be processed, rather than having to be loaded from main memory.

All in one step
The addition of the URM stack is fundamental to accelerating Java execution. It allows RISC instructions to directly manipulate data on the stack by accessing registers instea d of requiring completely separate stack operations. Data can be popped from the stack, operated on, and pushed back onto the stack in a single RISC instruction by referencing these registers.

The next step is to implement a translation scheme that efficiently utilizes the modifications made to the RISC architecture. Because the stacks are no longer held in memory byte codes, that push and pop data can be represented as RISC instructions that move data from register to register, rather than costly "load" and "store" instructions.

In this way, single byte codes can be mapped to single RISC "move" instructions. But because the stack combines both operands and variables and it is implemented in registers, there is no need to move data from one stack to another, as long as the software keeps track of what is happening.