Net processes shuffle design priorities

Net processes shuffle design priorities
By David Fotland, Chief Technology Officer, Ubicom Inc., San Jose, Calif., EE Times
August 30, 2002 (11:55 a.m. EST)
URL: http://www.eetimes.com/story/OEG20020829S0020

The special demands of network protocols and packet processing are a radical departure from the applications for which general-purpose processors were designed many years ago. Some of the assumptions underlying the design of general-purpose processors-and the microcontrollers derived from them-are 180 degrees out of phase with the modern realities of network processing.

Fundamental RISC tenets such as 32-bit aligned data and load/store architectures seem outdated in a world of variable-length packets and the need to move torrents of data through a processor with minimal detours in the registers.

Besides dealing with the deterministic nature of packet processing at the lower levels of the network protocols and the real-time nature of stream applications (MP3, voice-over-Internet Protocol and MPEG), another essential issue in the design of Internet-edge processors is which communication protocol to support. There are many network protocol s and general-purpose communication protocols: Ethernet, Fast Ethernet, 802.11 variants, Bluetooth, HomePlug, USB, digital subscriber line (DSL) variants, Docsis (for cable modems), PC card (PCMCIA), ISDN, General Packet Radio Service (GPRS) and more.

And a significant factor that affects the design of Internet-edge processors is cost. Unlike the expensive servers and routers that manage traffic deep inside the Internet, the embedded networking infrastructure systems are often end-user consumer devices and must be as affordable as consumer products, such as MP3 players, broadband modems and video-game consoles.

Embedding connectivity
Others are more expensive but still cost-conscious embedded systems, such as industrial machines, motor vehicles, heating/cooling systems and cellular basestations. Some products must run on batteries, imposing an additional requirement for low power consumption.

Those considerations led our engineers to come up with a new CPU architecture specific ally for embedding network connectivity in low-cost systems at the Internet edge. We hope to introduce the architecture, code-named Mercury, in a standard-product processor next spring. Fundamentally, the architecture is still a 32-bit RISC processor. It has a Harvard architecture, fixed-length 32-bit instructions, a RISC-like pipeline and single-cycle throughput. But from there on, it diverges.

While most other CPUs have hundreds of instructions because of their PC/workstation/server heritage, we limited the ISA to only 39 instructions. It is optimized for Internet-edge packet processing, not for running databases, compilers, word processors, spreadsheets or games. So we can afford to create an ISA that does the same very specific network-edge operations with fewer instructions, allowing the use of reduced code size and the reduction or elimination of external flash memory.

Perhaps the most interesting divergence from RISC in our design is its memory-to-memory architecture. Several instruc tions access memory twice-once to load a value from memory, and again to store the value after manipulating it.

Conventional RISC architectures shun multiple memory accesses because off-chip memory latencies haven't kept pace with the rising core frequencies of CPUs. Instead, they use separate instructions to load a value from memory into a register, store the value in a register after manipulating it and then copy the register back to memory. That still requires two memory accesses, but separating the load/store instructions allows a program to perform multiple register-to-register operations on a value before the final store.

While the so-called load/store architecture of RISC is well-suited for the software applications that RISC processors were designed to run in the 1980s and 1990s, it's not as suitable for 21st-century Internet-edge packet processing. In fact, about 40 percent of the instructions in a typical RISC instruction set are register-oriented load/store instructions.

Different approach
Our approach is different in several fundamental ways. First is the matter of memory-to-memory instructions. A packet processor rarely needs to perform multiple operations on a fragment of packet data between loading it from memory and storing it back to memory. So it's redundant to use separate instructions that load the packet data into a register, manipulate the data in a register-to-register fashion and then store the data back to memory.

Typical packet-processing operations touch the data only once-for instance, to perform a cyclic redundancy check (CRC) for TCP checksums on every byte in a packet. Why use multiple instructions that waste CPU cycles and inflate code size when a single instruction can do the job?

In this architectural approach, a single optimized instruction can load some packet data from memory, perform the necess ary operations on the data and then store the data directly back to memory-without stopping at a register along the way. For flexibility, the memory-to-memory instructions have multiple modes for base + index, base + offset and auto-increment memory addressing.

To implement the architecture, we opted for a modified pipeline and fast local memory. The pipeline has extra stages to calculate memory addresses, read memory, and write back to memory. It's deeper than a minimal RISC pipeline but still supports the single-cycle throughput that's characteristic of modern RISC processors.

One potential drawback of a deeper pipeline is a greater penalty for wrongly predicting a conditional branch, because the processor has to flush and reprime the pipeline with new instructions fetched from the correct branch address. But as will be explained in a moment, instruction-level hardware multithreading hides that penalty. Also, we have found that in our design, using static branch prediction with compiler hin ts instead of dynamic branch prediction actually works better than a shorter pipeline using dynamic prediction.

Almost any other processor would pay another performance penalty for accessing memory twice with a single instruction: the penalty of fetching data from off-chip memory after a cache miss. We've eliminated that penalty by storing packet data in fast local memory and dispensing with instruction/data caches altogether.

Instead of caches, which can impair determinism in real-time applications, the new architecture will have separate on-chip static RAMs for code and data. The processor can access this memory in a single cycle. It's an efficient Harvard architecture that minimizes bus conflicts and can handle most Internet-edge applications without any external memory.

One thing you won't find in our new processor design is huge first-in, first-out (FIFO) buffers. That's because it doesn't need to buffer large numbers of packets or keep multiple copies of packets in various places. It's fast enough at typical Internet-edge wire speeds to handle each packet as it arrives. Doing the same work with less internal memory slashes power consumption and the silicon cost.

Another fundamental differentiator is that we have used instruction-level hardware multithreading. A big advantage of such a design from a code development and software support point of view is that this doesn't require a heavyweight multithreaded or multitasking real-time operating system (RTOS). This is because implementing multithreading in hardware at the instruction level allows the processor to handle interrupts at high speed while shrinking the size and increasing the reliability of the operating system. Instruction-level multithreading simplifies the programming model, too, because it's managed by the processor.

Threaded execution

While the basic architectural m odel our team developed can support up to 32 threads, in the first implementation in hardware we chose to allow the processor to mingle instructions from eight different threads in its pipeline simultaneously. Each stage can be working on an instruction from a different thread.

Context switching requires zero instructions and zero clock cycles. The threads can be asynchronous hard real-time processes or non-real-time processes, in any combination.

Eight-way multithreading and zero-cycle context switching are made possible by eight program counters and eight register banks for storing thread states. As each pipe stage begins processing an instruction, it simply switches to the bank that holds the corresponding state information. Other pipe stages can be accessing other banks at the same time.

If a hard-real-time thread needs 50 Mips to get the job done and the processor can deliver 200 Mips, a special thread-scheduling stage in the design ensures that every fourth instruction is fro m that thread. Instructions from non-real-time threads get the remaining time slots, based on their priorities or a round-robin scheme. In a typical Internet gateway application, the processor might use four threads for software I/O, leaving four threads available for other tasks. Non-real-time threads can use any cycles available when a hard-real-time thread is idle, so no performance is wasted.

Packet optimization
Another design difference is an optimized instruction set for packet processing. With one or a few instructions, it can perform tasks that would require several instructions on other processors. For instance, a single shift-and-merge instruction can align data to any bit position in memory and extract any 32 bits from the data. This is invaluable for manipulating data packets, which usually aren't aligned on 32-bit boundaries. Normally, the task would require several instructions to load the data from memory, mask it, shift it and then store it back to memory.

Likewise, the CRCGEN instruction applies any CRC method on 1 byte of data at a time, an operation that would normally require four to eight instructions and a 256-byte lookup table.

The approach we've taken is also well-suited for Java interpreters. The Java virtual machine (JVM) is stack-based; most other processors copy the top of the Java stack into their registers for faster access. But unlike register files, stacks are unbounded in size, so the Java stack belongs in memory. Our design accommodates a memory-based stack of variable size because its memory-to-memory architecture allows a single instruction to pull a value off the stack, perform an operation and push the result back onto the stack. This dovetails with Java's byte-code instructions, which also operate in a memory-to-memory fashion. We've successfully tested the core of a JVM on this architecture.

The Mercury architecture supports Ubicom's Software SoC technology. This allows a single processor to support many different communication prot ocols by modifying the low-level software and some external hardware.