How to optimize programmability and speed in network processor design

How to optimize programmability and speed in network processor design
By Robin Melnick, Director of Marketing, Applied Micro Circuits Corp., Switching and Network Processing Division, San Jose, Calif., EE Times
November 1, 2002 (4:23 p.m. EST)
URL: http://www.eetimes.com/story/OEG20021101S0055

The relationship between NPU software and the underlying hardware is critical because of the need for deterministic processing results, even at very high speeds and regardless of peaks or spikes in traffic flow.

On one end of the NPU spectrum are hard-wired or only semi-programmable devices that achieve speed by making assumptions about the flow of the data processing algorithm — limiting flexibility for system designers. At the other end, are fully programmable NPUs built with multiple general-purpose cores - software intensive processors that require a significant amount of hand-tuned code to achieve adequate performance results. As is often the case, the most optimal solutions combine the best aspects from each end of the spectrum.

In order to deliver NPU solutions that optimize both programmability and performance, device designers need to keep a number of things in mind. First, in such designs it is necessary to maximiz e hardware parallelism for performance without increasing software complexity. Second, keep it simple, with an architecture that has a simple single-stage, single-image programming model.

It is important to also keep the code size and complexity down, through the use, wherever possible of kernel software — an "NPU operating system," including NPU "infrastructure APIs" - for common operations.

In addition, the architecture and tool methodology chosen must provide transparent use of both on-chip and off-chip coprocessors. More important here than in almost any embedded application, you should be able to tailor the software and hardware to provide consistent hard real- time deterministic results.

From a productivity point of view, the model chosen must minimize the amount of new code that must be created by the programmer. The architecture and tools chosen should allow the engineer to maximize true reusability of code across multiple product generations.

The sequential b ranch-oriented software models typically used in general-purpose environments are subject to wide variations in execution time and cannot deliver deterministic wire-speed processing as line rates climb to OC-48, OC-192 and beyond. In contrast, NPU architectures typically use multiple-core parallel-processing models that allow simultaneous operations to be performed on multiple data streams.

However, even among multi-core NPU architectures, there are fundamental differences in the programming models. The multi-stage pipeline approach focuses on logically partitioning the software into multiple blocks executed sequentially on the multiple processor cores.

The appeal of this approach is that it can be scaled over time through the addition of more cores in the pipeline. But system designers typically pay the price by having to spend significant time partitioning and optimizing their code. To begin with, programmers have to segment their application into the multiple logical blocks, which they must then stitch back together again with additional code so that the segments work across the multiple processor cores. Not only do they have to sub-divide the algorithm and create the additional code, the developers must also hassle with load-balancing the code segments.

Programmers generally first approach the partitioning challenge by focusing on the typical logical elements within the data flow algorithm: classification, search, statistics gathering, traffic metering and packet modification. But these logical blocks are rarely of uniform complexity or length in terms of execution time. As a result, the multi-stage programming model invariably has some degree of inefficiency and under-utilization of hardware resources. In an attempt to offset these inherent inefficiencies, programmers typically resort to further tuning and tweaking their multi-segment code.

This generally requires a significant amount of additional work, and the code tuning process often cannot be finalized until fa irly late in the development process when more subtle timing issues may present themselves. The overlay of other higher level mechanisms, such as "chaining" tools, can help alleviate some of the effort of stitching together segments and defining inter-processor hand-offs, but programmers are still faced with tweaking the load-balancing to avoid under-utilized cores, segment overruns, timing problems, etc.

In contrast, a single-stage programming model achieves hardware parallelism for performance with multiple threads executing on multiple cores, but with each packet or cell running to completion within a single thread on a single core. In this model, the entire data flow algorithm is created as a single complete program, just as it would be on a non-multi-processor system, allowing the same single program image to be executed identically by each thread on each core. This greatly simplifies the programming task while also optimizing performance by eliminating the under-utilization, inter-segment han d-off code, load-balancing, and other inter-core dependencies typical of the multi-stage pipeline model.

An additional benefit of the single-stage model is that it is not sensitive to latency problems, delivering deterministic performance under virtually any set of conditions. Regardless of the elapsed time needed to process a particular packet, no other cores are left idle, as can happen with a multi-stage model.

A single-stage parallel processing structure also can provide significantly smoother scalability for handling higher line rates. Because the same software runs across all processor cores, it becomes a straightforward proposition to scale capacity by adding more cores. In contrast, because a multi-stage architecture segments packet processing into logical blocks spread across different cores, it becomes necessary to reallocate and re-load balance the logical segmentation as more cores are added.

Infrastructure APIs

Ultimately it is the unique co mbination of both hardware and software that makes the real difference in a network processing environment. As compared to the restricted flexibility of configurable devices, the good news about the majority of more general-purpose NPUs is "you get to program everything," but the bad news is "you have to program everything."

The ideal mix here is the algorithm flexibility of fully programmable cores with the addition of specialized on-chip hardware coprocessors for common or complex tasks, delivering higher throughput with less complex software structures. The key for programmers is to be able to transparently access coprocessors - whether on-chip or off-chip - via kernel software APIs.

For example, in the programming model we developed for our architecture, as each frame is received by the NPU's channel coprocessor, a request is placed into the Request FIFO, even while the frame data is still moving into the Data FIFO. An nPcore task receives the request and processes the packet or cel l to its completion while leveraging any mix of embedded or off-chip coprocessors. Accessed via kernel "infrastructure APIs," both on-chip and off-chip coprocessors essentially appear as single-instruction atomic operations to the programmer. This virtually eliminates the need for developers to write low-level code, allowing them to focus instead on the flow of their own value-added applications, while still preserving complete flexibility in the algorithmic flow of such applications.

Ultimately, the best way to speed system development is to minimize the amount of "infrastructure" code that has to be written by application developers in the first place. Comparisons have shown that in real-world implementations using kernel API-accessed, coprocessor-assisted models such as AMCC's nP7000 family, 90 to 95 percent of the total NPU-resident code is in the kernel itself, leaving just 5 to 10 percent that's specific to any given application. And use of off-the-shelf application libraries even further r educes this figure.

A single-stage programming model in an NPU achieves hardware parallelism with multiple threads executing on multiple cores, but with each packet or cell running to completion within a single thread on a single core. This is done by creating the entire data flow algorithm as a single complete program, similar to a nonmultiprocessor system, allowing the same single program image to be executed identically by each thread on each core. This eliminates underutilization, intersegment handoff code, load balancing and other intercore dependencies typical of the multistage pipeline model.
Source: AMCC

Instruction set compatibility alone does not guarantee code reusability. In a multi-stage NPU model, moving to a new generation even within the same NPU family typically requires re-partitioning the applicatio n code and a new load-balancing exercise in order to avoid under-utilizing the new processing capacities.

Even if programmers have already optimized their multi-threaded code for a previous generation, significant portions may have to be completely re-done if the code modules do not still segment smoothly among the increased number of cores. It is irrelevant to say that the "instruction set hasn't changed" when programmers still have to re-tune and re-allocate major portions of their code between cores to take advantage of each new generation of performance.

In comparison, code written for a single-stage model typically can be migrated virtually intact from one NPU generation to the next. There is no re-partitioning, and the same "infrastructure APIs" can be leveraged to run existing code on new nPcores with immediate performance results. In a real-world example, in our architecture, we have found that less than 10 percent of application code had to be modified from one product generation to the next.