Network processors need a new programming methodology

Network processors need a new programming methodology
By Akash Deshpande,CTO and Founder, Teja Technologies, Inc., San Jose, Calif., EE Times
August 5, 2002 (10:32 a.m. EST)
URL: http://www.eetimes.com/story/OEG20020802S0040

One of the first challenges the developer faces in network processor unit-based system design is architecting the functionality to use all the various processing elements (PEs) in the complete system in an optimum manner.

For example, in some applications, such as a single-board network appliance, an Intel NPU could be used standalone, with the XScale running the control and management functions and the MEs running the mainstream forwarding functions. In this simplest of cases, the decision of partitioning the application logic is between the XScale and the MEs.

More typically, an NPU is used in conjunction with an external host CPU for control and management, multiple NPUs are used together through a switch fabric. For example, separate ingress/egress NPUs or coprocessors such as classifiers and CAMs may be added to boost performance. Furthermore, OEMs designing a scalable product family, rather than a single point product, w ill want to use these hardware building blocks configured in various combinations to meet a range of price/performance targets.

In the absence of a systems-level development environment, OEMs would have to program each of these various PEs in separate languages - C/C++ for the XScale and host CPUs, assembler or "Micro-C" (a C subset) for the ME, and some form of configuration language for the coprocessors. This approach requires an a priori assignment of logic to PE, locking the developer into a particular system configuration. For example, a software implementation of a classifier is considerably different from the code required to configure a classifier coprocessor. Thus, moving functionality from one PE to another would require manual recoding - leading to multiple software code bases, increasing the probability of errors, and hence loss of investment leverage in developing a product family that targets different price/performance points.

How then to provide an approach that does not r equire coding each PE in a separate language? We think the best approach to this problem by proposing is one in which the applicationon logic is developed in a manner that is independent of any specific targeted hardware. Multiple realizations are achieved from a common code base by mapping the application logic to the target hardware architecture corresponding to each price/performance point.

Such an approach requires the introduction of two new concepts: an architecture framework for parallel and distributed applications; and a state machine/state logic programming methodology that abstracts the network logic over specific hardware features.

The architecture framework must be robust enough to define the architecture of an arbitrary, distributed, multi-processor, networking application, and the programming methodology must be robust enough to define the logic of an arbitrary network application. Both must be represented in a manner that is completely independent of the target hardware, yet enables the efficient mapping of the application logic to different target PEs, thereby future-proofing application designs for new NPU generations. The framework must be supported by a run-time system that defines a standardized API across all PEs to provide common services such as scheduling, memory management, and communication. In addition, the programming methodology must be supported by language environment tools such as domain-specific editors, code generators (compilers), and debuggers.

Ultimately, the application architected using the framework elements and expressed in the programming methodololgy is mapped to the PEs of the targeted hardware. This mapping is not unique. Different mappings for the same target hardware produce different performance results, from which the best must be chosen. This mapping step takes on all the more significance when one considers the changing relationships between memory bandwidth, processor speed, and network media interfaces. With the mapping compl eted, code generators are employed to convert the application logic into high-performance machine executable code.

The essence of this new architectural framework is that the structure of any packet-oriented, networking application can be expressed using threads, memory spaces, and communication channels. A thread provides an execution environment for program logic fragments. Memory spaces provide storage for static data structures and dynamic memory allocation. Communication channels facilitate message passing between threads. Using these, one can move from a linear programming model to a parallel one.

The threads, memory spaces, and communication channels require a supporting run-time environment. It provides for the thread scheduling, memory management, and synchronization in the form of mutual exclusion semaphores, queues, and message passing. The run-time environment must be provided for each PE in the target hardware architecture.

The essence of this programming methodolog y is that the logic of any networking application can be described in terms of data structures and state machines with structured actions. Data structures are specified in an object-oriented style that supports inheritance, member variables, member functions, constructors and destructors. For additional modularity and reusability, a template mechanism is provided for class type generation by using late binding constants. The data structure specification abstracts over layout, alignment and memory access protocol. Even though application specification uses object-oriented classes, the code generators "flatten" the code into efficient C or assembly code.

Data structures can be coupled to program logic fragments described as state machines with structured actions. State machine states may be "wait" states where the program waits for passage of time or external synchronization, or "transient" states that simply break the program flow into manageable fragments of code. The states are joined by trans itions that specify conditions, called guards, under which they fire and a body of code, called action, that must be executed when fired. These "actions" provide for the abstract specification of computations on data structures, allocation of new data structures and state machines, communication of messages, and the management of the synchronization relationships amongst the different state machines.

Using state machines
The rationale for choosing state machines is based on the following observations:

Most networking protocols are defined using state machine representations
Configuration languages for coprocessors are typically state machine-based
Compiler and code generator internal representation is state machine-based
State machines are a natural extension of application modeling and provide an unambiguous, self-documenting description of the processing algorithm.

The object-oriented style of describing data structures and state machines promotes reusability of these classes and enables the user to extend the framework in unique ways through subclassing. The structured actions approach not only provides access to run-time APIs (abstracted from the hardware), but also enables the easy integration of existing application code and protocol stacks. In addition, Teja has utilized the flexibility of its code generators to integrate applications based on Intel's Portability Framework (or "microblocks") available from Intel and other third parties in the Intel NPU ecosystem.

Combining the software architecture framework and programming methodology results in an extremely powerful representation of a networking application described as a network of queues and servers.

In this framework, the data plane is where all packets transit from reception to transmission, the control plane where exception packet processing is performed and forwarding tables are updated, and a management plane where statistics or performance measures can be c ollected. As mentioned earlier, in the simplest case the control and data planes may belong to the same NPU or be distributed to multiple processors.

In the data plane are the receive (Rx) and transmit (Tx) servers, which interface with the physical media. In between are servers, or threads of execution, that perform classification, security, forwarding, or any number of packet-oriented tasks. The packets are handed off from one server to the next by movement of packet descriptors into and out of queues, while the packets themselves do not move. Depending on the processor characteristics, the queues may actually reside in a variety of places. The goal of the code generators is to analyze the silicon and the application sufficiently well in order to make mapping assignments automatically (while accepting user direction and providing adequate feedback) to optimize packet throughput.

The framew ork provides mechanisms for sending packets to and from the control plane through predefined drivers, and from the control plane to higher level functions through the standardized interfaces being defined by the Network Processing Forum.

Finally, the network application framework provides an agent interface that can pull the management information from the data plane and the control plane and present it to various management interfaces, such as a web or command line interface.

With these architecture framework and programming methodology concepts, complex, distributed, high-performance applications can be easily represented, all without reference to specific hardware elements. The last step then is to understand how these concepts are applied to create the working application.

The user starts with the class library design tool for defining all the classes needed in the application, using the framework classes subclassing them to specify unique functionality.

The software architecture design tool is next used to layout the instances of the classes and the channels of communication that define the desired application. No assumptions are made in the layout of the software architecture with regard to the target hardware, such as, whether there is a single NPU or multiple CPUs, NPUs and coprocessors - and what PEs run what functions.

The hardware architecture tool is used to describe the target hardware in terms of processor types, memory banks (SRAM, SDRAM, registers) bus connections between silicon devices (SRAM or PCI), and various other options. Also specified will be the OS to be used on the RISC core of the NPU or the host processors. The tool provides templates for commonly available chips and boards.

Once the software architecture is complete, the mapping tool is used to assign the software elements to the hardware elements - logical threads are assigned to processors, logical memory spaces mapped to physical memories, logical channels to physical medi a, etc. Once the assignments or mappings are made, the code generators will produce the appropriate run time code, and the user can test the application with the software simulators or the physical hardware.

Through the debugging tools, clues are given as to how to re-assign the software tasks, and the code generation is repeated. In this way, the iterative process of finding the optimal assignment of software architecture to hardware architecture can be cut to a fraction of the time, and the "Time to Performance" minimized.