How to analyze processor features for network use

How to analyze processor features for network use
By Markus Levy, Courtesy of Embedded Systems Programming
Apr 18 2005 (15:46 PM)
URL: http://www.embedded.com/showArticle.jhtml?articleID=160902230

EEMBC benchmarks may help you analyze and decide which processor to use in your network application. This article explains various benchmarks and how you can interpret their results. It also shows an example.

Selecting the best processor for the job requires that you balance many factors such as the chip's price, power, availability, and performance. Although performance is sometimes equated with speed, the demands put on processing subsystems differ widely depending on the task. In networking applications, processors need to move data packets and translate network addresses, among other things, and no device category is perfect when it comes to handling all of these tasks. Hardwired logic is often too inflexible to deal with many of the error conditions and interactions that can occur with Internet protocol (IP) packets. On the other hand, flexible software-programmable processors may be too slow to support next-generation routers and switches.

To give designers a tool to compare the performance of these processing subsystems as they carry data packets across a network, the Embedded Microprocessor Benchmarking Consortium (EEMBC) developed the Networking Benchmark Suite Version 2.0 that is structured to ensure the algorithms result in the maximum stress being applied to candidate processors, exposing the strengths and weaknesses of individual processors on different types of code. This article discusses the components of that suite and presents results of tests run on two high-end, PowerPC-based processors, the Freescale MPC7447A and the IBM 750GX.

No single benchmark is adequate to address processor performance in every application, and even within the networking application, the tasks faced by processors are quite different depending on where the devices are used. For example, routers typically don't need to process information in the Transmission Control Protocol (TCP) layer, and results for tests associated with TCP performance will skew the results for those OEMs looking to select the best processor for work at the lower IP layer.

Some of EEMBC's networking benchmarks are thus designed to reflect the performance of client and server systems, while others represent functions predominantly carried out in infrastructure equipment. Two new consolidated score types, the TCPmark and the IPmark, aggregate the results for the benchmarks in each group. The IPmark is intended for developers of infrastructure equipment, while the TCPmark, which includes the TCP benchmark, focuses on client- and server-based network hardware.

Quick look at the processors
When making a performance comparison between processors, it is always best to have as many system-level similarities as possible. Conveniently, the MPC7447A and 750GX both are based on the PowerPC instruction set, both have 32K instruction and data caches, and both have 64-bit external buses. Both processors ran benchmark code compiled with Green Hills Multi 4.0 compiler.

However, this is where the similarities end. Freescale's MPC7447A contains a 1.4-GHz superscalar core with a seven-stage pipeline capable of issuing four instructions per clock cycle (three instructions plus one branch) into 11 independent execution units. IBM's 750GX contains a 1-GHz core and a four-stage pipeline. The shorter pipeline gives this chip an advantage when running branch-intensive applications and benchmarks due to its shorter load latency and smaller branch-mispredict penalty. On the other hand, the MPC7447A has a different advantage with its integrated AltiVec engine, useful for vectorizable algorithms.

The MPC7447A has a 512K, 8-way set associative second-level (L2) cache and the 750GX has a 1M, 4-way L2 cache. The IBM chip's increased set associativity (8-way versus 4-way) is an advantage in multitasking systems, often delivering performance comparable to a cache that's twice as large. Furthermore, the 750GX's memory bus runs up to 200MHz, compared with the Freescale device with its 133-MHz bus. These differences will be significant factors in analyzing the benchmark results.

Routing IP packets
Whether a packet is to be forwarded to another router or processed and sent to a local machine, the router's first step for processing packets is to validate the IP header information. EEMBC's Packet Check benchmark models a subset of the IP-header validation work specified in the RFC1812 standard that defines the requirements for packet checks carried out by IP routers. To maintain as much realism as possible, the benchmark emulates the way in which actual systems process packet headers, and therefore also includes packet headers with intentional errors that the processor must handle appropriately. The benchmark uses a scheme where descriptors are separated from the packet headers, allowing the descriptors to be managed as a linked list with each descriptor pointing to an individual IP packet header. This demonstrates the processor's ability to work with the branch- and pointer-intensive code found in typical packet-switch code bases. Further, by focusing on error handling rather than raw packet throughput, the benchmark provides a more realistic check on processor behavior rather than just cache-to-memory speed.

Tested against the Packet Check benchmarks, the "out-of-the-box" scores for the Freescale and IBM processors are within a few percent of each other, as shown in Figure 1. Despite its 1-GHz operating frequency, the 750GX does well on these benchmarks due to its large L2 cache and 200-MHz system bus.

Figure 1: Packet Check benchmark scores for the Freescale MPC7447A and IBM 750GX

For the packet-forwarding function, the router must determine which other routers are available for forwarding and find the shortest path to each. Open Shortest Path First (OSPF) is the most popular Internet routing protocol used to determine the correct route for packets within IP networks. Using Edsger Dijkstra's shortest-path-first algorithm, EEMBC's OSPF benchmark performs a series of calculations to determine the destination port for each given route. More than half of the instructions executed in the OSPF benchmark are some type of compare or branch instruction. Intuitively this should favor the 750GX with its shorter pipeline, but it's the MPC7447A that gets the higher OSPF score because of its higher frequency, greater number of functional units, and caches that are large enough to hold all the code and data for the OSPF benchmark.

Once the route tables are built using protocols such as OSPF, efficient route lookups are fundamental to the performance of network routers. Based on information found in lookup tables, the Route Lookup benchmark receives and forwards IP packets using a mechanism commonly applied to commercial network routers. It employs a data structure known as the Patricia Tree, a compact binary tree that allows fast and efficient searches with long or unbounded-length strings. The benchmark monitors the processor's ability to check the tree for the presence of a valid route and walk through the tree to find the destination node to which to forward the packet. The code and data for this benchmark fit into the L1 caches, favoring the MPC7447A because of its higher frequency and because it doesn't need to use its external memory bus. Although this benchmark consists mainly of compare and branch instructions, the 750GX's shorter pipeline doesn't appear to provide significant benefit.

Figure 2: Networking Version 2.0 benchmark scores for the MPC7447A and 750GX

Figure 2 shows that the "out-of-the-box" Networking Version 2.0 benchmark scores for the MPC7447A and 750GX demonstrate that operating frequency alone does not account for processor performance. A larger L2 cache, faster memory bus, and shorter pipeline contribute to the 750GX's high performance. Only on the computation-intensive OSPF benchmark does the advanced superscalar MPC7447A take the advantage. (The Y-axis is iterations/sec).

Processing packets at the network boundary
An increasingly important function for IP routers that sit on the boundary between an organization's internal network and the Internet is Network Address Translation (NAT), converting packets as they pass from an internal network to the Internet at large. NAT provides a method to work around the limited number of IP addresses and ports on the Internet. Additionally, NAT is normally required when a network's internal IP addresses cannot be used outside the network, either because they aren't globally unique or because of privacy reasons.

Dynamic NAT routing processes all outgoing packets and has the additional complexity of port assignment on the incoming and outgoing packets to preserve connections between clients and servers. EEMBC's NAT benchmark uses packets with various source addresses, destination addresses, and random packet sizes. Each packet is wrapped with IP header information and the packets are assembled into a list for processing. The benchmark begins processing and rewriting the IP addresses and port numbers of packets based on the predefined NAT rules. Each rewritten packet will have a modified source IP address and source port chosen from the available ports of each IP address available to the router.

As Internet traffic passes from one part of the network to another, the packets themselves may need to be altered. Each network technology has a maximum frame size that defines the maximum transfer unit (MTU), or maximum packet size, that can be carried over the network. When an IP packet is too large to fit within the MTU of the egress (outgoing) interface, it can no longer be transmitted as a single frame. Rather, the IP packet must be split up and transmitted in multiple frames. EEMBC's IP Reassembly benchmark takes the asymmetric nature of fragmentation and reassembly into account and makes extensive use of out-of-order delivery and random source-packet sizes to stress the processor's ability to perform reassembly.

Figure 2 shows the results of NAT and IP Reassembly benchmark tests on the MPC7447A and 750GX. About 80% of the IP Reassembly benchmark's instructions are equally divided between load/store and compare/branch. The instruction mix for the NAT benchmark is similar to that of the IP Reassembly benchmark, with the addition of a few multiply and divide instructions. The combination of the shorter pipeline and the 1M L2 cache significantly favors the 750GX on this benchmark.

Another point not mentioned yet is that the system configuration used for generating these benchmark scores combined the MPC7447A with the older Tundra tsi107 system controller (133-MHz bus). However, the newer Discovery III system controller (not available at the time of this benchmark certification) supports a 167-MHz bus.

Improving quality of service
The data carried over the Internet has evolved from text and files—where timing and order of packet arrival are irrelevant—to voice, video, and multimedia, where timing and order are critical. Quality of service (QoS) processing addresses the mixed data type issues by giving the client guaranteed data transfer and error rates that are suitable to support a deterministic application. This QoS guarantee significantly reduces the network loading due to errors and retransmission and enables these new forms of data to flow over the Internet.

EEMBC's QoS benchmark simulates the processing undertaken by bandwidth management software used to "shape" traffic flows to meet QoS requirements. Based on predefined rules, the system paces the delivery of the system to the desired speed. The benchmark begins by processing packets according to the rule set, which determines the routing and addressing needed to preserve the QoS for each packet stream. As the number of packets in the system increases, port diversions occur to maintain the QoS, and queues are established to wait for available pipes. The benchmark is very compute-intensive, with about 45% of its instructions doing loads and stores and about 40% doing compares and branches (demonstrating the value of the 750GX's architecture).

TCP for clients and servers
Clients and servers must process higher-level protocols, such as the Transmission Control Protocol (TCP). Forming the transport layer protocol used by Internet applications such as Telnet, File Transfer Protocol (FTP), and HyperText Transfer Protocol (HTTP), TCP provides a link that looks to the application as though it is a direct connection. Because TCP uses IP services to deliver packets, and IP doesn't care about the order in which complete IP packets are delivered, TCP is designed to handle packet reordering and resending for situations where a router may have dropped packets to be able to meet its overall service level requirements.

EEMBC's Networking Version 2.0 benchmarks include a TCP benchmark that accounts for the different behavior of TCP-based protocols by measuring the performance of a processor that handles a workload derived from several application models. The TCP benchmark has three components to reflect performance in three different network scenarios. The first is a "Gigabit Ethernet" kernel involving large packet transfers to represent the likely workload of Internet backbone equipment. The second kernel assumes a standard Ethernet network for packet delivery and concentrates on large transfers using protocols such as FTP. The last kernel uses a standard Ethernet network model for the relay of mixed traffic types, including Telnet, FTP, and HTTP.

The main part of the benchmark involves processing all of the packet queues through a server task, network channel, and client task. These simulate the data transfers through the connection to provide a realistic view of how the processor will cope with various forms of TCP-based traffic. A large portion of the instructions in the TCP benchmarks is data loads and stores.

Different upper-level protocols stress TCP-handling hardware in different ways. For example, Telnet consists of short, small bursts of data in small packets that result from a user typing commands and receiving results. On the other hand, FTP consists of large amounts of data in large packets moving in one direction. HTTP is somewhere in the middle with bursts of files in one direction intermixed with control and handshaking traffic in both directions. This makes the consideration of traffic type essential when analyzing the performance of a processor that will process TCP-layer traffic.

Figure 3: Scores for the Freescale MPC7447A and IBM 750GX tested with EEMBC's TCP benchmark, which uses jumbo, bulk, and mixed data sets representing Gigabit Ethernet, standard Ethernet, and a mixture of traffic types (Telnet, FTP, and HTTP), respectively

Figure 3 shows the three sets of "out-of-the-box" scores for EEMBC's TCP benchmark using jumbo, bulk, and mixed data sets representing Gigabit Ethernet, standard Ethernet, and a mixture of traffic types (Telnet, FTP, and HTTP), respectively. The MPC7447A performs well on this benchmark due to its integrated AltiVec unit. (The Y-axis is iterations/sec).

The TCP scores on the MPC7447A are 50% to 65% better than those of the 750GX due to the former's AltiVec unit applied to the key time-consuming functions in the TCP protocol. Specifically, the memory-copy function (memcpy) was accelerated in these benchmarks solely by linking in the libmotovec AltiVec libraries available at www.freescale.com/AltiVec. The checksum and memcpy_and_checksum functions could also be accelerated with AltiVec, but not without changing the benchmark source code or the function calls in the libmotovec library—two optimizations that are not allowed under EEMBC rules.

More than clocks
The first scores based on Networking Version 2.0 benchmarks help point out that clock frequency accounts for only part of the performance of a processor. In fact, system-level effects such as cache size and memory bus speed have become much more important in benchmarks, as well as in real networking applications. The EEMBC Networking Version 2.0 benchmarks provide a standard for the measurement of processors by providing a realistic representation of the client-server framework, multi-user data generation, and real-world application kernels as defined by the industry.

Markus Levy is founder and president of EEMBC. He's worked for EDN Magazine and Instat/MDR in the past and is coauthor of Designing with Flash Memory. He also worked for Intel as a senior applications engineer and customer training specialist for Intel's microprocessor and flash memory products. You can reach him at markus@eembc.org.