Managing Memory Usage in VPLS-Enabled NPU Designs
Managing Memory Usage in VPLS-Enabled NPU Designs
Patrick Bisson, EZchip Technologies
Jul 22, 2004 (9:00 PM)
URL: http://www.commsdesign.com/showArticle.jhtml?articleID=23905161
Layer-2 virtual private networks (VPNs) based on virtual private LAN services (VPLS) are gaining more and more traction lately because of their end-user simplicity. They provide a means to extend the point-to-point Draft Martini solution to support multi-site interconnectivity across a multi-protocol label switching (MPLS) network. However, implementing VPLS presents numerous challenges that must be to overcome for an NPU-based design, particularly on the memory front. Specifically, implementing VPLS on a network processor design opens up issues with table lookup memory, packet frame memory, and statistics memory. Let's look at these issues and provide some solutions for these problems.. VPLS Explained In phase one of a VPLS service, the edge networking system receives packets from an Ethernet port from and performs a look-up based on the media access control (MAC) destination address (DA) address in an attempt to find out how to forward the frame (Figure 1). Should the look-up miss (i.e. not return a match), the frame will need to be forwarded to all the ports connected to the switch (flooding). The frame must to be replicated for each remote port inside the edge provider system and for each one, the correct Martini VC/MPLS tunnel combination header needs to be placed ahead of the encapsulated packet before its transmission. The number of times that this replication needs to be performed is equal to the number of remote ports connected to the bridge.
VPLS functions as a bridge between ports, which are not connected together directly. Some ports are local and others are remotely connected to the bridge through a virtual circuit (VC) Martini MPLS path. Therefore, the processing of packets follows the logic of a switch or bridge.
In the second phase of a VPLS service, edge system learns the input port and MAC source address association for the Ethernet frames received in order to forward these frames in the other direction. This is achieved by performing a look-up based on the SA MAC. If the lookup is successful, then the association has been already learned. If the look-up misses, then this is a new association and the system must be learned.
Once the learning phase is completed, packets received from a customer Ethernet port are forwarded based on the DA MAC address look-up, which will provide the VC Martini/MPLS tunnel where a packet gets forwarded (Figure 2). Note: Entries learned and not used for a while should be aged out, whereas other entries may be made permanent.
It should also be noted that Martini is a default case of VPLS. The bridge sub-network only has a single Ethernet port and a single Martini VC, and the learning of MAC addresses on ingress LSR processing is disabled and flooding is necessary.
VPLS Lookups
The VPLS implementation shown in Figure 3 supports a load-balancing scheme of traffic across a number of MPLS tunnels. Rather than having a Martini VC associated with a single MPLS tunnel/port combination, up to 16 possible associations are supported. The original Ethernet packets are load-balanced across the set of MPLS tunnels/port combinations by performing a hash on the MAC SA and DA, ensuring that packets of the same DA-SA pair flow through the same MPLS tunnel/port combination.
In Figure 3, lookups are composed of three main paths: forwarding lookup path, flooding lookup path, and learning lookup path. In the ingress path, the forwarding lookup path goal is to determine which MPLS labels to append to the packet, based on the destination MAC address. In this implementation, it is based on three look-up chains to accommodate the load-balancing implementation. In the egress path, the goal is to use the DA MAC address to find which Ethernet port to send the packet to.
The flooding lookup path handles the case where the MAC DA is unknown and the packet needs to be broadcast to all the MPLS paths that are part of the sub-network. The learning lookup path, on the other hand, handles the learning of the MAC addresses during the egress packet processing to enable the ingress MAC DA look-up.
Lookup Memory Challenges
In order to reduce the cost of the total system for this highly cost-sensitive application, the network processor should integrate a look-up engine to eliminate the need for external content addressable memories (CAMs) and SRAMs. In order to provide adequate performance, two additional elements are necessary to complement the lookup engines:
- Embedded memory to store some of the tables and minimize access to the external memory, where the large tables are stored.
- Lookup hardware algorithms for tree look-ups to minimize the number of external memory accesses and provide deterministic performance for hash table lookups. Using such an algorithm, hash lookups could be reduced to only two memory accesses.
To reduce the total system, an NPU packet-processing pipeline can integrate a lookup engine capable of performing look-ups in direct tables, hash tables and trees in embedded or external memories. It gives the developer the flexibility to place tables. The lookup engine must be programmable to support flexible chaining of lookups Figure 4.
Given the architecture shown in Figure 4, the total number of memory accesses for the ingress and egress VPLS lookups are summarized in Table 1. As this table shows, wire speed with minimum size Ethernet packets can be achieved for 10-Gbit/s applications.
MAC Address Learning
VPLS relies on "learning" the remote MAC addresses on the other side of the Martini VCs. A uni-directional network processor "sees" one direction of the traffic and the host is responsible for adding/aging the entries.
The learning process requires monitoring the MAC addresses on the egress path to see if they are known and potentially add them, while the MAC address is being used for a look-up in the other direction. With most network processors, the learning process needs to be handled by the host and, therefore, the speed of the learning process is limited by the host interface, host CPU performance, and more, and is typically measured in terms of thousands of entries per second.
To solve this problem, a network processor can be designed as simplex or full-duplex device. The advantage of having a bi-directional network processor is that it enables the processor to see both traffic directions, allowing it to see the packet requests and replies. In addition, it can have an on-chip learning mechanism that allows to create/delete entries to look-up tables at a very high-rate (Figure 5).
A hardware mechanism that ages the entries out automatically can also be added. This hardware mechanism allows MAC address learning to be implemented at a rate of millions of entries per second, allowing the switch to converge very quickly and off-loading this critical processing from the host processor.
Frame Memory Challenges
VPLS flooding requires the packet to be replicated as many times as the number of virtual lased lines (VLLs) belonging to the same sub-network. Since the header containing the Martini VC and tunnel labels are different for each packet, the original packet cannot simply be replicated.
Most NPUs cannot handle the replication process themselves and rely on an external traffic manager to replicate the packet (Figure 6). This solution requires the development of some external logic to handle the last look-up on a per replicated packet, or add another NPU to handle its processing.
A better solution would be achieved by handling the replication within the NPU. An NPU can provide a highly efficient on-chip packet replication mechanism.
To understand the importance of the on-chip packet replication mechanism, let's examine how packets are buffered inside a network processor. In the proposed network processor architecture, the packet memory is on-chip and divided into buffers. An on-chip data structure is used to chain the buffers together when large packets are received. A DMA hardware engine is responsible for receiving the packet from the physical interface and assembling it into one or multiple buffers in the frame memory.
When a packet is queued for transmission, the hardware DMA engine responsible for transmitting the packet walks the buffer chain structure and decrements a multicast counter, which is associated with each buffer. When the multicast counter reaches zero, the hardware DMA recycles the buffer automatically without microcode intervention and returns it to the buffer free list.
The NPU processing pipeline can be composed of multiple stages, each performing a different packet processing function. During the packet processing stages, the actual packet is stored in the frame memory and only a pointer to the packet is passed between the pipeline stages. Packet replication can be performed at any stage of the pipeline.
The replication process begins by a logical replication where multiple instances of the packet are created inside the pipeline from one stage to the next. During this step, the packet is only stored once in the frame memory. If identical instances of the packet are to be sent multiple times, they will be queued multiple times in different queues for transmission. However, only a single instance of the frame is stored in memory.
In a more interesting case, where the replicated packet needs to be modified slightly prior to each transmission (e.g. VPLS), a new buffer is allocated to write the additional part of the header and then re-chain the newly allocated buffer to the packet's original first buffer. If the packet is replicated N times, then N buffers are allocated with a different headers written to each and each is chained to the packet's original first buffer. Figure 7 illustrates the memory structure during the packet replication process.
The duration of the replication process is also critical for the overall performance. The network processor must be capable of handling the replication process in only a few clock cycles, allowing multiple packet replications at 10 Gbit/s.
Stat Memory challenges
In addition to general accounting of the number of packets and number of bytes, it is essential to enforce customer-negotiated service level agreements (SLAs). It is also important to rate limit the amount of packet flooding going through the switch.
Since statistics updates increase the memory bandwidth demand, designers must separate statistics memory from the lookup memory while also providing a dedicated interface for the statistics memory. Furthermore, rate computation for a single-rate or two-color markers can require numerous instructions, taking a bite out of the instruction budget available for packet processing
The network processor can provide a hardware-assist mechanism whereby the programmer gets the color of the packet (green, yellow, red) and simply has to decide what action to take with the packet. All the calculation of the rate limiting token buckets can be performed via a hardware block without adversely affecting the instruction budget per packet.
Wrap Up
VPLS is an interesting application that is highly demanding in terms of lookup, frame, and statistics memory performance and cost. In the end, to effectively implement VPLS, designers need to consider embedding lookup functions, packet replication, and policing capabilities inside the network processor in order to reduce off-chip memory strains.
About the Authors
Patrick Bisson is senior director of technology for Ezchip Technologies. Patrick holds an M.S. in Computer Science from Institut National des Sciences Appliquées and can be reached at patrick@ezchip.com