|
|||||||||
Managing Memory Usage in VPLS-Enabled NPU Designs
Managing Memory Usage in VPLS-Enabled NPU Designs Layer-2 virtual private networks (VPNs) based on virtual private LAN services (VPLS) are gaining more and more traction lately because of their end-user simplicity. They provide a means to extend the point-to-point Draft Martini solution to support multi-site interconnectivity across a multi-protocol label switching (MPLS) network. However, implementing VPLS presents numerous challenges that must be to overcome for an NPU-based design, particularly on the memory front. Specifically, implementing VPLS on a network processor design opens up issues with table lookup memory, packet frame memory, and statistics memory. Let's look at these issues and provide some solutions for these problems.. VPLS Explained In phase one of a VPLS service, the edge networking system receives packets from an Ethernet port from and performs a look-up based on the media access control (MAC) destination address (DA) address in an attempt to find out how to forward the frame (Figure 1). Should the look-up miss (i.e. not return a match), the frame will need to be forwarded to all the ports connected to the switch (flooding). The frame must to be replicated for each remote port inside the edge provider system and for each one, the correct Martini VC/MPLS tunnel combination header needs to be placed ahead of the encapsulated packet before its transmission. The number of times that this replication needs to be performed is equal to the number of remote ports connected to the bridge. In the second phase of a VPLS service, edge system learns the input port and MAC source address association for the Ethernet frames received in order to forward these frames in the other direction. This is achieved by performing a look-up based on the SA MAC. If the lookup is successful, then the association has been already learned. If the look-up misses, then this is a new association and the system must be learned. Once the learning phase is completed, packets received from a customer Ethernet port are forwarded based on the DA MAC address look-up, which will provide the VC Martini/MPLS tunnel where a packet gets forwarded (Figure 2). Note: Entries learned and not used for a while should be aged out, whereas other entries may be made permanent.
It should also be noted that Martini is a default case of VPLS. The bridge sub-network only has a single Ethernet port and a single Martini VC, and the learning of MAC addresses on ingress LSR processing is disabled and flooding is necessary. VPLS Lookups
In Figure 3, lookups are composed of three main paths: forwarding lookup path, flooding lookup path, and learning lookup path. In the ingress path, the forwarding lookup path goal is to determine which MPLS labels to append to the packet, based on the destination MAC address. In this implementation, it is based on three look-up chains to accommodate the load-balancing implementation. In the egress path, the goal is to use the DA MAC address to find which Ethernet port to send the packet to. The flooding lookup path handles the case where the MAC DA is unknown and the packet needs to be broadcast to all the MPLS paths that are part of the sub-network. The learning lookup path, on the other hand, handles the learning of the MAC addresses during the egress packet processing to enable the ingress MAC DA look-up. Lookup Memory Challenges
To reduce the total system, an NPU packet-processing pipeline can integrate a lookup engine capable of performing look-ups in direct tables, hash tables and trees in embedded or external memories. It gives the developer the flexibility to place tables. The lookup engine must be programmable to support flexible chaining of lookups Figure 4.
Given the architecture shown in Figure 4, the total number of memory accesses for the ingress and egress VPLS lookups are summarized in Table 1. As this table shows, wire speed with minimum size Ethernet packets can be achieved for 10-Gbit/s applications.
MAC Address Learning The learning process requires monitoring the MAC addresses on the egress path to see if they are known and potentially add them, while the MAC address is being used for a look-up in the other direction. With most network processors, the learning process needs to be handled by the host and, therefore, the speed of the learning process is limited by the host interface, host CPU performance, and more, and is typically measured in terms of thousands of entries per second. To solve this problem, a network processor can be designed as simplex or full-duplex device. The advantage of having a bi-directional network processor is that it enables the processor to see both traffic directions, allowing it to see the packet requests and replies. In addition, it can have an on-chip learning mechanism that allows to create/delete entries to look-up tables at a very high-rate (Figure 5).
A hardware mechanism that ages the entries out automatically can also be added. This hardware mechanism allows MAC address learning to be implemented at a rate of millions of entries per second, allowing the switch to converge very quickly and off-loading this critical processing from the host processor. Frame Memory Challenges Most NPUs cannot handle the replication process themselves and rely on an external traffic manager to replicate the packet (Figure 6). This solution requires the development of some external logic to handle the last look-up on a per replicated packet, or add another NPU to handle its processing.
A better solution would be achieved by handling the replication within the NPU. An NPU can provide a highly efficient on-chip packet replication mechanism. To understand the importance of the on-chip packet replication mechanism, let's examine how packets are buffered inside a network processor. In the proposed network processor architecture, the packet memory is on-chip and divided into buffers. An on-chip data structure is used to chain the buffers together when large packets are received. A DMA hardware engine is responsible for receiving the packet from the physical interface and assembling it into one or multiple buffers in the frame memory. When a packet is queued for transmission, the hardware DMA engine responsible for transmitting the packet walks the buffer chain structure and decrements a multicast counter, which is associated with each buffer. When the multicast counter reaches zero, the hardware DMA recycles the buffer automatically without microcode intervention and returns it to the buffer free list. The NPU processing pipeline can be composed of multiple stages, each performing a different packet processing function. During the packet processing stages, the actual packet is stored in the frame memory and only a pointer to the packet is passed between the pipeline stages. Packet replication can be performed at any stage of the pipeline. The replication process begins by a logical replication where multiple instances of the packet are created inside the pipeline from one stage to the next. During this step, the packet is only stored once in the frame memory. If identical instances of the packet are to be sent multiple times, they will be queued multiple times in different queues for transmission. However, only a single instance of the frame is stored in memory. In a more interesting case, where the replicated packet needs to be modified slightly prior to each transmission (e.g. VPLS), a new buffer is allocated to write the additional part of the header and then re-chain the newly allocated buffer to the packet's original first buffer. If the packet is replicated N times, then N buffers are allocated with a different headers written to each and each is chained to the packet's original first buffer. Figure 7 illustrates the memory structure during the packet replication process.
The duration of the replication process is also critical for the overall performance. The network processor must be capable of handling the replication process in only a few clock cycles, allowing multiple packet replications at 10 Gbit/s. Stat Memory challenges Since statistics updates increase the memory bandwidth demand, designers must separate statistics memory from the lookup memory while also providing a dedicated interface for the statistics memory. Furthermore, rate computation for a single-rate or two-color markers can require numerous instructions, taking a bite out of the instruction budget available for packet processing The network processor can provide a hardware-assist mechanism whereby the programmer gets the color of the packet (green, yellow, red) and simply has to decide what action to take with the packet. All the calculation of the rate limiting token buckets can be performed via a hardware block without adversely affecting the instruction budget per packet. Wrap Up About the Authors
|
Home | Feedback | Register | Site Map |
All material on this site Copyright © 2017 Design And Reuse S.A. All rights reserved. |