|
||||||||||
Data-in-transit Protection for Application ProcessorsChapter 1: Introduction This whitepaper attempts to help designers tasked with building an Application Processor based system that needs to incorporate support for what is typically called 'Data in Transit Protection'. For a given system this often translates in a requirement for high speed cryptographic data processing. So the emphasis is on high-speed data processing, as opposed to high security, or operations requiring high computational loads but little data - we are going to talk about processing lots of data, fast. Specifically, we mean 'fast' in the context of the resources available to the system. The assumption is that we are dealing with a system that also has other things to do than cryptographic processing. In fact in the majority of cases, the system was designed and dimensioned with a different task in mind . and Data in Transit protection is only added after the fact, or in a second revision of the product. Thus, the challenge we are going to address here is not just about doing high-speed crypto. It is about doing high speed crypto while minimizing its impact and footprint on the rest of the system. Just like Application Processor based systems have evolved over time to become more powerful and more complex, so have the cryptographic coprocessors that accompany these processors. To sketch the broad range of solutions available today, we will use 'a brief history of cryptographic offloading' to build a 'timeline' of cryptographic offloading solutions, with every step along the way adding additional sophistication to the cryptographic offloading. Most systems today don't need the fullest, most comprehensive solutions that have been introduced recently -but your system is bound to be comparable with a location 'somewhere on this timeline'. Figure 1: A history of cryptographic acceleration Chapter 2: Cryptographic offloading – a brief history 2.1 Software only The simplest form of cryptographic processing is obviously doing it all in software. This solution is the simplest to build and integrate in the system, but it also has the highest impact on the system. To appreciate why cryptographic processing of data is so hard on a processor system, consider the following:
2.2 Individual Crypto Engines with DMA support The first step in cryptographic offloading is adding dedicated hardware to take care of the crypto algorithms. This already provides a significant performance boost as the hardware crypto will be much more efficient in performing the cryptographic transformations than the processor itself. By adding DMA capability to the crypto cores, the processor only needs to set up the key material and DMA parameters, and off goes the accelerator - the processor can spend its cycles on other tasks. This is a relatively easy scheme to support in software since it’s a straightforward replacement of the crypto operation in software, with a call to the crypto hardware. The resource utilization on the rest of the system is still high though:
In practice this model typically suffers a heavy performance penalty if the crypto processing has to occur on small blocks of data. The high level of processor involvement causes a lot of idle time on the crypto hardware, causing them to never reach their full potential. The popularity of this acceleration model is confirmed by the fact that a lot of software only supports crypto acceleration in this model – this is for instance how OpenSSL expects to interact with cryptographic hardware. 2.3 Protocol Transform engine A more advanced form of cryptographic offloading takes the operation to be offloaded from the crypto- to the protocol level. In other words, instead of only accelerating an individual cipher- or hash operation, the hardware takes care of a complete security protocol transformation in a single pass. This generation of crypto acceleration also brings another improvement: it takes control over its DMA capability, making it a bus master, and allowing it to autonomously update state- and data in system memory. Although this new bus mastering capability makes integration with software more complicated, it allows for a huge efficiency increase for cryptographic acceleration:
These points allow the crypto hardware to achieve an almost 100% utilization, while still reducing the per-packet load on system resources. Because the maximum data throughput and maximum number of packets per second the system can process improve significantly compared to the single-core acceleration scenario in the previous section, overall resource load goes up – there simply is less waiting to be done, and more packet data to be processed. 2.4 Parallelized protocol transform engines For some systems, even a protocol transform engine is not fast enough. The simple answer then seems to be to ‘just throw more hardware at it’ to speed things up. As every system architect knows, reality is hardly ever that simple. To explain why this is the case, we need to dive a little deeper into the world of cryptographic hardware and data in transit protection protocols. Because of the way most encryption- and message integrity modes are designed, it’s not possible to assign multiple cipher- and hash cores to work on the same packet. Almost all encryption and message integrity modes incorporate an internal feedback loop that requires the result of the current step to be used as input to the next step – there is no way of working on multiple ‘steps’ in parallel. It’s the ‘size’ of this step that determines the maximum throughput a single encryption- or message integrity mode can achieve. Thus, and individual protocol engine can only process a packet as fast as the encryption- and message integrity mode can achieve in the technology used. Only by using multiple transform engines in parallel and processing multiple packets simultaneously is it possible to achieve throughputs beyond this limit. For modern technology and crypto algorithms, the limit is around 4 to 5Gbps, implying that with the arrival of the next generation Ethernet speeds of 10, 40 or even 100Gbps, multiple protocol transform engines will have to be deployed in parallel. A notable exception to the limitations mentioned above are the algorithm modes specifically designed for speed. AES-GCM is a great example; this mode used the AES algorithm in ‘counter mode’, which does not use data feedback and thus allows multiple AES cores to be deployed in parallel to work on the same packet. By also using an integrity mode that allows internal parallelization, AES-GCM can be built to provide throughputs far beyond the limits mentioned earlier. Obviously this also affects the latency that a packet incurs due to the crypto operation. This is one of the reasons why the designers of MACsec have chosen to only allow the use of AES-GCM as data protection mechanism. Unfortunately for all of the 'older' data in transit protection schemes, such as IPsec and SSL, this restriction to a single mode of operation can't be afforded or enforced, simply because connections to legacy systems will have to be supported. For older data in transit protection schemes, the use of multiple protocol engines in parallel to achieve the higher throughputs required for modern networks is required. multiple transform engines in parallel however brings a new challenge. As already indicated in the previous section, protocols like IPsec and SSL maintain 'state information' for a connection (or 'tunnel'). This information, typically referred to as a 'Security Association' or SA, is required before a transform engine can start processing on a packet, and it's updated after processing is done. Processing a packet using old SA data may cause the processing to fail completely, as is the case with SSL or TLS. Or, it may cause certain checks to become critical, such as with the IPsec replay window check where the currently allowed window of packet sequence numbers is kept as connection state. This 'challenge' causes a lot of systems to keep track of the security connection to which a packet belongs, so the packet for a 'single tunnel' can be scheduled for processing on the same transform engine every time. This way the system is making sure the connection state is carried correctly between packets. It obviously also negates the parallel processing capability for packets belonging to the same tunnel; parallel processing is only possible for packets belonging to different connections. In other words, a system specified to support 40Gbps of IPsec traffic may only be capable of handling 5Gbps of IPsec traffic per IPsec tunnel. Such a system will only be capable of achieving the full 40Gbps if that traffic is distributed over multiple IPsec tunnels. The good news is that for IPsec, where 'single tunnel operation' is common, this limitation can be addressed, provided the protocol acceleration hardware is designed for it. For SSL and TLS, this limitation can't be addressed as easily. Fortunately though for SSL/TLS the typical usage scenario results in lots of different, short lived, connections so the limitation is not as serious. The one scenario that may result in a single SSL connection, is with SSL based VPN's. For this reason, SSL based VPN's tend to use a modified version of SSL/TLS called 'Datagram TLS', DTLS, which is designed to allow operation over UDP instead of TCP, which means the DTLS protocol must be able to deal with datagrams that arrive out of order; the guaranteed packet ordering provided by TCP is not available. As a result, DTLS allows parallel processing of packets belonging to the same connection. 2.5 Moving on Even the parallelized protocol transform engine from the previous section isn't always sufficient to achieve the data throughput and packets per second a system architect is looking for. Simply adding more crypto hardware doesn't always do the trick; for various reasons, other system bottlenecks may prevent the crypto hardware from reaching its potential. Examples of performance limiting effects are: Data Bandwidth limitations Adding data in transit protection to an existing data stream tends to multiply the amount of data that needs to be moved around on the internal bus system. Where originally packet data came in over an external interface (Ethernet, WiFi) and got stored in memory, for use by some application running on the host processor, now the packet needs to be read from memory, get decrypted, and stored back in memory before it can be given to the application. The same obviously holds for outbound traffic. Thus, a gigabit interface that used to consume a single gigabit of internal bandwidth, all of a sudden requires 3 gigabit of internal bandwidth. In addition, for every packet processed, the key material and tunnel state (SA) needs to be read and updated by the crypto engine. Although in itself not a lot of data, it may still add up to a large data stream if a lot of small packets are processed. This may be alleviated by using an SA cache, on systems requiring support for limited number of tunnels; however systems dealing thousands of simultaneous tunnels typically can't afford to provide sufficient local memory to make caching effective. Processor Bandwidth limitations Every packet arriving in the system requires some attention from the system processor(s), even if the actual data movement and modification is handled by support hardware. This goes for connections that are not encrypted . the rule of thumb for terminating a TCP connection on a host processor used to be that for every bit of TCP traffic terminated, 1 Hz of processor bandwidth was required. It will be obvious that this does not improve if data in transit protection is added to a data stream The key here is that, assuming packet data movement is handled by DMA and cryptographic processing is handled by a crypto accelerator, the processor has to perform all packet handling operations, such as those for TCP described above, for every packet, regardless of the size of the packet. This means that every system has an upper limit for the number of packets it can handle per second, especially if the amount of bandwidth the processor is allowed to spend on packet handling is limited. Most systems don’t just move packets along; they actually need to act on them so they reserve, or would like to reserve, the majority of their bandwidth to other tasks, putting a further limit on the maximum number of packets the system can handle per second. Obviously this upper limit will decrease if the processor is given more tasks per packet due to the addition of data in transit security. Some common causes for this are:
It will be obvious that having to move data around in the system unnecessarily never helps throughput; this is true for any system, not just for cryptographic accelerators. Especially for DMA capable peripherals, data flow and data buffer management should be optimized such that data alignment, data buffer location and management as well as address translation (between the virtual addresses used by applications and physical address used by a DMA engine) allow for optimal use and cooperation by the peripheral and the OS or application. With the previous paragraphs in mind, we can construct a graph that shows the maximum throughput a system can achieve, as a function of packet size used to transfer the data. This graph clearly shows the two areas where throughput is limited by processor bandwidth and data bandwidth, respectively. Tangent A shows the maximum throughput achievable due to the system’s ability to process a maximum number of packets per second (c), thus maximum throughput is c x Packet Size. Tangent B shows the maximum data bandwidth available for the cryptographic accelerator. Any system deploying look-aside type cryptographic hardware will perform according to this graph, although obviously the exact slope of tangent A, and location of tangent B, will differ. Figure 2: Typical throughput graph for packet processing systems Attempting to improve throughput of a system by just adding additional cryptographic acceleration capability will obviously move tangent B up; however without further improvements to the system, the slope of tangent A is not changed, limiting the effect of the additional cryptographic hardware, as illustrated in the following figure. Figure 3: Effect on throughput when adding HW acceleration capable of 2x the original acceleration performance without improving processor packet handling efficiency The slope of tangent A, dictated by coefficient c, can be improved by increasing the efficiency of the IP- and cryptographic protocol stacks, and by making sure the interaction with the cryptographic hardware is as efficient as possible. The points mentioned in this section can help to achieve this, up to a certain point; if additional improvement is needed, it becomes necessary to move more functionality from software, directly to hardware. For this reason the class of ‘Inline Protocol Acceleration engines’ was introduced. 2.6 Inline Protocol Acceleration engines Originating from the Network Processor world, the concept of 'Inline Operation' has started to be used in the Application Processor world as well. The crypto acceleration architectures discussed so far, operate in what is typically referred to as 'Look-Aside mode': packet handling is done completely under software control, and only when the actual cryptographic operation needs to be performed, does the software 'Look Aside' to the cryptographic accelerator. After the crypto accelerator has completed its task, the packet is handed back to software and packet processing continues. The conceptual difference introduced by Inline Processing is the fact that software is no longer involved both before and after crypto acceleration . all cryptographic operations are performed on the packet before the software 'sees' the packet for the first time (or vice versa). This form of Inline Operation is typically called the 'Bump in the Stack' processing model. Some systems, especially those targeting networking gateway applications, take this concept one step further and allow a packet to travel from network interface to network interface completely through hardware, without involving software running on the Application Processor at all. This operational model, which is almost a hybrid between the typical Application Processor setup and a dedicated Network Processor setup, is often referred to as the 'Bump in the Wire' processing model. Since we are specifically addressing Application Processors in this whitepaper, we will focus primarily on the 'Bump in the Stack' model. After all, most Application Processors are used in a system that is required to actually use (consume) the packet data it receives (and vice versa); only network gateway applications are typically set up to 'forward' packet data without actually looking at the packet contents. The following two figures illustrate the difference between the look-aside and inline processing models, from a protocol stack point of view. The first figure shows a 'typical' protocol stack for IP with IPsec. Typical packet flow is from Ethernet, at the bottom, through the IP stack in software, making a brief excursion to the Cryptographic accelerator for decryption, and further up to the application. Outbound packets follow the same flow, in reverse. Figure 4 Example of (data plane) packet handling operations, on the left a typical IP with IPsec Protocol stack, on the right the operations executer in HW by a Flow Through accelerator When an Inline cryptographic accelerator is used, the picture changes as shown on the right side. All packet operations ‘in between’ the Ethernet MAC and the Cryptographic accelerator are performed in the hardware of the Inline protocol engine. The packet no longer makes an ‘excursion’ from the software stack, to get processed by the cryptographic accelerator; rather, the software stack only ‘sees’ the packet after it has been decrypted. With a Bump-In-The-Stack flow, the packet travels from Ethernet to the application, and vice versa. In a Bump in the Wire flow, the protocol accelerator also implements an IP forwarding function, so packet that arrive from Ethernet can be processed all the way up to the IP layer, get decrypted, and are then forwarded ‘back down’ to Ethernet again, causing the packet to never hit the software part of the IP stack. Both the Bump in the Stack as well as the Bump in the Wire operational models present some software integration challenges as typical networking stacks and applications are not designed for use in this model. When properly integrated however, major benefits can be achieved:
In other words, most or all of the issues raised in the previous section ‘go away’ when an inline crypto accelerator is deployed. Using an inline crypto accelerator in ‘Bump in the Wire’ mode can have an even more dramatic effect; because the crypto accelerator in this scenario comes with a built-in ‘packet forwarding engine’, the packet forwarding capability of the system through the crypto pipeline can outstrip the packet forwarding capability of the application processor in the system, to such an extent that often the terms ‘fast path’, denoting the inline crypto accelerator, and ‘ slow path’, denoting the application processor, are used – terms typically used in the world of network processors to indicate the optimized data path through the packet processing engines, versus packet handling by the slower general purpose processor. 2.7 Power Cryptographic accelerators not only bring improved data throughput. They also provide improved power consumption compared to a software-only solution. It will be obvious that an on-chip accelerator, using only the necessary amount of logic gates needed to perform the cipher- and hash operations, consumes significantly less power than a general purpose application processor. The application processor, and the parts of the system it uses to perform the required cryptographic operations, will activate much more internal logic compared to a dedicated crypto accelerator. In addition the application processor typically executes from off-chip SDRAM, increasing the combined power consumed even more. The most significant power savings are achieved by moving the cryptographic operations to dedicated hardware, preferably a protocol engine (to minimize the amount of data movement in the system). Beyond that, using Bump in the Stack type acceleration provides power optimization compared to a Look-Aside deployment, again because of the fact the packet data is moved in and out of SDRAM less often. Bump-in-the-Wire operation improves power consumption even more because packet data does not necessarily have to enter SDRAM any more at all, combined with the fact that the processor is not spending any cycles on packet processing. Chapter 3: Efficient Packet Engine design and integration Up to this point we have been discussing the different cryptographic acceleration architectures found in application processors today. Having established the application and usefulness of cryptographic acceleration, we will now look at what features make an accelerator efficient. This section focuses on the protocol-level accelerators from the previous section; these are often referred to as 'packet engines' hence you'll see that term used in the following sections as well. 3.1 The 'simple things' Any peripheral with (high throughput) DMA capability needs to provide certain features to allow easy integration with controlling software; packet engines are no exception. This means that the packet engine DMA subsystem should provide the following features:
Rather than requiring the processor to manually program the DMA engine on a transfer-by-transfer basis, the packet engine should allow the processor to queue a number of packets, leaving it to the packet engine to set up the individual bus mastering transactions autonomously. This also allows the packet engine to pre-fetch data to hide memory access latencies.
While software often enjoys the services of an MMU to make a buffer scattered in memory look like a contiguous virtual buffer, the lack of IOMMU in a lot of systems means the packet engine DMA will have to be able to deal with the scattering and gathering of data itself.
One of the benefits of descriptor-based control for the packet engine is that it allows asynchronous interaction with the packet engine. This implies that the processor either gets interrupted by the packet engine when processing for a packet is completed, or the processor polls on a regular basis to determine if processed packets are available and whether new ones can be queued. While the overhead of dealing with a hardware interrupt may be acceptable for situation with a low 'packet arrival rate', this overhead becomes inhibitive if the number of interrupts rises due to a high packet arrival rate. One way of lightening the interrupt load is to apply 'Interrupt Coalescing', which simply means that an interrupt is fired by the packet engine for every n packets processed, rather than for every single packet. This mechanism works fine if the system is dealing with a continuously high packet arrival rate. If packet arrival rate drops, however, interrupt coalescing may result in some packets not getting serviced for a long time because the 'coalescing limit' isn't reached, preventing the interrupt from being fired. In this case the packet engine must allow a time-out to be set; if processed packets are waiting and no new packets arrive during this time-out period, the packet engine triggers the interrupt to the processor anyway. This mechanism allows packet handling latency to remain under control. Finally, if the system finds itself under such a high packet arrival rate that even interrupt servicing becomes undesirable, the system may want to switch to a form of polling, similar to the behavior of the Linux New (Packet Processing) API. The above implies that the packet engine should provide support for all these mechanisms. Another thing to look for in a DMA-capable (or bus mastering) peripheral is its ability to interact efficiently with the internal bus system; it must be capable of:
Other system level considerations apply when using DMA capable peripherals, such as (data) cache coherency mentioned above, for instance. These however, need to be dealt with at the system level as they can’t be alleviated by the peripheral itself (alone). What will also help in this respect is having a software support environment for the device that is aware of these issues and can help deal with them. 3.2 Supporting modern application processor hardware Application processors, even those in mobile systems, have evolved from single-processor, 32 bit, single OS or RTOS into Multi-Processor, 64-bit systems with support for Virtualization, possibly running multiple OS’s and definitely running more applications in parallel. In addition, the presence of MMU and IOMMU functions, and the use of higher throughput, pipelined, memory ease the use of a DMA-capable peripheral in the system and provide much higher data throughputs. These features allow higher network throughput and make it easier for a crypto accelerator to be accessed from different applications in the system. On the downside, memory read access times have grown to a point where two or three new packets arrive in the system while the crypto accelerator is waiting for a single read access to get completed by the memory subsystem. This implies that to be effective in a modern system, a packet engine has to support a number of features that have nothing to do with the crypto operations themselves but rather, allow the packet engine to achieve its maximum potential as a part of a bigger, complex, system. In a sense these requirements hold for any highperformance peripheral in the system:
The packet engine must be capable of dealing with the system level requirements mentioned above before it can operate efficiently, i.e. achieve the maximum performance that the internal crypto algorithms can achieve. Next we will look at some of the requirements put on a packet engine to allow it to be used efficiently from a modern software perspective. 3.3 Supporting multiple applications and virtualized systems As already indicated earlier, modern application processors support virtualization, either in the classical sense, running multiple operating systems, or from a security perspective, deploying a normal- and a secure world, or even both at the same time. In addition, each virtualized environment can run multiple applications that require interaction with the packet engine. In top of that, some of these applications require a kernel component to control the crypto operations (which is typically the case for IPsec, for example), while other applications require access to the crypto accelerator from user space (such as SSL/TLS). This means that, for the packet engine to be used effectively in such an environment, the packet engine should provide the following features:
These are just a handful of requirements posed on a packet engine as it gets integrated in a modern multiprocessor application processor. In the past, requirements like these used to be applicable to high-end server systems; however there is clearly a shift in system complexity happening with the ever increasing power of application processors. 3.4 Upping the performance As indicated in the first half of this whitepaper, it may be necessary to use multiple processing pipelines in order to exceed the single-packet throughput limitations imposed by certain cryptographic operations. In addition we determined that it was beneficial for the packet engine to be able to work on multiple packets in parallel so more parallel read transactions and data pre-fetch operations could be set up, to allow more efficient read latency hiding. For this reason high-speed packet engines comprise of multiple processing pipelines, with each pipeline consisting of multiple stages, each stage operating on a different packet. Doing this brings improved throughput but it also brings some additional challenges:
Chapter 4: The Software Angle In the previous sections we have focused on the cryptographic hardware and what is needed to allow efficient interaction with software, at a fairly low level and from a predominantly hardware-based perspective. Now let’s take a look at the requirements that are put on a protocol stack as a whole, to allow it to make efficient use of two of the more advanced crypto acceleration architectures mentioned above, the Look-Aside Protocol Acceleration model, and the Inline Protocol Acceleration model. In general, the different offloading modes represent increasing integration challenges but also yield significant performance- and offloading improvements, as illustrated in the following figure for the IPsec scenario. 4.1 Look-Aside Model The first and most basic item to look at is the capability of the protocol stack to support protocol-level crypto acceleration. If the protocol stack only allows cryptographic offloading on an algorithm level, the added value of a sophisticated protocol acceleration engine is going to be limited. To unlock the full potential of the look-aside protocol engine, it is necessary that the protocol stack can keep the packet queue populated with packets at all times. It will be obvious that this requires the protocol stack to support asynchronous packet exchange with the protocol core, allowing the protocol stack to handle processed packets and set up new ones, while the crypto accelerator is processing the queued packets. Even more basic, the protocol stack must be built to allow simultaneous processing of multiple packets, either by multiple invocations of the data processing path or by some other form of parallel processing. It also requires that the protocol stack operates without accessing tunnel context on a per-packet basis. This means it needs to relinquish control over context updates, leaving those to the hardware. With context (or ‘tunnel context’) we mean any secure tunnel-related state data that needs to be carried between packets, such as cipher engine state or sequence number information. If the software stack ‘insists’ on updating tunnel context by itself, then it effectively needs to wait to submit a packet for a specific tunnel to the crypto hardware, until any previous packet from the same tunnel is completed- so the software stack can perform the context update and submit the next packet for the same tunnel. Similarly, if the protocol stack is designed to read all of the processing parameters from the tunnel context in order to submit them directly to the hardware (as part of the call to invoke hardware acceleration) the protocol stack needs to wait for any updates from the previous packet (for the same tunnel) to be completed before being able to submit the next packet. Unfortunately most ‘standardized’ crypto API’s in existence today operate using this model, supplying the key material with the call to the crypto algorithm, since they were not designed for highspeed data throughput applications. Thus, chances are that if a protocol stack uses a standard crypto API for hardware offloading, it is not going to be very efficient working together with a protocol engine. This often implies that a protocol stack capable of efficient hardware acceleration comes with its own proprietary crypto acceleration API. This API should be designed to allow efficient interaction with crypto hardware in general. Consider for instance the following two potential bottlenecks:
Interaction with hardware typically requires interaction with (kernel) drivers. Obviously, frequent switches between user- and kernel mode require significant processor bandwidth. An efficient protocol stack minimizes these transitions.
A protocol acceleration engine comes with its own DMA capability. This puts a requirement on the software stack to place data to be processed by the engine, in a memory location that is accessible and usable by the packet engine DMA. If the protocol stack is unaware of this and puts packet data in memory buffers that are inaccessible for DMA transactions, are unaligned, or cause cache coherency issues, the packet engine driver may be forced to copy the data to a ‘DMA safe’ location.
In most systems, the performance difference between being able to execute operations from cache, versus external memory, is huge. The presence of hardware acceleration potentially helps, as the cipherand hash code no longer needs to be in processor cache. Still, the protocol stack must make sure the hardware interaction with the crypto accelerator doesn’t prevent the cache system from functioning efficiently, for instance because data structures ‘owned’ by the hardware engine end up in cache, or because large amounts of packet data are ‘pulled through’ the cache. In general, the protocol stack itself should be reasonably efficient in handling of individual packets, even if it can offload cryptographic transformation to hardware – if it takes the protocol stack longer to prepare a new packet for crypto processing than it takes the crypto accelerator to process it, then data throughput is still going to be limited by processor bandwidth. Processor bandwidth, protocol stack efficiency, and cryptographic accelerator throughput should be in balance. 4.2 Inline model Any form of inline processing, either Bump-in-the-Wire or Bump-in-the-Stack, typically requires dedicated integration with the system. The reference to the complete system, as opposed to just the protocol stack, is deliberate: Inline protocol accelerators can be connected directly to the Ethernet MAC interface. For that reason, the inline accelerator must take care of a number of non-cryptographic packet operations that are otherwise done by layers 2- and 3 of the IP protocol stack. Depending on the deployment, this may result in a system where the regular Ethernet Driver is completely integrated with the driver for the Inline protocol accelerator, with the combination acting as an ‘advanced Ethernet driver’ in the system. In this scenario, modifications to the protocol stack are also significant; rather than actually taking a packet, classifying it, and managing the crypto transform, the protocol stack now just needs to be aware that IPsec processing has already happened (on ingress), even before the protocol stack ‘sees’ the packet for the first time. Or vice versa, on egress, that IPsec processing can be deferred until after the protocol stack hands off the packet for transmission. In this scenario, the data plane in the protocol stack is ‘reduced’ to maintaining statistics, error detection, and exception processing. Due to the fact that Inline protocol acceleration hardware is capable of autonomous packet classification, the protocol stack needs to have support for functionality that is traditionally only found in network processors: it needs to be capable of interacting with hardware classification functions. This implies setting up and maintaining classification rules, shared with and used by the hardware classifiers, and synchronizing access to data structures used simultaneously and, more importantly, autonomously, by the inline protocol accelerator. In the case of look-aside operation, every operation on the cryptographic accelerator is initiated and controlled by software, and operation on certain data structures can thus be easily stopped to manage the core or its associated data structures. With inline protocol acceleration, the accelerator hardware receives packets directly without intervention from software, putting additional requirements on the synchronization between hardware and software in order to manage shared data structures. Another item often overlooked is the fact that a protocol stack, supporting the use of an inline protocol accelerator, must be capable of working with multiple ‘data planes’. This means that the protocol stack must understand the fact that packet data, plus the associated context data, may be handled by one (or more) hardware protocol accelerators. In addition, the protocol stack itself must implement a full data plane in order to deal with exception situations and packets that cannot be handled by hardware. A final remark related to inline processing is the fact that not every protocol ‘lends itself well’ to this operational model. Security protocols that are designed for use at the application level, or ‘higher up the IP stack’, rely on services provided by the lower stack levels. To support inline acceleration for such higher-level protocols, the inline protocol accelerator would have to implement these services in hardware as well; a task that may not always be feasible. Alternatively, in case an operation is required that is not supported by the hardware, the packet can be processed using an ‘exception path’ in software, bypassing the hardware accelerator. This should of course only occur for a very small percentage of packets processed. An obvious example of a higher level service that is not typically supported in inline hardware is the packet ordering feature of TCP. Protocols relying on this feature, such as SSL/TLS, are typically not fully supported by inline protocol accelerators, which are designed to process packets as they arrive. This implies that ‘Inline’ acceleration of SSL/TLS is typically implemented asymmetrically. For packet data originating from the local host, packet ordering is guaranteed and inline acceleration is feasible. For ingress, where packets can arrive out of order, the protocol acceleration is typically implemented as a look-aside operation, to allow the packets to be ordered by the TCP stack in software, before submitting them for decryption to the protocol accelerator hardware. For typical (http) server deployments this works well, since ingress traffic is typically low, with clients requesting data, and egress traffic is high, containing the actual requested data. An example of a lower-level service that is not typically supported by inline accelerators is that of fragmentation/reassembly. Since this is ‘not supposed to happen’ in a well-configured setup anyway, the processing of fragmented packets is left to the software exception path or ‘slow path’. Chapter 5: And then there’s this… Up to this point we have only discussed ‘data plane acceleration’ for Data in Transit protection. Data plane acceleration assumes that the key material required to perform the encryption/decryption operation, is already present. Before a tunnel is created, these keys must be exchanged with the communication partner. Systems dealing with a high connection setup/tear down rate may be limited in the number of tunnels they can create because of the cryptographic operations required during this key exchange. This is in fact a typical scenario for a web server protected using SSL/TLS. In this case, a different type of cryptographic accelerator is available specifically designed to offload the very compute-intensive large number modular exponentiation operations required by typical key exchange protocols. This type of cryptographic accelerator is referred to as a ‘Public Key accelerator’. In addition to a PKA, a system dealing with a high connection setup rate also tends to require access to a high amount of truly random data. True random data is used to make it hard for an attacker to guess the value of the key material; the security of most protocols relies directly on the quality of the random data used. Creating highquality random data in a digital system is a challenging task in itself; generating a lot of it without compromising its quality is even harder. To help systems with this challenge, hardware true random number generators are available that use inherent quantum-level effects of semiconductor circuits to generate random data. By now we have seen that a system providing Data in Transit protection deals with a lot of key material, as well as identity information used to establish a trust relationship with a communicating peer during tunnel/connection setup. The typical ‘security model’ for devices providing data in transit protection, is that the device itself is located in a secure environment and that it is therefore not needed to provide specific protection for the key material and identity information handled by the device. There can be situations where this security model is not valid, for instance if untrusted software is running on the device, or if the device is located in an unprotected environment. In this case, it may be required to provide hardware based protection for the key material and identity information handled by the device; this also requires cryptographic hardware, such as a hardware key store or a trusted execution environment. Chapter 6: Conclusion In this whitepaper we have highlighted some of the challenges for achieving high throughput data in transit protection for application processors. Different architectural models have been explained, showing the evolution of cryptographic offloading hardware and the effects the different architectures have on the hardware, software and performance of an application processor based system. We have also looked at the features that make for an efficient cryptographic accelerator. Finally we looked at the requirements that modern and future systems will place on cryptographic accelerators, both from hardware and a software perspectives. It will be clear that packet engine design and integration is no longer (primarily) related to the ability to provide high ‘raw crypto throughput’. The requirements the system poses on the crypto hardware to allow the system to tap the acceleration potential have become much more important. Another ongoing trend is the fact that the crypto accelerator is pulling in more and more functionality from the surrounding system. Virtualization support in the packet engine hardware allows the software component in the virtualization layer to become smaller. Bump-in-the-Stack and Bump-in-the-Wire operational modes pull OSI layer 2 and 3 functionality, plus parts of the packet forwarding function, into the packet engine hardware. Lastly, packet engines are evolving to overcome limitations imposed by legacy cryptographic modes and protocols that were never designed to go up to the speeds offered by modern network technologies. Although perhaps today this development is of primary use to server deployments, the next generation of applications processors may benefit from the lessons learned today. AuthenTec’s latest generation of packet engines, the SafeXcel-IP-97 and SafeXcel-IP-197 IP core series are built to support all of the presented optimization, acceleration and offloading mechanisms. These IP codes are supported by the DDK-97 and DDK-197 driver development kits as well as AuthenTec’s QuickSec and Matrix toolkits. About AuthenTec AuthenTec is a leading provider of mobile and network security. The Company's diverse product and technology offering helps protect individuals and organizations through secure networking, content and data protection, access control and strong fingerprint security on PCs and mobile devices. AuthenTec encryption technology, fingerprint sensors and identity management software are deployed by the leading mobile device, networking and computing companies, content and service providers, and governments worldwide. AuthenTec's products and technologies provide security on hundreds of millions of devices, and the Company has shipped more than 100 million fingerprint sensors for integration in a wide range of portable electronics including over 15 million mobile phones. Top tier customers include Alcatel-Lucent, Cisco, Fujitsu, HBO, HP, Lenovo, LG, Motorola, Nokia, Orange, Samsung, Sky, and Texas Instruments. Learn more at www.authentec.com. AuthenTec offers an extensive selection of silicon IP cores that offer efficient HW acceleration of IPsec, SSL, TLS, DTLS, sRTP, MACsec, HDCP protocols, in Look-Aside, Bump-in-the-Stack and Bump-in-the-Wire architectures, as well as 3DES, AES (ECB, CBC, CTR, CCM, GCM, XTS), RC4, KASUMI, SNOW3G, ZUC, RSA, ECC, DSA ciphers, MD5, SHA-1, SHA-2 has and HMAC cores, accompanied by Driver Development Kits and industry leading toolkits such as QuickSec/IPsec, QuickSec/MACsec, MatrixSSL, MatrixSSH and DRM Fusion/HDCP. Acceleration performance from a few 100Mbps to 40 and even 100Gbps can be achieved in todays 90, 65, 45, 40 and 28nm designs. Please visit AuthenTec’s website for more details (http://www.authentec.com/Products/EmbeddedSecurity.aspx).
|
Home | Feedback | Register | Site Map |
All material on this site Copyright © 2017 Design And Reuse S.A. All rights reserved. |