|
|||||
Implementing IKE Capabilities in FPGA Designs
Implementing IKE Capabilities in FPGA Designs Implementing Internet key exchange (IKE) capabilities in standalone ASICs housing IPSec and other security tasks may appear to be a benefit to designers on paper. However, in real implementations, this tightly coupled architecture limits a designer's flexibility in adjusting and adapting their IKE capabilities for proprietary or newer key exchanging schemes. Fortunately, designers now have another option. By splitting up mathematical operations into smaller subsections, designers can now implement Diffie-Hellman and other IKE capabilities in field programmable gate array (FPGA) device without gobbling up tons of logic cells. And, at the same time, they get the added flexibility to make changes to these algorithms over time. In this article, we'll show how the Diffie-Hellman algorithm can be implemented in FPGA designs. We'll also show the implementation and efficiency gains that can be achieved using this approach. Exchanging KeysBefore diving into the specific implementation issues, let's first provide a quick overview of how IKE works. IKE is a protocol defined under the Internet Engineering Task Force's (IETF's) RFC 2409 specification. The protocol is used for key management in IPSec cryptosystem and enables the automatic negotiation (and renegotiation) and creation of security associations (SAs) between IPSec peers. IPSec can be configured without IKE, but including an IKE implementation in IPSec will enhance its security, flexibility and simplify the creation and implementation of IPSec security policies. For IPSec to function between two network peers, a security association (SA) must be established between them that describe the connection's parameters, such as the secret keys for the secret-key ciphers, the cipher modes, and key lifetimes amongst other information. To establish an SA, data must be exchanged over an unsecured network, such as the Internet, in plain sight of potential attackers. This represents t he classic dilemma of establishing trust in cryptosystems: How do you secure the exchange of information that leads to the trust establishment without compromising the very data you wish to protect? IKE solves this dilemma by using a number of sub-protocols and standards. In essence IKE is split into a number of sub-protocols that are used in the various stages of the SA establishment. They include the Internet Security Association and Key Management Protocol (ISAKMP). A detailed discussion of ISAKMP is beyond the scope of this paper but suffice it to say that ISAKMP is responsible for the provision the following services to the IKE: ISAKMP defines mechanisms for key exchange in different modes over each session. Each protocol results in an authenticated key exchange, using the Diffie-Hellman algorithm. The security of the Diffie-Hellman is based on the difficulty of calculating the discrete logarithms of very large numbers. The Diffie-Hellman protocol is used to secure key exchange over insecure channels and is required to provide the keying material for the symmetrical, secret key ciphers such as DES and AES. These ciphers provide the data encryption information actually sent between the users of the IPSec sessions. Diffie-Hellman Fundamentals The two parties then exchange their public values (Y), and perform a further modular exponentiationthis time using the other party's public value as the generator. Both parties will then have calculated the same value. A third party listening in on the channel cannot calculate the same result since it does not have either of the secret values x, which can only be derived from transmitted public values Y by calculating the discrete logarithm of this very large number. Alice and Bob will show us how it is done, using some small numbers: generator g=2 and prime p=157.
From the above example it should be obvious that fundamental to the Diffie-Hellman exchange is the exponentiation operati on Y = gxmodp. This operation is the same fundamental operation of the RSA public-key cipher. A further simplification is to notice that the initial steps involve exponentiation with the generator 2. In hardware terms this simplifies to a modulus operation of a 2x, which can be several orders of magnitude faster than the full modular exponentiation, i.e. where the generator g is the same size as the modulus. We will examine the implementation aspects of this shortly. The Diffie-Hellman exchange is illustrated in Figure 1. This figure shows a typical arrangement of communicating peers with third party eavesdropping on their communications.
Mathematics for FPGA Implementations After looking at the problem in detail, the Montgomery algorithm appeared to be particularly suitable for FPGA implementation, especially for a low cost device that does not have special features such as multipliers. We will start by examining the general Montgomery algorithm, and proceed to describe some of the optimizations to the algorithm to reduce hardware cost. Montgomery math allows modular multiplications to be calculated without division, which are particularly ungainly in hardware. An exponentiation can be performed with Montgomery numbers identically to regular integers, by using the standard square and multiply algorithm. Using Montgomery numbers just adds two steps to the process; first, the integers must be converted to Montgomery space with all the intermediate multiplications being performed using Montgomery te chniques. Then the final result must be converted back to regular integer format. The Montgomery modular multiplicationAB mod M for Radix 2 calculations (i.e. calculating an n-bit number will require n iterations) is:
Where the subscript i refers to an individual bit position in a number, with i = 0 being the least significant bit (LSB). As a result of the optimizations in Montgomery multiplications, specifically the right shifts, there is an inherent factor present in the result; the actual calculation performed is:
Hence the need to transform the operands A and B into Montgomery space, commonly referred to as m-residue, by the modular operation:
Since both operands will be converted to m-residue, there will be a 2n factor in the final Montgomery result, which can be removed with a final Montgomery multiplication by the integer '1'. FPGA Implementation
All three numbersthe intermediate number S, the modulus M, and the operand Aare n-bits wide, or 2048 bits in a typical IPSec session. Since numbers of this size cannot be added efficiently in current FPGA technology, an indirect method has to be found. Under one indirect method typically implemented, designers use both systolic arrays and pipelined adders. However, we found that breaking down this equation into smaller subsections, and calculating the result iteratively, allowed the most efficient uses of FPGA resources, especially r outing. Too often, FPGA implementations of algorithms are compared purely based on logic resources. Routing resources also impact the quality of the final result in two ways: first in the performance of the core itself (easily measured) and second in the performance of the surrounding logic of the system in which the core is a part of. If the core uses routing out of proportion to it's logic utilization, then the rest of the system will be negatively affected, thus increasing the effective die area required to implement the system and reducing overall system performance. By implementing the calculation of the intermediate values in an iterative manner, the core logic and routing can be contained in a very small, easily routable section of the device, allowing the core to operate at the maximum performance allowable by the logic and routing fabric. This also allows the core to be scalable across a wide range of bit widths, even dynamically from multiplication to multiplication. Increased system perf ormance can then be realized by instantiating multiple cores in a single FPGA, each running at the maximum performance supported by the device, as all routing will be locally contained. Since the cores are very smalltypically 10% of the smallest, least expensive FPGAs in existencemultiple instantiations are an effective method of scaling system performance. The architecture of the core is relatively straightforward, as indicated by its very small sizeas little as 300 logic cells (LCs). There are a number of memories for S, M, B, and A as described previously. These memories are used to hold the values for each individual Montgomery multiplication. In addition, two more memories, one for the exponent, E, and one for the intermediate modular exponentiation value, are required. A state machine controls the multiple nested loops that calculate the modular exponentiation, using multiple modular multiplications. A simple word wide interface directly accesses the memo ries. The core can be parameterized by the user to fit the system bus width16-, 32-, 64-, or 128-bit widths are supportedwith the interface width also determining the processing step size. The throughput of the core is directly proportional to the word width specified by the user. Diffie-Hellman FPGA Architecture
In general, a practical Diffie-Hellman solution is composed of at least three basic components:
Figure 2 depicts a typical Diffie-Hellman accelerator block that can be used to offload modular calculations and random number generation from a host CPU into the FPGA fabric. The diagram shows a single calculation engine, however, it would be straightforward to increase the number of arithmetic engines to multiply the system's throughput linearly.
When implementing an RSA function in hardware, it is preferable to also implement a straight modulus function, which can be used to carry out the conversion to m-residue space before starting the Montgomery exponentiation. The modulus function carries out the following mathematical function:
The modulus core is also required to accelerate the calculation of the intermediate values needed to implement the Chinese Remainder Theorem (CRT)-based version of modular exponentiation. While the details of the CRT are beyond the scope of this article, a modular exponentiation can be split into two smaller modular exponentiations. As the complexity of modular exponentiations will square with word length, this will typically give a four-fold improvement in performance over a straight calculation. In the Diffie-Hellman core, this modulus function can serve another purpose besides the m-residue conversion. Consider the situat ion where the generator g = 2 (as defined for the IPSec IKE Diffie-Hellman groups described previously). The first stage of the Diffie-Hellman calculation is then Y=2xmodp. To represent a power of 2 number in binary, it is a simple matter of inserting a '1' in the appropriate position in a field of zeros, resulting in the first-stage operations becoming Y=Amodp where A=2x, which is the same equation used to describe the modulus core functionality. This simplification is an important one because it reduces the first part of the key generation from a full modular exponentiation to a single modulus. Since the modulus operation is several orders of magnitude faster than the equivalent exponentiation calculation, we can save a significant amount of calculation time by exploiting this simple shortcut. The Diffie-Hellman core above is intended to form part of an overall system that incorporates a processor, or some other mechanism for generatin g the packets and the digital signature insertion. Once the secret keys have been exchanged they will be added to the connections SA, after which the process is repeated for the next exchange. Size and Scalability
This function consumes a relatively small amount of logic resources. The logic utilization can be further reduced by omitting the RNG from the design and relying on software or data derived random numbers instead. This will save the 820 logic cells required by the RNG and produce the minimal Diffie-Hellman Core as follows:
To give an idea of the relative size of this solution, the smallest low cost FPGAs contain about 3000 logic cells, with a current high-volume price of $4. Thus a hardware acceleration solution supporting both Diffie-Hellman and RSA, is available for about $1. Implementation Efficiency Generally these devices have an efficient implementation of the modular arithmetic function as well as the important random number generator. However, the modular arithmetic functionality may be tuned to particular protocols such as IPSec or SSL, which may not be suitable for a general-purpose solution. More commonly, general-purpose processors (GPPs), and sometimes digital signal processors (DSPs), are used to implement the modular exponentiation algorithms. There are many software libraries available, which can be targeted to a wide variety of platforms, allowing a programmer to implement any combination of key exchange and public key cryptography easily. Performance results for the processor approach can vary widely, dependant on the type of processor used, as well as the quality of the library. The major advantage to the processor approach is flexibility. There are two major disadvantages:
An alternative approach, as described in this article, is to use an FPGA with its general-purpose logic fabric partitioned into an embedded processor and an a cceleration unit. For example, in the case of a 1024-bit exponent and modulus, it is possible to implement a radix-2 exponentiator that is 500 logic cells in size, and can achieve 32 key exchanges per second. This is for a very low cost, consumer product oriented device. Alternatively, using the higher performance FPGAs, which contains embedded 36x36 multipliers, a higher radix version of the modular exponentiation can be supported, that can achieve upto 640 key exchanges per secondagain using only a small fraction of the resources of the device. Programmable logic compares favourably with both the performance and cost of ASSPs, but has the flexibility of processors; but more importantly, as the FPGA can be customized to any degree by the user, the design can be optimized for the application at hand. An additional benefit of the FPGA approach is the ability to design systems for non-standard Diffie-Hellman generators and primes. For example, designers may wish to implement a special Diffie-Hellman group for use with a proprietary security implementation that uses a generator g=13. With the FPGA cores, changing generators can be as simple as changing the a few user-defined parameters in the acceleration core configuration file and recompiling the design for the same hardware platform - or even updating the hardware platform while in service. About the Authors Martin Langhammer is chief scientist at Altera's European Technology Center in the UK Martin has a B.A.Sc. in Electrical Engineering from University of Toronto and can be reached at mlangham@altera.com.
|
Home | Feedback | Register | Site Map |
All material on this site Copyright © 2017 Design And Reuse S.A. All rights reserved. |