Implementing IKE Capabilities in FPGA Designs

Implementing IKE Capabilities in FPGA Designs
By Ahmed Shihab, Alcahest; and Martin Langhammer, Altera, CommsDesign.com
December 5, 2003 (9:16 a.m. EST)
URL: http://www.eetimes.com/story/OEG20031205S0005

Implementing Internet key exchange (IKE) capabilities in standalone ASICs housing IPSec and other security tasks may appear to be a benefit to designers on paper. However, in real implementations, this tightly coupled architecture limits a designer's flexibility in adjusting and adapting their IKE capabilities for proprietary or newer key exchanging schemes.

Fortunately, designers now have another option. By splitting up mathematical operations into smaller subsections, designers can now implement Diffie-Hellman and other IKE capabilities in field programmable gate array (FPGA) device without gobbling up tons of logic cells. And, at the same time, they get the added flexibility to make changes to these algorithms over time.

In this article, we'll show how the Diffie-Hellman algorithm can be implemented in FPGA designs. We'll also show the implementation and efficiency gains that can be achieved using this approach.

Exchanging KeysBefore diving into the specific implementation issues, let's first provide a quick overview of how IKE works.

IKE is a protocol defined under the Internet Engineering Task Force's (IETF's) RFC 2409 specification. The protocol is used for key management in IPSec cryptosystem and enables the automatic negotiation (and renegotiation) and creation of security associations (SAs) between IPSec peers. IPSec can be configured without IKE, but including an IKE implementation in IPSec will enhance its security, flexibility and simplify the creation and implementation of IPSec security policies.

For IPSec to function between two network peers, a security association (SA) must be established between them that describe the connection's parameters, such as the secret keys for the secret-key ciphers, the cipher modes, and key lifetimes amongst other information. To establish an SA, data must be exchanged over an unsecured network, such as the Internet, in plain sight of potential attackers.

This represents t he classic dilemma of establishing trust in cryptosystems: How do you secure the exchange of information that leads to the trust establishment without compromising the very data you wish to protect?

IKE solves this dilemma by using a number of sub-protocols and standards. In essence IKE is split into a number of sub-protocols that are used in the various stages of the SA establishment. They include the Internet Security Association and Key Management Protocol (ISAKMP). A detailed discussion of ISAKMP is beyond the scope of this paper but suffice it to say that ISAKMP is responsible for the provision the following services to the IKE:

Authenticate the remote peer
Cryptographically protect the management session
Exchange information for key exchange
Negotiate all traffic protection parameters

ISAKMP defines mechanisms for key exchange in different modes over each session. Each protocol results in an authenticated key exchange, using the Diffie-Hellman algorithm. The security of the Diffie-Hellman is based on the difficulty of calculating the discrete logarithms of very large numbers.

The Diffie-Hellman protocol is used to secure key exchange over insecure channels and is required to provide the keying material for the symmetrical, secret key ciphers such as DES and AES. These ciphers provide the data encryption information actually sent between the users of the IPSec sessions.

Diffie-Hellman Fundamentals
The Diffie-Hellman algorithm is composed of several large modular exponentiations, and can most clearly be understood by example. First, the two parties must share two public (non-secret) numbers: the generator (g), and the modulus (p), usually from a table of commonly used values. The generator g is usually a very small number (e.g. 2, 3,...), and p is a very large prime number (e.. 768 - 8192 bits long). In addition, both parties generate a secret number (x), generally the same size as the modulus. Both pa rties then calculate their public value using the following modular exponentiation:

The two parties then exchange their public values (Y), and perform a further modular exponentiation—this time using the other party's public value as the generator. Both parties will then have calculated the same value. A third party listening in on the channel cannot calculate the same result since it does not have either of the secret values x, which can only be derived from transmitted public values Y by calculating the discrete logarithm of this very large number.

Alice and Bob will show us how it is done, using some small numbers: generator g=2 and prime p=157.

From the above example it should be obvious that fundamental to the Diffie-Hellman exchange is the exponentiation operati on Y = g^xmodp. This operation is the same fundamental operation of the RSA public-key cipher.

A further simplification is to notice that the initial steps involve exponentiation with the generator 2. In hardware terms this simplifies to a modulus operation of a 2^x, which can be several orders of magnitude faster than the full modular exponentiation, i.e. where the generator g is the same size as the modulus. We will examine the implementation aspects of this shortly.

The Diffie-Hellman exchange is illustrated in Figure 1. This figure shows a typical arrangement of communicating peers with third party eavesdropping on their communications.

Figure 1: Diagram illustrating the Diffie-Hellman key exchange process.
Mathematics for FPGA Implementations
The Diffie-Hellman key exchange algorithm was shown to depend on a mod ular exponentiation function similar to the one used by the RSA algorithm. There are several algorithms that can be used to implement the exponentiation of such large numbers.
After looking at the problem in detail, the Montgomery algorithm appeared to be particularly suitable for FPGA implementation, especially for a low cost device that does not have special features such as multipliers. We will start by examining the general Montgomery algorithm, and proceed to describe some of the optimizations to the algorithm to reduce hardware cost.
Montgomery math allows modular multiplications to be calculated without division, which are particularly ungainly in hardware. An exponentiation can be performed with Montgomery numbers identically to regular integers, by using the standard square and multiply algorithm. Using Montgomery numbers just adds two steps to the process; first, the integers must be converted to Montgomery space with all the intermediate multiplications being performed using Montgomery te chniques. Then the final result must be converted back to regular integer format.
The Montgomery modular multiplication—AB mod M— for Radix 2 calculations (i.e. calculating an n-bit number will require n iterations) is:

Where the subscript i refers to an individual bit position in a number, with i = 0 being the least significant bit (LSB).
As a result of the optimizations in Montgomery multiplications, specifically the right shifts, there is an inherent factor present in the result; the actual calculation performed is:

Hence the need to transform the operands A and B into Montgomery space, commonly referred to as m-residue, by the modular operation:

Since both operands will be converted to m-residue, there will be a 2ⁿ factor in the final Montgomery result, which can be removed with a final Montgomery multiplication by the integer '1'.
FPGA Implementation
The core of the Montgomery modular multiplication is the intermediate number calculation:

All three numbers—the intermediate number S, the modulus M, and the operand A—are n-bits wide, or 2048 bits in a typical IPSec session. Since numbers of this size cannot be added efficiently in current FPGA technology, an indirect method has to be found.
Under one indirect method typically implemented, designers use both systolic arrays and pipelined adders. However, we found that breaking down this equation into smaller subsections, and calculating the result iteratively, allowed the most efficient uses of FPGA resources, especially r outing.
Too often, FPGA implementations of algorithms are compared purely based on logic resources. Routing resources also impact the quality of the final result in two ways: first in the performance of the core itself (easily measured) and second in the performance of the surrounding logic of the system in which the core is a part of. If the core uses routing out of proportion to it's logic utilization, then the rest of the system will be negatively affected, thus increasing the effective die area required to implement the system and reducing overall system performance.
By implementing the calculation of the intermediate values in an iterative manner, the core logic and routing can be contained in a very small, easily routable section of the device, allowing the core to operate at the maximum performance allowable by the logic and routing fabric. This also allows the core to be scalable across a wide range of bit widths, even dynamically from multiplication to multiplication. Increased system perf ormance can then be realized by instantiating multiple cores in a single FPGA, each running at the maximum performance supported by the device, as all routing will be locally contained. Since the cores are very small—typically 10% of the smallest, least expensive FPGAs in existence—multiple instantiations are an effective method of scaling system performance.
The architecture of the core is relatively straightforward, as indicated by its very small size—as little as 300 logic cells (LCs). There are a number of memories for S, M, B, and A as described previously. These memories are used to hold the values for each individual Montgomery multiplication. In addition, two more memories, one for the exponent, E, and one for the intermediate modular exponentiation value, are required. A state machine controls the multiple nested loops that calculate the modular exponentiation, using multiple modular multiplications.
A simple word wide interface directly accesses the memo ries. The core can be parameterized by the user to fit the system bus width—16-, 32-, 64-, or 128-bit widths are supported—with the interface width also determining the processing step size. The throughput of the core is directly proportional to the word width specified by the user.
Diffie-Hellman FPGA Architecture
Having discussed the architecture of the cores, let's now discuss the best way to implement the Diffie-Hellman protocol in FPGAs, in order to ensure both performance and scalability. As well as the mathematical operations discussed so far, there are a number of other functions that must be supported by any practical Diffie-Hellman key exchange engine:

Choosing statistically random numbers to use in the first phase of the exchange.

Constructing the packets for the system to use.

Monitoring the random numbers and keys for collision attacks where the peer is sending repeating or weak keys.

Produce packets with some authentication method that reduces the risk of man-in-the-middle attacks.

In general, a practical Diffie-Hellman solution is composed of at least three basic components:

Modular arithmetic functional block.

A random number generator (RNG).

A processor controlling data flow and handling the packet processing. This can be a dedicated embedded processor, most often inside the FPGA, or the main system processor.

Figure 2 depicts a typical Diffie-Hellman accelerator block that can be used to offload modular calculations and random number generation from a host CPU into the FPGA fabric. The diagram shows a single calculation engine, however, it would be straightforward to increase the number of arithmetic engines to multiply the system's throughput linearly.

Figure 2: Typical Diffie-Hellman accelerator block for an FPGA.
When implementing an RSA function in hardware, it is preferable to also implement a straight modulus function, which can be used to carry out the conversion to m-residue space before starting the Montgomery exponentiation. The modulus function carries out the following mathematical function:

The modulus core is also required to accelerate the calculation of the intermediate values needed to implement the Chinese Remainder Theorem (CRT)-based version of modular exponentiation. While the details of the CRT are beyond the scope of this article, a modular exponentiation can be split into two smaller modular exponentiations. As the complexity of modular exponentiations will square with word length, this will typically give a four-fold improvement in performance over a straight calculation.
In the Diffie-Hellman core, this modulus function can serve another purpose besides the m-residue conversion. Consider the situat ion where the generator g = 2 (as defined for the IPSec IKE Diffie-Hellman groups described previously). The first stage of the Diffie-Hellman calculation is then Y=2^xmodp. To represent a power of 2 number in binary, it is a simple matter of inserting a '1' in the appropriate position in a field of zeros, resulting in the first-stage operations becoming Y=Amodp where A=2^x, which is the same equation used to describe the modulus core functionality.
This simplification is an important one because it reduces the first part of the key generation from a full modular exponentiation to a single modulus. Since the modulus operation is several orders of magnitude faster than the equivalent exponentiation calculation, we can save a significant amount of calculation time by exploiting this simple shortcut.
The Diffie-Hellman core above is intended to form part of an overall system that incorporates a processor, or some other mechanism for generatin g the packets and the digital signature insertion. Once the secret keys have been exchanged they will be added to the connections SA, after which the process is repeated for the next exchange.
Size and Scalability
An example implementation outlined above will have the following logic resources:

This function consumes a relatively small amount of logic resources. The logic utilization can be further reduced by omitting the RNG from the design and relying on software or data derived random numbers instead. This will save the 820 logic cells required by the RNG and produce the minimal Diffie-Hellman Core as follows:

To give an idea of the relative size of this solution, the smallest low cost FPGAs contain about 3000 logic cells, with a current high-volume price of $4. Thus a hardware acceleration solution supporting both Diffie-Hellman and RSA, is available for about $1.
Implementation Efficiency
Standalone ASIC (or ASSP) implementations generally do not exist for Diffie-Hellman only implementations; instead they tend to be part of a more application specific device such as an IPSec security processor or SSL processor. Using the Diffie-Hellman only part of the chip may not be straight forward or cost effective, as the Diffie-Hellman engine may be closely tied to the overall functionality of the chip, and not available independently.
Generally these devices have an efficient implementation of the modular arithmetic function as well as the important random number generator. However, the modular arithmetic functionality may be tuned to particular protocols such as IPSec or SSL, which may not be suitable for a general-purpose solution.
More commonly, general-purpose processors (GPPs), and sometimes digital signal processors (DSPs), are used to implement the modular exponentiation algorithms. There are many software libraries available, which can be targeted to a wide variety of platforms, allowing a programmer to implement any combination of key exchange and public key cryptography easily. Performance results for the processor approach can vary widely, dependant on the type of processor used, as well as the quality of the library. The major advantage to the processor approach is flexibility. There are two major disadvantages:

Compared to a hardware-accelerated implementation, most CPU-based systems are low-throughput in terms of the number of key-exchanges per second.

To increase the performance it may become necessary to dedicate a separate processor to the task of key exchanges, which will not only add a cost component into the system, but also consume board space, power, and design time.

An alternative approach, as described in this article, is to use an FPGA with its general-purpose logic fabric partitioned into an embedded processor and an a cceleration unit. For example, in the case of a 1024-bit exponent and modulus, it is possible to implement a radix-2 exponentiator that is 500 logic cells in size, and can achieve 32 key exchanges per second. This is for a very low cost, consumer product oriented device. Alternatively, using the higher performance FPGAs, which contains embedded 36x36 multipliers, a higher radix version of the modular exponentiation can be supported, that can achieve upto 640 key exchanges per second—again using only a small fraction of the resources of the device. Programmable logic compares favourably with both the performance and cost of ASSPs, but has the flexibility of processors; but more importantly, as the FPGA can be customized to any degree by the user, the design can be optimized for the application at hand.
An additional benefit of the FPGA approach is the ability to design systems for non-standard Diffie-Hellman generators and primes. For example, designers may wish to implement a special Diffie-Hellman group for use with a proprietary security implementation that uses a generator g=13. With the FPGA cores, changing generators can be as simple as changing the a few user-defined parameters in the acceleration core configuration file and recompiling the design for the same hardware platform - or even updating the hardware platform while in service.
About the Authors
Ahmed Shihab is the technical director at Alcahest Limited. He has worked in the DSP and network security industry for the last ten years since graduating with an Engineering Honours degree from Southampton University, UK. Ahmed can be reached at ashihab@alcahest.com.
Martin Langhammer is chief scientist at Altera's European Technology Center in the UK Martin has a B.A.Sc. in Electrical Engineering from University of Toronto and can be reached at mlangham@altera.com.