How to implement double-precision floating-point on FPGAs

By Danny Kreindler, Altera Corporation
October 03, 2007 -- pldesignline.com

Floating-point arithmetic is used extensively in many applications across multiple market segments. These applications often require a large number of calculations and are prevalent in financial analytics, bioinformatics, molecular dynamics, radar, and seismic imaging, to name a few. As opposed to integer and single-precision 32-bit floating-point math, many applications demand higher precision, forcing the use of double-precision 64-bit operations. This article demonstrates the double-precision floating-point performance of FPGAs using two different approaches. First, a theoretical "paper and pencil" calculation is used to demonstrate peak performance. This type of calculation may be useful for raw comparison between devices, but is somewhat unrealistic as it assumes data is always available to feed the device, and does not take into account memory interfaces and latencies, place and route constraints, and other aspects of an actual FPGA design. Thus, secondly, the real results of a double-precision matrix multiply core that can easily be extended to a full DGEMM benchmark are demonstrated and the real-world constraints and challenges of achieving such results are discussed in detail.
Introduction
An increasing number of applications in many vertical market segments, from financial analytics to military radar to various imaging applications, are relying on computations with floating-point (FP) numbers. These applications implement various basic functions and methods such as fast Fourier transforms (FFTs), finite impulse response (FIR) filters, synthetic aperture radar (SAR), matrix math, and Monte Carlo. Many of these implementations use single-precision FP, where FPGAs can provide up to ten times the sustained performance compared to traditional CPUs. Recently, there has been increasing interest in double-precision performance to see how well FPGAs can compete with CPUs, especially for designs that have power and cooling constraints.

In a recent article titled FPGA Floating-Point Performance – A Paper and Pencil Evaluation, the author – Dave Strenski – discusses how to estimate the double-precision (64-bit) peak FP performance of an FPGA. In this article, his method is evaluated and – more importantly – he expands on it with "real-world" considerations for estimating the sustained FP performance in an FPGA. These considerations are validated using a matrix multiplication design running in an Altera Stratix II FPGA.

The double-precision general matrix multiply (DGEMM) routine is referenced here. DGEMM is a common building block for many algorithms and is the most important component of the scientific LINPACK benchmark commonly used on CPUs. The Basic Linear Algebra Subprograms (BLAS) include DGEMM in the Level 3 group. The DGEMM routine calculates the new value of matrix C based on the product of matrix A and matrix B and the previous value of matrix C using the formula C = áAB + âC (where á and â are scalar coefficients).

For this analysis, á = â = 1 is used, though any scalar value can be used as it can be applied during the data transfer in and out. As can be seen, this operation results in a 1:1 ratio of adders and multipliers. This analysis also takes into account the logic required for a microprocessor interface protocol core and adds the following considerations:

Memory interface module for low latency access to local data

Data paths from memory interface to FPGA memory

Data path from FPGA memory to FP cores

Decrease to FP core FMAX when the FPGA is full

Unusable FPGA logic due to routing challenges of a full FPGA

The FPGA benchmark focuses on the performance of an implementation of the AB matrix multiplication with data from a locally attached SRAM. The effort to extend this core to include the accumulator to add the old value of C is a relatively minor effort.
Click here to read more ...

E-mail This Article Printer-Friendly Page

Contact Altera

Fill out this form for contacting a Altera representative.

Your Name:

Your E-mail address:

Your Company address:

Your Phone Number:

Write your message:

Search Silicon IP

16,000 IP Cores from 450 Vendors

Altera Hot IP

10/100/1000 Ethernet (Triple Speed)

DDR2 SDRAM Controller supporting Altmemphy

IP Compiler for PCI Express x1 (Soft IP)

DDR2 SDRAM Controller

See Altera IP >>

Related Articles

How to build ultra-fast floating-point FFTs in FPGAs

Implementing floating-point DSP on FPGAs

Tutorial: Floating-point arithmetic on FPGAs

Floating-point emulation: faster than hardware?

Implementing floating-point algorithms in FPGAs or ASICs

See Altera Latest Articles >>

New Articles

How NoC architecture solves MCU design challenges

Automating Hardware-Software Consistency in Complex SoCs

Beyond Limits: Unleashing the 10.7 Gbps LPDDR5X Subsystem

How to Design Secure SoCs: Essential Security Features for Digital Designers

System level on-chip monitoring and analytics with Tessent Embedded Analytics

See New Articles >>

Most Popular

System Verilog Assertions Simplified

System Verilog Macro: A Powerful Feature for Design Verification Projects

An Outline of the Semiconductor Chip Design Flow

Scan Chains: PnR Outlook

Optimizing Analog Layouts: Techniques for Effective Layout Matching

See the Top 20 >>

How to implement double-precision floating-point on FPGAs

Contact Altera

Search Silicon IP

Altera Hot IP

Related Articles

New Articles

Most Popular

Partner with us

List your Products

Design-Reuse.com

Related Articles

How to build ultra-fast floating-point FFTs in FPGAs

Implementing floating-point DSP on FPGAs

Tutorial: Floating-point arithmetic on FPGAs

Floating-point emulation: faster than hardware?

Implementing floating-point algorithms in FPGAs or ASICs