3+ ways to design reconfigurable algorithm accelerator in IP block

by Tapio Ristimäki and Jari Nurmi, Institute of Digital and Computer Systems, Tampere University of Technology
Tampere, Finland

Abstract :

Three ways to design a reconfigurable algorithm accelerator are shown. It is also presented that the problem field makes it possible to make such a simplifications that any parallel topology can easily been adapted to be an accelerator IP block. The comparison between three approaches mentioned has been done and the results show that non-FPGA based solutions are more promising.

1. Introduction

The IP methodology is maybe the best solution with the view to reducing the productivity gap. An area that is not yet covered with IP blocks is application specific algorithms, which have to be designed and implemented with the hardware description language or whatsoever. With reconfigurable algorithm accelerator also the hardware of application specific algorithms could be implemented within the IP methodology and the implementation of functionality could be lifted to abstraction level of some kind of software. The reconfigurability would also make it possible to change the algorithms used after manufacturing even dynamically such that same silicon area could be used for accelerating several algorithms.

2. Premise

Parallel computing has been studied for decades in the academic society of the VLSI. It has been noticed that it is difficult to design a general-purpose parallel topology, which could replace the traditional Von Neuman processor. For our point of view the biggest problems has been in:

Making a compiler which makes efficient parallel assembler for a sequential presentation
Making parallel processor so generic that all kind of programs can be executed on it.
Implementing not only the parallel structures but also the structures needed in actual execution of sequential program flow breaks the “ideal” topology.

List 1. Problems of general-purpose parallel topologies

It is much easier to get things work on the machines which use parallel computing, if the problems mentioned above can be somehow sidestepped. Fortunately algorithm accelerators are used only for computing very tiny inner loops of the algorithm kernels, which limits the problem such that

Compiler is not needed. Short acceleration programs can be made by hand.
We do not have to design generic topology because the accelerator is used only in certain kinds of situations.
Set of sequential structures can be minimized.

List 2. Simplifications.

3. FPGA approach

The most obvious way to design algorithm accelerator IP is an FPGA block. The approach is published at least in [1,2]. LSI logic has also tried to make commercial application such that the FPGA block can be implemented to their RapidChip platform [3] but with no great success. We have also made case study, in which we implemented RSA encryption to the Xilinx platform FPGA[4]. In that study we found out that the SoC with one FPGA block inside is very attractive solution to made computation intensive applications. However we also realized that FPGA block does not fit well to the IP methodology. The biggest reason is that the functionality of the FPGA is designed more or less in the same level as the functionality of the hardwired accelerator. The other negative side is that FPGA cells are not synthesizable, and that is why only manufacturer specific cores can be used.

4. MIMD approach

Coarse-grain reconfigurable structures have produced a lot of different kinds of MIMD based architectures. The best known ones can be found from [5-10]. The idea of coarse-grain computing is to combine several from four bits to n bits wide processing elements to the one computing unit. In the begin the coarse-grain systems were combinations of circuit boards, in the 90’s those could be implemented into single silicon die and now we are in the point where the massively parallel matrix of 16-bit DSP processors can be implemented to be only one IP block of the SoC as shown in [11]. In [11] the actual implementation of a coarse-grain based reconfigurable IP block RAA (reprogrammable algorithm accelerator) is given. RAA is the matrix of 16-bit wide tiny DSP-processors and it can be connected to the SoC via its OCP-IP [12] interface. The communication topology of RAA is depicted in the figure 1.

Figure 1. (a) Communication topology from the external controller point of view. (b) Communication topology between the nodes.

The information what we got in the studies in connection with [11] was that coarse-grain based reconfigurable block could be feasibly as a commercial application.

5. Lifting abstraction level even higher

It is easy to see that the premises given in the list 2 can easily be adapted, not only to ones which are usually connected to the term “reconfigurable”, but also to any other parallel topology. The thing, which makes this very interesting, is that even topologies which are very difficult in normal situations to make work good enough can (even easily) be made work under conditions of the list 2. To give examples of potential parallel architectures we could name SIMDs, associative and neural architectures, reduction machines, dataflow machines etc. However, the more exotic TT (transport triggered) architecture [13] was chosen to be an example.
TTA is a processor, which basically has only one instruction, MOVE. Programmer controls the program flow by moving the operand to the operand register of certain functional unit. When operand register is trigged the particular functional unit of the TTA processor transfer operands according it’s function and stores result(s) to the result register. Execution goes forward by moving the result to the memory or to the other functional unit.
The processor itself has minimal a set of controlling structures and a lot of duties of the normal processor has moved to the compiler, which increases the ratio between area and theoretical computational efficiency (1). On the other hand it is easy to see that the sequential instructions does not fit to this topology at all, but adding them will break the ideal topology (2) and that the efficient compiler needs an artificial intelligence to be efficient (3). The clause (1) makes TTA attractive choice to be an algorithm accelerator and in the light of the list 2 clauses (2) and (3) does not mean that the implementation would be impossible or even difficult to do.
Our example TTA based algorithm accelerator IP has 30 MAC units, 5 sifters and 10 basic ALUs. It also has 10 register banks. The topology is depicted in the figure 2.

Figure 2. The topology of TTA based algorithm accelerator.

The implementations of functional units are trivial and effective Synopsys designware [14] components were used. On the other hand the implementation of the communication and control network is not trivial at all but it needs huge design space exploration to find even sub optimum structure. However the most obvious bus / connection matrix approach were used in this paper such that each functional unit is connected to the eight “busses”. To make design synthesizable the mux based architecture with separate write and read lines were used.

6. Conclusion

The comparison between the three mentioned topologies is given in the table 1. Same 1024-bit RSA encryption has been implemented to the every accelerator making comparing possible. The pure ASIC implementation is also shown.

Table 1. Comparison FPGA vs. MIMD vs. TTA.

	Silicon technology used	Area (mm2)	Code size of RSA example (rows)	Execution time of RSA (ms)	Hard/Soft core
ASIC [17]	0.35	41	2000	22	-
FPGA[4]	0.13	100-200(1)	1600	8.5	H
MIMD[11]	0.13	6.5	20	5.8	H+S
TTA(2)	0.18	2	1050(3)	4	H+S

(1)   Figure is approximated according the information of [15,16].
(2)   Figures are results of first stage experiments.
(3)   Code is trivial to write (copy-paste) for dataflow graph of the algorithm.

Table 1. indicates very clearly that ASIC based methods are more promising than FPGA based one on the reconfigurable algorithm accelerator IPs. Those are not only smaller in die size but – what more important - the functionality of them can be designed on the higher abstraction level and so designing time can be decreased. In ASIC and FPGA approaches the functionality is implemented with VHDL, so every control and data transfer must be determined in code and quite the huge amount of code is needed. In MIMD based architecture the code needed is the assembler code to every single processor. Fortunately the most of processors executes the same code and only few rows has to be written to modeling e.g. the most computational intensive parts of RSA algorithm. However, for our point of view the most illustrative coding style of the topologies presented is in TTA. The “code” is group of data moves from place to place. Actually we can straight forward use the data-flow of the algorithm by only writing it up in right format.

Our studies shows that it would be very feasible to implement application specific algorithms to the IP based SoCs in presented reconfigurable algorithm accelerator methodology. By using IP blocks also in application specific parts of SoC the designing and verification times could be shorten. The other clear advantage would be that the same silicon area could be used to accelerate several algorithms and on the other hand the algorithms used could be updated or even changed according the needs of the user.

12. References

[1]   S.J.E Wilton and R. Saleh, “Programmable logic IP cores in SoC design: opportunities and challenges”, Proc., IEEE Conference on Custom Integrated Circuits, 2001, pp. 63-66.

[2]   J.Greenbaum,”Reconfigurable logic in SoC systems”, Proc., IEEE Custom Integrated Circuits Conference, 2002, pp. 5-8.

[3]   http://www.eetimes.com/semi/news/OEG20020624S0016

[4]   T. Ristimäki, J. Nurmi, “Implementation of a Fast 1024-bit RSA Encryption on Platform FPGA”, Proc., 6th IEEE International Workshop on Design and Diagnostic of Electronic circuits and systems, 2003, pp.

[5]   A. Marshall, T. Stansfield, I. Kostarnov, J. Vuillemin and B. Hutchings, “A reconfigurable arithmetic array for multimedia applications”, Proc., ACM/SIGDA international symposium on Field programmable gate arrays, 1999, pp. 135-143.

[6]   E. Mirsky and A. DeHon, “MATRIX: a reconfigurable computing architecture with configurable instruction distribution and deployable resources”, Proc., IEEE Symposium on FPGAs for Custom Computing Machines, 1996, pp. 157-166.

[7]   J.R. Hauser and J. Wawrzynek, ”Garp: a MIPS processor with a reconfigurable coprocessor”, Proc., IEEE Symposium on FPGAs for Custom Computing Machines, 1997, pp. 12-21.

[8]   E. Waingold, M.Taylor, D. Srikrishna, V. Sarkar, W. Lee, V. Lee, J. Kim, M. Frank, P. Finch, R. Barua, J. Babb, S. Amarasinghe and A. Agarwal, “Baring it all to software: Raw machines” , Journal, IEEE Computer Volume: 30 Issue: 9 , pp. 86-93.

[9]   B. Salefski and L. Caglar, “Re-configurable computing in wireless”, Proc., Design Automation Conference, 2001, pp. 178-183.

[10]   T. Miyamori and K. Olukotun,” reconfigurable multimedia array coprocessor”, Proc., ACM/SIGDA international symposium on Field programmable gate arrays, 1998, pp. 261.

[11]   T.Ristimäki, J. Nurmi : “Implementing User and Application Specific Algorithms within IP-methodology: a Coarse-Grain Approach”, Proc., 1st IEEE International Symposium on System-on-Chip, 2003.

[12]   www.ocpip.org

[13]   H. Corporal: Microprocessor Architectures from VLIW to TTA (Book)

[14]   http://www.synopsys.com/products/designware/designware

[15]   http://www.chipworks.com/Reports/Flyers/Xilinx_Virtex._circuits.htm

[16]   www.xilinx.com

[17]   T.Ristimäki : ASIC Design of RSA Encryption (M.Sc. Thesis in finish), 2002.