Reconfiguring Design -> C-based architecture assembly supports custom design
C-based architecture assembly supports custom design
By Krishna V. Palem, Chief Technology Officer, Proceler Inc.Berkeley, Calif., EE Times
February 8, 2002 (12:50 p.m. EST)
URL: http://www.eetimes.com/story/OEG20020208S0058
Designing custom solutions has played a central role in addressing two of the most significant concerns driving the embedded-systems industry: cost and time-to-market. In the past, customization usually meant investing time and money to design an application-specific solution, usually an integrated circuit (ASIC). Prevalent as this approach might be, a key hurdle is that it embodies high initial investment usually referred to as nonrecurring engineering costs, or NRE, intrinsic to hardware design---the approach to overcoming this has been to amortize the NRE over a large number of ASICs, thus making it attractive only in the context of high-volume applications---as well as long time-to-market.
If the embedded systems industry is to grow at the projected rates to yield an overall market size of $100 billion by the end of the decade, it is ess ential that hurdles imposed by customization along the cost (NRE) and time-to-market dimensions be rapidly overcome.
The approach we have taken is architecture assembly, an idea best understood by taking a closer look at customization. Typically, the parts of the application that are prime candidates for customization are compute-intensive computational kernels---quite often, bodies of loops or iterative structures. At a high level, architecture assembly involves identifying a select collection of such compute-intensive (loop) structures and optimizing their execution via custom solutions designed completely automatically using patent-pending optimizing compiler technology. The resulting designs can therefore be realized quickly and at low cost.
The approach allows application development and customization to be entirely software-based. Given its proliferation in the embedded systems industry, the C programming language is now the focus of our efforts with suitable extensions to Java being planned for subsequent versions of the product.
Another unique aspect of this approach is the focus on reconfigurable computing as the vehicle for deploying the custom designs. Based on field-programmable gate-array technology, reconfigurable computing affords efficient execution of custom hardware designs and moreover, as the designs change, the same piece of hardware can be reconfigured to serve as the emulation engine for the new design(s). Thus, as the programmer designs new applications in C, these designs can be deployed without the need for new hardware design cycles and hence associated NRE. Thus, all the inertia and concomitant costs of involving a hardware-centric design process will be overcome.
Specifically, the hardware platform that the tools target are conventional microprocessors coupled to a commercial-off-the-shelf (COTS) reconfigurable processor from vendors such as Altera or Xilinx as well as other tightly integrated solutions---often referred to as a configurable system-on-chip (CSOC)--- from vendors that include Atmel, Triscend and Chameleon.
Hardware expertise and design is bypassed on a path from a C-based application to an executing custom design, by the use of tools that allow a pure software design cycle, as well as by targeting COTS reconfigurable hardware.
Architecture assembly revolves around the idea of taking a piece of a program and instantiating it using "elements'' of hardware such as adders, multipliers, registers for storage and interconnections that are best suited to it, creating a soft processor to accelerate that piece of the program. An integral part of the idea is to ensure that these elements from which a custom implementation is assembled, are prebuilt or equivalently built offline.
The granularity and complexity of these elements is typ ically at the level of what is conventionally referred to as the instruction-set architecture of a typical microprocessor.
These architecture elements are constructed offline and implemented using highly efficient algorithms and synthesis techniques, as opposed to a time-consuming online synthesis performed during application development, as is the case in ASIC development.
As a result, our compiler includes a library of these architecture elements, which can also be considered instructions, storage and interconnection schemes for connecting them, together with optimized implementations.
The challenges and innovations in the compiler technology we have developed lie in fast algorithms that analyze the candidate loop to be customized, and then assemble the execution vehicle best-suited for it using the menu of available (prebuilt) components from the proprietary instruction-set architecture choices. The approach involves taking the application kernel such as the body of a loop, and matching it to the architecture's elements: data path, storage and interconnect.
The resulting elements are then chosen from the library and assembled to yield the net-list implementing the kernel under consideration. During this process of assembly, various geometric constraints that dictate the behavior of the assembled net-list on the FPGA are considered and optimizations are included aimed at improving such behavior.
The compute-intensive kernel of the application is taken into an intermediate representation---typically a program dependence graph. In so doing, the compute-intensive components, invariably loop bodies that are candidates for customization, are first parallelized at two levels: at the level of loops--applying such classical transformations such as distribution, unrolling, tiling and others---as well as at the fine-grained instruction level to enhance and help leverage instruction-level parallelism, or ILP.
The resulting parallelized pieces of code are further analyzed and optimized for de ployment in custom form on the target reconfigurable platform. To deploy these parallel loop bodies in custom form, we've adopted a two-tiered approach. First, the highly parallel loop body is instantiated using the prebuilt instruction-set architecture elements with novel enhancements to technology referred to in the compiler construction world as code-generation. Here, a challenge is to ensure that the architecture repertoire is adequate from the perspective of providing customized solutions to any embedded application that is likely to be encountered.
Consequently, significant time and effort has been invested in developing such a repertoire or library of designs, which can then be selected to best suit the needs of any given loop body. At the end of this step, the operators of the program graph will all be assigned appropriate computational units from the prebuilt repertoire, as well as storage elements and mechanics for interconnecting them to maintain and transfer the data being computed.
The n, as a second crucial step, a variety of compiler optimizations are performed that take the results of the first step and "map" them onto the two-dimensional geometry offered by the surface of the reconfigurable part of the target hardware platform such as a CSOC. The result is a highly parallelized and optimized form of the computational kernel, delivered in some standard net-list format such as EDIF.
Again, to keep compilation times and hence time-to-market low, a key concern is to ensure that the above-mentioned two steps be realized using fast algorithms. We have taken this concern into account, balancing it against the need to ensure the resulting mappings and instantiations are indeed efficient along the various dimensions of customization of interest such as power, time and size of the resulting design.
This is achieved by a range of fast two-dimensional optimizations that explore and map the results of the first step onto the surface of the reconfigurable part, so that the constraints and e xpectations imposed by the designer are met.
In so doing, these optimizations leverage existing knowledge from the substantial area of compiler optimizations for ILP---the commercial form of ILP being embodied in processors that were co-developed by Hewlett-Packard Co. and Intel Corp. as part of the IA-64 program and referred to as Explicitly Parallel Instruction Computing (Epic ) as well as cost models we've developed that guide the adaptation and further innovation of these techniques to be relevant in the context of reconfigurable hardware.