System Design Methodologies for System on Chip and Embedded Systems

by Eddy Blokken, Johan Vounckx, Michel Eyckmans, Miguel Miranda   IMEC

Abstract :
Several system level design exploration methodologies exist that help designers to transform a high level specification in to an implementation on a SoC or embedded system. These methodologies are starting to show up in the EDA tools feature lists, but don¡¦t yet receive the attention they need. At IMEC we have a research tradition to application specific system design methodologies. In this paper an overview will be given of the results that have been obtained up to now, and the roadmap will be given of the challenges that lay ahead.

1    Drawing the scene

SoC (System-on-Chip) based products and embedded systems are rapidly growing in complexity. On one hand we notice an enormous growth of the complexity of the user requirements, like flexibility, multiple functionality, autonomy, cost,¡K On the other hand we observe a continuous improvement in the capabilities of the underlying hardware: the number of transistors per chip increases at a tremendous pace and the clock speed improvements of electronic devices are beyond imagination

Unfortunately the mapping efficiency of applications does not scale. It decreases continuously. As an example: a Pentium-4 chip at 2GHz does not perform linearly better than its 486 ancestor at 66MHz, despite the fact that it contains many more transistors and the power density is significantly higher. For embedded systems, where additional constraints such as the power consumption are very important, this picture looks even worse.

In order to make the problems still more difficult to solve, it is noticed that entering the deep sub-micron technologies (90nm and less) is also making the design community face new challenges: interconnect on these chips is becoming a key bottleneck for the performance of the chips, and managing the leakage of the active devices is becoming essential for power efficient systems.

Despite this example, in the past, system champions were still able to obtain reasonably good results in a reasonable design time. Their success is based on years of experience, extensive know-how and feeling. However, in the future even these champions will need to be assisted by new technologies and established and proven design methodologies, that guide designers through the complex and tedious job of designing complex embedded systems. Another problem that has to be solved is the lack of these system champions: in order to increase the efficiency and reduce the lead-time of these designs, intelligent methodologies supported by the necessary design tools are a must.

2    A focused approach to resolve bottlenecks

Research done at IMEC on design methodology always focused on the development and demonstration of design methods that enable the design of advanced SoC-based products and embedded systems and optimize the design quality and the design productivity. We focus on the trajectory between a given abstract application specification and the mapping on an ASIC or platform architecture, as described in Figure 1. Often an application specification is described in an executable language, like C++ or MATLAB, which puts an emphasis on the functionality and the proof of concept of the algorithm. We call this ¡§dirty high-level application description¡¨, because it is written from a functional viewpoint, and it is not taking in to account the future implementation in an embedded system or SoC. By means of code structure changes, optimized code is generated with emphasis on speed, energy consumption, efficient memory hierarchy and others, and this within constraints like real-time performance and QoS. Cleaning the application description will ultimately result in nicely multi-threaded descriptions that can run on multi-processor platforms, and that have an optimized memory architecture. For each of these threads (ƒp thread) a designer now has to implement that function. By the insertion of constructs that take concepts of time (registers) and bit-width, inter-processor messages and tokens into account a complete multi-process system can be described and simulated, including processes on hardware accelerators or instruction set processors (ISP¡¦s). Important is that the designer has a continuous view on the correctness of his design by comparing it to a basic specification. Also the performance has to be monitored, so models at the different levels of refinement have to be available, to give the designer an idea of the key characteristics of his design.

Figure 1: The abstract design flow

One of the major design challenges for a system designer is the partitioning of the different threads in to ¡¥software¡¦ (SW), i.e. functions executed on one or more ISP¡¦s, or on ¡¥hardware¡¦ (HW), i.e. dedicated hardware accelerators. One solution to this partitioning problem is trial and error. A better approach consists of describing the different functional units in a unified language, and of only making the choice between HW and SW at the last moment, when the timed bit-true description of all functional blocks is known.

And because the outside world isn¡¦t digital, analogue modules are often needed to interface with the outside world. They add significantly to the complexity of modeling, simulation and verification.
For a co-design of analogue and digital modules, you need at the high abstraction level a methodology to model and simplify analogue modules in order to efficiently simulate them together with digital modules.
Another problem is the co-habitation of digital and analogue modules on the same die, i.e. SoC systems. Digital systems are very robust to parasitic signals, but they create noise in the substrate of the SoC. This noise has a significant influence on the analogue circuits. Being able to model these noise sources, and understanding how this noise is reaching the sensitive analogue circuits allows the designer to make choices and trade-offs that will make his SoC more robust and fault-free.

At IMEC, the design flow described above is applied in several application domains, like broadband wireless networks, image coding for advanced multi media applications and several of our industrial affiliates and partners have used it in collaboration with IMEC.

3 Overview of existing research results

3.1 The methodology chain

IMEC, as a research institute, has decades of experience in optimizing the design flow described in the previous section. The design flow described in Figure 1 can be represented by a set of IMEC design technologies that try to solve some design problems in that area. In Figure 2 the logical chain of all these tools is given, some mature, some still in a basic research phase. Further in this section we will focus on each of them. The availability of these tools and methodologies is discussed in Section 5.

Figure 2: The IMEC tool flow

3.2 Data Transfer and Storage Exploration (DTSE):

The goal of this research is to improve the efficiency (e.g. in terms of power consumption and memory footprint) of applications, both for custom architectures (ASICS) and programmable processor platforms with predefined memory organizations (e.g. including caches and SDRAMS).

The need to optimize application descriptions at the higher abstraction level for memory access is driven by the fact that for data intensive applications the data transfer and storage related actions are significant power drains. We also notice that the clock speed of processors is increasing significantly over time, but the access speed of external memory is lagging behind. The consequence of that is that many applications today are no longer limited in speed by their calculation performance (Gops), but by the access to (common) memory to store and retrieve data. By taking memory hierarchy and data transfer and storage in to account in an early phase, these memory bottlenecks can be removed and better overall design results obtained.

The DTSE methodology consists of a set of techniques that comprise different orthogonal optimization steps. The main objective is to offer a system designer a road-book to apply these steps in the right order, and maximize the effect of each step. Some of the steps are without tool support, others, more complex steps are supported by research or more ¡¥mature¡¦ prototype tools (mature means in a research context that scientists, other than the original inventor, can use them in an effective way). The mature tools are grouped in the ATOMIUM tool suite.

ATOMIUM (A Toolbox for Optimizing Memory I/O Using geometrical Models) operates at the behavioral level of an application, expressed in C. The output is a transformed C description, functionally equivalent to the original one, but typically leading to strongly reduced execution times, memory size and power consumption.

The different steps in the DTSE methodology are:
Analysis (ATOMIUM/Analysis): a profiler and pruning tool to identify quickly in complex applications what code is used in the application, and in which part of the code most memory accesses are made. Typical questions that are answered by the tool are:

which data structures and arrays are being accessed most, and from which functions;
which functions, or parts of functions, require most memory-accesses;
how large is the peak memory usage and at which point in the code does it occur.

Data-flow transformations (methodology): modifies the algorithmic data-flow to remove any redundant data transfers (reads and writes to data that are only partially needed) that are typically present in real-life code. Data flow transformations also serve as enabling transformations for other steps in the global DTSE methodology because they break data-flow bottlenecks.

Global loop transforms (methodology): mainly increase the locality and regularity of the accesses in the code. In an embedded context this is clearly good for memory size (area) and memory accesses (power) but of course also for performance. So this step should be seen as a way to improve the quality of the code before the more detailed but also quite locally applied conventional compiler/synthesis loop optimizations. This preprocessing also enables later steps in our DTSE script.

Memory Hierarchy Layer Assignment (research tool): decides how to optimally assign the application data to the various memory hierarchy layers such that data reuse is maximized and costly redundant inter-layer transfers are minimized. It takes the constraints of the ATOMIUM/MA step into account to guarantee a solution that meets the real-time constraints of the application.

Memory Architect (ATOMIUM/MA): allows designers to explore the effects of timing constraints on the required memory architecture and translates the timing constraints into optimized memory architecture constraints. This information is very useful when defining a power and area efficient memory architecture that provides sufficient storage bandwidth to meet the application¡¦s timing constraints.

Memory Compaction (ATOMIUM/MC): allows designers to automatically optimize the storage order of multi-dimensional data with the goal of reusing memory locations as much as possible, hence reducing the required memory sizes and power consumption for the application. This is especially useful for today¡¦s memory-hungry multimedia applications. An additional benefit of the reduced memory-space requirements achieved by ATOMIUM/MC is that cache hit rates can increase considerably.

RACE (research tool): reduces the addressing overhead typically present in applications with complex memory access patterns.

SPRINT (research tool): transforms sequential C code into a set of communicating SystemC processes, where the assignment of code to processes is specified by the user.

The DTSE methodology is based on research that started around 1990. Several patents cover it, and over 50 papers at the most important international conferences and journals. The most recent articles on the tools and its success stories are described in [1], [2], and [3].

3.3 OCAPI methodology

The implementation of embedded networked appliances requires a mix of processor cores and HW accelerators on a single chip. When designing such complex and heterogeneous SoCs, the HW/SW partitioning decision needs to be made prior to refining the system description. With OCAPI-xl, we developed a methodology in which the partitioning decision can be made anywhere in the design flow, even just prior to doing code-generation for both HW and SW. This is made possible thanks to a refinable, implementable, architecture independent system description.

OCAPI-xl is a C++ class library targeting system design for heterogeneous HW/SW architectures. The executable model contains the functionality specified in a target-independent way. It can also contain the architectural properties, but both are specified independently.

During design space exploration, OCAPI-xl gives the designer simulation based performance feedback. This allows exploring multiple HW/SW partitioning alternatives and making a well-founded decision for the final implementation. The path down to implementation is based on incremental refinement and automatic code generation. It results in plain ANSI-C code for the software components and synthesizable register transfer HDL (SystemC, VHDL or Verilog) for the hardware parts.

OCAPI-xl provides the following distinguishing features:

A unified modeling style in C++ to represent functionality at different levels of abstraction, independent from architecture and/or implementation decisions.
A system model consisting out of parallel communicating processes. For the behavior in the time domain multiple models can be chosen for each process at various levels of accuracy.
A software-like set of high-level communication primitives which allow describing a wide range of applications and which can be implemented both in hardware and software.
Mapping and remapping functionality onto the architecture does not require a code rewrite. Therefore partitioning exploration can be easily carried out. To help the designer, a detailed but easy to use simulation based performance report is generated in HTML format.
The path from the unified model down to implementation relies on an automatic code generator. The hardware parts result in synthesizable register transfer HDL (SystemC, VHDL or Verilog). The software part is dumped in compilable ANSI-C requiring only a minimal OS-API for threads and thread communication.

The mature class library and supporting environment has been used at IMEC for several implementations, like a fully autonomous reconfigurable web-CAM [4], a MAC-controller [5], an ADSL modem [6] and many others. Additional information can be found in [7].

3.4    DISHARMONY

Analog circuits are inherently nonlinear. This complicates simulation. IMEC developed a modeling environment, called DISHARMONY that can compare the relevance of the different non-linearities in analog designs. This yields insight to a system designer about the nonlinear behavior of his design, and helps him to simplify the model, and consequently simulations are speeded up thanks to a reduced amount of non-linearities. Reliable high-level simulations of analog modules require models for the front-end blocks that take into account signal degrading effects such as noise, impedance loading, nonlinear behavior¡K Also here DISHARMONY assists the designer.

Simulation-based models can be constructed in the DISHARMONY modeling environment starting from a circuit net list. The resulting model describes frequency dependence in conjunction with nonlinear behavior.
DISHARMONY is currently a research tool that has been used intensively to support IMEC¡¦s research in 5GHz wireless broadband front-end architectures. A description of its powerful possibilities can be found in [8].

3.5    Front-end Architecture Simulator for digital Telecom applications (FAST)

FAST is an efficient simulator for analog and RF front-ends, described at the architectural level. It uses a special signal representation (multirate-multicarrier or MRMC) that can efficiently combine low-frequency analog circuits with RF circuits in one simulation. FAST is in essence a dataflow simulator in which the computations are scheduled such that the capabilities of the processor that is used for the simulations are fully exploited.

As an illustration, a simulation of a 5GHz WLAN receiver front-end with FAST is ten times faster than with MATLAB where the MRMC signal representation of FAST is implemented. Without the use of this MRMC signal representation in MATLAB, FAST is 700 times more efficient.

Thanks to its dataflow nature, FAST can be coupled fairly easy with a digital simulator. In this way, FAST has been coupled with the digital modeling and simulation environment OCAPI-xl. As an example, a complete end-to-end simulation of a WLAN link taking into account a complete receiver front-end and a complete receiver modem is possible at a reasonable CPU time (0.35 seconds per OFDM symbol).

The performance of a complete telecom link is often quantified with the bit-error-rate (BER). This measure is typically determined with lengthy Monte Carlo simulations. IMEC has developed a methodology that reduces the CPU time to determine the BER by more than two orders of magnitude, compared to Monte Carlo simulations. The methodology is specific for telecom applications that use a multi-carrier modulation scheme. Examples of such applications are 5GHz WLAN and ADSL. Using this methodology, the BER of the WLAN link referred before is determined in one hour with the coupled FAST-OCAPI environment and with the models for the analog and digital blocks.

The FAST tool was a key enabler in the chip-package co-design of integrated receivers and in optimizing our digital IQ-mismatch compensation techniques. We refer to [9] for more detailed information.

3.6    Substrate Waveform ANalysis Tool (SWAN)

The different aspects of substrate-noise coupling are:

Noise generated by switching digital circuits.
Propagation of the noise through the substrate.
Impact of the noise on the analog circuits.

For the last item commercial tools exist, but not a lot of support is given to the analogue designers for the first two issues.

In the last few years, IMEC has elaborated a methodology to analyze the switching noise generated in digital circuits of practical size on low-ohmic substrates. This methodology, called SWAN, generates a simulation model for the complete digital part, which comprises the package as well as a combination of macro models for all standard cells in the design.

The high-level substrate noise simulation methodology can be divided in 3 steps:

Automatic library characterization: this step consists of the creation of substrate macro models for each digital gate. This characterization has to be performed only once for each new technology version of a digital standard-cell library.
Extraction of the switching events: a substrate simulation model for the entire chip is created using the previously created substrate macro models of the different gates, and is combined with switching events and the previously extracted macro-model currents, to obtain the total bulk-injection currents and power-supply currents.
The actual substrate-noise simulation: finally, the chip-level substrate model is combined with the total bulk-injection currents and the power-supply currents to simulate the substrate-noise generation of the entire digital circuit.

As an example, SWAN generates a simulation model for an 86 K gates design in about 30 seconds. A simulation with this model in SPICE predicts the substrate noise with a relative error of 10% on the rms value (compared to measurements).

The SWAN methodology has been applied to verify several digital low-noise design techniques such as shaping of the power supply current, the use of decoupling capacitance between the digital VDD and ground, and the use of a separate substrate bias. More information is available in [10].

4    The future roadmap

Existing methodologies are extended to new area¡¦s like reconfigurable platforms, and more recently IMEC is focusing towards multi treaded design problems, solving the process partitioning and scheduling dilemmas. They are described below. Other research topics, like substrate noise modeling in high ohmic substrates, or cache optimization algorithms are omitted, but information is available at www.imec.be .

4.1    High-level system modeling and mapping for reconfigurable systems

This research focuses on the development of methodologies and supporting tools to enable the modeling of future heterogeneous reconfigurable platform architectures and the mapping of applications onto these architectures. The objective is to assist the designer in making the correct trade-offs while mapping dynamic multi-media applications onto these platforms.

Given that future platforms will have embedded reconfigurable hardware, the research is focusing on how applications can make use of this new possibility in an optimal way. Both abstraction of the hardware platform, as the unified representation of an application in HW and SW are studied, so tasks can swap from one mode to another during runtime, without stopping the application. More information on this research is given in [11].

4.2    Task Concurrency Management

Nowadays applications become increasingly dynamic. The load, both in terms of processing and bandwidth requirements, shows an important variation in function of time. As an example, variations in scene and objects in a 3D animation result in a load difference up to a factor 30.
The increasing pressure to make systems compact and portable imposes important constraints. Hence (average) power reduction is a major design constraint and a differentiating criterion.
IMEC¡¦s Task Concurrency Management (TCM) approach dynamically minimizes the system-wide average power consumption in function of the actual load of the system. The methodology starts from the idea that applications can be modeled as a set of concurrent tasks. A dynamic application is characterized by the fact that the number and the nature of tasks and their dependencies heavily change in time. At design time, for each individual task, many platform mappings are generated.
The most promising ones are represented in a Pareto curve that describes the optimal trade-off between the energy consumption and the execution time. These Pareto curves are used by a low-complexity run-time scheduler that maps the different tasks of a highly dynamic application in such a way that the total power consumption is minimized while respecting the total execution time constraint.
More information about TCM is available in [12].

5    Design Methodologies at IMEC

As mentioned in the introduction, IMEC is a research institute, which mission statement is to create know-how in the area of design technology among others. The design technology described above is the result (in making) of this research. But who can benefit from this?

First of all, IMEC has research programs in multi media and broadband wireless applications. A significant amount of know-how has been build up together with key players in these areas from all continents. The model used for this collaboration is the IMEC Industrial Affiliation Program (IIAP). In these programs the partners have researchers present at IMEC, who have free access to all the tools mentioned above, if it is needed for their research program.

A second objective of IMEC is to provide advanced industry oriented training on micro-electronics and design methodologies. Both the DTSE and the OCAPI methodologies are taught on a regular basis. More information on schedules is available at www.imec.be/mtc .

Furthermore we license the ATOMIUM tool suite as is, until a commercial alternative is available. This is interesting for research teams of OEM¡¦s who want to experiment with the design methodologies, and eventually integrate them in their design flow. Similarly, he OCAPI technology has been licensed to major partners for the integration in their design flow.

The ultimate goal is to have an EDA vendor teaming up with IMEC to integrate one or several tools in their product portfolio, or to have a start-up or joint venture that makes real products out of these tools, and brings them to the market. In the past IMEC has followed that road several times already, CoWare and Target being two examples of a list of successful initiatives.

6    Conclusions

We have described the current status of different design technology tools and methodologies developed at IMEC. They mainly focus on area¡¦s where EDA tool vendors are still hesitating to include the features in their product portfolio. The future design tools are focusing on problems of wireless multimedia applications of the future: how to manage complexity and still keep control of the power consumption of the application, even under extreme dynamic and flexible conditions.
These advanced tools are available to our research partners, and mature tools are also licensed to third parties until a commercial alternative is available.

7    References

[1]    F.Catthoor, S.Wuytack, E.De Greef, F.Balasa, L.Nachtergaele, A.Vandecappelle, ``Custom Memory Management Methodology -- Exploration of Memory Organisation for Embedded Multimedia System Design'', ISBN 0-7923-8288-9, Kluwer Acad. Publ., Boston, 1998.

[2]    F.Catthoor, K.Danckaert, C.Kulkarni, E.Brockmeyer, P.G.Kjeldsberg, T.Van Achteren, T.Omnes, ``Data access and storage management for embedded programmable processors'', ISBN 0-7923-7689-7, Kluwer Acad. Publ., Boston, 2002.

[3]    M.Miranda, F.Catthoor, M.Janssen, H.De Man, ``High-level Address Optimisation and Synthesis Techniques for Data-Transfer Intensive Applications'', {em IEEE Trans. on VLSI Systems}, Vol.6, No.4, pp.677-686, Dec. 1998.

[4]    D. Desmet, P. Avasare, P. Coene, S. Decneut, F. Hendrickx, T. Marescaux, J.-Y. Mignolet, R. Pasko, P. Schaumont, D. Verkest "Design of Cam-E-leon: A Run-time Reconfigurable Web Camera," Embedded Processor Design Challenges -- Systems, Architectures, Modeling, and Simulation - SAMOS, LNCS 2268, Springer, pp 274-290.

[5]    Vivek Nema, Bart Van Poucke, Miguel Glassee, Serge Vernalde "Flexible Syatem Model and Architectural Exploration for HIPERLAN 2 DLC Wireless LAN Protocol," International Conference on Communications May 2003, Anchorage USA.

[6]    Dirk Desmet, Michiel Esvelt, Prabhat Avasare, Diederik Verkest, Hugo De Man "Timed Executable System Specification of an ADSL Modem using a C++ based Design Environment : A Case Study," International Conference on Hardware/Software Codesign, Rome,Italy,May 1999, pp.38-42.

[7]    G.Vanmeerbeeck, P.Schaumont, S.Vernalde, M.Engels, I.Bolsens "Hardware/Software Partitioning fo Embedded Systems in OCAPI-xl," CODES'01, Copenhagen, Denmark, April 2001.

[8]    Petr Dobrovolný, Piet Wambacq, Gerd Vandersteen, Dries Hauspie, Stéphane Donnay, Marc Engels and Ivo Bolsens "The effective high-level modeling of a 5GHz RF variable gain amplifier," NDES 2001- Nonlinear Dynamics of Electronic Systems, Delft, The Netherlands, 21 - 23 June 2001.

[9]    Gerd Vandersteen, Piet Wambacq, Stephane Donnay, Yves Rolain and Wolfgang Eberle "FAST - an efficient high-level dataflow simulator of mixed-signal front-ends of digital telecom transceivers," chapter in book Low-power design techniques and CAD tools for analog and RF integrated circuits, edited by Piet Wambacq, Georges Gielen and John Gerrits, Kluwer Academic Publishers, 2001.

[10]    Mustafa Badaroglu, Marc van Heijningen, Vincent Gravot, Stephane Donnay, Hugo De Man, Georges Gielen, Marc Engels, and Ivo Bolsens "High-Level Simulation of Substrate Noise Generation from Large Digital Circuits with Multiple Supplies," DATE2001 (Design, Automation and Test in Europe Conference)

[11]    J-Y. Mignolet, S. Vernalde, D. Verkest, R. Lauwereins "Enabling hardware-software multitasking on a reconfigurable computing platform for networked portable multimedia appliances," Proceedings of the International Conference on Engineering of Reconfigurable Systems and Algorithms (ERSA), pp. 116-122, Las Vegas, June 2002.

[12]    Yang, P.; Wong, C.; Marchal, P.; Catthoor, F.; Desmet, D.; Verkest, D., and Lauwereins "Energy-aware runtime scheduling for embedded-multiprocessor SOCs," IEEE Design & Test of Computers, Vol. 18: (5)46-58; 2001