Prototyping ARM926 PrimeXsys-based SoCs

Prototyping ARM926 PrimeXsys-based SoCs
By Richard Newell, Aptix, CommsDesign.com
March 13, 2003 (6:51 p.m. EST)
URL: http://www.eetimes.com/story/OEG20030313S0016

The move to digital still camera, digital video, and other advanced multimedia capabilities is an inevitable step already being taken by the wireless sector. Camera phones with Internet access are becoming the norm today with video-enabled phones beckoning.

But to make these phones hum, designers need a processing platform optimized to handle the traditional call processing tasks of a mobile and the multimedia capabilities provided by next-generation wireless products. To that end, ARM unveiled its ARM926 PrimeXsys platform (PXP), which is gaining quite a bit of attention in the mobile space.

As interest in PXP begins to swell, designers of system-on-a-chip (SoC) devices are tasked with making this processing platform bulletproof from an operation standpoint. To make that happen, engineers need to extensively validate the performance of their ARM9 PXP implementations prior to implementing the core in silicon

But, choosing the right prototyp ing method is not always as easy as it sounds. From hardware-based simulation to software-based simulation, there are a myriad of approaches designers can use to validate the performance of their ARM processing platform. While other methods exist, an FPGA prototyping scheme will prove to be the best option when implementing the ARM926 PXP. Here's why.

PXP—What's All The Hype About?
The ARM926 PXP is built around the ARM926EJ-S Harvard architecture microcontroller core, with support for virtual memory and has many of the more common peripherals already included. The PXP architecture is equipped with two processor busses, which are brought out to a six-layer AMBA-AHB bus architecture that connects a four-port SDRAM controller and an SRAM controller (Figure 1).

Figure 1: Block diagram of the ARM926 PXP architecture.

A two bus-master DMA controller performs DM A operations simultaneously with the CPU instruction or data memory fetches. A special "MOVE" processor can accelerate pixel operations for video compression. The PXP has been optimized for low-power operation consistent with its intended use in mobile and hand-held applications.

PXP is more that just a mega hardware-IP core. ARM has worked with partners to ensure that several appropriate OSes are already ported to the PWP platform. Between ARM and its partners, a number of verification tools are available to speed the hardware and software verification and SoC validation. Many of the ARM and partner tools have been designed to operate together. For instance, a JTAG ICE and software development tool can be used together with an FPGA-based prototype running on a reconfigurable platform.

Validation Options
Since the OS is already ported to the core PXP platform, designers can concentrate on just the required drivers for your custom peripherals, and can start higher-level software development much sooner. Designers don't have to wait for silicon to arrive before starting work on the software.

So what are the options available for validation of your ARM software and ARM PXP-based hardware? Some of the common approaches considered include: software-based simulation; hardware-assisted simulation acceleration; the use of behavioral models for all or part of the design-under-test (DUT), including the special case of instruction-set simulators such as ARMulator to speed up simulation; hardware emulation using "mainframe" emulators; and an FPGA-based prototype.

Figure 2 shows the approximate relative performance and accuracy for the various options. Several of the technologies, both hardware- and software-based, run in the 100 kHz range. In modeling with behavioral models or instruction set simulators, designers give up accuracy in order to get these speeds using general-purpose workstations and servers. Designers can improve accuracy by using full HDL or gate-level simulations. But, i f this option is used, the speed of the software-only approach drops so low that only small portions of the design can be considered in any one test. Additionally, simulating software running on the target hardware is almost out of the question.

Figure 2: Performance vs. accuracy for common verification

Simulating the ARM PXP by itself, let alone with a designer's own contributions to the SoC, may be measured in just tens or hundreds of hertz. Hardware-assisted simulation and emulation bring back the accuracy, but with no more speed than the lower-accuracy behavioral simulations.

An emulator may allow integration with external hardware, allowing for more rigorous testing than with simulation alone, but mainframe emulators are too expensive to use for software development. They are often a central company resource dedicated to hardware groups. In any case, 100 kHz is still too s low for serious software work.

A direct FPGA prototype is capable of higher speed, typically by one to two orders of magnitude, than any of the other approaches, plus it is at least as accurate as any of the other modeling approaches. By moving into the 10 MHz performance range, much more extensive software development is possible. Alternative approaches permit only low-level driver code-snippets to be verified.

Through the FPGA-based approach, designers can run the actual operating system, all the firmware, and even application software at speeds usable for early development and debugging, long before the actual silicon is available. With the right tools, an FPGA prototype may be used in both a co-emulation mode with cycle-accurate or transaction-level models running on a general-purpose workstation, and in an "in-circuit" mode attached to real-world hardware interfaces. Using an FPGA prototype for both early hardware integration and verification, and later for software integration and debugging, r educes risk in the SoC before tape-out, and allows much of the software to be available by the time the fabricated chip finally arrives.

The cost of an FPGA-based prototype is low enough to allow several copies or "replicas" to be built for use by your hardware and software developers, and to be placed with your key customers for integration in the next-level system.

FPGA Prototyping Options
If an FPGA prototype looks attractive for a project, designers will want to use the largest state-of-the-art RAM-based FPGAs with high pin-counts, lots of internal memory and other features best suited for prototyping SoC designs, such as those from the Xilinx Virtex-II family.

Prototypes are classified based upon the flexibility of the interconnect between the FPGAs: a fully programmable interconnect, a fixed predetermined interconnect, and a fixed fully custom layout. Depending on the gate and pin capacity of a project, a fixed one- or two-FPGA commercially available board may be suitable. Befor e investing in the fixed FPGA configuration, make sure that your system can really fit within the fixed architecture and capacity. Often, the number of primary I/O, available FPGA-to-FPGA nets, required external memory chips, or some other restriction imposed by the architecture is more limiting than the advertised gate capacity.

Don't confuse the way FPGA vendors and some board vendors count gates with what is required for estimating die size at a particular foundry. Designers will get more accurate results if they do memory capacity calculations in bits rather than equivalent transistors or gates. Compute the utilization based upon the actual memory organizations needed versus those that the FPGA technology provides.

For instance, if designers have a memory to model that is organized 20 x 1600 (32,000 bits), designers may be required to use three blocks of 36 x 512 memory (55,296 bits), for a utilization (for this example) of approx. 58%. Designers should be able to predict very accurately the in ternal FPGA memory utilization in this way, plus they will find which memories should be external. Even large FPGAs today only hold about 3 Mbits of memory.

The remaining logic in a design should be broken down into storage registers (flip-flops and latches), and combinatorial logic. Registers compare directly to FPGA-published internal resources one-to-one. To compare the combinatorial logic, it is best to use historical averages from previous FPGA prototypes that have been done using a similar design style. Lacking that data, designers might roughly estimate that a 4-input lookup table is the equivalent of about five or six two-input NAND gates, on average. Finally, apply a utilization factor for both the registers and lookup tables. In current-generation FPGAs this might be as high as 70 to 90%.

If the capacity calculation indicates that a design is near or over two FPGAs, the risks of using a board with a predetermined fixed-interconnect increase dramatically. A fixed interconnect must apportion the available I/O pins of each FPGA to nets going to each of the other FPGAs in the system, and perhaps to some multi-fanout nets, reserved test points, memory chips (which might or might not be useable in your design), and primary I/O connectors. With three or more FPGAs, the mix becomes highly splintered and it is unlikely that the mix of pin, net, and other resources selected by the board vendor will be useable for your design.

If a design is more than 500,000 equivalent "ASIC gates", designers may well decide they can't use a fixed-interconnect solution. If a designers is working with the ARM PXP core, a one or two FPGA solution is almost certainly not suitable, and the pre-defined interconnect in larger systems is probably too fractured to be useful.

The Other Options
Both fully reconfigurable systems and full-custom systems—a designer's other FPGA prototyping options—are very flexible. Pin and net resources can be allocated at board design time (or even just before downlo ading) to meet the requirements of the project. The capacity is ten FPGAs or more.

Many memory options are also available. So the decision as to which one (or both) to use hinges on other factors: quantity required, cost, schedule considerations, stability of the design, flexibility and use models available, useful life, EDA software, and more. Table 1 compares the target platforms considered here.

Table 1: Comparison of Reconfigurable, Fixed, and Custom FPGA Prototypes

It may at first appear that the cost of the reconfigurable system is much higher than that of a custom solution. Upon closer examination, when one considers the hidden costs of designing, fabricating, and debugging a custom board, and the earlier availability of the reconfigurable platform, the cost per unit per month is similar, unless the quantity of replicates required is high (over ten).

Reconfigurable systems can be leased for the term of a project, or re-used on the next project. In contrast, custom prototypes are likely to end up in the scrap bin at the end of the project. With cost almost removed as a factor for a low and moderate quantity of prototypes, the decision to build custom boards must be driven by other considerations, such as form factor, the desire for even higher performance, or very large quantities.

The reconfigurable target board has the advantage of being useable much earlier in the project, when the design is less stable. Being fully reconfigurable, different blocks and sub-systems can be mapped to FPGAs and partially verified before the full design is even available. Blocks can be "co-emulated" over a high-speed link to a workstation, with part or all of the design in the hardware and the remainder of the design and the test bench running on the workstation. Various languages and levels of abstraction are supported: VHDL, Verilog, SystemC at behavioral, RTL, cycle-accurate and transaction-levels of abstra ction, for example.

Mapping PXP SoC Design to FPGAs
No matter what FPGA-based target system a designer chooses, making a direct prototype of a design will give the designer the advantages of accuracy and speed. However, FPGA technology has constraints relative to a full-custom IC design that must be considered. A small amount of planning during the design phase, knowing both the foundry technology and the FPGA technology, will make the process of mapping the design to your prototype hardware go smoother and faster. This "design-for-prototyping" approach is exemplified by the codification of FPGA design reuse guidelines by Xilinx, Mentor Graphics, and Synopsys in their jointly produced "FPGA Openmore" scoring system.

The ARM PXP has several features that affect FPGA mapping. Two of the top-level features of the ARM PXP are its power-saving features and its high performance. The power-saving features are implemented using several layers of clock gating and power-down control. One of the pri ncipal contributors to performance is the use of a multi-layer AMBA bus for a very high composite bus bandwidth.

First, let's consider the effects of clock gating with respect to a multi-FPGA implementation. FPGA vendors have done an admirable job of designing internal clock distribution structures to drive tens of thousands of registers with the low skew required to avoid hold-time violations. Also, commercial multi-FPGA prototyping systems control board-level clock skew under limits where there is no issue with hold time for FPGA-to-FPGA signals, and with care you can do the same if you design your own boards.

However, a problem arises when two clocks, perhaps derived from a common source, are required to have closely coincident edges within the same FPGA. The two clocks will be placed on separate internal clock-tree resources, and it is almost impossible to guarantee that the edges will coincide within register hold-time requirements for minimum delay paths crossing between the two clock domains. Consider the impossible situation the FPGA vendors have: one clock tree may be loaded with ten thousand registers and the one next to it may have only one hundred loads. There is bound to be some clock skew in a case like this.

The solution is to remove most or all of the gated and derived clock domains in the prototype version of the design by transforming gated and derived clocks to synchronous clock enables, merging the clock domains together yet preserving the functionality. Inserting delays or other techniques can be used to fix any remaining hold-time issues. Fixing gated clocks is a good job for design automation tools. Planning clock-gating logic and hierarchy carefully can make this job easier, whether done manually or through use of an EDA tool.

AMBA Bus Challenges
The multi-layer AMBA bus presents a different challenge. Here, the problem is one of managing the interconnect between FPGAs.

In an SoC design, designers have many layers of interconnect to provide wide datapaths between highly separated regions. In a multi-FPGA implementation, however, designers you are constrained by the number of I/O pins provided by the FPGA package. Since each FPGA also has a finite capacity (in gates, registers, memory, etc.), designers working with a PXP-based design will be required to partition their design into several FPGAs.

Good planning in the implementation and the hierarchical organization of the AHB data bus muxes can make the partitioning job much easier. A hierarchical partitioning tool can help find a partition that fits the FPGA capacity and pin constraints. In some more severe cases, multiplexing of signals may be required in order to fit the design in a reasonable number of FPGAs.

It's important for designers to keep in mind that both your hardware engineers and software developers will need some form of visibility into the PXP core or your custom logic for debugging purposes. To help out, a JTAG ICE with hardware and software breakpoint capability has been built in t he PXP core. By connecting a JTAG cable from the FPGA prototype to a PC (as seen in (Figure 3), designers can load and examine memory and all the critical internal CPU registers. ARM also offers the Embedded Trace Macrocell, for even more extensive logging of CPU operation.

Figure 3: Photograph of a working PXP pre-silicon prototype using a reconfigurable target platform.

FPGA mapping tools should also offer designers additional ways to see any signal in the prototype. With some tool sets it may be possible to cross-trigger between software and hardware events, making debugging easier.

Wrap Up
For many projects using the ARM PXP platform, an FPGA-based pre-silicon prototype can reduce risk prior to hardware tape-out, and allow serious software development to proceed several months before silicon and an evaluation board are available. The right choice of proto type target platform and appropriate tools, along with planning during the hardware design phase, can speed the availability of the prototype so its advantages to the project can be maximized.

About the Author
Richard Newell is the director of hardware product strategy at Aptix Corporation. He received a BSEE degree in 1976 and worked at Motorola, Rockwell International, and BEI Systron Donner before joining Aptix in 1996. Richard can be reached at richardn@aptix.com.

Industry Articles

Prototyping ARM926 PrimeXsys-based SoCs