|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Selecting an embedded MCU: How to avoid evaluation trap?Didier MAURER , Aurélie DESCOMBES from Dolphin Integration Abstract: The main goal of this article is to focus on the difficulties encountered by SoC integrators when selecting an embedded microcontroller (MCU). Indeed, the selection is based on MCU performances, but the comparison can be difficult and compromised when considering all the parameters influencing these performances. Selecting an embedded MCU: How to avoid evaluation trap? When a SoC integrator has to select a microcontroller for his application, most of the time, he refers to his own previous experiences rather than on a rational assessment as no standard benchmarking process has emerged for selecting an embedded microcontroller. The reason is that there are a lot of criteria to consider and it is difficult to achieve a fair comparison between the different products available. Experience shows that the main criteria to consider when selecting a microcontroller are, on one hand: speed, area, consumption, computing power, code density, on the other hand: quality and maturity of development & debug tools. Therefore, due to the complexity of parameters to be taken into account (technology, Process/Voltage/Temperature conditions, standard cell library, tools versions ...), and knowing that it is rare that suppliers provide exhaustive specifications of all these parameters when communicating the performances of a core, it is unlikely that the evaluator can make a meaningful comparison between different products based only on data given in the datasheets. In this article we will discuss the pitfalls to avoid during the assessment of the relevant criteria for the selection of an embedded MCU, and we will propose some guidelines to make such assessments more straight forward. POWER CONSUMPTION Indeed, the power consumption of a processor depends on many factors such as the target technology (lithography size, process flavour, threshold voltages…see Table 1), the library of standard cells used to implement the core and the execution activity imposed on the processor during power simulation. Beyond that, what is left unsaid is often more important than what is stated explicitly. Consequently, caution and judicious reading of the vendor data sheets are mandatory when comparing the power figures for competing processor IP. Rarely, you will find apples to compare with apples from the data sheets only. Table 1: Process option versus Power
The table 1 shows the great variability of power according to threshold voltages and process variant. In the given example (NAND2 drive 2 gate), the dynamic power can be 43 % higher in LP process but the leakage power can be up to 175 times lower! Furthermore, as leakage varies dramatically with temperature, letting the reader unaware of the process and temperature corner being used for power estimation may lead to a wrong comparison and then to attribute artificial qualities to a processor. Table 2 shows the impact of standard cells library selection on power figures: depending on the library chosen in the same lithography and the same process, the dynamic power may vary in a wide proportion. Table 2: Library options versus Power
Figure 1 shows the impact of process corner on power specifications. We can notice that the PVT conditions applied for the measurement can decrease of 30% the power consumption Figure 1: Flip80251-Typhoon power consumption in different PVT conditions While values of power consumption of a packaged processor (standard chip) take necessarily into account all circuitry in the package, power specifications of processor cores are based on simulations - vendors are free to delete or ignore any number of power-dissipating functions when reporting power numbers. Some vendors do not include specific functions when they measure the power consumption. They often specify in their datasheets what they did not take into account. However, sometimes, these functions are vital for the microcontroller. For example, when a supplier does not include the clock tree, the customer has to appreciate the consequences of this omission: indeed, the clock tree includes the gates operating at the highest frequency of the core. This function dissipates much of the CPU power. It is not possible to design a microcontroller without clock tree. Obviously, it would be unfair to compare the power specifications of two processors and omit the clock tree from one of them (see Figure 2). Figure 2: Flip80251-Typhoon power consumption with/without function omission Another example of frequent omission: Shall we consider the static consumption? Even if the datasheet clarifies all the items listed above, one more important “detail” remains unclear (in most of the datasheet): What is the processor doing while the power simulations run? Table 3: Influence of processor activity on power figures
Data contained in datasheets will give a global idea of the dynamic consumption. Thanks to that, it will help to have an overview of the consumption of competitors’ solutions, but since no standardized power-benchmarking program for processor cores has emerged and since the measurement conditions can be far from the customer’s application conditions, it seems unrealistic to have accurate power consumption estimation for a microcontroller without running its own power simulations. AREA When processor vendors communicate on the area, there are also some omissions that could make the numbers meaningless. The MCU area depends a lot on the configuration considered: is it only the processor configuration? If not, which peripherals are considered? Which options? These questions are really crucial when considering the MCU area. Figure 3: Flip80251-Typhoon area The number of gates is helpful to express the area independently of the library chosen for the synthesis. It is computed by dividing the core area after synthesis by the area of a reference gate. For all vendors, this reference gate is a NAND2 gate, but is there only one NAND2 gate? The answer is no because there are different subtypes of NAND2 gates varying in size. The other point that makes the gate count comparison unfair is the variability of gate size between libraries. The relative size between the NAND2 and the others cells has a direct impact on the gate count. Imagine you are synthesizing a core with 2 libraries in TSMC 0.18um which differs only for the size of the NAND2 drive 1 and the size of one Flip-Flop. The library A has a NAND2 of 10 um2 and a DFF of 100 um2. The library B has a larger NAND2 (15 um2) but a smaller DFF (95 um2). Suppose that for a given design, the number of NAND2 and DFF is the same. The total area would be identical with library A and library B. However, the library A will give a gate count 50% larger than the library B even if it is the same core, synthesized with the same constraint in the same technology. Therefore, the silicon area is the figure that really matters. It will represent the real cost for the designer and its customer. However, once again, this figure depends on many parameters: the lithography size and the process, the standard cells library used, the PVT conditions, the synthesis constraints (frequency, I/O delays, max cap, rest of SoC constraints, etc…), and if the result is pre or post place and route. Taking into account all these parameters, it seems impossible to compare area figures from different providers measured in exactly the same conditions. The customer will even have to ask to the MCU provider to give the detailed measurement conditions because it is rarely indicated in the datasheets. Then, the figure of the datasheets will be useful for a first evaluation. Table 4: Influence of library in core area
Moreover the principal parameter impacting the area is the clock frequency chosen. Between area-oriented optimization and speed-oriented optimization synthesis, the result will be widely different. It demonstrates that this information is crucial when comparing value in data sheet, to know if the microcontrollers are compared in the same conditions. Figure 4: impact of target speed on silicon area As shown before, even if the frequency and the area are highly linked together, the right way to define target frequency is to look at the frequency that enables to process the critical part of the application program in the required amount of time. Then, you could assess what is the impact of the required frequency on area and power. Indeed, the goal is not to target the highest frequency possible with a core because it would give both a huge area and a high consumption, but to find the pair “processing power – core frequency” that provides the best trade-off area / power. In that way, processor frequency has to be considered rather as a trade-off that enables to achieve the others target. CODE DENSITY / PROCESSING POWER The wide range of applications makes it difficult to characterize the embedded domain. In embedded systems, the applications range from sensor systems with simple MCUs to smart phones that have almost the functionality of a desktop machine combined with support for wireless communications. Another particularity of the embedded world is that there is not a significant legacy code base that would favour a standard instruction set architecture (ISA), as it has happened in the desktop world. This has led to a remarkable diversity of ISAs for embedded applications that makes the selection of the benchmark program even more crucial for finding the best architecture for a particular application. The more frequent pitfall when evaluating the code density or the processing power is to underestimate the role of the selected benchmark on the reliability of the result. By not applying the right benchmark, you could select a processor which is not the most appropriate for your application as shown in the table 5 below.
These benchmarks are part of the category "industrial control" of MiBench. So, the question is: how to select the right benchmark for a meaningful appreciation of the processing power (and consequently the code density)? Since different application domains have different execution characteristics, a wide range of benchmark programs has been developed in the attempt to characterize these different domains. Most of these benchmarks are targeted towards specific areas of computation. For instance, the primary focus of the Dhrystone was to measure integer performance; LINPACK is for vectorizable computations; and Whetstone is for floating point intensive applications. Other benchmarks are available to stress network TCP/IP stacks, data input/output and other specific tasks. Why Dhrystone is not good? To improve the limited capabilities of synthetic benchmarks, standardized sets of real application programs have been collected into various application-program benchmark suites. These real application programs can more accurate characterize how current applications will exercise a system than the other types of benchmark programs. However, to reduce the time required to run the entire set of programs, they often use artificially small input data sets. This constraint may limit the ability of the application to accurately model the memory behavior and I/O requirements of a user’s application programs. However, even with these limitations, these types of benchmark programs are the best to been developed to date. There have been some efforts to characterize embedded workloads, most notably the suite developed by the EEMBC consortium and its academic equivalent MiBench. They have recognized the difficulty of using just one suite to characterize such a diverse application domain and have instead produced a set of suites that typify workloads in some embedded markets. For instance, MiBench benchmark programs are divided into six suites with each suite targeting a specific area of the embedded market. The categories are Automotive and Industrial Control, Consumer Devices, Office Automation, Networking, Security, and Telecommunications. All the programs are available as standard C source code for being portable. In summary:
CONCLUSION It is very easy to make a processor assessment that leads to a wrong conclusion, because at each step of the evaluation, there are many possibilities to compare data that where not measured under the same conditions. That is the reason why we could see many circuits that embeds not the most appropriate MCU, but only a convenient one.
|
Home | Feedback | Register | Site Map |
All material on this site Copyright © 2017 Design And Reuse S.A. All rights reserved. |