|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Competitive Advantages of the Mali Graphics Architectureby Falanx Microsystems Introduction When mobile phones were first introduced, few could have predicted a future beyond simple text messaging. For 2004, however, research firm IDC projected sales of 93 million camera enabled mobile phones, with projections for 300 million camera phones by 2007. Similarly, the Yankee Group reported that mobile users downloaded 49 million games in 2004, and projected that consumers will spend $1 billion on mobile games in 2008. In early 2005, Fox announced one-minute “mobisodes” (mobile episodes) of its hit thriller 24, while Verizon introduced a new multimedia service that will deliver news, sports, music videos and 3D games to mobile subscribers, with content providers including VH1, Comedy Central and NBC. Suddenly, users are asking their mobile phones to handle the same range of 2D, 3D and video formats that play on their desktop, and cell phone manufacturers must supply these capabilities to meet the revenue opportunities offered by 3D games and digital video. While the demands may be similar, however, the environments for computer and mobile phone graphics couldn’t be more different. On the power rich computer desktop, ardent gamers are combining two graphics boards, each with a separate power connection, for the most accelerated game play. In the mobile environment, however, chip size and battery life are paramount, and graphics technology must provide high visual quality, optimum performance and low power consumption. Clearly, the power hungry graphics technologies that work on desktop computers are totally unsuited for mobile phones. This creates a new demand for graphics technology that allows mobile phone users to enjoy these new classes of content while meeting the size and power consumption requirements of mobile phone vendors. Falanx Microsystems’ Mali graphics architecture is designed to meet that demand. This architecture requires the fewest gates of any competing technology and smallest die size, so can easily be integrated into the most compact mobile devices, and also draws the least power during normal operation, extending battery life. Despite the compact size and power efficiency, each Mali configuration includes bandwidth efficient 4X Full Scene Anti-Aliasing (FSAA) for optimal 2D and 3D quality, and the unique (and patent pending) ability to reuse these 3D gates for video encoding and decoding. Falanx licenses the family of Mali IP Cores directly to equipment and System-on-a-Chip (SoC) vendors. Targeted at these customers and the trade press, this white paper details the operation and competitive advantages of the Mali graphics architecture. Today, no one really knows if 3D games or video related applications will contribute significantly to network revenue over the next few years. What is certain, however, is that mobile phones equipped with Falanx’s Mali technology are ideally equipped for both 3D and video. This makes Mali the obvious choice for vendors seeking to supply a cost effective, high performance platform for today’s market demands while “future proofing” their technology for potential revenue opportunities in the future. Mali Technology Overview The Mali architecture is a scalable technology for 2D and 3D graphics and video encode/decode acceleration. Originally designed in 1998 to meet the OpenGL®2.0 standard, the Mali architecture has been streamlined for OpenGL®ES 1.1. The first generation of Mali cores, the Mali100 and Mali50, offers patent pending techniques for industry leading 4X FSAA without memory spikes or performance degradation, optional 16X FSAA for leading image quality, and methods for re-using 3D gates for video encoding and decoding and for optimal performance on low gate count. The current line of Mali cores present scalable solutions for different market segments, from entry-level mobile phones to high-end smart phones, game pads and set-top boxes, while maintaining binary compatibility with first generation cores, easing the upgrade cycle. As shown in Figure 1, there are two cores for rasterization (and video encode/decode) and one geometry processor for matching with either raster engine.
Briefly, the Mali55 offers the lowest gate count and is designed to be matched with ARM9 / ARM11 for transform and lighting (T&L). The Mali55 can also be matched with MaliGP for higher video performance and reduced power consumption for 3D graphics. The Mali110 is designed for highperformance 3D graphics and video with performance matched with MaliGP. Finally, MaliGP is a programmable vertex shader /DSP architecture for T&L and to accelerate several video encoding and decoding algorithms. Table 1 details the basic configuration and performance parameters of the Mali family. All Mali cores are designed for easy integration into existing SoC architectures, reducing time-to-market and system complexity for a fully featured 3D and video enabled SoC solution. All operations on the SoC bus are burst optimized with 8-burst or 16-burst transactions. Table 1: Configuration and performance parameters for Mali family of graphics cores.
A. Most Compact and Least Expensive to Manufacture As shown in Table 2, Mali requires fewer gates and less RAM that the ARM MBX family of graphics cores and a smaller die size. Note that the ARM figures in Table 2 do not include video decode, encode or a geometry processor, so the gate count /die size for these functions must be added to these totals for an accurate comparison. Table 2: Comparative gate counts for ARM 3D and Falanx.
* All ARM figures from the ARM 3D Graphics Acceleration brochure downloaded from www.arm.com in December, 2004 Mali’s low gate counts and small die sizes offers several key competitive advantages, including the ability to produce smaller chips for compact cell phones, lower power consumption and ease of integration, the latter two discussed in more detail below. Most objective, however is the cost savings delivered by the smaller die size. Table 3 illustrates the cost savings an SoC vendor would realize by using the Falanx core rather than the PowerVR MBX core. Total added chip size for the ARM solution would be 4-8 mm2, compared to the 2- 4.5 mm2 for Falanx. Assuming a chip size without multimedia acceleration of 20 mm2, and the largest die size for both ARM and Falanx, each wafer incorporating the Mali core would yield 358 more chips than a wafer incorporating ARM technology, resulting in a per chip saving of about $0.12. Over a life cycle of 10,000,000 chips, this would produce total savings about $1.2 million. Table 3: Comparative gate counts for ARM 3D and Falanx.
Note that when comparing prices and size differences between IP cores, the critical parameter is the die size and wafer prices in a given process technology. Gate count can be presented in different ways, and different configurations of SRAM can have great impact on final die size. In a fair, apples to apples comparison, Mali is the most compact 2D, 3D and video encode/decode core available today, delivering substantial cost savings during manufacture, reducing initial NRE and minimizing power consumption in use. As we’ll see in the next section, Mali delivers this compact size without compromising 2D, 3D or video quality or performance. B. High Performance, Comprehensive Feature SetDespite Mali’s low gate count and small die size, the cores offer industry leading 2D/3D quality and highly competitive video encoding/decoding. Here are the details: Gate Free 30fps VGA Video Encoding/Decoding To meet the end-user requirements for high quality mobile games and video, a SoC manufacturer needs to supply 2D, 3D and video encoding and decoding hardware. With other graphics core technologies, an SoC vendor must add separate cores for each of these functions, potentially from different vendors, contributing to higher gate counts and increased NRE. Falanx solves this problem with patent pending technology that enables 3D gates in the Mali core to perform most essential video encoding and decoding functions, significantly reducing the total cost of a multimedia enabled SoC (Figure 2). Specifically, Mali is designed to provide 30fps VGA resolution encode and decode for high-quality video services and personal video content and 2-way conferencing at CIF resolutions. Figure 2. Video functions performed by Mali cores. The solution is flexible and scalable. For entry level video encoding/decoding, or if the SoC core already has a video capable DSP, the SoC manufacturer can integrate Mali55. For high performance smart phones, the SoC vendor can integrate both the MaliGP and Mali110. Mali offers sufficient power for operations like embedding video streams into real-time graphics content, enabling many real-time effects and editing possibilities for video and images. Most importantly, as shown in Table 4, either Mali pixel processor configured with the ARM and MaliGP, can encode VGA resolution video into MPEG-4 format at 30 fps, providing full frame rate performance for video conferencing or video messaging. H.264 numbers are available on request. Table 4: MPEG-4 Decode / Encode requirements.
Video messaging clearly represents a substantial new revenue opportunity for mobile networks, but to harvest that opportunity, mobile phones require full frame rate, encode/decode functionality at VGA resolutions. Mali technology is the least expensive alternative available to SoC vendors seeking to power mobile phones with these capabilities. Bandwidth and Power Efficient 4X FSAA One might think that with the small, limited resolution LCD panels on mobile phones, image quality might not be important. Actually, the reverse is true, primarily because mobile phone users hold the display close to their eyes. In Graphics for the Masses: A Hardware Rasterization Architecture for Mobile Phones, authors Tomas Akenine-Moller and Jacob Strom compared the size and viewing distances for computer and mobile phones and concluded that “[t]hese display conditions implies that every pixel on a mobile phone should ultimately be rendered with higher quality than on a PC system.” http://www.cs.lth.se/home/Tomas_Akenine_Moller/pubs/masses.pdf.
The most noticeable quality deficit in 3D images and digital pictures are the “jaggies,” a deficit so widespread that it’s included in Webster’s jargon file (http://dictionary.reference.com/search?q=jaggies). Interestingly, after defining the jaggies as the “staircase effect observable when an edge … is rendered on a bitmap display,” Webster’s states that “the closer you are to the screen the easier you can spot artifacts due to under-sampling of geometry (aliasing).” Almost all graphics core vendors use full scene anti-aliasing (FSAA) to smooth the jaggies. While undoubtedly effective (see Figure 3), FSAA is usually very memory and memory bandwidth intensive, because bandwidth requirements scale linearly with the number of samples per pixel used for anti-aliasing (4X the data for 4X FSAA). For this reason, most graphics core vendors offer only 2X FSAA or none at all, rather than the 4X/16X offered by Falanx. Mali can offer 4X/16X FSAA courtesy of patent pending technology that enables 4X anti-aliasing with no hit to memory bandwidth or frame rate. Briefly, using immediate mode rendering-like techniques, the Mali architecture renders each primitive as it arrives, and generates each pixel and all the sub-pixels required for anti-aliasing on-chip. In contrast, traditional tile-based architectures basically render each pixel for each visible primitive. Since traditional anti-aliasing requires more sub-pixels to be rendered than without anti-aliasing the consequence is more bandwidth usage. As described in Figure 4, the Mali architecture uses four parallel pipes, each processing one sub-pixel for four. While this approach would seem to require increased gate count for the 4X FSAA renderer pipeline, that is not the case. Instead, by using the relative positions between the sub-samples down the pipeline, thin data-paths are required to handle the sub-pixel calculations. For 4X FSAA, this is utilized to perform one averaged texture look-up per pixel, which is carefully filtered and applied to all the sub-pixels. This technique is commonly known as multi-sampling and does not introduce additional texture bandwidth (which is true for all architectures using this technique).
Finally all the sub-pixels use on board memory for z-reads and writes, thereby eliminating the linear increasing bandwidth per sub-pixel introduced by traditional immediate mode rendering pipelines like those used by ATI and NVIDIA. For 16X, on-chip circuits up-sample the geometry to effectively produce 16 virtual sub-pixel pipelines. This technique does, however, add to the bandwidth usage, increasing power consumption and slowing the frame rate. Figure 5 shows the impact of 4X and 16X FSAA on 3D performance in frames per second using the older generation Mali 100 (predecessor to the 110) and the Mali GP. This is plotted over scenes with increasing complexity as measured by vertices per frame. Note that irrespective of image complexity, 4X FSAA frame rates are virtually identical to frame rates without FSAA, and as demonstrated below, 4X FSAA produces minimal bandwidth spikes. Even with 16X FSAA enabled, frame rates start at a respectable 60 fps, and remain above 20 fps even when displaying the most complex frame. This with Mali100 running at 40MHz.
Falanx is the only vendor offering “free” 4X FSAA, and also the only vendor with available 16X FSAA. This enables Mali-equipped mobile phones to deliver market leading image quality for 3D game play at very compelling frame rates. As we’ll see next section, Mali produces this unique blend of performance and visual quality with minimal power consumption. C. Lowest power consumption There are three major drivers of power consumption in an embedded graphics system; gate count, software and memory bandwidth. Gate count is more or less a function of the total amount of gates in the design. Techniques like clock gating are either automated during the integration phase, or embedded directly into the core implementation (Falanx offers both approaches). The critical feature is that the core can detect when it finished rendering a frame and automatically turn itself off, which can dramatically improve power savings, especially for frame rate locked applications. However, since all serious IP vendors do this, actual power consumption comes down to a function of gate count. In configurations where parts of the graphics functionality is performed in software, the software must obviously be optimized for the target platform and the interaction between the hardware and software must obviously be efficient so that valuable cycles are not spent “doing nothing”. That said, the greatest power consumer in a graphics system is memory bandwidth. In addition, high bandwidth usage tends to slow other mobile phone functions that also require bandwidth. For these reasons, low bandwidth consumption is a critical feature for graphics cores. At a high level, there are two stages of 3D processing that impact overall graphics bandwidth, Transform and Lighting (Geometry) and Rendering (Pixels). Geometry is either handled by the device CPU or a geometry co-processor like the MaliGP, with the output then transferred to the 3D processing unit (like the Mali 50/55/100/110) for final rendering. Either way, when evaluating or comparing graphics technologies, it’s critical to analyze and measure bandwidth consumption in both stages. Traditional immediate mode renderers Traditionally, there are two common ways to design 3D graphics processors, immediate mode rendering and tile-based rendering. With immediate mode rendering (Figure 6) operation during the geometry stage is relatively efficient, since the geometry engine simply sends all raw vertices to the raster engine.
Once there, the raster engine renders, shades and textures each pixel, then sends it to an off-chip external Z-buffer and color buffer after it has determined if the pixel is actually visible in the frame, and not obscured by other pixels. This method of handling z- and color-values use an enormous amount of bandwidth per rendered pixel and can easily bottleneck performance, especially when considering the additional per pixel bandwidth required for texturing and alpha blending. The situation worsens with higher overdraw (depth complexity of the scene) because bandwidth is expended for pixels that will never be visible (reading textures, and z-values for each pixel, times the overdraw, to check if the pixel will be visible). With FSAA enabled, the already high per pixel bandwidth in an immediate mode renderer increases linearly by the number of samples per pixel (e.g. 2 times for 2X FSAA, 4 times for 4X FSAA). This spikes memory bandwidth and power consumption significantly and often slows the display rate beyond acceptable levels. This is the reason why you seldom see more than 2X FSAA in immediate mode renderers for mobile phones. Overall, traditional immediate mode rendering is a brute force approach that works acceptably well in the personal computer environment, where bandwidth is plentiful and power consumption largely irrelevant. However, in the bandwidth limited, power-starved mobile cell phone environment, it’s clearly not the optimal architecture. Tile-based Renderers In contrast, tile-based rendering (Figure 7) reduces per pixel bandwidth inefficiencies by breaking each frame into separate blocks called tiles, rendering them independently and then assembling them together before display. This enables tile-based renderers to perform z-buffer calculations on-chip, eliminating traffic to the z-buffer. The cost is added bandwidth per primitive, which is seldom mentioned.
The best known tile-based renderer company, PowerVR, performs Hidden Surface Removal (HSR) onchip by introducing a pre-rendering pass and filling the z-buffer with values that track which pixels are visible. Then, in a second pass, the chip can render and fetch the textures for only those visible pixels. This reduces texture bandwidth per pixel because of reduced overdraw, but potentially doubles vertex bandwidth, and requires increased gate count over typical immediate mode renderers. With FSAA enabled in a traditional tile-based architecture, the numbers of pixels that must be analyzed for visibility increases linearly with the number of samples (e.g. 2X for 2X FSAA), again boosting the bandwidth usage per vertex to unacceptable high levels. However, since bandwidth has been saved on texture data because of HSR and the on-chip Z-buffers, FSAA will often require lower bandwidth than in an immediate mode renderer. In sum, traditional tile-based renderers offer better bandwidth utilization than immediate mode renderers in low complexity games, but as scene complexity increases, the bandwidth saving per pixel is eroded by the additional bandwidth usage for geometry. HSR increases the gate count and when anti-aliasing is enabled, memory bandwidth usage increases significantly over that required for non-FSAA display. Falanx Mali Architecture: The Falanx Mali architecture (Figure 8) is a hybrid between tile-based and immediate mode renderers. In the geometry phase, Falanx’s Mali architecture operates like most tile-based renderers, dividing the image into tiles but using a proprietary method that reduces per vertex bandwidth and memory usage significantly over traditional tile-based renderers. Rather than performing all z-ordering before retrieving textures (HSR), Mali relies on two far more cost efficient techniques to reduce texture bandwidth.
First is a proprietary and highly efficient Early Z implementation that is virtually free in terms of gates and typically eliminates approximately 50% of occluded pixels, reducing texture bandwidth by the same percentage. Then, Mali applies high-quality 2 bits per pixel texture compression, FLXTC, method that virtually eliminates the need to reduce texture bandwidth any further. By combining the best of immediate mode and tile-based renderer techniques, all Mali cores are extremely bandwidth efficient compared to either traditional approach, even without considering Mali’s highly efficient 4x FSAA. With 4X FSAA enabled, Mali requires significantly less bandwidth than all competing architectures under a very broad range of operating conditions. Proving these competitive bandwidth claims is complicated by the lack of a common platform for actually testing bandwidth consumption. In an attempt to simulate the operation of immediate mode and tile-based rendering, Falanx produced theoretical models on how the Mali architecture would perform using both immediate mode and traditional tile-based rendering techniques. When possible, these calculations were compared to actual benchmark results published by vendors using these architectures, which generally confirmed the Falanx theoretical models. We then compared these calculations to actual results produced by the Mali architecture using the following assumptions:
The initial results are shown in Figure 9.
Without FSAA, the tile-based architecture requires less bandwidth than immediate mode for the geometry complexities that can be expected by advanced 2006 mobile games, which will be about 10-15k polygons, while the Mali architecture is even more efficient. As scene complexities approach roughly 20,000 polygons, however, immediate mode becomes the preferred approach.
However, with 4X FSAA enabled (Figure 10), both the traditional tile-based and the Mali architecture outperforms immediate mode renderers by a significant margin. Most significantly, the Mali architecture consumes about 1/3 of the bandwidth of a traditional tile-based renderer. Theory aside, what really counts are real-world numbers. Falanx has a large suite of real-world OpenGL ES applications running with Falanx Performance Analysis Tool (PAT) which confirming Mali’s highly efficient memory bandwidth usage. When evaluating other technologies, be sure to evaluate bandwidth consumption and frame rates over scenes with different complexity levels, and with FSAA enabled and disabled. Even without FSAA enabled, the Mali architecture is competitive against both immediate mode and tilebased renderers at most relevant scene configurations. However, relatively few mobile phone users will tolerate visible jaggies in their display, making 4X FSAA a practical necessity. With 4X FSAA enabled, Mali is clearly the most bandwidth/power efficient architecture, extending battery life significantly over all competitive technologies. D. Easy to Integrate/Accelerated Time to Market Clearly, advanced hardware technology is useless unless it can be efficiently integrated into the SoC and includes a full set of all relevant drivers. Falanx excels on both fronts, allowing customers to shrink their time to market. Hardware Since Mali offers 2D, 3D and video acceleration in one core solution, SoC complexity is significantly reduced compared to other solutions, thereby saving SoC designers significant time in actual integration and verification of the total solution. All Mali Cores have simple interfaces to standard SoC interfaces, including AMBA AHB 2.0, AXI and OCP, both 64-bit and 32-bit bus configurations. For minimal design changes to existing SoC architectures Mali is designed to deliver high performance in high latency systems, thus does not require a separate memory controller further reducing the complexity of the total system. Mali shares the memory interface with other IP cores, so all memory accesses are burst optimized, with the typical case being 8-burst and 16-burst on the bus. This simplifies system verification, and also contributes to higher utilization of the total available bandwidth on the SoC bus system. In addition, Mali is delivered as a soft-core IP that potentially can be implemented in different process technologies with different tool-chains. The Mali IP cores can quickly be retargeted to fit into any toolchain for optimized and quick synthesis and implementation. To ease the burden of integration further, all Falanx IP cores contains a minimum embedded RAM macro instances and provides direct access to these macros for RAM BIST or other test mechanisms. Software Software is as crucial as hardware when it comes to performance, functionality and power consumption in an embedded graphics system. Application developers are usually not interested in HOW the hardware works, but only what kind of functionality it implements and how fast it can execute this functionality. The Falanx approach is to jointly develop software and hardware to produce efficient interfaces between the two, thereby squeezing out additional performance in an already efficient architecture. As a part of the Mali graphics cores, Falanx ships a full set of ARM optimized pre-verified software stacks, including all Khronos APIs, and the following software components:
As shown in Figure 11, several other software items are also included, including reference video codecs running on top of the hardware. Due to the programmability of the Mali architecture Falanx is continuing to investigate other software that can be beneficial for the end-user multimedia experience, including font rendering technologies, imaging (e.g. JPEG2000) and audio acceleration. Mali’s scalability makes the architecture uniquely well suited for product lines targeting multiple market segments, while Mali is easy to incorporate into an SoC design from both a hardware and software perspective. Overall, Mali enables SoC vendors to supply a high performance, diverse product line with minimal NRE costs and speedy time to market. Summary of Competitive Advantages and Benefits
Conclusion The Falanx Mali graphics architecture offers a unique combination of small size, efficient power consumption and comprehensive 2D, 3D and video encode/decode features with industry leading 4X FSAA. Mali’s scalable architecture is ideal for phones ranging from entry level to power user, and are designed for easy integration into SoC cores. Falanx targets semiconductor SoC vendors delivering platform and infrastructure products for mobile communication and computing. This includes the market for handheld 3D graphics and multimedia hardware such as mobile phones, PDAs, Tablet PCs, game consoles and in-car infotainment and navigation systems. About Falanx Falanx incorporated in April 2001 and is a privately held corporation developing and marketing IP Graphics Cores for use in mobile phones, PDAs, set-top boxes, handheld gaming devices and infotainment systems. The result of seven years of research and development targeting the weaknesses of traditional desktop graphics architectures, Falanx's Mali Graphics Solution™ is embedded into a scalable and easy-to-integrate architectural framework and is ready to accelerate existing and future APIs for 2D and 3D graphics. The Mali Graphics Solution™, with its patent pending algorithms, is capable of accelerating video encoders and decoders like MPEG4, H.264. The company is backed by leading, long term venture capital firms with European and U.S. operation and Invanor, a government organization providing funding and networks for start-up and growth companies. The Falanx Board of Advisors consists of representatives from the Norwegian University of Technology and Science, Nordic Semiconductors and SINTEF- Europe’s largest independent research organization.Note: The information contained herein was gathered from a variety of sources and is believed correct at the time of this writing. If you discover any outdated or incorrect information, please send an e-mail to jan@falanx.com with a message header stating “Mali White Paper - Errata.” Thank you.
|
Home | Feedback | Register | Site Map |
All material on this site Copyright © 2017 Design And Reuse S.A. All rights reserved. |