|
|||
Tuning Fork - A Tool For Optimizing Parallel Configurable Processorsby Shay Gal-On, Steve Novak, Oz Levia – Improv Systems, Inc. In this paper we describe our experience in developing Tuning Fork: a tool to assist the user in fine-tuning the match between application and processor. 1. Introduction A developer then has several considerations in selecting the ‘optimal' processor configuration: These factors give rise to complex tradeoffs between chip size, power requirement and performance [7]. Many applications, such as those found in the area of DSP, have large amounts of instruction-level parallelism in loops that can be exploited using various software pipelining techniques [5], [8], [9], [14]. Maximally utilizing this parallelism to increase performance requires a processor configuration with the right mix of parallel resources. Moreover, special purpose resources (e.g. Multiply/Accumulate units, bit-manipulation units, SIMD units, etc.) can be added to a processor to further increase performance. However, every resource added to the processor has an associated cost in terms of die size, power, and even clock frequency. What, then, is the right trade-off between cost and performance? Designing an Application Specific Processor (ASP) can be more of an art then a science, relying on the designer's expertise to find a combination of resources which will result in the best tradeoff between performance and other constraints. Tuning Fork aims at avoiding the guesswork, and allowing even relatively inexperienced designers to develop highly efficient ASP. Traditional tools are capable of revealing and even optimizing dynamic software bottlenecks. These include profilers providing feedback on where the bottlenecks are [3], suggestions on optimizing the software [2], and even on the fly optimizations [4]. EDA tools are available for design and synthesis of embedded systems [15], etc., and studies were made trying to compare the efficiency of providing custom hardware instead of software operations [10], [11] etc. Tools that support processor design are not as numerous. Not many configurable processors are available to the design community and typically processor design has been the craft of few architects. Using runtime statistics, a profiling tool provides the user with a view of the application in relation to resource utilization (the focus is on cycles being the critical resource). The assumption underlined by all profilers is that the application is flexible and the processor is fixed, and therefore it is the software that needs to be optimized. These separate approaches are inadequate for Application Specific Processors (ASP). When a suitable technology is available, as is the case with the configurable Jazz DSP processor, not only the application but also the processor itself can be extensively configured and fine-tuned towards one application. Special units, known as Designer Defined Computational Units (DDCUs [6]), can even be built and incorporated in the processor to optimally perform a key operation, (e.g. shift by a constant amount and multiply, special SIMD type operations as used in various DSP processors [12] etc., or any other hardware that is applicable to the specific algorithm being used.) Figure 1 illustrates the development of an ASP as an iterative process. After developing and verifying the application, a specific embedded platform is targeted. From then on, an iterative process of optimization continues until all the required constraints are met: performance, die size, and power. Optimizing the HW (Step 9) is only possible with configurable processors. Analyzing the processor in relation to a specific application allows Tuning Fork to integrate the software (Step 6) and hardware (Step 8) aspects of the optimization process for an ASP, thus allowing previously unattainable levels of refinement in the application as a whole - both software and hardware. Tuning Fork provides both visual and pro-active feedback to the user. The visual feedback consists of both qualitative (graphical) and quantitative (numeric) analysis of an application running on a specified processor configuration (obtained via profiling). This feedback includes per-engine, per-method, and per-block code size and cycle-accurate performance feedback, along with dynamic and static resource utilization charts (at varying degrees of granularity). It also includes cyclic and acyclic dependence chain analysis that provide a hard lower bound on performance. Using this information, the user is able to isolate performance bottlenecks and gain insight into their cause.
The remainder of this paper is organized as follows: 2. Related Work Tools such as VISTA [13], even aim at making the user understand the HW-SW interaction. Like Tuning Fork, VISTA provides both qualitative and quantitative feedback to the developer, which allows him to visualize resource utilization (both dynamic and static) and identify performance bottlenecks. However, unlike Tuning Fork, VISTA does not provide any pro-active suggestions to the user – VISTA's feedback is entirely passive. Tools such as VTune [2], direct the application developer to the application bottlenecks. VTune can even provide suggestions on application modifications for common cases. Other tools try to optimize the application code at runtime, relying on profiling information without user intervention. Dynamo, [4] for example, will dynamically select an execution trace based on profile, and apply various optimizations to the code. The optimized code is then executed instead of the original code. However, none of those tools take into account the possibility of changing the processor itself to relieve the bottleneck.
3. Improv Jazz DSP processor capabilities Improv Jazz DSP is a scalable, configurable, high performance VLIW multi-processor [18], [19]. Jazz processor's Computations Units (CU) and Memory Interface Units (MIU) VLIW structure is controlled via instructions slots. Jazz has a shallow pipeline and most instructions are completed in two cycles (including the fetch). The Jazz processor uses a Harvard memory architecture, where data and program have different address spaces. It does not have a register file and registers are distributed in the various CUs and MIUs. The processor uses multiple task queues to control the order of execution (including interrupts) and includes power features. Like many DSP processors, Jazz supports multiple Data memories (Figure 2). Several aspects of the Jazz processor are configurable. These include:
Jazz Tools and Flow Notation Environment: High-level verification and SW development tool for embedded applications. Composer: Allows the developer to define a processor – configuring any of the elements mentioned above, and in some cases using multiple processors to create a platform defining memory sizes, processor types, and the connections between them. Jazz PSA Compiler: a retargetable compiler for C and Java source code. Jazz PSA Compiler, when invoked, loads in the target native platform, and compiles the application to perform optimally on the target platform. Once the application code has been compiled to a specific platform, the user has two options to measure performance:
The output from any of the methods above is a profile that can be correlated to the application source code. This profile is used by Tuning Fork, in conjunction with data collected during the compilation, to provide the user with visual representation of resource utilization on the various processors. In addition Tuning Fork attempts to analyze the match between the application and the processors, and suggests configuration changes that are likely to increase performance. Tuning Fork analyzes resources that are under-utilized, and will also suggest specific slot overlays if the developer is looking to decrease instruction width or change memory layout to increase memory bandwidth, in addition to other insights that assist the developer in perfecting the match between the platform and the application. Sample Application:Voice Processing System (VPS) 4. Visualizing and Solving Performance Bottlenecks To identify the highly executed blocks of code, we rely on profile information [17]. Each basic block in the code is matched with two main values: the number of cycles a single execution of the block takes, and the number of times the block is executed. We present this information (figure 3, panel 1) in a spreadsheet format that allows sorting using different criteria. Panel 1 allows the user to easily navigate to different basic blocks of interest. Once a specific basic block is chosen, the information panels are filled with relevant information: the corresponding source code (panel 6), the resulting assembly code (panel 5), static utilization of hardware resources in this block (panel 2), and summary information about the block collected during compilation (panel 4). Per-cycle utilization is also available to correlate specific assembly and dependence chains to the various resources (Panel 3). The summary information includes statistics such as total cycles and number of operations per cycle, flow type of block (inner loop, outer loop etc.), length of dependence chain, presence of cyclic dependencies, etc.
Figure 3 This information allows the user to quickly answer the following question: Can the execution of this basic block be improved by using different hardware? If the length of dependence chain in the block (or cyclic dependence chain in loops) is smaller than the cycles the block execution requires, then there is a hardware bottleneck. The user can correlate the assembly code with the source code, and perhaps gain insight as to how the hardware or the software might be improved for a better match. The user may also want to know which hardware resources are most likely blocking the execution? The utilization graph shows the usage of each hardware resource. For loops that can be pipelined, any highly utilized resource may be a blocking resource unless the loop has a cyclic dependency blocking further optimization.
On the Jazz Processor, a branch has one cycle latency, meaning that it should be possible to execute the loop in two cycles. However, we observe that the loop takes 4 cycles. We also observe that slot 4 is 100% utilized by the Multiply Accumulate unit (named mp0). Indeed, looking at the analysis provided by Tuning Fork, we can see that the Multiplier unit is a bottleneck in this block, as well as an estimate of the cycles to be gained by adding another Multiply Accumulate unit (Figure 4). Figure 4: Analysis results for main loop in Adaptive Predictor source code Another important issue when trying to optimize hardware bottlenecks is even harder for the user to tackle without an automated tool. Any change in the hardware is likely to affect multiple blocks. Estimating the effect of adding or removing resources on the overall performance of the application is extremely valuable. On the one hand, the programmer can always create a new configuration and get the results after compilation. However, trying all possible combinations is simply not feasible. Instead, we use Tuning Fork to provide an overall view of the interaction between the processor and the application (Figure 5).
Figure 5: Global view. Top left – static utilization. Top right – dynamic utilization. Bottom – resource bottlenecks analysis for the processor.
We can also observe that two of the memories are only used for reads and never for writes. This will allow the designer to use memories with read-only ports for those memories.
Figure 6: Data allocation suggestions for the VPS application 5. Overlapping slots
On the Jazz platform, the programmer can control the instruction word size without reducing the number of resources on the processor by overlapping resources into a single instruction slot. However, doing so will decrease the amount of parallelism available, potentially increasing the number of instructions in the program. The user is then faced with the challenge of finding the correct units to overlay such that the instruction width is within bounds, yet the performance is not penalized too much. This is a multi dimensional problem as the user needs to select the units to use, the number of VLIW slots and the right organization for the units in the slots.
6. Algorithms The mini scheduler: The Jazz PSA Compiler uses Mutation Scheduling [4] and Resource-Directed Loop Pipelining [14] to parallelize and schedule application code for the Jazz DSP processor. After scheduling, a fast mini scheduler pass is used to detect resource bottlenecks. The pass is an enhanced basic block list scheduler, which marks any resource that is blocking an optimal schedule (optimal schedule being defined as a schedule in which the block execution time takes no longer then the longest dependence chain). Whenever a resource is identified as blocking the scheduler, the mini scheduler creates a virtual resource that is identical to the blocking resource, and that resource becomes available to the mini scheduler.
This method, while not completely accurate, (as it does not take into account possible changes in register allocation and other transformations that might become possible/valuable in the presence of more hardware resources), provides a good approximation of the effect that removing the resource will have on the block. This method is particularly effective for pipelined loops, which tends to be the area the user is most concerned with in DSP applications. Overlapping units: 7. Summary and future activity
Tuning Fork at current does not provide the designer with suggestions regarding customized hardware that does not already exist on the processor. It is our intention to provide analysis of code fragments that can lead to various custom hardware units. For example: operations that are likely to benefit from SIMD type manipulation, chains of operations that use a constant, multiple operations that use the same input, and more. Tuning Fork also provides cost estimates for the various suggested processor changes, focusing mainly on the performance and code size aspects. We intend to extend and enhance these estimates to include power estimates, and provide quick estimates for various user changes (a quick what-if analysis). 8. References [1] The Semantics of Program Dependence (1989)Robert Cartwright, Matthias Felleisen. SIGPLAN Conference on Programming Language Design and Implementation [2] Intel Corporation. VTune Performance Analyzer 4.5. http://developer.intel.com/vtune/analyzer/index.htm [3] gprof: a Call Graph Execution Profiler (1982) Susan L. Graham, Peter B. Kessler, Marshall K. McKusick SIGPLAN Symposium on Compiler Construction [4] Dynamo: A Transparent Dynamic Optimization System (2000) Vasanth Bala, Evelyn Duesterwald, Sanjeev Banerjia SIGPLAN Conference on Programming Language Design and Implementation [5] Mutation scheduling: A unified approach to compiling for fine-grain parallelism. S. Novack and A. Nicolau. In Languages and Compilers for Parallel Computing, pages 16--30. Springer-Verlag, 1994. [6] "Embedded Tools & Compilation Techniques for a Reconfigurable and Customizable DSP Architecture" C. Liem, F. Breant, S. Jadhav, R. O'Farrell, R. Ryan, O. Levia Proc of the IEEE/ACM Co-design Symposium, pending publication, Estes Park, CO, May, 2002. [7] Multidimensional Exploration of Software Implementations for DSP Algorithms (1999) Eckart Zitzler, Jürgen Teich, et al VLSI Signal Processing Systems [8] Software Pipelining (1995) Vicki H. Allan Reese B. Jones Randall M. Lee Stephen J. Allan ACM Computing Surveys [9] Rau B. R., "Iterative modulo scheduling: An algorithm for software pipelined loops" Proc. of 27th Annual Int'l Symp. on Microarchitecture, pp. 63--74, Dec.1994, San Jose, California. [10] Software Synthesis and Code Generation for Signal Processing Systems Shuvra S. Bhattacharyya, Rainer Leupers, Peter Marwedel IEEE Transactions on Circuits and Systems---II: Analog and Digital Signal Processing, 47(9):849--875, September 2000. 9 [11] Trade-Offs in HW/SW Codesign Wolfram Hardt, Raul Camposano Proceedings 4th ACM Workshop on Hardware/Software Codesign, Pittsburgh, PA (USA), March 1996 [12] Exploiting SIMD parallelism in DSP and multimedia algorithms using the altivec technology. Huy Nguyeni and Lizy Kurian John. 13th International Conference on Supercomputing, 1999. [13] VISTA: The visual interface for scheduling transformations and analysis. Steven Novack and Alexandru Nicolau. In Languages and Compilers for Parallel Computing. Springer-Verlag, LNCS 768, 1993 [14] Resource-Directed Loop Pipelining. Steven Novack and Alexandru Nicolau. The Computer Journal, Oxford University Press, 40(6), 1997 [15] Hardware-Software Co-Design of Embedded Systems: The Polis Approach. F. Balarin, M. Chiodo, P. Giusto, H. Hsieh, A. Jurecska, L. Lavagno, C. Passerone, A. Sangiovanni-Vincentelli, E. Sentovich, K. Suzuki, and B. Tabbara. Kluwer Academic Publishers, 1997 [16] Instruction level parallelism for reconfigurable computing. T.Callahan and J.Wawrzynek. 8th International Workshop on Field-Programmable Logic and Applications, September 1998. [17] Optimally profiling and tracing programs. Thomas Ball and James R Larus. In ACM Transactions on Programming Languages and Systems, volume 16, pages 1319--1360, July 1994. [18] Jazz 2.0 architecture whitepaper Steven. H. Leibsen Microprocessor Report, Mar. 2000 [19] Jazz 2.1 architecture reference guide Improv Systems, Jan. 2002.
|
Home | Feedback | Register | Site Map |
All material on this site Copyright © 2017 Design And Reuse S.A. All rights reserved. |