A Heuristic Energy Aware Application Mapping algorithm for Network on Chip
* School of engineering of Islamic Azad University, Science & Research branch
** Iran Telecommunication Research Center
*** ECE department of Tehran University
Abstract
In this paper we have introduced a heuristic algorithm which automatically maps a given set of intellectual property onto a generic regular network-on-chip (NoC) architecture and constructs a deterministic routing function such that the total communication energy is minimized. We also compared the simulation results with a full and also partially GA based methods namely Scenario 2 and 3. The results show the maximum of 17% enhancement in energy consumption over the full GA based method is achieved.
1. INTRODUCTION
Continuing advance of the semiconductor technology will enable us to integrate billions of transistors on a single chip by the end of the decade. By this trend traditional System on Chips will face serious challenges due to their interconnection limitations such as design, synthesis, verification, test, etc. To overcome these challenges, an increase in reusability and interconnection efficiency is inevitable[1].
The communication-architecture synthesis consists of defining a communication topology and mapping cores onto the communication architecture[2]. Although the current electrical bus-based architectures are simple and completely widespread, use of these architectures will not meet the requirements of the future SoCs because of their seriously limited scalability and energy inefficiency[1]. By using the interconnection network as the communication infrastructure between cores, Network on Chip (NoC) will be an alternative and a promising solution for overcoming the existing challenges of the system on chip and uses the advantages of the interconnection networks, such as the structured network wiring, modularity, and scalability[3,4].
Simultaneously, the SoC design methodology must address “the system-level synthesis through HW/SW partitioning and mapping of the application tasks onto the pre-designed cores”, and “the communication-architecture synthesis through mapping of the cores onto the communication architecture”[2]. Of the two problems, the second one becomes more important due to the following reasons:
1- As the integration level of an SoC system scales up, the volume of data and control traffic among the cores grows increasingly and becomes the major performance bottleneck;
2- Unlike the power consumed within a core, the power dissipated in communication architecture does not benefit from down-scaling of feature size[3].
According to our assumptions, application mapping onto the communication architecture is a process of locating any individual IP-Core onto a specific platform which forms an Interconnection network or NOC. Given the communication requirements of cores, in this paper, we address the problem of optimally mapping the cores onto NoC architecture with two objective functions:
-
To minimize the average hop distance.
-
To minimize the maximum energy consumption over the communication links of the NoC architecture.
Both of the metrics are the key parameters of the performance and cost, e.g. the energy consumption and the communication delay. We refer to this problem as the Core Mapping Technique(CMT) problem(See Fig. 1).
Figure 1: Tile-based architecture and the illustration of the mapping/routing problems.[6]
To do such work we present an intelligent method to optimally map the cores on to the NOC platform. Figure 1 shows an example of optimally core mapping onto a 2-D mesh platform.
The rest of this paper is organized as follow:
Section two deals with the previous works and references some people who have worked in this area. In section three we show a formulation for modeling our problem. Within section four, five different scenarios will be discussed. In six, simulation results are presented and conclusion will be in section seven
2. Related Works
Network on chip is going to be a promising architecture for future complex SOCs. Demicheli and Jantsch have introduced NOC as such promising architecture. Dally suggest using the on-chip interconnection networks instead of ad-hoc global wiring to structure the top-level wires on a chip and facilitate modular design[6]. In [7], Hemani presents a honeycomb structure in which each processing core is located on a regular hexagonal node connected to three switches while these switches are directly linked to their next nearest neighbors.
Considering the communication-architecture synthesis, Hu has addressed the mapping and routing path allocation problem for tile based architectures[8]. T.Lei in [9] has proposed a two step genetic algorithm for mapping application on to a 2D mesh based NOC. D.Shin has also mentioned a similar case[9]. They have introduced a three step genetic algorithm to reduce the power consumption within a 2D mesh based NOC.
In [11] M.Nickray presented a similar strategy for reducing communication power consumption in NOCs. They have used Fat-Tree topology and a three step genetic algorithm for mapping and routing respectively.
We also refer the reader to several recent surveys [8-11] on NoCs for pointers to recent research and development. The mapping of clusters onto the physical topology of processors has been studied in the field of parallel processing[15,16]. The mapping of cores onto NoC architecture presents new challenges when compared to the mapping in parallel processing. A major difference is that the traffic requirements on the links of a NoC are known for a particular application, thus the bandwidth constraints in the NoC architecture need to be satisfied by the mapping. In what follows, we address this issue by presenting a heuristic mapping and routing techniques for NoC. The main contributions of our work is to obtain better routing algorithm comparing to the D.Shin[10] and M.Nickray[11] works.
3. Problem formulation
Simply stated, for a given application, our objective is to decide on which tile should each IP be mapped to and how should the packets be routed, such that the total communication energy consumption is minimized under given performance constraints. To formulate this problem, we need the following definitions.
Definition 1: An application characterization graph (APCG) G= G(C,A)is a directed graph, where each vertex ci represents one selected IP, and each directed arc ai,j characterizes the communication from ci to cj [8]. Each ai,j has the following properties:
- v(ai,j)is the arc volume from vertex ci to cj, which stands for the communication volume (bits) from ci to cj.
- b(ai,j)is the arc bandwidth requirement from vertex ci to cj, which stands for the minimum bandwidth (bits/second) that should be allocated by the network in order to meet the performance constraints. Please note that this definition is the same as the definition we have for task graph in [9,11]. Within a system, it is completely possible to express the data dependency, logical dependency and also timing dependency between tasks of a TG. In here, for simplification we just consider the data dependency.
Definition 2: An architecture characterization graph (ARCG) G’= G(T,R) is a directed graph, where each vertex ti represents one tile in the architecture, and each directed arc ri,j represents the routing parameter from ti to tj. Each ri,j has the following properties:
- Pi,j is the a set of candidate minimal paths from ti tile to tj tile , gives the set of links used by Pi,j.
- e(ri,j) is arc cost. This represents the average energy consumption of sending one bit of data from ti to tj. this is represented as [7].
4. Scenarios
Based on the mentioned target in the previous section, four different scenarios will be studied.
Scenario 1:
In the first scenario, mapping and routing are considered as a complex problem. In other words the problem is to find an algorithm which simultaneously solves both the mapping and routing problem.
It is clear that this method is highly time consuming and very complex to solve. In the literature, its been addressed as an unsuitable approach. We have addressed this scenario with the same purpose and therefore we omit it from our suggestion list.
Scenario 2:
In the scenario 2, the main problem is divided in two smaller problems namely: mapping and routing problems respectively. Within this scenario, we decide to find an optimal mapping for the tasks of a ATG based on a typical Genetic Algorithm with the goal of reducing the Manhattan distance between nodes.
For the routing, another GA based X-Y routing algorithm will be used for any individual routing path. The energy consumption for each routing path and then for all paths is calculated as a fitness function for our GA algorithm.
Scenario 3:
The scenario 3 is almost similar to the scenario 2 with a slight difference. For the mapping problem, we have used the same GA Mapping algorithm in the scenario 2. Based on the Application Task Graph (ATG) a list of mapped resources will be generated.
For the routing, we have utilized a heuristic search algorithm which first, generates a number of full routing paths. For example it generates 500 full routing paths. On the next step, we will evaluate goodness of the routing process by calculating a cost function. This function shows the communication energy consumption and also the overall distance which itself represents total system delay respectively.
Only a subset of these random full routing paths, which their fitness values are under a specific threshold, will be selected. This threshold should get determined within the simulation runs. We will call these selected full routing paths as “The Candidate List”.
Since each full routing path consists of “r” sub-routing paths and r is the number of connections within a TG, on the third step this algorithm will select the best sub-routing paths within the candidate list. Therefore, we expect to have a new full routing path with lower energy consumption.
Scenario 4
In this scenario, the routing process remains the same but the mapping changes from GA to another heuristic search algorithm which is almost similar to the routing algorithm.
In the new mapping algorithm, a number of random mappings based on the Application Task Graph (ATG) will be generated. Then by defining a fitness function and a threshold value a subset of these mapping will be selected. We similarly call this the candidate list for mapping. The threshold value determines the size of the candidate list. It is clear that the larger list cause more accurate mapping within more execution time. Hence the threshold should be decided in a way that it satisfies both the accuracy and the latency.
5. Proposed algorithms
5.1 Energy model
Hu and Marculescu in [8] have proposed a model for energy consumption. This is based on energy calculation of sending one bit over communication channel between node i and node j with a 2-D Mesh topology as
Where Esbit is bit energy consumed on the links between nodes and nhops is the Manhattan distance between the source and destination tile.
As indicated in [10] the above equation can also be used to calculate the energy consumed by sending a packet over a path established between a typical source and destination tiles.
Hence, for a connection(Pi,j) that sends k packets/sec and for the total connections the consumed energy will be as follow:
In this model our concern is the consumed energy over the communication links and therefore we don’t consider the average energy being consumed on the computational resources.
5.2 Task Graph
Based on the definition 1 in the section 3, we represent each application by a Task Graph[1,8]. As it is indicated in figure 2, a TG contains vertices and edges. For every vertex, , represents a PE which performs a certain task or function. This could be considered as a micro processor, DSP or a memory module. Each edge shows a connection between two different tasks. We denote each edge of the TG as the rate of packet generation at the source node. As its depicted a TAG is a weighted directed acyclic graph(WDAG) whose it’s edges show the priority for execution of tasks..
Figure 2: A simple Task Graph[11]
TGs are also periodic. This means in a real NOC application the system may go through the TG for many times. In this paper the energy model and all the calculations are based on one period execution of the TG
5.3 Mapping Algorithm
There are problems that may not get solved by exhaustive search methods. In these cases, heuristic techniques could be more useful due to their attempt to achieve a near optimal solution within acceptable period of time.
As mentioned in the previous section, each scenario has a heuristic mapping algorithm that randomly searches for the best mapping within the problem space.
The first approach, which is just for the sake of addressing the problem, is to put the IP cores onto different random places and calculates if they are put in a suitable manner and if they meet the basic problem constraints.
The second approach is determined in way that we use a GA based algorithm for task assignment and task mapping. Please note that, breaking down an application into tasks and assigning each task to an IP core, the task assignment has been done during the execution of our TG generator program[11]. The remaining part, task mapping, will be done by a GA based algorithm as follow:
- Generating a random population as initial chromosomes.
- Evaluate the fitness of each chromosome in population
- Select mate which itself includes: crossover, mutation and accepting.
- Replace the new population with the previous one to form a better generation of population.
- Test the answers to see if the algorithm has converged to a solution.
5.4 Scheduling Algorithm
The tasks within the ATG need to be scheduled in accordance to their precedence and logical dependencies. So far, many algorithms have been proposed in the literatures. Most of them have discussed this issue from Operating System point view. But this can be extended and applied to the task scheduling in NOC.
In [10] D.Shin has used a simple List Scheduling(LS) scheme. While J.Hu has suggested a Branch and Bound(BB) algorithm[8]. In [17] a GA based LS algorithm in proposed. Within our research, for the sake of simplicity, we first use the simple LS algorithm while other algorithms may have better performance.
The LS uses the mobility of each task to determine the priority. The mobility of each task is defined as the difference between ASAP start time and the ALAP end time. To get these times we need to know the communication delay of each edge. However, we cannot know the exact value due to the asynchronous communication protocol of an NOC.
5.5 Routing algorithm
In [10] they have applied a Genetic Algorithm to a deterministic X-Y routing method that finds a near optimal solution for each routing path independently in separate GA runs.
We consider the whole routing paths as one problem. Fist we generate almost all possible full routing paths without considering the suitableness of each full routing as depicted in figure 3.
Then a suitableness value for each full path will be calculated according to the power consumption and the delay of routing paths. This will help to establish a measure for selection the best candidates for routing paths. In the next step we use a kind of permutation between Routing Paths(RPs) of the selected full RPs to find better combinations for the RPs included in a full routing path.
Figure 3: A bank of full paths
The result will be the best routing paths with minimum communication energy consumption. For this purpose we have used an intelligent XY routing algorithm.
6. Simulation Results
We have modeled the TG and the mesh platform in Matlab. Also, we have implemented the GA base mapping and routing algorithms in that environment. Figures 4 to 9 show the simulation results. Within each simulation, the mesh size varies from 3*3 to 7*7 and number of tasks are equal to the total number of tiles. Also for the routing algorithm used in Scenario 3 and scenario 4, the number of full routing paths varies from 10 to 300. In this paper we call it “Max”.
Figures 4 and 5 represent the total energy consumption on the communication links and the algorithm complexity respectively. We will use this case as measure for comparing the energy consumption and execution time of our next scenarios. In figures 6 and 7 we have run the simulation for different cases of routing algorithm for scenario 3. Therefore, 4 levels of Max are considered in figures 6 and 7. The same thing is done in figures 8 and 9 which they represent the simulation results of scenario 4.
In the next step, we have compared the enhancement ratio of energy consumption and algorithm complexity of scenarios 3 and 4 over the scenario 2 in tables 1 and 2. For mesh sizes 3*3 to 6*6, table 1 shows the enhancement ratio of the total energy consumption while at the table 2 the time complexity of the algorithms is presented. The comparison shows that the scenario 4 has the best outcome comparing to the other scenarios we have introduced in section 4.
Figure 4: Total energy consumption of scenario 2
Figure 5: Total execution time of scenario 2
Figure 6: total energy consumption in scenario 3
Figure 7 : Total execution time of scenario 3
Figure 8: Total energy consumption in scenario 4
Figure 9: Total execution time of scenario 4
Table 1: Enhancement ratio of the total energy consumption
scen3 over scen2 (%) | scen4 over scen2 (%) | |
3*3 | 1.61 | 1.6 |
4*4 | 1.91 | 2.16 |
5*5 | 7.1 | 4.77 |
6*6 | 17.98 | 15.35 |
Table 2: Enhancement ratio of the algorithm complexity
scen 3 over scen 2 | scen 4 over scen 2 | |
3*3 | 2.5264 | 2.2232 |
4*4 | 3.0549 | 5.4683 |
5*5 | 3.3168 | 10.017 |
6*6 | 3.681 | 12.157 |
7. Conclusions
In this paper, we addressed the mapping and routing path allocation problems for regular tile-based NoC architectures. An efficient algorithm was proposed, which automatically maps the IPs to tiles and generates a suitable routing function such that the total communication energy consumption is minimized under specified performance constraints. As suggested, although we focus on the architectures interconnected by 2-D mesh networks, our algorithm can be adapted to other regular architectures with different network topologies. This remains to be done as future work.
As indicated in figures 10 and 11, in our new algorithm separation of the mapping and the routing procedure may lead to the sub-optimality of the solution. However in this way we could achieve a very good performance in algorithm complexity while we could minimize the total energy consumption over the communication links.
Figure 10: Energy consumption of the three scenarios
Figure 11: Execution time of the three scenarios
Reference
[1] S. Kumar et al., “A Network on Chip Architecture and Design Methodology,” in Proc. ISVLSI’02, pp. 105-112, April 2002.
[2] K. Lahiri, A. Raghunathan, and S. Dey, “Efficient Exploration of the SoC Communication Architecture Design Space,” in Proc. IEEE/ACM ICCAD’00, pp. 424-430, 2000.
[ 3] Shashi Kumar, Axel Jantsch,” A Network on Chip Architecture and Design Methodology”,
[4] . Benini and G. D. Micheli,” Network on Chip A New SoC Paradigm”, IEEE Computer , pp 70-77, Jan 2002
[5] L. Benini and G. D. Micheli, “Powering Networks on Chips,” in Proc. Int’l Symp. System Synthesis, pp. 33-38, Montreal, Canada, 2002.
[6] W. J. Dally and B. Towles, “Route Packets, Not Wires: On-Chip Interconnection Networks,” in Proc. DAC’01, pp. 684-689, June 2001.
[7] A. Hemanni, “Network on a Chip: An architecture for billion transistor era” , Proceedings of the IEEE NorChip Conference, November 2000
[8] J. Hu and R. Marculescu, “Energy-Aware Mapping for Tile-based NoC Architectures Under Performance Constraints,” in Proc. ASP-DAC’03, pp. 233-239, Jan 2003.
[9] T. Lei and S. Kumar, “A Two-step Genetic Algorithm for Mapping Task Graphs to Network on Chip Architecture,” in Proc. DSD’03, pp. 180-187, Sept. 2003.
[10] D.Shin, J. Kim, “power aware communiation optimization for network on chips with volytage scalable links”, CODES + ISSS’04 , September 8-10, 2004
[11] M.nickray, “Power and Delay Optimization for Network on Chip”
[12] L.Benini and G.De Micheli, “Networks on Chips: A New SoC Paradigm”, IEEE Computers, pp. 70-78, Jan. 2002.
[13] P.Guerrier, A.Greiner,”A generic architecture for on-chip packet switched interconnections”, DATE 2000, pp. 250-256, March 2000.
[14] E.Rijpkema et al., ”Trade-offs in the design of a router with both guaranteed and best-effort services for networks on chip”,DATE 2003, pp. 350-355, Mar 2003.
[15] V.Lo et al.,”OREGAMI: Tools for Mapping Parallel Computations to Parallel Architectures”,Intl Journal of Parallel Programming, vol. 20, no. 3, 1991, pp. 237-270.
[16] N.Koziris et al.,”An Efficient Algorithm for the Physical Mapping of Clustered Task Graphs onto Multiprocessor Architectures”, Proc. of 8th EuroPDP, pp. 406-413, Jan, 2000.
[17] Yoginder S. Dandass, “Genetic List Scheduling for Soft Real-Time Parallel Applications“
Related Articles
- Application Driven Network on Chip Architecture Exploration & Refinement for a Complex SoC
- Using the application modeling and mapping methodology for system-level performance analysis
- Rapid Creation of Application Models from Bandwidth Aware Core Graphs
- A Multi-Objective Optimization Model for Energy and Performance Aware Synthesis of NoC Architecture
- From a Lossless (~1.5:1) Compression Algorithm for Llama2 7B Weights to Variable Precision, Variable Range, Compressed Numeric Data Types for CNNs and LLMs
New Articles
Most Popular
- System Verilog Assertions Simplified
- System Verilog Macro: A Powerful Feature for Design Verification Projects
- Synthesis Methodology & Netlist Qualification
- Enhancing VLSI Design Efficiency: Tackling Congestion and Shorts with Practical Approaches and PnR Tool (ICC2)
- Demystifying MIPI C-PHY / DPHY Subsystem
E-mail This Article | Printer-Friendly Page |