|
||||||||||
Modular Design Of Level-2 Cache For Flexible IP Configuration By Thang Tran, Synopsys Inc. Abstract: The microprocessor-memory gap has been growing for over 30 years, and in that time caches have been crucial components in digital system design. All high-performance microprocessors are designed with level 1 and level 2 caches, and multicore systems include many hierarchical levels of caches. The internet of things (IoT) along with the proliferation of portable devices are a major thrust for high-end embedded systems thus the need for higher performance processors and multicore; even little flash drives include multicore systems-on-chip (SoCs). As a result level 2 cache is becoming important in embedded designs. This paper presents an innovative level 2 cache design that meets the requirements of flexibility, configurability, low power, and small area for embedded systems. I. INTRODUCTION My first cache design was more than 30 years ago, it was the 2 KB Branch Target Cache (BTC). We jokingly called it “Bill-The-Cat.” Every transistor was designed by hand, including the SRAM cell, the sense-amp, the pre-charged logic, the row decoder, the input/output drivers, and the state machine. Fast forward to today’s technology where level 2 (L2) cache is no longer a custom design. A memory compiler is used to generate the exact required array size with a great deal of flexibility for tag, status, and data arrays configurations. The L2 cache is fairly complex to design for all possible configurations and features. To a great extent, the memory hierarchy is the driving force behind processor proliferation, which has evolved from desktop and laptop PCs to smartphones, tablets, and automobiles; and on the horizon, Internet of Things (IoT) and embedded vision applications. This shift requires extensible and configurable microprocessors with an optimal balance of performance, power, and area (PPA). The L2 cache must evolve to provide flexible configurability with the lowest possible PPA to be incorporated into the configurable microprocessors. II. BACKGROUND A. Microprocessor Evolution As soon as pipeline stages are added to a microprocessor design, the speed of the microprocessor improves at a much faster rate than the speed of the memory. As shown in Figure 1, processor speed was 200 times faster than the DRAM speed in the year 2000. Going to external memory takes hundreds of clock cycles, thus data fetch becomes critical for the microprocessor performance. The data dependency penalty is an obstacle to be solved by the microprocessor designers. The cache becomes an important feature for the microprocessor. The BTC is used to fetch the first two instructions internally, allowing the subsequent instructions to be fetched from the external memory. The cache memory evolves to multiple levels and the cache hierarchy is the most effective way to reduce the data fetch time. Current microprocessors for PCs, laptops, smart phones, and tablets have built-in cache hierarchy. The PCs and laptops are often advertised with the L2 cache size. Fig. 1. Microprocessor-Memory Performance Gap [1]
Fig. 1. Microprocessor-Memory Performance Gap [1] B. L2 Cache Requirements The first generation of microprocessors was driven by the high performance required by the PC and laptop market. Once the PC and laptop market became saturated, the requirements of the next generation of microprocessors was driven by mobile devices like smartphones and tablets, which require low-power microprocessors to enable longer battery life. Almost everyone has a cell phone worldwide and many have two, making the number of cell phones in use today more than the total population of the world [2]. The maturity of the tablet and smartphone market is evidenced by Apple announcing a drop in sales for the first time in 15 years. There is no obvious single next big application. The market is diverse with a wide range of applications with different requirements. The IoT, connected automobiles, wearables, medical devices, gaming, virtual devices, smart cities, building and factories are all driving a new and diverse set of processor requirements. The ability to configure and extend a processor to meet the requirements of the application is a promising new technology model for processors. It provides high degrees of flexibility, giving SoC developers the ability to include their own proprietary custom instructions. The SoC design can also reach the optimal tradeoffs in performance, power and area by customizing the processor IP to the application requirements. Also, the customization of processor technology provides the ability to meet evolving standards. With the wide variety of applications and requirements today, from consumer to automotive, there are just too many different types of applications to standardize on a single fixed processor IP. Following the same trend, L2 cache must provide the flexibility in configuration for the SoC design. The configurations include cache size, the number of way-associative, the number of banks, the number of input ports with individual read/write configuration, flexible memory latency, power-down options, error correcting code (ECC) and parity, and cache partitioning. In order to adapt to all the requirements for L2 cache design, a new design methodology is necessary. The modular design method for L2 cache will be discussed in detail in subsequent sections. Cache coherency is another full topic by itself and will not be discussed in this paper. Throughout this paper, “configure” refers to the build-time configuration of the RTL for the L2 cache and “program” refers to the run-time options. III. MODULAR DESIGN The L2 cache design must provide the flexibility to be used in many different embedded systems and applications. Like microprocessor design, L2 cache has a pipeline, as shown in Figure 2. The requests traverse through the pipeline of the tag array and the data array to read and write data. The traditional design interlocks the requests through all the pipeline stages. The design becomes complex when considering the cache miss, the cache replacement, the data eviction, the prefetching of data and all the possible conflicts between them. The traditional design lacks the flexibility required for all the possible configurations needed for an embedded system. Fig. 2. Typical 10-Stage Pipeline of the Level 2 Cache A. Modular Design Concept The modular design concept is breaking the L2 cache into independent units. The fixed pipelining concept is no longer important. Each unit processes its own function without interference from other units. The features of modular design are:
The tag bank unit in Figure 3 is an example of modular units. The tag bank is the execution block. The queue received requests from multiple cores and sends one request at a time to the tag bank. The tag bank can take 1 or more clock cycles to access data. The latency counter is used to allow the next request to access the tag bank. Each tag bank is an independent modular unit and each tag bank unit executes the requests from its own queue. Fig. 3. Tag Bank Units Modular design has the following advantages, which will be discussed in further detail in later sections.:
B. L2 Cache Modular Units The overall view of the L2 cache is shown in Figure 4. The L2 cache is broken down into many modular units. Each data bank, each tag bank, and each input channel are individual modular units. Each modular unit independently executes the request and sends to the next modular unit.
Fig. 4. L2 Cache Block Diagram The central unit of the L2 cache is the data buffer where the tag ID is assigned to individual request. As a new request is received, an entry in the data buffer, which will be used throughout the L2 cache, is assigned to the request. The data buffer consists of:
Other L2 cache modular units are:
IV. ADVANTAGES OF MODULAR DESIGN FOR L2 CACHE The modular design makes extensive use of the queue. The same queue structure is used in all units and is shown in Figure 5. The queue is a better alternative than the shift registers in terms of power. The data is written once to an entry instead of write and read to shift to the next entry. The data is retained until the execution is completed. The request can be replayed by moving the read pointer to the saved pointer. Fig. 5. Queue Structure and Operation A. Parallel Processing for Higher Performance There are several advantages for the tag bank units in Figure 3:
The modular design is similar to the reservation stations and functional units of the microprocessor. The requests are allowed to execute in parallel and out of order in different functional units. B. Flexibility of Execution Latency Synopsys’ DesignWare® ARC® HS cores are high-performance cores with L2 cache support. The L2 cache is designed to match the clock frequency of the ARC HS cores for easy interfacing and highest possible performance but the SRAM arrays can take multiple cycles to access data. The L2 cache can be configured with wide range of cache sizes from 128 KB to 8 MB. The access latencies are different for 128 KB and 8 MB caches. With modular design, the access latency is implemented with a simple counter. As the request is sent from the queue to the tag or data bank, the latency counter will count down to zero before allowing the next request to be accepted. This process is independent from all other units in the L2 cache. This simple mechanism of the latency counter allows further flexibility:
The modular design is very flexible in memory access latency and timing. For example, the input-channel read unit does not care about where or when the data is received. As long as the data in the Data Buffer is valid, then the input-channel read unit sends the data back to the input channel. The valid data can be from data banks in 2-5 clock cycles or from the external memory in 100 clock cycles. The data banks can take additional clock cycles for ECC correction which is unknown to the input-channel read unit. Basically, the modular design allows the memory access latency to be programmed or configured based on the requirements of the IP embedded design or application. In a way, this is similar to different execution times for the simple ALU and the complex ALU. C. Hierarchical Architectural Clock Gating The modular units are independently clock gated. For example, if there is no tag miss and no eviction, then the Data Miss Unit and Eviction Buffer Unit are clock-gated off to save power. The tag, data, and status arrays are organized with independent banks. The clocks for the banks are active only if there is a valid request for the bank. If a tag bank has no request, then it is clock-gated off to save power. When there is no request from the cores, then the L2 cache is totally clock-gated off, except for the core interface buffer which remains on to receive requests from the cores. The clocks of the L2 cache units are sequentially enabled as needed to service the request. D. Flexibility of Configuration Each modular unit is independent. The queue of the modular unit can receive the input from one source or multiple sources, making configuration much simpler. Of course, there is a limitation to the number of configurations due to timing constraints. The particular configurations of this L2 cache are:
E. Hierarchical Synthesis and Unit-Level Verification Additional benefits from the modular units are hierarchical synthesis and unit-level verification:
SUMMARY The presented L2 cache has many considerations for low-power design and great flexibility for the embedded memory system. Yet because of the modular design, the L2 cache design was completed in a relatively short time with minimal resources. This L2 cache can be configured and programmed in many types of multicore systems and SoCs. While the small L2 cache of 128 KB can be used as back-side L2 cache for a single core, with the large configurable size of 8 MB and large number of input ports, it can also be used as the L3 cache in multi-cluster system. In the current diverse markets from IoT to automotive to high-end application, an extensible and configurable microprocessor is needed to achieve optimal PPA. The flexible L2 cache is an integral part of the next generation of microprocessors [3]. Many basic concepts of the modular design in the L2 cache are applicable to other design. The future designs of complex superscalar microprocessor and multithread processing units takes advantage of the modular design to improve on all design parameters of performance, clock frequency, area, power, time-to-market, scalability, and configurability. REFERENCES [1] Inouye, J., et. al., “Overcoming the Memory Wall,” Oregon State University, 2012. [2] https://en.wikipedia.org/wiki/List_of_countries_by_number_of_mobile_phones_in_use [3] Casey, F. and Tran, T., “Configurable Microprocessor for Life Essential Devices,” IP-SoC 2016 [4] “DesignWare ARCv2 ISA, Programmer’s Reference Manual for ARCv2 Processors.” ARC Synopsys, August 2016. [5] “DesignWare ARC HS Series Databook,” ARC Synopsys, August 2016. If you wish to download a copy of this white paper, click here
|
Home | Feedback | Register | Site Map |
All material on this site Copyright © 2017 Design And Reuse S.A. All rights reserved. |