NVM OTP NeoBit in Maxchip (180nm, 160nm, 150nm, 110nm, 90nm, 80nm)
STBus asynchronous decoupler: an answer to the IP integration issues in future technologies
and Daniele Mangano - University of Messina, Faculty of Engineering -- Messina Italy
Abstract
The block called Asynchronous Decoupler allows asynchronous communication between two synchronous IPs endowed with an STBus interface, according to the GALS (Globally Asynchronous Locally Synchronous) paradigm.
Background
Modern Systems-on-Chip (SoCs) are composed of a huge number of IPs, properly interconnected through an on-chip communication system, being a traditional bus or an innovative Network-on-Chip (NoC), put together on the same silicon die and operating at high clock frequencies, often differing from an IP to another.
The growing silicon die size and the high operation frequency makes the clock skew to be a problem, usually solved with the traditional approach of a well balanced clock tree.
However, the timing closure and the clock tree balancing is becoming harder and harder, requiring various respin to converge; hence, the traditional balanced clock tree solution looses efficiency.
Consequently, a new design paradigm has been developed: the GALS paradigm (Globally Asynchronous Locally Synchronous), according to whom a complex design is partitioned into a proper number of blocks synchronous to a local clock, while the communication among them is performed in an asynchronous fashion.
The GALS paradigm gives the advantage of reduced clock skew constraints and power consumption thanks to the smaller local clock trees.
The asynchronous communication between different functional blocks operating at different frequencies becomes than the primary problem to solve.
Classical asynchronous solutions have the drawback of relying on a specific communication protocol, usually not compatible with the existing synchronous protocols. Moreover, asynchronous buses relying on the delay-insensitive encoding (such as 1-on-4 and dual rail) require an additional cost in terms of both logic and wires, while other asynchronous buses not using such an encoding require to check and balance the delays of the different wires implementing the physical connections between the different system modules.
Solution
The proposed solution is an interface block called Asynchronous Decoupler (referred to as AD in the rest of the paper) allowing asynchronous communication between two synchronous IPs endowed with an STBus interface (STBus [1] is a proprietary communication systems developed by STMicroelectronics).
Because of the impact of the asynchronous link on both latency and available bandwidth, the AD is particularly suitable to support asynchronous communication between IPs quite far from each other or having not too high bandwidth requirements.
The basic idea behind the AD is that of adapting, in a transparent manner, the synchronous layer of the STBus communication protocol to an asynchronous one, relying on a delay-insensitive encoding.
The AD is basically composed of three main building-blocks:
- a first module responsible for adapting the synchronous STBus physical layer to an asynchronous layer, using a four phases handshake; it is seen as a target by the STBus initiator IP;
- a second module responsible for the delay-insensitive encoding and decoding and the completion detection according to the m-of-n scheme (with m=4 and n=8);
- a third module responsible for adapting the asynchronous layer to the synchronous STBus physical layer, also implementing a four phases handshake, and seen as an initiator by the STBus target IP.
The m-on-n encoding relies on the fact that exactly m bits out of n are at the logic value ‘1’ as soon as data are stable and ready to be transmitted. The transmission request is implicit in the data encoding and has to be considered asserted as soon as the number of ‘1’ in the data bus equals m ; this implies the need to start from an idle state when all the data bits equal ‘0’ (RZ, Return to Zero transmission).
Adopting an m-on-n encoding technique the number of symbols that can be transmitted with a data word is given by the combinatorial analysis, and the maximum number of symbols is obtained when m equals n/2.
The following table shows the number of symbols that can be represented with such an encoding (n/2-on-n) in different cases with a different number of wires. It is also shown the length of a data word that should be used to represent the same number of symbols, and the redundancy (in terms of wires) added by the n/2-on-n encoding with respect to a traditional binary representation.
Wires | Symbols | Bits | Redundancy |
2 | 2 | 1 | 100% |
4 | 6 | 2 | 100% |
6 | 20 | 4 | 50% |
8 | 70 | 6 | 33.3% |
10 | 252 | 7 | 42.9% |
12 | 924 | 9 | 33.3% |
14 | 3432 | 11 | 27.3% |
16 | 12870 | 13 | 23.1% |
32 | 601080390 | 29 | 10.3% |
64 | 1.832x1018 | 60 | 6.7% |
128 | 23.951x1036 | 124 | 3.2% |
256 | 5,768x1075 | 251 | 2.0% |
Even if the lowest values of redundancy are obtained with great numbers of wires, such a solution is not practical since the complexity of the encoders and decoders would grow accordingly.
A good compromise looks to be the 4-on-8 encoding, giving a redundancy of only 33.3% in terms of wires, and allowing to transmit 70 symbols (usually requiring 6 bits) with only 8 wires.
The asynchronous communication technique based on such an approach requires then to collect the wires of the parallel buses (data, address, control) in groups of 6 wires each, and add 2 wires of redundancy to each of them in order to apply the 4-on-8 encoding mechanism.
Encoding logic
The encoding process has to perform the association of each of the 64 (26) possible input patterns to a unique and unambiguous pattern among the 70 possible output ones according to the 4-of-8 encoding scheme.
The encoding logic consists of the following main blocks:
- a sorter, a combinational module allowing to count the number of ‘1’ at the inputs;
- a combinational circuit performing the downright encoding;
- a number of multiplexers and their driving circuits.
The decoding logic consists of the following main blocks:
- a combinational circuit performing the downright decoding;
- a request detector, responsible to tell the asynchronous receiver the data to be transmitted are stable.
The purpose of the synchronous layer is to make the STBus AD to appear as a target from an initiator perspective, and as an initiator from a target perspective, masking the asynchronous communication and issues to the synchronous initiators and targets.
Because of the STBus protocol features, while for an STBus AD implementing the type 1 protocol (specific for low performance peripherals access) it’s impossible to distinguish between an initiator and a target interface except for the signals direction, in the case of an AD implementing the type 2 or type 3 protocol (specific for high performance memory access) it’s required to design four distinct interfaces:
- STBus initiator/Asynchronous link
- Asynchronous link/STBus interconnect
- STBus interconnect/Asynchronous link
- Asynchronous link/STBus target
The described asynchronous communication methodology suffers of the following two problems:
- the time required to complete a transmission is about four times the average delay of the wires implementing the physical level;
- the need for avoiding metastability determines a further reduction of the throughput; this because the solution to the metastability consists of using synchronizers, whose presence contributes to increase the asynchronous link latency.
To evaluate the implementation cost of the STBus Asynchronous Decoupler we relied on the consideration that in general a N-inputs logic gate can be implemented by using N-1 2-inputs logic gates.
According to this, it has been determined that the 4-on-8 encoder can be implemented using about 100 2-inputs logic gates (36 for the sorter), and the related decoder can be implemented using about 50 2-inputs logic gates, plus about 50 more for the request detector.
This means that, in order to implement an asynchronous link for a 256 bits wide bus, since 43 groups of 6 bits each are required, the number of 2-inputs logic gates needed is 43 x (100 + 50 + 50) that is 8600 logic gates.
This approach allows to use 170 wires less than the 256 additional ones required by a more classical approach such as the dual rail encoding.
Benefits
The greatest advantage given by the AD consists of the possibility to completely decouple part of the system (initiator and/or target IPs) from the interconnect (STBus). In opposition to the asynchronous buses up to now proposed, this solution allows to integrate an asynchronous communication within an interconnect systems intrinsically synchronous such as the STBus, and then allowing the reuse of the existing synchronous protocols
Another advantage of such a solution compared to full asynchronous solutions is the possibility to place the block only where strongly required, reducing at minimum the amount of redundant logic and wires required to implement asynchronous channels.
Moreover, thanks to the delay-insensitive encoding used by the block, no balance is required on the wires supporting the communication at physical level.
Performance
Some simulation results are reported in order to give an idea of the performance that can be achieved using the STBus Asynchronous Decoupler as transmission link.
An 8-wires link with the (arbitrarily chosen) delays reported in the table below has been simulated.
Wire | Delay (ns) |
1 | 180 |
2 | 220 |
3 | 230 |
4 | 210 |
5 | 310 |
6 | 200 |
7 | 255 |
8 | 245 |
Ack | 261 |
Simulations have been run with different values of the frequency of both initiator and target clock (clk_i and clk_t respectively).
A first consideration coming out from the simulation results is that the average throughput increases as the number of cells transmitted back-to-back increases. This is due to the higher speed at which the first cell is transmitted, since this is the only one that doesn’t need to wait the completion of the handshake related to a preceding cell.
The instantaneous throughput, calculated as the ratio between the number of bits contained in a transmitted cell and the time required to transmit the cell, shows some evident fluctuations due to the different wires delays.
In fact, since for each transmission only four out of eight wires are involved, the actual delay between two consecutive cells, and then the instantaneous throughput, depends on which of the eight wires are actually involved in the transmission.
Analyzing the throughput of the link as function of both clk_i and clk_t it’s possible to notice that
- a saturation effect appears over determined values of clk_t;
- below a quite low value of clk_t the throughput is basically constant; we call such a particular frequency cut-off frequency.
Other interesting properties outlined by the simulations are:
- for great values of clk_i the throughput is independent on clk_t but depends on clk_i;
- on the contrary, for low values of clk_t the throughput does not depend on clk_i but depends on clk_t;
- the cut-off frequency is a growing function of clk_i;
- the value of the cut-off frequency is almost equal to the inverse of the sum of the double of the propagation delay of the ack wire and the double of the average propagation delay of the eight wires.
Because of its properties and performance, the STBus Asynchronous Decoupler is particularly suitable to plug to the STBus interconnect IPs requiring a bandwidth not excessively high and having not strict requirements in terms of low latency. In such cases it’s possible for them to replace the expensive synchronous links with equivalent long and slow asynchronous links.
Acknowledgments
The work described in this paper has been carried out by the STMicroelectronics OCCS (On Chip Communication System) team in cooperation with the Microelectronics group of the faculty of engineering of the University of Messina (Italy).
It’s thanks to the outstanding collaboration between these two groups that this publication has been made possible.
References
[1] STBus Functional Specs, STMicroelectronics, public web support site, http://www.stmcu.com/inchtml-pages-STBus_intro.html, STMicroelectronics, April 2003
Related Articles
New Articles
- Quantum Readiness Considerations for Suppliers and Manufacturers
- A Rad Hard ASIC Design Approach: Triple Modular Redundancy (TMR)
- Early Interactive Short Isolation for Faster SoC Verification
- The Ideal Crypto Coprocessor with Root of Trust to Support Customer Complete Full Chip Evaluation: PUFcc gained SESIP and PSA Certified™ Level 3 RoT Component Certification
- Advanced Packaging and Chiplets Can Be for Everyone
Most Popular
- System Verilog Assertions Simplified
- System Verilog Macro: A Powerful Feature for Design Verification Projects
- UPF Constraint coding for SoC - A Case Study
- Dynamic Memory Allocation and Fragmentation in C and C++
- Enhancing VLSI Design Efficiency: Tackling Congestion and Shorts with Practical Approaches and PnR Tool (ICC2)
E-mail This Article | Printer-Friendly Page |