Enabling High Performance SoCs Through Multi-Die Re-use
Andrew Jones, Stuart Ryan (STMicroelectronics R&D Ltd)
Abstract:
This paper gives a high-level overview of a technique for rapid design of new IC designs using multiple dice packaged in a variety of aggregations allowing for differnent performance levels and price points to be achieved.
The technique relies on a new high-bandwidth low pin-count communication channel between two or more dice.
The channel allows the on-chip interconnect to be extended to bridge between chips while allowing other signals to be integrated in a low power manner. This arrangement provides the basis of a family of platforms which supports the integration of multiple chips within a package (or on a board) and permits each die to have been designed independently.
Introduction
The primary reason for the growing importance of System in package (SiP) technologies are the various workarounds for the complications of integrating heterogeneous circuits monolithically. E.g., when integrating noisy digital circuits with noise-sensitive analog circuits proves too problematic it is often expedient to circumvent the issue using a multi-chip module (MCM). Since defect density scales with area, the integration of sizable digital logic with smaller functions, e.g. RF devices, can diminish yield; SiP technologies have successfully addressed this [6].
High-density CMOS SoC methodologies involve accessing and working with a single technology library selected on the basis of its suitability for the SoC as well as more pragmatic factors to do with available manufacturing options. SoC partitioning choices are highly dependent on library construction and routing capabilities that are identified within the tools. The elevated integration
capacity of SiP can lower the complexity of each partition while lessening the size and routing complexity of the printed circuit board (PCB). It is notable that many SiP design challenges result from a lack of comparable design infrastructure between ASIC technologies and the sheer variety of layout possibilities. Packaging concepts include chip stacked on chip, flip-chip stacked on chip and chips placed side by side in a package[5].
In future CE devices, chips are increasingly likely to be constructed from multiple die within a single package. We should be clear on the motivation for this:
1) Analog and IO cells typically shrink very little in smaller geometries compared to digital logic but they do increase their cost. Typically, denser technology nodes are more expensive per unit area. So that if a given function doesn’t shrink significantly, its differential manufacturing cost is likely to be higher. Also, the use of unshrinking analog logic for IO may lead to increasingly pad-limited designs that, again, reduces silicon cost-effectiveness.
2) As the industry races to make 28-nm SoCs the standard node, there is a tension between supporting low voltage, high speed IO logic; and higher voltage interconnect technologies, for example HDMI, The lower voltage interfaces, e.g. DDR3, require a thinner transistor gate oxide e. Supporting both gate oxide thicknesses is an additional, cost burden. More tellingly, porting high-speed analog interfaces to a new process consumes significant expert resource in terms of characterisation and qualification time [8]. By decoupling the implementation of analogue blocks from that of the digital can produce a useful reduction in the time to working silicon.
The authors have completed architectures in which a SiP comprising a 32nm/28nm die containing supra-gigahertz SMP CPU cores, DDR3 controller(s) and highly-differentiated IP, is connected to a 40nm or 55nm die containing multiple analog PHYs. Because of the removal of most of the analogue blocks the 28nm/32nm die is able to optimise the benefit it derives from its higher gate density.
Unsurprisingly, the most powerful motivation for this approach is time to market (TTM). The implementation of certain types of analog circuits using “trailing edge” lithography is quite simply easier[6]. It is accompanied by the ability to re-use hardened cores, their validation, analog characterisation and, critically in some cases, their third-party certification. The SiP challenges [7] e.g. more difficult packaging and the development of architecture standards which allow interworking may be viewed as the cost to be paid for this TTM advantage.
While the initial step in this strategy would be to implement two dice in the SiP, it is easy to see that more complex devices can be considered if more dice are included, for example an RF die could be added to support a TV tuner, or perhaps a wireless networking PHY layer, within the same package.
The attractiveness of this approach increases when one looks at the increase in the number of IP blocks per design (see Figure 1 below).
Figure 1 : Past, present and future count of IP blocks per device (source: Semico research [10])
STAC: A Narrow inter-die Network
Here we describe the salient features of the STAC interconnect. Further outlines can be found in the literature e.g. [9], aspects of which have been extensively patented.
The STAC (ST Advanced inter-die Communication) port narrows the width of the on-chip interconnect to as few as 8-bits and uses a high-speed interface to bridge to a similar port on the companion die linking to its on-chip interconnect in turn. This allows the on chip network to carry packetized memory requests and responses onto a flow-controlled, token-based, high-speed network between the dice. The STAC implements a wormhole-switched ([11]) port with virtual channels and dedicated buffers to respect end-to-end Quality of Service (QoS) commitments.
Virtual Wires
A key innovation in the STAC architecture is the notion of virtual wires. Typically, a majority of communication between two die connected by a STAC will be memory transactions. Nonetheless, there will also be significant communication in the assertion and de-assertion of interrupt lines, DMA handshakes, reset request and acknowledges, power down requests and so on. These out-of-band (OOB) signals are packetized and transmitted over the STAC then multiplexed with the very same wires that carry memory-transacting packets.
Figure 2 STAC protocol levels
Memory transactions are carried by a sequence of segments over the STAC. OOB signals are mediated via a special type of STAC segment called a wire segment. OOB signals are each allocated a position within one of several bundles of wires. Each bundle is transmitted as single wire STAC segment with a bundle identifier called a virtual channel. The receiver of such a STAC segment is able to associate the state of each bit within a STAC segment with the state of a specific wire within the wire indicated by the virtual channel. The state of each wire in a bundle is not continuously transmitted but, rather, is sampled at regular intervals and it is only these samples that are transmitted in a wire STAC segment along with data traffic. In effect, each sample is used to specify the state of a register that holds the state of each OOB signals on the other side of the STAC. The STAC performs this transmission bi-directionally so that wires can be virtually connected from either side.
Figure 3 Virtual Wiring multiplexing of wire (I) packets with high(Nh) and
Low (Nl) priority memory (STNoC) packets on a STAC link
The key parameters, namely the sampling rate, the number of bundles transmitted and the priority of transmission of these bundles are all configurable. Accordingly, for reasons of power, the implementation will suppress the transmission of wire segments if there is no change from the previous transmission.
For SoCs that carry conditional access or data that is governed by a digital rights management (DRM) protocol then it is important that the STAC is secure from attack. This is true even though, in most cases, the STAC is not physically exposed to external probing. Greater security can be achieved by supporting packet encryption across the link. In some cases this is mandatory in order that the SoC conforms to the applicable security standards.
IP access rights may be configured so that all packets that originate off-die are marked as such to assist protection against potential spoofing attacks.
Quality of Service
The key to getting much of this to work efficiently is a robust quality of service infrastructure. The management of the delay experienced by virtual wire signals and memory transactions is of crucial importance. We address this problem with two STAC features:
- Control of the sample rate of virtual wires
- The implementation of priority classes for transactions
Figure 4 A STAC interface
End-to-end Quality of requires that both interconnect and resource managers (e.g. memory controllers) jointly managing system traffic. QoS is specified at the initiation of each request both for signals and for memory and subsequently used at each arbitration point. The STAC has to trade-off QoS priority with availability of channel bandwidth to optimise performance. The flexibility to dynamically alter the priority classification is key to enabling systems to adapt their behaviour for different system use cases.
Table 1 Example STAC traffic Classification
The mapping between traffic classes and requesters is highly configurable. The indications above in Table 1 are only indicative for each class. For example we would normally place CPUs in class A. An ARM dual-core Cortex A9 may require ~800MBytes/Sec and support a deep pipeline of outstanding memory accesses but its performance is latency sensitive.
A 1080p30 display may consume ~200MB/sec but is significantly more latency tolerant than most CPUs. A 3D GPU such as the ARM MALI-400 MP may use up to ~3.2GB/Sec but is thankfully quite latency tolerant. These figures argue against static priority schemes in systems that expect to support a wide variety of use cases.
Example architectures
Consider a high-end set-top box architecture may be conceived as shown below:
Figure 5 High-level block diagram of STB SoC
A partitioning which is partly determined by issues of porting analog I/P discussed earlier and partly influenced by die re-use aspirations leads to the kind of SiP partition below in Figure 3. In which there is a digital processing die that performs:
- Video transport pipeline,
- Conditional access,
- Decompression,
- Image quality improvement
- Rendering the UI
- Hosting any web applications
- Compositing.
The complementary I/O processing die performs
- High speed I/O e.g. SATA, USB, PCIe Gigabit Ethernet interfaces,
- Analog TV out interfaces
- Digital TV out interfaces
The digital processing die can be coupled with other, less generic, I/O die that performs specialised front-end processing for particular markets such as terrestrial, satellite or cable for cost-efficient solutions.
Figure 6 SiP partition for STB SoC
One area that the authors studied was the emerging gateway market. A headed gateway is a device which combines back-end decode capability with a front-end modem powerful enough to supply video, data and voice services to all devices in the home. In Figure 4 below illustrates how
The STB digital die may be attached to a DOCSIS 3 front-end die to construct a headed gateway.
Figure 7 SiP dies for Headed Gateway
This further allows the mix of technologies to be different from other SiP uses of the die.
Architectural Flexibility
A key driver in the industry is the quest for increased margin that sometimes implies greater optimisation of SoCs for particular Markets. For market coverage this can imply a number of different SoCs each tailored to a particular feature/price point. Commonly, the availability of design team resource is one obstacle which prevents the implementation of this type of specialisation. So finding flexible and higher productivity forms of design re-use are crucial.
A natural progression from IP re-use is towards greater die re-use within a SiP as it also enables re-utilization of tasks within the SoC process which are normally repeated for each integration. This includes reconfiguring certain volatile IP such as interconnect but also one also gets to re-use the timing, power and area optimizations of soft IP.
Of course, part of the challenge of a good re-use strategy is choosing the system partitioning which affords the greatest flexibility. There are 3 key flexibilities afforded by partitioning SiP along the lines indicated earlier.
The first comes from the observation that in the Set-top box market many devices are distinguished by the number and type of I/O interfaces as much as their ability to decode, composite and render digital images and sounds. E.g. the number of Ethernet ports, the type of USB ports (2 or 3, OTG or host only) the speed of SATA interfaces (3 Gbit or 6 Gbit) and so on. Also, it is a common observation that the lifecycle of I/O interface popularity tends to be unsynchronised with media processing requirements. Although this is not always true - some media processing innovations bring with them the requirement for a different type of interface - it is true frequently enough to be a useful observation.
Developing 2 varieties of media processing die and 2 varieties of I/O processing die can yield 4 different SiPs with a design resource utilisation closer to developing only 2 chips.
In addition there is a hybridisation benefit that results from the realisation that with appropriate architectural standards an I/O processing die can be enhanced with computation engines or that a media processing die can be enhanced with an I/O interface subject to the technology constraints mentioned earlier. For example STMicroelectronics is readying a cable modem device that can be deployed as a standalone gateway router or in conjunction with a media-processing die to comprise a cost-efficient headed gateway.
Conclusions
This paper has described and exemplified a new architectural approach for implementing a communication channel between systems on different silicon dice. It illustrated how the on-chip interconnect can be extended between chips while retaining high bandwidth and low latency. In addition, this architecture allows other signals to be integrated within this communication channel to reduce pin-count and power consumption in a manner which allows arbitrary implementations of such interfaces to retain their interface compatibility. STMicroelectronics has recently started deploying this technology across a range of consumer video devices.
Acknowledgements
Thanks to STMicroelectronics Grenoble, France R&D team as a whole and in particular, Ignazio Urzi & Dominique Henoff, but also Alberto Scandurra of the R&D team in Catania, Italy, for their experience, insights and pragmatism in implementing the first instantiation of these idea.
References
[1] Deaves R. and Jones A., “A Toolkit for Rapid Modeling, Analysis and Verification of SoC Designs”, IPSOC, Nov. 2003.
[2] Deaves R. and Jones A., “An IP-based SoC Design Kit for Rapid Time-to-Market”, IPSOC, Dec. 2002.
[3] Jones A. and Ryan S., “A re-usable architecture for functional isolation of SoCs”. IP 07 – IP Based Electronic System Conference. Dec 2007
[4] King L. Tai, “System-in-package (SIP): challenges and opportunities”, Proceedings of the 2000 conference on Asia South Pacific design automation, p.191-196, January 2000, Yokohama, Japan
[5] Trigas, C. “Design challenges for system-in-package vs system-on-chip” Custom Integrated Circuits Conference, 2003. Proceedings of the IEEE 2003 P.663 - 666
[6] Goetz, M. ”System on chip design methodology applied to system in package architecture” Electronic Components and Technology Conference, 2002. Proceedings. 52nd Pp 254 - 258
[7] Pfahl R.C and Adams,J. “System in Package Technology”, International. Electronics Manufacturing Initiative (iNEMI), 2005.
[8] Udo Sobe, Achim Graupner, Enno Böhme, Andreas Ripp and Michael Pronath, |”Analog IP Porting by Topology Conversion and Optimization” IP - ESC 2009 Conference Dec. 2009, Grenoble, France.
[9] Jones A. and Ryan S., |STAC: Advanced inter-die communication Technology” IP-SOC10 Grenoble, France
[11] Dally W.J. and Towles B. (2004). "13.2.1". “Principles and Practices of Interconnection Networks”. Morgan Kaufmann Publishers, Inc.
|
Related Articles
- Achieving High Performance Non-Volatile Memory Access Through "Execute-In-Place" Feature
- OCP 'tags' support high-performance SoCs
- Merged-logic-type embedded DRAM suits high-performance SoCs
- New Embedded DRAM Solutions for High-Performance SoCs
- NoCs and the transition to multi-die systems using chiplets
New Articles
- Quantum Readiness Considerations for Suppliers and Manufacturers
- A Rad Hard ASIC Design Approach: Triple Modular Redundancy (TMR)
- Early Interactive Short Isolation for Faster SoC Verification
- The Ideal Crypto Coprocessor with Root of Trust to Support Customer Complete Full Chip Evaluation: PUFcc gained SESIP and PSA Certified™ Level 3 RoT Component Certification
- Advanced Packaging and Chiplets Can Be for Everyone
Most Popular
- System Verilog Assertions Simplified
- System Verilog Macro: A Powerful Feature for Design Verification Projects
- UPF Constraint coding for SoC - A Case Study
- Dynamic Memory Allocation and Fragmentation in C and C++
- Enhancing VLSI Design Efficiency: Tackling Congestion and Shorts with Practical Approaches and PnR Tool (ICC2)
E-mail This Article | Printer-Friendly Page |