1Tb/s 3W Inductive-Coupling Transceiver IP for 3D-Stacked SiP
1Keio University, Yokohama, Japan, 2NEC Corporation Sagamihara, Japan
Abstract :
A 1Tb/s 3W inter-chip transceiver transmits clock and data by inductive coupling at a clock rate of 1GHz and data rate of 1Gb/s per channel. 1024 data transceivers are arranged with a pitch of 30ìm in a layout area of 1mm2. The total layout area including 16 clock transceivers is 2mm2 in 0.18ìm CMOS and the chip thickness is reduced to 10ìm. Bi-Phase Modulation (BPM) is employed for the data link to improve noise immunity, reducing power in the transceiver. 4-phase Time Division Multiplexing (TDM) reduces crosstalk and channel pitch. The Bit Error Rate (BER) is lower than 10-13 with 130ps timing margin.
Introduction
The performance gap between computation in a chip and communication between chips is increasing, making inter-chip communication a bottleneck in development of high-performance LSI systems. One approach to realize high-speed interfaces is to shorten the chip-to-chip distance. System in Package (SiP) reduces the chip-to-chip distance significantly by thinning chips and stacking chips on each other in a package, which provides strong motivation to develop high-speed, low-power, and high-density interface between 3-dimensionally (3-D) stacked chips.
Several 3-D interface technologies have been investigated [1-4]. Mechanical wired approaches, such as a microbump or Through-Si Via (TSV) are employed in [1],[2]. Electrical wireless approaches based on capacitive coupling or inductive coupling are utilized in [3],[4]. We have developed an inductive-coupling interface [4]-[7] where chips are stacked and inductively coupled by on-chip metal inductors. A transmitter changes current in the metal inductor and a receiver samples induced voltage through inductive coupling and then recovers data. The inductive-coupling interface has many advantages over the wired interface. 1) Cost is lower since the interface (metal inductor) can be implemented in a standard LSI process while the wired interface requires an additional mechanical process for fabrication. 2) Scaling is easier since the inductive-coupling interface can remove a scaling limitation due to the mechanical process in the wired approach. The inductive coupling interface is scaled down by shortening a vertical distance that can be reduced down to several micron meters in face-to-face stacked chips. 3) Reliability is higher. The inductive-coupling interface is non-contact scheme and chips are detachable. By using the interface as a test head,
Fig.1 Block diagram of inductive-coupling transceiver.
individual chips before assembly can be tested without damaging any chips. Power for the chips under the test can also be transferred through the inductive coupling. Even if power transfer efficiency is low, the chips can operate since the tester can transmit large power. 4) Area-consuming and highly-capacitive ESD protection devices can be removed due to the non-contact scheme. 5) The inductive-coupling interface can communicate through circuits. Transceiver circuits can be placed under the metal inductor to save layout area. Indeed, the transceiver circuits are placed under the metal inductor in this work. In addition, the inductive-coupling interface overcomes some limitations of the capacitive-coupling interface since it enables over 3-stacked inter-chip communications as reported in [6] while the capacitive-coupling interface is employed to only two chips stacked face-to-face [3],[8]-[10]. Since chips can be stacked face-up, power and ground can be provided by bonding wires in a low-power application such as mobile phones or digital cameras. If one of the chips consumes higher power, it can be placed at the bottom and stacked face-down to an area-bump package. For high-performance and scaled systems, TSV may be necessary to provide power through all stacked chips. However advanced fine-pitch TSV and at-speed testing are not required just for DC connections so that the cost and KGD problems do not occur.
At ISSCC 2005, a 195Gb/s 1.2W inductive-coupling transceiver has been reported [6] where 195ch transceivers are arranged with a pitch of 50ìm. The inductive-coupling transceiver communicates at 195Gb/s with power efficiency of 6mW/Gb/s (6pJ/b) and area efficiency of 2.5mm2/Tb/s. In this work, the state-of-the-art inductive-coupling transceiver is presented for over-tera-bit/s data
Fig.2 BPM data transceiver and simulated waveforms.
Fig.3 (a) Simulated waveforms and (b) calculated BER in NRZ and BPM signaling.
Fig.4 Clock transceiver and simulated waveforms.
Fig.5 4-phase TDM.
communication with higher power and area efficiency. Clock is also provided by the inductive coupling for the first time.
Inductive-Coupling Transceiver
Figure 1 presents the block diagram of the inductive-coupling transceiver. The transceiver comprises of 16 slices of a 64ch block, yielding 1024ch data transceivers in total. Each of 64ch blocks consists of 64 data transceivers and one clock transceiver. Clock for the transmitter Txclk is transmitted through the inductive coupling and clock for the receiver Rxclk is recovered by the clock transceiver. The clock frequency is 1GHz. The phase Interpolator (PI) generates 4 time slots in one clock cycle by creating 4-phase clocks from both Txclk and Rxclk for Time Division Multiplexing (TDM). Data transceivers are divided into the time slots to reduce crosstalk. Each data transceiver communicates at 1Gb/s/channel. 1Tb/s data bandwidth is obtained by 1024 parallel data links.
A. Bi-Phase Modulation Data Transceiver
Figure 2 shows the schematic diagram of the data transceiver and simulated waveforms. Bi-Phase Modulation (BPM) signaling is employed for the data link. At the positive edge of Txclk, a pulse generator in the data transmitter produces negative pulse voltages whose pulse width is determined by the delay of the inverter chain. NOR, NORB perform pulse shaping. A succeeding H-bridge circuit generates positive or negative pulse current IT according to Txdata. In every clock cycle, the positive pulse is generated when Txdata is high, and the negative pulse is generated when Txdata is low. A sense-amplifier flip-flop in the data receiver samples positive or negative pulse voltage VR corresponding to the polarity of IT, and then it recovers Rxdata.
In the previous work [6], a Non-Return-to-Zero (NRZ) signaling was employed. Figure 3.(a) shows the VR signals in the NRZ and BPM signaling. In the NRZ signaling, VR signal is not generated when the same data continues. On the other hand, in the BPM signaling the VR signal is always generated in every clock cycle. Therefore noise immunity of the receiver is improved and receiver’s sensitivity in the BPM signaling can be maximized while that in the NRZ signaling has to be set low enough to ignore crosstalk. The high sensitivity in the BPM signaling enables lower Bit Error Rate (BER) with smaller transmission power. The receiver’s sensitivity in the NRZ signaling has to be set low enough to ignore the increased crosstalk. Therefore larger transmit pulse energy is required in the NRZ signaling. Figure 3.(b) presents calculated BER of NRZ and BPM signaling as a function of the transmit pulse energy. The pulse energy in the BPM signaling is reduced by a factor of 3 for BER of 10-9. For lower BER, the energy reduction becomes more significant. Although switching activity is doubled in the BPM signaling, power dissipation of the data transceiver is finally reduced in the BPM signaling.
B. Wireless Clock Transceiver
Figure 4 presents the wireless clock transceiver. An H-bridge circuit in the transmitter is driven by Txclk with buffers INV, INVB whose fanout is set high enough (~10) to reduce harmonics in Txclk and generate triangular current ITC. Pre-amplifiers in the receiver amplify received voltage VRC. Succeeding feedback inverter chains recover full-swing Rxclk.
Fig.6 Inductive-coupling transceiver with test circuits.
Fig.7 Chip microphotograph.
Fig.8 Infra-red photo of stacked test chips.
Fig.9 Experimental setup.
The clock transceiver is an asynchronous circuit like [8]-[13] that consume much power due to static power dissipation. However, by the synchronous clock, high-sensitive yet low-power circuits can be utilized for the data transceiver. Since power dissipation in the data link is dominant in that of the 64ch block, total power dissipation is reduced by employing the synchronous data transceiver.
C. 4-phase TDM
4-phase TDM is utilized for crosstalk reduction. Figure 5 describes the scheme and simulated waveforms. The phase interpolator generates 4-phase clocks that are assigned like a checkerboard pattern in the data transceiver array. Simulated waveforms are shown on the left. When the channel pitch is taken down to 30ìm, the crosstalk increases to the same level of the signal. 2-phase TDM reduces crosstalk to half of the signal however it is not low enough for communications with BER lower than 10-13. 4-phase TDM reduces crosstalk to 10mV-peak voltage and enables BER lower than 10-13.
Test Chip Design and Experimental Setup
Figure 6 depicts the block diagram of circuits for test. A delay controller, a TDM controller, a pitch controller, and Built-In-Self-Test (BIST) circuits are implemented in the 64ch block. Phase timing of Rxclk, is adjusted by the delay controller by UI/128 steps (7.8ps). The TDM controller changes number of phases and phase assignment so that the transceiver with 4-phase, 2-phase or without TDM can be tested for comparison. The pitch controller selects activated channels to change channel pitch and number of aggregated channels. The BIST circuits are implemented for BER measurement. Pseudo Random Binary Sequence (PRBS) generators produce 223-1 word pattern for transmitted data. Number of errors in received data is counted in the receiver. Scan chain initializes PRBS generators and outputs measured errors count for BER measurement.
Figure 7 shows microphotographs of the test chips fabricated in 0.18ìm CMOS. The transmitter chip is placed on top of the receiver chip, with both chips in face up. Both chips are polished to 10ìm thickness. Communication distance including an adhesive layer is 15ìm. The clock transceiver transmits 1GHz clock by the metal inductor with a diameter of 200ìm. The clock transceiver is set up for every 64 data transceivers. The data transceiver communicates at 1Gb/s/channel by the metal inductor with a diameter of 29.5ìm. 1024 data transceivers are arranged with a pitch of 30ìm. The transmitter and receiver circuits are placed under the metal inductors to save layout area. Experiments indicate influence from the transceiver circuits to the inductive channel is negligibly small. Because of the compact layout, inter-channel skew in the 64ch block can be suppressed to 11ps in the clock distribution network. Figure 8 shows infra-red photos of the stacked chips. The two chips are aligned by a conventional infra-red alignment with alignment patterns in top-metal layer. The measured alignment error is less than 3ìm which is negligible.
Figure 9 describes an experimental setup for the stacked test chips. The stacked chips are mounted on a wafer, placed on a probe station without electromagnetic shield, and tested in a laboratory room with no control of temperature, dust and air. A probe card provides connections between the stacked chips and external sources. Power and ground are provided by DC probes. Scan-in data is generated by an external data-timing generator to initialize and control the
Fig.10 Measured received clock and jitter.
Fig.11 Snapshot of data waveforms
Fig.12 Measured timing bathtub curve.
chips. The data-timing generator provides differential 1GHz clock Txclk to the transmitter chip. The wireless clock transceiver transmits the clock to the receiver chip and it recovers Rxclk. In on-chip BIST circuit, PRBS generators produce 223-1 word pattern for Txdata and errors in Rxdata are counted. An oscilloscope monitors waveforms of Rxclk and Rxdata. A logic analyzer measures number of errors and calculates BER.
Measurement Results
A. Wireless Clock Transmission
Figure 10 presents measurement results of wireless clock transmission. 1GHz clock is successfully transmitted by the wireless clock transceiver. Rms jitter is 9.5ps in Rxclk, some of which is caused by 6ps-rms jitter in Txclk by the external data timing generator. Jitter produced in the clock transceiver can be assumed as 7.4ps (=(9.52-62)0.5ps). The clock transmitter consumes 4mW and the clock receiver consumes 6mW from 1.8V supply.
Fig.13 Measured BER dependence on channel pitch.
Fig.14 Measured timing bathtub curve of center channel in channel array.
B. Data Communication
1Gb/s/channel data communication is demonstrated in Fig.11. 1GHz clock is transmitted by the wireless clock transceiver and 223-1 PRBS generator provides transmitted data. Snapshot of data waveforms is presented on the right. It shows that the both data and clock transceivers operate correctly. Delay time between Txdata and Rxdata is 10ns, including delay caused by cable and buffers in the experimental setup. By taking them out, it is confirmed that latency between Txdata and Rxdata is 1 clock. Measured BER is lower than 10-14 which is as reliable as that of wired interfaces. The data transmitter consumes 2mW and the data receiver consumes 0.4mW from 1.8V supply. Measured timing bathtub curve is shown in Fig.12. On-chip delay controller sweeps phase timing of Rxclk by 7.8ps. BER lower than 10-13 is examined by the 223-1 PRBS data of 1Gb/s. Timing margin of 150ps is obtained. The margin is sufficiently wide so that the timing can be easily adjusted. The edges of the bathtub curve match calculated results in a case where the sampling clock has 7.4ps-rms jitter. It indicates that the assumption on the sampling jitter is reasonable. In addition, there is no difference in measured timing bathtub curve of a transceiver in a condition where circuits are not placed under the metal inductor. Interference between the transceiver circuits and the metal inductors is negligible. However, it is necessary to consider noise coupling between the transmitter and a blocking field.
TABLE I PERFORMANCE SUMMARY
C. Array Communication
BER dependence on channel pitch and the number of phases in TDM was measured. The measured results are plotted in Fig.13. By increasing the number of phases in TDM, crosstalk is reduced and the channel pitch can be shortened for the same BER. 1024 transceivers arranged with a pitch of 30ìm operate at BER lower than 10-13 with the 4-phase TDM. As a result, aggregate data bandwidth of 1Tb/s is achieved with 1mm2 area for the data transceivers. Figure 14 presents measured timing bathtub curve of the center channel in the array where all the surrounding channels are operating. Although the timing margin is reduced to 130ps due to the crosstalk, it is still wide enough to adjust the sampling timing against inter-channel skew and PVT variations.
D. Performance Summary and Comparison
Chip performance is summarized in Table I. 1Tb/s data bandwidth is obtained by 1024 data transceivers arranged with a pitch of 30ìm. 1GHz clock is also provided by the inductive coupling. The transceiver chip consumes 3W from 1.8V supply where 1024ch data transceiver consumes 2.4W, 16ch clock transceivers and 16 phase interpolators consume 0.6W. The layout area for the data link is 1mm2 and that for the clock link is 1mm2 in 0.18ìm CMOS. Power efficiency is 3mW/Gb/s that is half of the previous work [6] and area efficiency is 1mm2/Tb/s that is 40% of the previous work [6].
The chip performance is compared with transceiver chips reported at ISSCC’96~’05 [1],[6]-[9],[14]-[25] in Fig.15. Since 3D-interfaces [1],[4],[6]-[9] communicate close proximity, they have advantages in improving the bandwidth with higher power and area efficiency. The state-of-the-art inductive-coupling transceiver achieves the highest bandwidth with the lowest power and smallest layout area. The bandwidth is 3.3 times higher than Rambus FlexIO implemented in CELL processor [25]. Power dissipation is 7 times lower than FlexIO. And layout area is 4 times smaller than FlexIO even in less advanced technologies.
Conclusions
A 1Tb/s 3W inductive-coupling transceiver has been developed. 1GHz clock is also transmitted by the proposed wireless clock transceiver. BPM signaling improves noise immunity, reducing power dissipation to 3mW/Gb/s (=3pJ/b). 4-phase TDM reduces crosstalk, decreasing channel pitch to 30ìm and layout area to 1mm2. As a result, among transceiver chips reported at ISSCC past ten years, the inductive-coupling transceiver achieved the highest bandwidth with the lowest power and smallest layout area.
Fig.15 Performance comparison with ISSCC transceiver chips in (a) bandwidth, (b) power, (c) layout area.
Acknowledgements
The authors are grateful to Prof. Takayasu Sakurai with the University of Tokyo for valuable discussions. The VLSI chips in this study have been fabricated in the chip fabrication program of VLSI Design and Education Center (VDEC), the University of Tokyo in collaboration with MOSIS and Taiwan Semiconductor Manufacturing Company (TSMC).
References
[1] T. Ezaki, et al., “A 160Gb/s Interface Design Configuration for Multichip LSI,” IEEE ISSCC Dig. Tech. Papers, pp.140-141, Feb. 2004.
[2] J. Burns, et al., “Three-Dimensional Integrated Circuits for Low-Power, High-Bandwidth Systems on a Chip,” IEEE ISSCC Dig. Tech. Papers, pp.268-269, Feb. 2001.
[3] K. Kanda, et al., “A 1.27Gb/s/ch 3mW/pin Wireless Superconnect (WSC) Interface Scheme,” IEEE ISSCC Dig. Tech. Papers, pp.186-187, Feb. 2003.
[4] D. Mizoguchi, et al., “A 1.2Gb/s/pin Wireless Superconnect Based on Inductive Inter-chip Signaling (IIS),” IEEE ISSCC Dig. Tech. Papers, pp.142-143, Feb. 2004.
[5] N. Miura, et al., “Analysis and Design of Inductive Coupling and Transceiver Circuit for Inductive Inter-Chip Wireless Superconnect,” IEEE J. Solid-State Circuits, vol.40, no.4, pp.829-837, Apr. 2005.
[6] N. Miura, et al., “A 195-Gb/s 1.2-W Inductive Inter-Chip Wireless Superconnect with Transmit Power Control Scheme for 3-D-Stacked System in a Package,” IEEE Journal of Solid-State Circuits, vol.41, no.1, pp.23-34, Jan. 2006.
[7] N. Miura, et al., “A 1Tb/s 3W Inductive-Coupling Transceiver for Inter-Chip Clock and Data Link,” IEEE ISSCC Dig. Tech. Papers, pp.424-425, Feb. 2006.
[8] R. Drost, et al., “Proximity Communication,” IEEE Journal of Solid-State Circuits, vol.39, no.9, pp.1529-1535, Sept. 2004.
[9] R. Drost, et al., “Electronic Alignment for Proximity Communication,” IEEE ISSCC Dig. Tech. Papers, pp.144-145, Feb. 2004.
[10] L. Luo, et al., “3Gb/s AC-Coupled Chip-to-Chip Communication using a Low-Swing Pulse Receiver,” IEEE ISSCC Dig. Tech. Papers, pp.522-523, Feb. 2005.
[11] A. Iwata, et al., “A 3D Integration Scheme utilizing Wireless Interconnections for Implementing Hyper Brains,” IEEE ISSCC Dig. Tech. Papers, pp.262-263, Feb. 2005.
[12] M. Sasaki, et al., “A 0.95mW/1.0Gbps Spiral-Inductor Based Wireless Chip-Interconnect with Asynchronous Communication Scheme,” Symposium on VLSI Circuits Dig. Tech. Papers, pp.348-351, June 2005.
[13] Jian Xu, et al., “2.8 Gb/s inductively coupled interconnect for 3D ICs,” Symposium on VLSI Circuits Dig. Tech. Papers, pp.352-355, June 2005.
[14] Y. Unekawa, et al., “A 5Gb/s 8×8 ATM Switch Element CMOS LSI Supporting Five Quality-of-Service Classes with 200MHz LVDS Interface,” IEEE ISSCC Dig. Tech. Papers, pp.118-119, Feb. 1996.
[15] Y. Ohtomo, et al., “A 40Gb/s 8×8 ATM Switch LSI using 0.25ìm CMOS/SIMOX,” IEEE ISSCC Dig. Tech. Papers, pp.154-155, Feb. 1997.
[16] B. Lau, et al., “A 2.6GB/s Multi-Purpose Chip-to-Chip Interface,” IEEE ISSCC Dig. Tech. Papers, pp.162-163, Feb. 1998.
[17] T. Takahashi, et al., “110GB/s Simultaneous Bi-Directional Transceiver Logic Synchronized with a System Clock,” IEEE ISSCC Dig. Tech. Papers, pp.176-177, Feb. 1999.
[18] M. Fukaishi, et al., “A 20Gb/s CMOS Multi-Channel Transmitter and Receiver Chip Set for Ultra-High Resolution Digital Display,” IEEE ISSCC Dig. Tech. Papers, pp.260-261, Feb. 2000.
[19] K. Yang, et al., “A Scalable 32Gb/s Parallel Data Transceiver with On-Chip Timing Calibration Circuits,” IEEE ISSCC Dig. Tech. Papers, pp.258-259, Feb. 2000.
[20] T. Tanahashi, et al., “A 2Gb/s 21CH Low-Latency Transceiver Circuit for Inter-Processor Communication,” IEEE ISSCC Dig. Tech. Papers, pp.60-61, Feb. 2001.
[21] R. Nair, et al., “A 28.5GB/s CMOS Non-Blocking Router for Terabit/s Connectivity between Multiple Processors and Peripheral I/O Nodes,” IEEE ISSCC Dig. Tech. Papers, pp.224-225, Feb. 2001.
[22] P. Landman, et al., “A 62Gb/s Backplane Interconnect ASIC based on 3.1Gb/s Serial-Link Technology,” IEEE ISSCC Dig. Tech. Papers, pp.52-53, Feb. 2002.
[23] K. Tanaka, et al., “A 100Gb/s Transceiver with GND-VDD Common-Mode Receiver and Flexible Multi-Channel Aligner,” IEEE ISSCC Dig. Tech. Papers, pp.264-265, Feb. 2002.
[24] G. Paul, et al., “A Scalable 160Gb/s Switch Fabric Processor with 320Gb/s Memory Bandwidth,” IEEE ISSCC Dig. Tech. Papers, pp.410-411, Feb. 2004.
[25] K. Chang, et al., “Clocking and Circuit Design for a Parallel I/O on a First-Generation CELL Processor,” IEEE ISSCC Dig. Tech. Papers, pp.526-527, Feb. 2005.