Stochastic Computation applied to the design of Error Correcting Decoders
by Gord Harling, Dr. Shie Mannor, Dr. Warren Gross (WideSail)
ABSTRACT
We describe the application of stochastic computation to a family of error-correcting decoders. We have applied this technology to the Low Density Parity Check (LDPC) codes first described by Gallagher in the 1960s. LDPC is the highest performance error correcting code known to date and is used in IEEE standards including WiFi, WiMAX, DVB-S2, and 10 GbE. Although the coding technique has only recently been rediscovered as semiconductor technology made it economically viable most implementations are expensive in circuit area and power consumption. Although architectures have been devised to reduce the cost they suffer from performance loss due to the shortcuts taken to reduce circuit size.
We have completed the design and layout of a LDPC block decoder with greatly reduced size and power consumption. The use of stochastic computation reduces the effects of quantization error which is a limiting factor in the performance of error correcting decoders based on conventional bit-serial or partially parallel architectures. We will present results demonstrating the performance of our stochastic “bit-stream” technology.
Introduction
Generating stochastic streams in theory requires a source of randomness. Our studies have indicated that truly random numbers are not necessary and that pseudo-random (repeatable, “random-looking”) sequences are enough. Moreover, our results show that even pre-generating a small number of pseudo-random numbers and then storing them in memory in the decoder does not degrade performance. Therefore, in practice, stochastic decoders do not require extremely high entropy and can be predictable and controllable. We perform bit-true simulations to model the exact operation of the hardware.
An essential component of a stochastic design is the comparator shown in Figure 1. This is used to convert probabilities to stochastic streams. In this figure, X and R are W-bit wide inputs. R is a random number with uniform distribution which varies in each clock cycle. The output bit of the comparator is `1' when X > R and ‘0’ when X < R. therefore, the probability of having an output equal to `1' is X / 2W.
Figure 1: Probability to stochastic stream conversion.
Once a stochastic stream of bits has been generated it must be used to perform mathematical operations such as multiplication and division of probabilities.
Let Pa = Pr(ai=1) and Pb= Pr(bi=1) be the probabilities to be multiplied. The outcome, Pc, can be computed by an AND gate as shown in Figure 2 below.. Similarly, other Boolean operations e.g., NOT, XOR etc. can be implemented simply using basic logic gates.
Figure 2: Stochastic multiplication.
Consider the JK flip-flop shown in Figure 3 with input sequences of ai and bi representing the probabilities of Pa and Pb respectively. The output bit ci is equal to one with the probability of Pc and is equal to zero with the probability of 1-Pc. A random output transition from 1 to 0 occurs with probability of 1 - PcPa and the reverse transition occurs with the probability of PcPb. Since the expected occurrence of random transitions in both directions must be equal, we have Pc = Pa/(Pa + Pb). The resulting circuit, shown in Figure 4, although an approximation to Pa/Pb, matches the variable node operation we require for an err-r correcting decoder..
Figure 3: Stochastic division.
The stochastic representation of probabilities in the code graph results in low-complexity bit-serial parity-check and variable nodes.
Let Pa = Pr(ai = 1) and Pb = Pr(bi = 1) be the probability of two input bits, ai and bi, in a degree-3 check node. The output probability Pc is computed as
Pc = Pa (1 - Pb) + Pb (1 - Pa).
Similarly, the equality function in a degree-3 variable node for inputs Pa and Pb is
Pc = Pa Pb/(Pa Pb+ (1 - Pa)(1-Pb)).
Figure 4 shows the resulting hardware structures. Note that the variable node in Figure 4(B) is in the hold state (i.e., c i= ci-1), if the two input bits are not equal.
Figure 4: Parity check and variable nodes in stochastic decoders.
In addition to simple variable and check node structures, stochastic computation also reduces the routing congestion problem because only one bit (per direction) is required to represent an edge between a parity-check and a variable node. This implies that in a decoding round, the stochastic decoding proceeds by the variable and check nodes exchanging a bit (per direction) along each edge in the graph. We refer to these decoding rounds as decoding cycles (DCs) to highlight that they do not directly correspond to the iterations in conventional iterative decoding algorithms.
Initial attempts to implement stochastic decoders using the circuits described above were not successful (except for some special cases as in [1], and these attempts still did not match the performance of conventional decoders). This is due to the sensitivity of stochastic decoding to the level of random switching activity (bit transition). This problem is especially crucial because the code graph has cycles. The latching problem refers to the case where cycles in the graph correlate messages such that a group of nodes are locked into a fixed state which is solely maintained by the correlated messages This problem can be worse at high Signal-to-Noise-Ratios (SNRs) because the received Log-likelihood ratios (LLRs) are large so that the corresponding probabilities approach 0 (or 1). In this situation, bits in stochastic sequences are mostly 0 (or 1), making random switching events very rare.
Figure 5 illustrates the latching of two variable nodes, v1 and v2, into a fixed (hold) state for several DCs, which is forced by a 4-cycle in the absence of enough switching activity. The latching problem causes huge performance loss for decoding LDPC codes. Previous attempts by other groups to tackle the latching problem were not successful. Our approach succeeds because it views the decoding as a relaxation procedure and is not concerned with reproducing the exact values of probabilities. We introduce two simple procedures which are together necessary for solving the latching problem.
Figure 5. An example of latching within a 4-cycle.
Ingredients of the Proposed Solution
1. Re-randomizing Edge Memories
The first component of our solution to the lockup problem is re-randomization by reordering the last few “good” bits generated at the variable nodes, i.e. those that were not generated from the hold state of the J-K flip flop. We introduce M-bit shift registers, called edge memories (EMs) at each edge of the code graph. Each EM is updated only when the variable node is not in the hold state for that edge. If a hold state happens, a bit is randomly chosen from the EM of the edge and passed as the outgoing bit, this can be done by generating random addresses for EMs. Bit-true simulations show that we can share the random number generation methods for the input streams with no loss of performance. Using this updating scheme the random switching activity of stochastic messages in the code graph is increased because the chance of locking into a fixed state is reduced. This is because in a hold state, a bit is randomly chosen from those previous outputs of the variable node which are not produced in a hold state.
2. Noise-Dependent Scaling
The second component of our solution to the lockup problem is to scale the received values to increase the switching activity. The latching problem can be worse at high signal-to-noise ratios (SNRs) due to the lack of switching activity. The idea of scaling channel LLRs is based on scaling received LLRs to increase switching activity in the decoder. Our solution scales channel LLRs by a factor which is proportional to the operating SNR. Because the scaling factor is proportional to the noise level, it ensures a similar level of switching activity for different ranges of SNRs. The scaled LLR, Si, for the i-th symbol, yi, is calculated as
Si = (a N0 / Y) L i = 4a yi / Y,
where Y is a fixed maximum value of the received symbols and, a is a constant factor in which its value can be chosen based on the BER performance. The noise dependence of the LLR without scaling is cancelled by the noise-dependent scaling factor resulting in a noise-independent expression. This has the welcome side-effect of eliminating the SNR estimation block from the receiver.
Competing Technologies
- Fully Parallel Multi-Bit Decoders
The first LDPC decoder was reported in [2] in 2002 as a fully parallel multi-bit architecture. All of the nodes are directly implemented on the chip and connected by a large number of wires. The advantage of this technique is high throughput but the high speed comes at a steep price: large silicon area and high power dissipation. Additionally, the design of this chip required an unusual effort and required the development of custom computer-aided design software to route the wires on the chip. This is not a commercially-viable solution. For these reasons, this architecture has not been reconsidered in the 6 years since this publication.
In comparison to this architecture, stochastic decoding shares the throughput advantage since it is also fully parallel. However, because stochastic decoding messages are only 1-bit wide, the number of wires required is reduced dramatically and routing the wires can be done using conventional design methodologies. This has been verified by our synthesis results. The area of stochastic decoders is much smaller due to the ultra-low-complexity variable and check node hardware.
- Partially-Parallel Multi-Bit Decoders
Because of the difficulties in implementing fully-parallel multi-bit decoders, most effort in LDPC decoding hardware both in research and commercial activities has focused on partially parallel decoders. In this technology, only a small number of hardware processing elements are implemented on the chip and are shared amongst the nodes of the graph to minimize silicon area. The drawback of this technique is that the throughput is significantly reduced. Power consumption is also high because of the data shuffling required between memory banks. This architecture works best with specially-constructed codes that are designed to reduce collisions between accesses to the memory banks.
In contrast, stochastic decoders are fully parallel, do not require internal memory and have much greater throughput. They are not restricted to any particular code structure. Stochastic decoders directly attack the area and routing problems by using simple node hardware enabling us to integrate all of the nodes onto a single chip.
- Fully-Parallel Bit-Serial Decoders
To reduce the routing congestion problem, bit-serial decoders are fully-parallel decoders that send one bit of a multi-bit message at a time over a single wire. This reduces the routing complexity significantly but the multi bit messages must be stored and parallelized and manipulated using multi-bit ALU hardware in both the check and equality nodes so that the silicon area required is still too large.
Stochastic decoder nodes are much simpler and when they converge the switching activity on the wires decreases drastically as the stochastic stream becomes a train of all ‘0’s or all ‘1’s reducing the power consumption of the decoder. In contrast, bit-serial decoders always consume power even when the decoder has converged.
- Analog Decoders
To address the above concerns, decoders built using analog circuits were developed in the research community. As in stochastic decoding, they use a very simple node circuitry, and only one wire per message. However, analog decoders suffer several obstacles to commercialization:
- they are not scalable beyond small codes and are not able to decode commercially-relevant codes
- their design is very involved and
- analog designs are not as portable between integrated circuit technologies. This means that when a new silicon process is developed (say moving from 65 nm to 45 nm), the chip must be redesigned substantially.
Stochastic decoders use the standard digital design flow and can be ported to new technologies by computer-aided design software with little modification. Analog decoders tend to be designed to operate at the highest speed achievable in a given process to provide processing margin and may consume more power than is required. If a communication protocol offers variable throughput then a stochastic decoder may be clocked more slowly to lower the power consumption.
A Stochastic Decoder for 10Gb Ethernet
We have applied the technique of stochastic decoding to the specified algorithm for 10 Gb Ethernet (10 GBase-T). Our decoder provides excellent decoding performance as shown in Figure 6 in which we compare simulations of the decoder circuit to a high precision computation using the same input data. Computation of the Sum-Product algorithm for 32 iterations using high precision 64 bit arithmetic would be prohibitively expensive, our stochastic decoder implementation provides less than 0.2 dB loss in performance at full line rate.
Figure 6: Stochastic decoding performance of LDPC code for 10 Gbit Ethernet.
Conclusions
We have described the principle of stochastic computation and demonstrated its advantages in performance, circuit size, and power consumption when applied to LDPC decoding. Properly designed it can produce excellent results with few of the drawbacks of the simplifying design techniques used in most LDPC decoders.
References
[1] W. J. Gross, V. Gaudet, and A. Milner, “Stochastic implementation of LDPC decoders,” Proc. 39th Asilomar Conf. on Signals, Systems, and Computers, Nov. 2005.
[2] A. J. Blanksby and C. J. Howland, “A 690-mW 1-Gb/s 1024-b Rate-1/2 Low-Density Parity-Check Decoder,” IEEE Journal of Solid-State Circuits, Vol. 37, No. 3. pp. 404-412, March 2002.
Other important Works
[1] S. Sharifi Tehrani, W. J. Gross, and S. Mannor, “Stochastic decoding of LDPC Codes,” IEEE Communications Letters, 10(10):716–718, Oct. 2006.
[2] S. Sharifi Tehrani, S. Mannor, and W. J. Gross, “Stochastic decoding of LDPC Codes,” To appear in the IEEE International Symposium on Multiple Valued Logic, May 2007.
[3] T. Richardson and R. Urbanke, “The capacity of low-density parity-check codes under message passing decoding,” IEEE Transactions on Information Theory, vol. 47, pp. 599–618, February 2001.
[4] J. Pearl. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Kaufman, 1988.
[5] B. Gaines, Advances in Information Systems Science, Chapter 2, pages 37-172, Plenum, New York, 1969.
[6] R. G. Gallager. Low Density Parity Check Codes. MIT Press, Cambridge MA, 1963.
[7] C. Berrou, A. Glavieux, and P. Thitimajshima, “Near Shannon-Limit Error-Correcting Codes: Turbo Codes,” Proc. IEEE Int. Conf. on Communications, pp. 1064-1070, May 1993.
[8] F. R. Kschischang, B. Frey, and H.-A. Loeliger, “Factor Graphs and the Sum-Product Algorithm,” IEEE Trans. Information Theory, Vol. 47, No. 2, pp. 498-519, February 2001.
[96] B. Brown and H. Card. “Stochastic neural computation I: Computational elements,” IEEE Trans. on Computers, 50(9):891–905, Sept. 2001.
[10] V. Gaudet and A. Rapley, “Iterative decoding using stochastic computation,” Electron. Letters, 39(3):299–301, Feb. 2003.
Related Articles
- Multi-Channel Multi-Rate (MCMR) Forward Error Correction (FEC) - IP for High Speed Networking Applications
- Soft-Decoding in LDPC based SSD Controllers
- PCIe error logging and handling on a typical SoC
- Cheaper, Denser NAND Requires Better Error Correction
- Efficient methodology for design and verification of Memory ECC error management logic in safety critical SoCs
New Articles
- Quantum Readiness Considerations for Suppliers and Manufacturers
- A Rad Hard ASIC Design Approach: Triple Modular Redundancy (TMR)
- Early Interactive Short Isolation for Faster SoC Verification
- The Ideal Crypto Coprocessor with Root of Trust to Support Customer Complete Full Chip Evaluation: PUFcc gained SESIP and PSA Certified™ Level 3 RoT Component Certification
- Advanced Packaging and Chiplets Can Be for Everyone
Most Popular
- System Verilog Assertions Simplified
- System Verilog Macro: A Powerful Feature for Design Verification Projects
- UPF Constraint coding for SoC - A Case Study
- Dynamic Memory Allocation and Fragmentation in C and C++
- Enhancing VLSI Design Efficiency: Tackling Congestion and Shorts with Practical Approaches and PnR Tool (ICC2)
E-mail This Article | Printer-Friendly Page |