High-Fidelity Conversion of Floating-Point Networks for Low-Precision Inference using Distillation

A major challenge for fast, low-power inference on neural network accelerators is the size of the models. There is a trend in recent years towards deeper neural networks with more activations and coefficients, and with this rise in model size, comes a corresponding rise in inference time and energy consumption per inference. This is particularly significant in resource-constrained mobile and automotive applications. Low-precision inference helps reduce inference cost by reducing DRAM bandwidth (which is a significant contributor to device energy consumption), compute logic cost and power consumption.

In this context, the following question naturally arises: what is an optimal bit depth at which to encode neural network weights and activations? There are several proposed number formats that reduce the bit depth, including Nvidia’s TensorFloat , Google’s 8-bit asymmetric fixed point (Q8A) and bfloat16 . However, while being a step in the right direction, these formats cannot be said to be optimal: for example, most of these formats store an exponent for each individual value to be represented, which can be redundant when multiple values share the same range. More importantly, they do not consider the fact that different parts of a neural network often have different bit-depth requirements. Some layers can be encoded with a low bit depth, while other layers (like input and output layers) need a higher bit depth. An example of this is MobileNet v3, which can be converted from 32-bit floating point to bit depths mostly in the 5-12 bits range (see Figure 1).

Click here to read more ...

Industry Expert Blogs

High-Fidelity Conversion of Floating-Point Networks for Low-Precision Inference using Distillation