FP8 Formats for Deep Learning

Created on 2022-09-23T10:27:41-05:00

Use of 8-bit floating point registers to store and process weights of deep neural networks.

Existing implementations often cast to a wider register to run the math then scale and cast back to 8-bit

8-bit networks are usually able to cope with realistic workloads

Weight update skipping on overflows bad idea: 8-bit weights overflow too often

Calculations were done by simulating FP8: math was actually done on wider registers but converted to 8-bit encodings and back