FP8 Formats for Deep Learning
Created on 2022-09-23T10:27:41-05:00
Use of 8-bit floating point registers to store and process weights of deep neural networks.
Existing implementations often cast to a wider register to run the math then scale and cast back to 8-bit
8-bit networks are usually able to cope with realistic workloads
Weight update skipping on overflows bad idea: 8-bit weights overflow too often
Calculations were done by simulating FP8: math was actually done on wider registers but converted to 8-bit encodings and back