Forward diffusion: Gaussian noise is added to samples across T steps.
Reverse diffusion: This step tries to reconstruct the original sample after forward diffusion has destroyed it.
Quinn: In the case of text to image generation it seems to use an encoder (usually "CLIP") to get an engram of the prompt, then uses forward diffusion to scatter the prompt to a 2D blob of noise, and then tries to reverse diffuse the 2D blob of noise back to an output image.
Quinn: Sounds dumb actually. Like what an autoencoder does but an extremely overcomplicated means of allowing it to iterate towards more accurate results rather than attempting a one-shot conversion.