Voice characteristics conversion for HMM-based speech synthesis system

Created on 2022-05-22T21:22:09-05:00

Return to the Index

This card pertains to a resource available on the internet.

This card can also be read via Gemini.

Learn to synthesize a voice with an appropriately large dataset.
Learn a new voice by fitting the existing dataset to the new voice samples.
Use some math to smooth and interpolate the remaining edge cases.

Structure

Voice data is analyzed by mel-celstrum analysis.
Correlation between phonemes and mel-cepstrum parameters is done with a phoneme layer.
Correlation of sentence words to phonemes is done with another HMM layer.

Converting trained network to a new speaker

Maximum A Posteriori (MAP) estimation and Vector Field Smoothing (VFS) algorithms.
MAP estimation is used to start with existing tuning and increment it towards known data about the new voice.
VFS is used to interpolate data from the new incoming voice where no training data was available.