Voice characteristics conversion for HMM-based speech synthesis system
Created on 2022-05-22T21:22:09-05:00
- Learn to synthesize a voice with an appropriately large dataset.
- Learn a new voice by fitting the existing dataset to the new voice samples.
- Use some math to smooth and interpolate the remaining edge cases.
Structure
- Voice data is analyzed by mel-celstrum analysis.
- Correlation between phonemes and mel-cepstrum parameters is done with a phoneme layer.
- Correlation of sentence words to phonemes is done with another HMM layer.
Converting trained network to a new speaker
- Maximum A Posteriori (MAP) estimation and Vector Field Smoothing (VFS) algorithms.
- MAP estimation is used to start with existing tuning and increment it towards known data about the new voice.
- VFS is used to interpolate data from the new incoming voice where no training data was available.