Detailed Word Error Rate (WER %)
Lower is better. Empty cells
denote pairs not tested by a baseline model.
Src | Tgt |
YourTTS | XTTSv2 |
LatinX
(Fine-tuned) | LatinX (DPO) |
Detailed Speaker Similarity (SMOS)
Human evaluation scores (Mean ±
95% CI). Higher is better.
Src | Tgt |
Baseline (Real) | XTTSv2 |
LatinX
(Fine-tuned) | LatinX (DPO) |
Detailed Objective Similarity (Sim-O / Sim-E)
Cosine similarity scores.
Higher is better. Best Sim-O per row is in
bold. Sim-E values that surpass the best
Sim-O are underlined.
Src | Tgt |
YourTTS | XTTSv2 |
LatinX
(Fine-tuned) | LatinX (DPO) |
Detailed Naturalness (MOS)
Human evaluation scores for
naturalness (Mean ± 95% CI). Higher is better. Real
audio scores are in bold. The best result
among generated models is underlined.
Src | Tgt |
Baseline (Real
Audio) |
XTTSv2 |
LatinX
(Fine-tuned) | LatinX (DPO) |