Cannot Model Conversation Structure
English Language Dominance
Language Limitation
Check available languages
Memory Bottleneck in Training
No Pre-trained Language Model Use
Pronunciation Fine-tuning Limitations
Limited fine-tuning features
Real-Time Generation Delay
RVQ time-to-first-audio scales poorly