Cannot Model Conversation Structure
English Language Dominance
Memory Bottleneck in Training
No Pre-trained Language Model Use
Not All Languages Supported
Real-Time Generation Delay
RVQ time-to-first-audio scales poorly
Requires Significant Compute
High computational resources needed
Speech Quality Depends on Input
Varies with text complexity and length