Cannot Model Conversation Structure
English Language Dominance
Generation Time
minutes to hours
Memory Bottleneck in Training
No Commercial Use by Default
No Pre-trained Language Model Use
Real-Time Generation Delay
RVQ time-to-first-audio scales poorly