Cannot Model Conversation Structure
Device Performance Dependent
English Language Dominance
Keep App Open Requirement
Manual Language Selection Required
Memory Bottleneck in Training
No Explicit API Integration
No Pre-trained Language Model Use
Real-Time Generation Delay
RVQ time-to-first-audio scales poorly