Conversational Speech Generation
Dataset Size
1 million hours
Emotion Tags
normalslowcryingsleepysighchuckle
Guided Emotion and Intonation
Input Streaming for Lower Latency
LLM-based Customizability
Model Sizes
Tiny: 1B backbone, 100M decoderSmall: 3B backbone, 250M decoderMedium: 8B backbone, 300M decoder
Model Tokenizer Type
Non-streaming (CNN-based) tokenizer
Multiple Speaker Handling
Objective Metrics
Word Error RateSpeaker SimilarityHomograph DisambiguationPronunciation Consistency
Open Source Release Planned
Orpheus Speech Models
Medium (3B)Small (1B)Tiny (400M)Nano (150M)
Partial Multilingual Support Planned
Planned for 20+ languages
Pretrained and Finetuned Models
Pretrained modelsFinetuned models
Sample Finetuning Scripts
Sliding Window Detokenizer
Streaming Inference Speed
Faster than playback on A100 40GB for 3B model
Subjective Metrics
Comparative Mean Opinion Score
Training Data Volume
100k+ hours of speech, billions of text tokens