OpenAI Realtime API vs Sesame

Comparing the features of OpenAI Realtime API to Sesame

Feature

OpenAI Realtime API

Sesame

Capability Features

Consistent Personality

Context Awareness

Conversational Dynamics

Conversational Speech Generation

Dataset Size

1 million hours

Emotional Intelligence

Enterprise Privacy Commitment

Evaluation Suite

Expanded Model Support Planned

Five New Voices

Function Calling

Human and Automated Safety Monitoring

Interruption Handling

Model Sizes

Tiny: 1B backbone, 100M decoderSmall: 3B backbone, 250M decoderMedium: 8B backbone, 300M decoder

Multiple Speaker Handling

No Training on Data Without Permission

Objective Metrics

Word Error RateSpeaker SimilarityHomograph DisambiguationPronunciation Consistency

Partial Multilingual Support Planned

Planned for 20+ languages

Playground Access

Prompt Caching Planned

Pronunciation Correction

Public Beta

Reference Client Available

Sequence Length

2048

Single-Stage Model

Six Preset Voices

Speech-to-Speech

Streaming Audio Inputs/Outputs

Subjective Metrics

Comparative Mean Opinion Score

Supports Text and Audio Inputs

TextAudio

Text and Audio Input

TextAudio

Training Epochs

Ultra Low Latency

WebSocket Connection

Integration Features

Agora Integration

Chat Completions API Integration

GitHub Release

LiveKit Integration

LLama Architecture Backbone

Mimi Split-RVQ Tokenizer

OpenAI Node.js SDK Planned

OpenAI Python SDK Planned

Supports GPT-4o

gpt-4o-realtime-preview

Twilio Voice API Integration

Limitation Features

AI Disclosure Requirement

Audio Only Modality (Initially)

Cannot Model Conversation Structure

English Language Dominance

Lower Session Limits Tiers 1-4

Lower than 100

Memory Bottleneck in Training

No Pre-trained Language Model Use

No Simultaneous Session Limit Anymore

Real-Time Generation Delay

RVQ time-to-first-audio scales poorly

Simultaneous Sessions Limit Tier 5

100

Usage Policy Restriction

Pricing Features

Approximate Audio Input Price

$0.06/minute

Approximate Audio Output Price

$0.24/minute

Free Preview

No Free Tier

Open Source

Apache 2.0

Pricing Audio Input

$100/1M tokens

Pricing Audio Output

$200/1M tokens

Pricing Cached Audio Input

$20/1M tokens

Pricing Cached Text Input

$2.50/1M tokens

Pricing Text Input

$5/1M tokens

Pricing Text Output

$20/1M tokens