Speaker Diarization
Lira identifies who is speaking in real-time using Deepgram Nova-2, combined with DOM-based participant name scraping from Google Meet.
Why Deepgram?
Google Meet provides a mixed audio stream — all participants' voices are combined. Lira needs to know who said what for:
- Speaker-attributed transcripts ("John: I think we should...")
- Per-person contribution analysis in summaries
- Accurate task assignment ("Sarah mentioned she'd handle the design")
How It Works
1. Deepgram Streaming
Meeting audio is streamed to Deepgram Nova-2 in parallel with Nova Sonic:
Audio Bridge → PCM audio → Deepgram WebSocket
→ Real-time transcripts with speaker indices
→ { text: "I think we should...", speaker: 0 }
2. DOM Participant Scraping
Simultaneously, Lira's browser automation scrapes Google Meet's UI to get participant names:
Google Meet DOM → Participant list
→ ["John Smith", "Sarah Chen", "Lira AI"]
3. Speaker Index → Name Correlation
Deepgram assigns numeric speaker indices (0, 1, 2...). Lira correlates these with real names by:
- Detecting which participant has the "speaking" indicator in the DOM
- Mapping the active Deepgram speaker index to that participant
- Building a correlation table that improves as more people speak
Speaker 0 → "John Smith" (correlated after John spoke first)
Speaker 1 → "Sarah Chen" (correlated after Sarah responded)
Speaker 2 → "Lira AI" (known — it's the bot)
4. Named Transcripts
Every message stored in DynamoDB includes the speaker's real name:
{
"speaker": "John Smith",
"text": "I think we should redesign the homepage",
"timestamp": "2026-03-29T10:15:32Z",
"sentiment": "neutral"
}
Deepgram Configuration
const deepgramConfig = {
model: 'nova-2',
language: 'en',
smart_format: true,
diarize: true,
interim_results: true,
endpointing: 300,
sample_rate: 16000,
channels: 1,
encoding: 'linear16',
};
Graceful Degradation
If Deepgram is unavailable, Lira falls back to:
- DOM-only speaker detection (less accurate, polling-based)
- Generic "Participant" labels if DOM scraping also fails
- Full transcription still works — just without attribution
Accuracy & Limitations
- 2-3 speakers: ~95% accuracy after initial correlation
- 4-6 speakers: ~85% accuracy
- 7+ speakers: Accuracy decreases as voices become harder to distinguish
- Overlapping speech: May misattribute during crosstalk