Speaker Diarization

Lira identifies who is speaking in real-time using Deepgram Nova-2, combined with DOM-based participant name scraping from Google Meet.

Why Deepgram?

Google Meet provides a mixed audio stream — all participants' voices are combined. Lira needs to know who said what for:

Speaker-attributed transcripts ("John: I think we should...")
Per-person contribution analysis in summaries
Accurate task assignment ("Sarah mentioned she'd handle the design")

How It Works

1. Deepgram Streaming

Meeting audio is streamed to Deepgram Nova-2 in parallel with Nova Sonic:

Audio Bridge → PCM audio → Deepgram WebSocket
  → Real-time transcripts with speaker indices
  → { text: "I think we should...", speaker: 0 }

2. DOM Participant Scraping

Simultaneously, Lira's browser automation scrapes Google Meet's UI to get participant names:

Google Meet DOM → Participant list
  → ["John Smith", "Sarah Chen", "Lira AI"]

3. Speaker Index → Name Correlation

Deepgram assigns numeric speaker indices (0, 1, 2...). Lira correlates these with real names by:

Detecting which participant has the "speaking" indicator in the DOM
Mapping the active Deepgram speaker index to that participant
Building a correlation table that improves as more people speak

Speaker 0 → "John Smith" (correlated after John spoke first)
Speaker 1 → "Sarah Chen" (correlated after Sarah responded)
Speaker 2 → "Lira AI"   (known — it's the bot)

4. Named Transcripts

Every message stored in DynamoDB includes the speaker's real name:

{
  "speaker": "John Smith",
  "text": "I think we should redesign the homepage",
  "timestamp": "2026-03-29T10:15:32Z",
  "sentiment": "neutral"
}

Deepgram Configuration

const deepgramConfig = {
  model: 'nova-2',
  language: 'en',
  smart_format: true,
  diarize: true,
  interim_results: true,
  endpointing: 300,
  sample_rate: 16000,
  channels: 1,
  encoding: 'linear16',
};

Graceful Degradation

If Deepgram is unavailable, Lira falls back to:

DOM-only speaker detection (less accurate, polling-based)
Generic "Participant" labels if DOM scraping also fails
Full transcription still works — just without attribution

Accuracy & Limitations

2-3 speakers: ~95% accuracy after initial correlation
4-6 speakers: ~85% accuracy
7+ speakers: Accuracy decreases as voices become harder to distinguish
Overlapping speech: May misattribute during crosstalk

Why Deepgram?​

How It Works​

1. Deepgram Streaming​

2. DOM Participant Scraping​

3. Speaker Index → Name Correlation​

4. Named Transcripts​

Deepgram Configuration​

Graceful Degradation​

Accuracy & Limitations​