Voice Identity (Acoustic Compass): Speaker Attribution in Real Time

Why Speaker Attribution Matters

Knowing what was said is not always enough. In a conversation with more than one speaker, knowing who said it changes the meaning, the urgency, and the appropriate response.

A nurse and a doctor both address a patient. The patient addresses the doctor. Each speaker’s translated output needs to be attributed to the right person — or the translated conversation becomes a confusing wall of statements with no relational context.

Voice Identity is Puente’s answer to this. It goes beyond the turn-taking labels of Group Mode and the basic speaker separation of Auto-detect mode. It builds an actual acoustic fingerprint of each speaker, attributes every translation turn to a specific person, and presents those attributions visually in the Speaker Table View.

The Three Signals

1. Voice Embedding (ECAPA-TDNN)

Puente builds an acoustic fingerprint of each speaker using an ECAPA-TDNN model — a neural network architecture designed for speaker verification. This fingerprint captures the unique combination of fundamental frequency, vocal tract resonances, and articulatory characteristics that make a voice identifiable.

The fingerprint is registered to a session registry on first occurrence. On subsequent turns, incoming audio is compared against the registry. A match above the confidence threshold attributes the turn to that speaker. This is the primary identification signal and the most reliable for speakers who have already been heard in the current session.

2. Direction of Arrival

When the native microphone array module ships, Puente will use the spatial angle of incoming audio to help distinguish speakers who are physically located in different parts of the room. A voice coming from 30 degrees to the left is a different speaker than a voice from 90 degrees to the right.

Direction of Arrival is a fusion input, not a standalone identifier — it helps resolve ambiguous cases where two speakers have acoustically similar voices. The architecture is already in place; the native module that unlocks full directional resolution is in development.

3. PTT Side

When an earbud pair with left/right button controls is in use, Puente can use which earbud button was pressed as a confident speaker attribution signal. In a two-party earbud conversation, Party A holds the left bud and Party B holds the right. Pressing the left bud’s button to initiate speech is an unambiguous attribution signal that requires no acoustic analysis at all.

PTT side is the most reliable method in two-party earbud contexts because it is deterministic — there is no probability threshold, no confidence score. It is the winning signal in the fusion engine when available.

Speaker Table View

The Speaker Table View is a new display layout available when Voice Identity is active. Each identified speaker gets a dedicated color-coded lane in the translation output:

Color coding — up to 8 distinct colors, one per identified speaker
Direction arrows — when Direction of Arrival data is available, an arrow icon shows the speaker’s approximate position in the room
Long-press detail sheet — tap and hold any speaker lane to see the identification breakdown: ECAPA-TDNN confidence score, direction estimate (if available), PTT side (if applicable)

The detail sheet exists for transparency: Puente shows you exactly how it identified each speaker, so you can trust the attribution — or correct it manually if the system made an error.

Privacy

Voice embeddings built by Acoustic Compass are stored locally on-device. They are never uploaded to any server, never shared with third parties, and never used for any purpose outside of speaker attribution within Puente sessions. You can clear all stored speaker profiles in Settings → Privacy → Clear Voice Profiles.

Download Puente — Speaker Table View available with Pro