Timestamp Identification for Clip Cutting
How we locate the exact start and end times for Congressional Record statements in floor video.
We want to provide audio and video clips for every statement in the Congressional Record. When you're reading what a member of Congress said on the floor, you should be able to watch or listen to them actually saying it—to fact-check, see the original context, or simply verify the source. We also want to enable AI models, researchers, and other platforms to build their own journalistic and analytical experiences on top of our data. This means we need to find the precise timestamp range in floor video where each statement occurs—and we need to do this at scale across roughly one million statements in our target corpus.
The challenge is that the Congressional Record and the floor audio are two fundamentally different documents. The Record is edited for readability: members and staff clean up spoken remarks, fix grammar, remove filler words, and restructure sentences before publication. This is intentional—the Record is meant to be a polished transcript, not a verbatim one. But it means we can't simply search for Record text in an audio transcript.
We have no choice but to work with both sources. The Congressional Record provides essential metadata we can't get elsewhere: precise speaker identification, topic headings, bill and amendment numbers, procedural context, and the full text of statements submitted for the record that were never spoken aloud. The floor audio provides the actual timestamps we need—and for our purposes, it's critical. We filter out procedural and administrative remarks to surface only substantive statements, and we need to cut precise clips for each one.
Prior Work
This post builds on work we published on arXiv. We introduce TimeStampEval, a benchmark for retrieving precise millisecond timestamps from long transcripts given non-verbatim quotes. The core challenge: official written records and speech-to-text transcripts are semantically identical but syntactically different, breaking traditional fuzzy matching.
We evaluated six LLMs across the Gemini and GPT-5 families, finding that GPT-5 models achieved the highest accuracy on fuzzy matching tasks. Our two-stage "Assisted Fuzzy" approach—RapidFuzz pre-filtering followed by LLM verification on short snippets—improved accuracy by up to 50 percentage points while cutting costs by over 90%. The paper covers the algorithm in detail.
Traditional fuzzy matching fails when searching for quotes that appear semantically identical but syntactically different across documents—a common problem when matching official written records against speech-to-text transcriptions. We introduce TimeStampEval, a benchmark for retrieving precise millisecond timestamps from long transcripts when given potentially non-verbatim quotes, and demonstrate a simple two-stage approach that dramatically improves accuracy while cutting costs by over 90%.
Our motivating use case is an automated long-form podcast that stitches Congressional Record clips into AI-hosted narration. The core technical challenge: given a sentence-timestamped transcript and a target quote that may have syntactic drift from transcription errors or editorial changes, return the exact start_ms/end_ms boundaries. While standard algorithms handle verbatim matching perfectly, real production systems face fuzzy variants that break traditional approaches.
We evaluated six modern LLMs on a 2,800-sentence Congressional Record transcript (120k tokens), finding four key results. First, simple prompt engineering dominates model selection: placing the query before the transcript and using a compact text-first format improved accuracy by 3–20 percentage points while reducing tokens by 30–42%. Second, off-by-one errors form a distinct regime between complete failures and exact matches, suggesting models understand the task but struggle with precise boundaries. Third, a modest reasoning budget (600–850 tokens) rescues weak configurations from 37% to 77% accuracy and pushes strong ones to mid-90s. Fourth, our two-stage "Assisted Fuzzy" approach—using RapidFuzz to narrow candidates then LLM verification on short snippets—improved fuzzy matching accuracy by up to 50 percentage points while reducing latency by half and cost-per-correct by 87–96%.
A focused deep dive with Gemini Flash across ten transcripts (50k–900k tokens, spanning 1989–2025) confirmed three patterns: (1) the method scales robustly to 400k tokens with gradual degradation approaching one million, (2) accuracy remains stable across decades despite vocabulary and speaking style changes, and (3) the system correctly rejects absent targets with 95–100% accuracy, a critical production safeguard. The resulting recipe is practical: for verbatim quotes, skip LLMs entirely; for fuzzy matches, pre-narrow with traditional fuzzing then verify with an LLM on just the relevant snippet.
Read the full paper →The Task
To see the problem concretely, consider this statement from David Schweikert on "Growing the GDP" from January 13, 2026. It starts like this:
Madam Speaker, as a mercy to the poor staff here, I am not going to use the
whole half an hour. I am just going to try to run through a couple of concepts.
I did a little tripping up of some details last week, where we were trying to
explain some of the things we see going on in the math. I want to walk through a
concept.
We actually, right now, as a country, are remarkably blessed. The GDP growth,
the economic growth, is actually well beyond almost any of what even my
economists were predicting over the last year.
We are seeing some data right now--if you actually look at the Atlanta Fed,
GDPNow, nowcasts, some of these--we could hit a five-handle this quarter, which
is remarkable.
Looking at some of the data from the third quarter of 2025, it looks like the
number of a 4.3 percent GDP growth is going to hold up....
To provide audio and video clips for this statement, we need to determine exactly where in the source files Rep. Schweikert starts and stops speaking—with millisecond precision. The House records its entire floor session as a single video file that can run 10+ hours. Somewhere in there, this statement happens. We need to find it.
Here's what the same statement looks like in the transcript. We re-transcribe the floor audio using AssemblyAI, which gives us word-level timestamps we can rely on:
{
"start_ms": 40191728,
"end_ms": 40193328,
"text": "Thank you, Madam speaker, pro tem."
},
{
"start_ms": 40193488,
"end_ms": 40199648,
"text": "I get myself slightly set up here as mercy to the poor staff here."
},
{
"start_ms": 40200078,
"end_ms": 40201678,
"text": "I'm not going to use the whole half an hour."
},
{
"start_ms": 40201678,
"end_ms": 40203918,
"text": "I'm going to just try to run through a couple of concepts."
},
{
"start_ms": 40206478,
"end_ms": 40214478,
"text": "Did a little tripping up of some details last week where we were trying to explain some of the things we see going on in the math."
},
{
"start_ms": 40215358,
"end_ms": 40219278,
"text": "And I want to walk through a concept."
},
{
"start_ms": 40220238,
"end_ms": 40223918,
"text": "We actually right now as a country are remarkably blessed."
},
{
"start_ms": 40224398,
"end_ms": 40232948,
"text": "The GDP growth, the economic growth is actually well beyond almost any of even my economists were predicting over the last year."
},
{
"start_ms": 40233028,
"end_ms": 40235668,
"text": "We're seeing some data right now in you."
},
{
"start_ms": 40235668,
"end_ms": 40244388,
"text": "If you actually look at the Atlanta feds GDP now, now cast some of these, we could hit a five handle this quarter which is just like remarkable."
},
{
"start_ms": 40245908,
"end_ms": 40256918,
"text": "We actually think the, you know, some of the Data from the third quarter, 25 looks like it's the number of a 4.3% GDP growth is going to hold up."
}
If you look closely, you can see that while these two sources cover the same speech, there are differences everywhere. Here are just a few from this short excerpt, with the key differences highlighted:
This is just a few sentences from one statement. Now imagine doing this across a million statements—our goal is to process the full historical Congressional Record at scale. We can't do simple string matching. We need an algorithm that finds the right location in the transcript despite all these differences.
The transcription will occasionally mishear words, especially technical terms, acronyms, and proper nouns. That's expected. But for timestamp finding, we've found the transcript is more than sufficient. Our pipeline doesn't need a perfect transcription—it needs to locate the right region of audio, and there's enough signal in the text to do that reliably.
Prompt Format: Why These Conventions?
We don't send raw JSON to language models. Instead, we transform it into a compact text format that performs well while reducing token count by 30%. The specific format we use—and why it matters—emerged from systematic testing across thousands of trials.
To benchmark these choices, we used an exact match task: the target sentence appears verbatim in the transcript, so we know the correct answer and can measure accuracy precisely. Our actual production task is fuzzy—as shown above, the Congressional Record text does not appear verbatim in the transcript—but the exact match task gives us a clean signal for comparing prompt configurations.
We tested three dimensions:
(1) JSON versus text format. As shown above, our
speech-to-text provider returns transcripts as JSON—each sentence an
object with text, start_ms, and
end_ms fields. The "text format" unpacks this into a
compact one-line-per-sentence representation, eliminating braces,
quotation marks, and key names—about 30% fewer tokens:
The House will be in order., start_ms: 34090, end_ms: 35050; The prayer will be offered by our chaplain, Reverend Kibben., start_ms: 35050, end_ms: 38420; Almighty God, we give you thanks for this gathering of your servants., start_ms: 39100, end_ms: 43280; ...
(2) Field order within each sentence. In the text
format, we can arrange the three fields—sentence text,
start_ms, and end_ms—in different orders.
We tested three configurations:
"First" — sentence text before timestamps.
text, start_ms: X, end_ms: Y;
The American people deserve better than this., start_ms: 9323830, end_ms: 9334470;
"Middle" — sentence text between timestamps.
start_ms: X, text, end_ms: Y;
start_ms: 9323830, The American people deserve better than this., end_ms: 9334470;
"End" — sentence text after timestamps.
start_ms: X, end_ms: Y, text
start_ms: 9323830, end_ms: 9334470, The American people deserve better than this.
(3) Target placement in the prompt. Whether the target sentence from the Congressional Record is placed before or after the transcript in the prompt:
Instructions: Find the start and end timestamps... Target: The American people deserve better than this. Transcript: The House will be in order., start_ms: 34090, end_ms: 35050; The American people deserve better than this., start_ms: 9323830, end_ms: 9334470; ...
Instructions: Find the start and end timestamps... Transcript: The House will be in order., start_ms: 34090, end_ms: 35050; The American people deserve better than this., start_ms: 9323830, end_ms: 9334470; ... Target: The American people deserve better than this.
This deep dive used Gemini 2.5 Flash across 1,440 trials. The results were striking.
Looking at this another way, here's the progression as we applied each optimization. We start with a naive baseline: JSON format with the target sentence from the Congressional Record placed after the transcript. From there, each change builds on the previous:
The progression tells an important story. Starting with native JSON and the target sentence from the Congressional Record placed after the transcript (52%), naively switching to text format actually makes things worse—dropping to 39%. But we can recover: placing the target sentence before the transcript jumps us to 72%, and reordering the text to appear before timestamps reaches 91%—while also saving 30% in token usage. The final configuration outperforms JSON by 39 percentage points.
Adding model thinking (via Gemini 2.5 Flash's extended thinking mode) pushes accuracy to 96%, and using the larger Gemini 2.5 Pro reaches 98%. OpenAI's GPT-5 series was released during our evaluation and showed similar patterns, confirming these findings generalize across model families.
Why GPT-5 for production? The benchmark above uses an exact match task, but our production task is fuzzy—the target doesn't appear verbatim in the transcript. We evaluated both tasks across models:
| Model | Fuzzy Match Accuracy |
|---|---|
| Gemini 2.5 Pro | 91.7% |
| Gemini 2.5 Flash | 87.8% |
| Gemini 2.5 Flash-Lite | 46.1% |
| GPT-5 | 92.8% |
| GPT-5 mini | 87.8% |
| GPT-5 nano | 70.6% |
GPT-5 has a slight accuracy advantage on the fuzzy matching task (92.8% vs. 91.7% for Gemini 2.5 Pro). Since errors can compound—we run this task twice per statement (start and end boundary), across hundreds of thousands of statements—even small accuracy differences matter at scale. We're also Tier 5 with OpenAI, which gives us the highest rate limits, while our Google Cloud tier is lower. At the volume we process, rate limits become a real constraint, so the combination of accuracy edge and higher throughput made GPT-5 the practical choice.
Why text over JSON? The text format is about 30% shorter in token count compared to equivalent JSON, eliminating braces, quotation marks, and key names. This directly reduces cost and latency. And when configured correctly ("First, Before"), it matches or exceeds JSON performance—so you get the savings without sacrificing accuracy.
Why text before timestamps? Placing the sentence text before its timestamps ("First") consistently outperformed placing it in the middle ("Middle") or at the end ("End"). We believe this ordering aligns better with how models process sequential information—they read the content first, then associate it with the relevant timestamps.
Why place the target sentence before the transcript? Placing the target sentence from the Congressional Record before the transcript in the prompt significantly outperformed placing it after. This makes sense: the model knows what it's searching for before it encounters the search space, rather than having to "remember back" after processing a potentially very long transcript.
Based on these findings, we standardized on this configuration throughout our pipeline: text format with sentence text before timestamps ("First"), and the target sentence from the Congressional Record placed before the transcript ("Before"). We also use structured outputs—a capability of modern LLM APIs that guarantees responses match a defined schema (we define ours with Pydantic models). Here's a simplified prompt showing the full structure:
INSTRUCTIONS: Find the start_ms timestamp for the target sentence in the transcript. Return the timestamp of the sentence that best matches the target. OUTPUT FORMAT: { "start_ms": integer } TARGET SENTENCE: Madam Speaker, as a mercy to the poor staff here, I am not going to use the whole half an hour. TRANSCRIPT: The House will be in order., start_ms: 34090, end_ms: 35050; The prayer will be offered by our chaplain, Reverend Kibben., start_ms: 35050, end_ms: 38420; Almighty God, we give you thanks for this gathering of your servants., start_ms: 39100, end_ms: 43280; ... The chair recognizes the gentleman from Arizona, Mr. Schweikert, for 30 minutes., start_ms: 40185000, end_ms: 40190500; Thank you, Madam speaker, pro tem., start_ms: 40191728, end_ms: 40193328; I get myself slightly set up here as mercy to the poor staff here., start_ms: 40193488, end_ms: 40199648; I'm not going to use the whole half an hour., start_ms: 40200078, end_ms: 40201678; I'm going to just try to run through a couple of concepts., start_ms: 40201678, end_ms: 40203918; Did a little tripping up of some details last week where we were trying to explain some of the things we see going on in the math., start_ms: 40206478, end_ms: 40214478; And I want to walk through a concept., start_ms: 40215358, end_ms: 40219278; We actually right now as a country are remarkably blessed., start_ms: 40220238, end_ms: 40223918; The GDP growth, the economic growth is actually well beyond almost any of even my economists were predicting over the last year., start_ms: 40224398, end_ms: 40232948; We're seeing some data right now in you., start_ms: 40233028, end_ms: 40235668; If you actually look at the Atlanta feds GDP now, now cast some of these, we could hit a five handle this quarter which is just like remarkable., start_ms: 40235668, end_ms: 40244388; We actually think the, you know, some of the Data from the third quarter, 25 looks like it's the number of a 4.3% GDP growth is going to hold up., start_ms: 40245908, end_ms: 40256918;
Our Approach
We search for the Congressional Record statement text (the target) in the speech-to-text transcript (the search corpus). We don't search for the entire statement—just the boundaries. We find the start timestamp and end timestamp separately.
We use a three-stage pipeline, starting with fast exact matching and escalating to more expensive methods when needed:
- Progressive shortening — progressively shorten the target until a unique exact match is found
- LLM-assisted fuzzy matching — use string distance metrics to construct a short snippet, then pass it to a small LLM to extract timestamps
- Full transcript scan — send the full transcript to a large LLM when lower cost methods fail
Most statements resolve in the first two stages. Here's roughly how the work breaks down:
1. Progressive Shortening
We take the first sentence (for start timestamp) or last sentence (for end timestamp) from the Congressional Record and search for a unique exact match in the transcript. If no unique match exists, we progressively shorten the sentence—removing words from the end for start timestamps, or from the beginning for end timestamps—until we find a unique match or hit a minimum length.
Key constraints
Common phrases we skip
These parliamentary phrases appear in dozens or hundreds of statements per session. Searching for them would waste time and never produce a unique match. This is the same common-phrase list we reuse elsewhere in the pipeline—listed here as a reminder.
- "I yield myself such time as I may consume."
- "Thank you, Mr. Speaker." / "Thank you, Madam Speaker."
- "I thank the gentleman for yielding."
- "I reserve the balance of my time."
- "I yield back the balance of my time."
- "I yield back."
- "I yield the floor."
- "Without objection, so ordered."
Examples
Here are three scenarios showing how progressive shortening works in practice. Click "Run" to see the animation.
1. Standard match
We search for the target (from the Congressional Record) in the transcript (from speech-to-text). The example below is real—notice how the Record uses a colon and "there is" while the transcript uses a comma and "there's". These small editorial cleanups mean exact matching fails, so we progressively shorten from the end until we find a unique match.
2. Honorific handling
The Congressional Record almost always starts statements with an honorific ("Mr. Speaker," "Madam President," "Thank you, Mr. Chairman," etc.). But in the transcript, this honorific may appear as its own stand-alone sentence—if the speaker pauses after saying it, the speech-to-text provider interprets it as a separate sentence. And in extended floor debates, members sometimes skip the honorific entirely. The Record editors add it back—addressing the presiding officer is required by House and Senate rules. This means when we see "Mr. Speaker, [text]" in the Record, we're never sure if "Mr. Speaker" was actually spoken as part of that sentence, or if it was added by editors as a matter of protocol. To account for this, we search for both variants.
This seems sensible, but introduces a subtle challenge when finding a target. Consider the statement, "Mr. Speaker, I rise today to honor our troops abroad." The speech-to-text provider might transcribe this as two separate sentences—"Mr. Speaker." followed by "I rise today to honor our troops abroad."—or as a single combined sentence, "Mr. Speaker, I rise today to honor our troops abroad." In the transcript excerpt below, both versions appear (highlighted in blue).
Our algorithm identifies timestamps based on uniqueness. If our target is "I rise today to honor our troops abroad." we'd find Match A as unique—but we should also consider Match B. If our target is "Mr. Speaker, I rise today to honor our troops abroad." we'd find Match B as unique—but we should also consider Match A. In reality, there's no unique match here, so this should go to the LLM. Without handling honorifics properly, we'd get false unique matches:
Our colleges and universities have taken hundreds of millions from foreign adversaries like China., start_ms: 1145000, end_ms: 1152000; They are selling access, influence, and sensitive research to the highest bidder., start_ms: 1152000, end_ms: 1158500; This is unacceptable., start_ms: 1158500, end_ms: 1160000; That is why the House passed the DETERRENT Act., start_ms: 1160000, end_ms: 1164000; America's intellectual capital should never be auctioned off to those who seek to undermine us., start_ms: 1164000, end_ms: 1171000; Freedom is worth defending economically, culturally, and morally., start_ms: 1171000, end_ms: 1176000; ... MATCH A (separate) Mr. Speaker., start_ms: 1205000, end_ms: 1207500; I rise today to honor our troops abroad., start_ms: 1213000, end_ms: 1218000; These men and women sacrifice everything for our freedom., start_ms: 1218000, end_ms: 1223500; History will not judge how cautious we were., start_ms: 1223500, end_ms: 1227000; It will judge how courageous we were., start_ms: 1227000, end_ms: 1230000; ... MATCH B (combined) Mr. Speaker, I rise today to honor our troops abroad., start_ms: 1456000, end_ms: 1463000; The fate of unborn millions depends on us., start_ms: 1463000, end_ms: 1468000; Let us act with conviction., start_ms: 1468000, end_ms: 1471000; Let us lead with faith., start_ms: 1471000, end_ms: 1474000; Let us govern with courage., start_ms: 1474000, end_ms: 1477000; Mr. Speaker, I yield back the balance of my time., start_ms: 1477000, end_ms: 1482000;
So for every target sentence, we automatically search both with and without the honorific. If either variant matches, we count both as potential matches when checking for uniqueness.
This approach introduces a subtle edge case. Suppose the target is "Mr. Speaker, I rise today to honor our troops abroad." but only Match A exists in the transcript (where "Mr. Speaker." is separate). The exact match for the full sentence fails. We then try without the honorific: "I rise today to honor our troops abroad." This matches the second sentence in Match A at timestamp 1213000—but we've cut off "Mr. Speaker." which the member actually said at 1205000.
This isn't a big deal since we're most interested in the substantive portion of the speech. However, for completeness we apply a simple lookback: after finding a match, if the previous transcript sentence is just an honorific (e.g., "Mr. Speaker." or "Thank you, Madam President."), we extend the start timestamp to include it.
I rise today to celebrate the Monroe Red Jackets High School football team., start_ms: 31422091, end_ms: 31438771;
3. Escalation to LLM-assisted fuzzy matching
When the Record text differs from what was spoken, progressive shortening exhausts all options and we fall back to LLM-assisted fuzzy matching (described in the next section).
I'm not going to use the whole half an hour., start_ms: 40200078, end_ms: 40201678;
Progressive shortening requires an exact string match. Here, the transcript has "as mercy" instead of "as a mercy" and "I'm" instead of "I am"—small differences, but enough that no substring of the target matches exactly. When the algorithm exhausts all options without finding a unique match, we escalate to LLM-assisted fuzzy matching, where string distance metrics and an LLM can handle these variations.
Note: The demos above show finding the start timestamp using the first sentence. We find the end timestamp the same way, but using the last sentence.
2. LLM-Assisted Fuzzy Matching
When progressive shortening fails to find an exact match, we need a more flexible approach. There is natural drift between the transcript and the Record—sometimes small, sometimes significant.
Consider our earlier failure case: the Record says "as a mercy to the poor staff here, I am not going to use the whole half an hour" but the transcript has "I get myself slightly set up here as mercy to the poor staff here" followed by "I'm not going to use the whole half an hour." The speaker said something extra that didn't make it into the Record, and wording differs throughout—yet when viewed holistically, a human can immediately see these are the same statement.
An LLM can make this same holistic judgment. We use a small LLM (GPT-5-mini) to keep costs manageable, but small models are less capable—and as we found in our paper (referenced at the start of this article), sending an entire transcript to a small LLM is not a successful approach. Even with a small model, processing thousands of statements against full transcripts would be prohibitively expensive.
Fortunately, it's often unnecessary. By using fuzzy string matching to produce candidate snippets, we drastically narrow the search space. The small LLM only sees the relevant regions—typically 30–50 sentences instead of 3,000+—and this targeted approach is quite successful.
Phase 1: Finding candidates with RapidFuzz
Before matching, we apply light text cleaning to the target: normalizing whitespace and removing stray punctuation. This reduces false negatives from trivial formatting differences without changing the semantic content.
We then use RapidFuzz, a fast string matching library, to find transcript sentences that are similar to our target. We run two complementary searches:
Sentence-level matching (ratio). We compare the target text
against each transcript sentence individually using RapidFuzz's ratio
scorer. This uses Indel distance—counting the minimum insertions and
deletions needed to transform one string into another—normalized to a 0–100
similarity score. We keep the top 3 matches that score above 70%. This method
works well for longer quotes that resemble complete sentences, but struggles
with short phrases that may be similar to multiple sentences in a 3-hour transcript.
Substring alignment (partial_ratio_alignment). Record statements
often span multiple transcript sentences. To catch these cases, we join the
entire transcript into one long string and use partial_ratio_alignment
to find the best substring match. Unlike ratio, which compares
two strings in their entirety, partial_ratio_alignment slides the
shorter string (the target) across the longer string (the joined transcript)
and finds the contiguous region with the highest Indel-based similarity. It
returns character offsets marking where in the transcript the best match starts
and ends, which we map back to sentence indices. This method captures short or
fragmentary quotes that sentence-level matching misses.
Why both? This dual approach was discovered empirically. During development, we found that using either method alone produced failure points: sentence-level matching failed on short phrases; alignment-based matching sometimes missed obvious single-sentence matches. By taking the union of both result sets—merging overlapping candidates—we get the best of both methods and cast a wider net over the transcript.
Phase 2: Building snippets
Each candidate match points to a location in the transcript. But sending just that one sentence to the LLM wouldn't give it enough context. Instead, we build a "snippet" around each match—a window of surrounding sentences.
Dynamic snippet sizing. Short statements need small snippets; long statements need larger ones. We count how many sentences the target contains, add a buffer (minimum 3, maximum 8, or 25% of the sentence count), and apply this combined offset to each side of the match center. A 4-sentence target gets an offset of 7 (4 + 3 buffer), yielding a snippet of roughly 15 sentences (7 before + match + 7 after). A 20-sentence target gets an offset of 25 (20 + 5 buffer), yielding roughly 51 sentences. This ensures the entire target statement is almost certainly included and provides enough context for the LLM to confidently select the start and end timestamps.
Snippet merging. If two candidate matches are close together, their snippets might overlap. Rather than send redundant text to the LLM, we detect overlapping snippets and merge them into a single, larger snippet. This reduces token usage without losing information.
Phase 3: LLM verification
Now we have one or more snippets that likely contain our target. We send them to GPT-5-mini along with the target text. The prompt asks the model to find the timestamp for the best semantic match—either the start or end boundary, depending on which we're currently searching for.
This is where the LLM earns its keep. It handles variations we cannot predict or enumerate: acronym differences ("Atlanta Fed" vs "the Atlanta Fed's GDPNow"), extra sentences the speaker said but the Record omits ("Let me get myself set up here"), contractions, minor rephrasing, and countless other corner cases we haven't encountered yet. We let the model make a holistic judgment.
The model returns JSON with the matched timestamp. If no reasonably close match exists in any snippet, it returns zero and we escalate to the next stage.
Why not embeddings? A natural question is why we don't use vector embeddings and cosine similarity. The answer is that embeddings capture semantic similarity—"I will support this bill" and "I am going to vote yes" would rank as similar because they express related intent. But we need lexical similarity: "I am going to vote yes" and "I'm gonna vote yes" are the same statement with transcription variations. String distance metrics correctly identify the second pair as a near-match while distinguishing the first pair as different text entirely. For near-duplicate matching with transcription noise, string distance is the right tool.
Why this works
The key insight is division of labor. RapidFuzz is fast but literal—it can't understand that "I am" and "I'm" mean the same thing. LLMs understand semantics but are slow and expensive. By using RapidFuzz to narrow the search space from 3,000+ sentences to maybe 30–50 sentences in a few snippets, we get the best of both: speed from the string matching, intelligence from the LLM, and reasonable cost.
In practice, this stage handles the majority of statements that progressive shortening misses. The combination of sentence-level and alignment-based matching rarely fails to surface the correct region, and GPT-5-mini is accurate enough to pick the right boundaries from well-targeted snippets.
I get myself slightly set up here as mercy to the poor staff here., start_ms: 40193488, end_ms: 40199648;
I'm not going to use the whole half an hour., start_ms: 40200078, end_ms: 40203678;
We have important work ahead of us., start_ms: 40204000, end_ms: 40207000;
3. Full Transcript Processing
When LLM-assisted fuzzy matching fails to find a match, we fall back to sending the full transcript to a large language model. This is slower and more expensive, but handles cases where the statement deviates significantly from the spoken text—or where RapidFuzz simply failed to surface the correct region.
In line with our research described above, we send the transcript in our text-first format (not JSON) with the target sentence placed before the transcript. This format reduces token count and gives the model clear context about what it's searching for.
We route requests based on transcript length: if the transcript is under 200,000 tokens, we send it to GPT-5, which has a 400,000 token context window. For longer transcripts—typically full-day Senate sessions—we use Gemini 3, which supports up to 1 million tokens. As with earlier stages, we run separate passes for the start and end timestamps to ensure accurate clip boundaries.
Guardrails and Fallbacks
At each stage we run a set of guardrails and sanity checks. If any check is triggered, we promote the boundary to the next phase for a more sophisticated matching technique. Some checks are universal (applied after every phase), while others are specific to a particular phase.
Universal Checks
After every phase produces a result, we validate it against three checks. Failing any one promotes the boundary to the next phase:
Zero timestamps (0,0). A return of (0,0) means "not found." Our research showed that all models we tested are excellent at returning (0,0) when passed a target sentence that does not appear in the transcript—near 100% accuracy. So (0,0) is a reliable signal that the match genuinely failed, not a model error.
Negative duration. If the end timestamp is earlier than the start timestamp, the result is nonsensical. This typically means one boundary matched in the wrong part of the transcript.
WPM sanity check. We estimate the expected clip duration from the statement's word count. We use a generous range—80 words per minute at the slow end (a deliberate, pausing speaker) to 220 words per minute at the fast end (rapid floor debate). If the clip duration implied by the start and end timestamps our algorithm found falls outside this range, the timestamps are likely wrong and the result is rejected.
Manual review (email). If the final result still fails the WPM check, we send an email to our internal contact address with the statement ID, speaker, date, the bad timestamps, and the expected duration range so the clip can be corrected manually. This ensures we’re notified even when the pipeline runs automatically.