Engineering Deep Dive

Parsing the Congressional Record: Speaker Identification

Identifying speakers in the Congressional Record—and why a simple regex wasn't enough.

The Congressional Record is the official daily transcript of proceedings in the U.S. Congress, published continuously since 1873. Every day Congress is in session, the Government Publishing Office produces a PDF containing everything that happened on the floor—speeches, votes, procedural motions, and more.

The Record includes both spoken remarks and written statements that members submit for the record. It's edited for readability and flow, so while it's close to verbatim, nearly every statement has some differences between what was actually spoken and what appears in print. We download a bulk data package from congress.gov that contains HTML files broken up by page and topic—roughly 140 files per session day, or about 25,000 files per year.

25K Files

21K Statements

12K Topics

Per Year

Goal

There are many projects that process the Congressional Record, but we want to do it better than it's been done before. We're building a website that's more usable, friendly, and accessible, with data exposed via MCP and API, with video and audio clips available, and—importantly—with all administrative boilerplate removed so users can focus on the substantive statements.

What We'll Cover

Today we'll cover speaker identification. This is a key part of our pipeline since every statement gets linked to the member who spoke it. The Congressional Record provides speaker cues in the HTML—but once you start trying to parse them, you find out it's much harder than you'd expect. In future posts, we'll cover topic extraction, administrative filtering, and media processing.

I handled the system design, high-level algorithmic design, and pipeline architecture. The implementation code was written with AI assistance.

The Basic Pattern

Here's what a typical Congressional Record file looks like:

          CREC-2025-03-03-pt1-PgS1445-7.htm — Complete file
        

          <html>
<head>
  <title>Congressional Record, Volume 171 Issue 40 (Monday, March 3, 2025)</title>
</head>
<body><pre>
[Congressional Record Volume 171, Number 40 (Monday, March 3, 2025)]
[Senate]
[Page S1445]
From the Congressional Record Online through the Government Publishing Office
[www.gpo.gov]

                        Nomination of Linda McMahon

  Mr. GRASSLEY. Mr. President, the Senate will soon vote to confirm
Linda McMahon to be Secretary of Education. I know that some people feel
that the Secretary of Education should have extensive experience in a
school system; however, it is important to remember that education is
still mostly a State and local responsibility. There is no such thing
as a Federal superintendent of schools. The U.S. Department of Education
doesn't run any schools. Our many public school systems in this country
do not report to Washington, DC, and Washington also has no authority
over what is taught in our schools around the country.

  The job description of a Secretary of Education is to manage a
bureaucracy that runs a number of funding programs. By all accounts,
Linda McMahon did a great job running the Small Business Administration
in the last Trump administration. I have no reason to believe that she
cannot run the Department of Education. I think she understands the
difference between the State and Federal role in education. I also
expect that she understands the difference between the executive and
legislative branches when it comes to the serious policymaking of
education. Congress passes the laws and holds the power of the purse.
We have also the responsibility to make sure that laws are faithfully
executed by the executive branch of government. I expect Linda McMahon
to respond to all the congressional inquiries in a timely and responsive
manner.
</pre></body>
</html>
        

The filename tells you a lot: pt1 is the part number (most days have one part, occasionally two), Pg is page, and H is the chamber—House. Senate pages use S, and Extensions of Remarks (written House statements) use E. There's also D for Daily Digest, which we ignore. Page numbers start at 1 on the first day of each Congress and don't reset daily—by year's end you're in the thousands. The -6 suffix indicates this is the 6th topic or section break on that page.

One thing you'll notice immediately in any Congressional Record file: members constantly say "Mr. Speaker" or "Mr. President." This isn't repetitive formality—it's procedural requirement. In the House, all remarks are addressed to the Speaker (or Speaker pro tempore). In the Senate, remarks are addressed to the presiding officer, formally "Mr. President" or "Madam President." Members don't speak directly to each other; they speak through the chair. So every speech, every interjection, every yield starts with addressing the presiding officer.

Each time a member speaks, the Record follows a consistent format: Mr. LASTNAME. or Ms. LASTNAME. followed by the speech text.

Speaker Detection

Ms. FOXX. Mr. Speaker, I rise to recognize the 100th anniversary of the Leakesville Garden Club in Rockingham County...

→

Mr. GRASSLEY. Mr. President, the Senate will soon vote to confirm Linda McMahon to be Secretary of Education...

→

A simple regex handles the common case:

          pattern.py
        

          def classify_speaker_label(line):
    HONORIFICS = ["Mr.", "Ms.", "Mrs.", "Miss."]
    NAME_RE = r"[A-Z][A-Z]+"

    SPEAKER_RE = re.compile(
        r'^(?P<honorific>' + '|'.join(HONORIFICS) + r')\s+'
        r'(?P<surname>' + NAME_RE + r')\.'
    )

    match = SPEAKER_RE.match(line)
    if not match:
        return {"matched": False}

    return {
        "matched": True,
        "speaker_name": match.group("surname")
    }
        

The core mechanic: we scan the record line-by-line, and only lines that start with a speaker label open a new speaker. Everything else gets appended to the current statement.

Line-by-Line Scan

Mr. GRASSLEY. Mr. President, I rise today to discuss transparency in government.

For decades, I have fought to ensure that taxpayers

know how their money is being spent.

Whistleblowers are essential to rooting out waste and fraud.

They protect taxpayers and strengthen accountability.

Ms. WARREN. Mr. President, I thank the Senator from Iowa for his leadership.

Protecting whistleblowers is not a partisan issue.

It is fundamental to good government.

Current Statement

—

Saved to Database

That seems simple enough—until you hit the corner cases.

Corner Cases

A combination of special name formatting, disambiguation rules, and contextual cues makes the problem surprisingly complex. Many of these cases we had to discover ourselves through trial and error—they aren't documented anywhere. We'll walk through each case in order of increasing complexity, showing how the regex evolves at each step.

1. Disambiguating members with the same last name

There are two Representatives named SCOTT in the House—one from Georgia, one from Virginia. When disambiguation is needed, the Record includes the state. Our regex handles an optional of STATE suffix:

          Adding state to the pattern
        

          def classify_speaker_label(line):
    HONORIFICS = ["Mr.", "Ms.", "Mrs.", "Miss."]
    NAME_RE = r"[A-Z][A-Z]+"

    # NEW: allow optional "of STATE" disambiguation
    SPEAKER_RE = re.compile(
        r'^(?P<honorific>' + '|'.join(HONORIFICS) + r')\s+'
        r'(?P<surname>' + NAME_RE + r')'
        r'(?:\s+of\s+(?P<state>[A-Za-z\s]+))?'
        r'\.'
    )

    match = SPEAKER_RE.match(line)
    if not match:
        return {"matched": False}

    return {
        "matched": True,
        "speaker_name": match.group("surname"),
        "state_disambiguation": match.group("state")
    }
        

State Disambiguation

Mr. SCOTT of Virginia. Mr. Speaker, as we stand here today, workers' rights are under attack...

→

Mr. SCOTT of Georgia. Mr. Speaker, by direction of the Committee on Armed Services...

→

2. When state isn't enough: full name disambiguation

What if two members from the same state share the same last name? This is rare, but it happens. We discovered through trial and error that in these cases, the Record uses the member's full name. The regex handles an optional first name before the last name:

          Adding optional first name
        

          def classify_speaker_label(line):
    HONORIFICS = ["Mr.", "Ms.", "Mrs.", "Miss."]

    # NEW: allow optional first name before last name
    NAME_RE = r'(?:[A-Z]+\s+)?[A-Z][A-Z]+'

    SPEAKER_RE = re.compile(
        r'^(?P<honorific>' + '|'.join(HONORIFICS) + r')\s+'
        r'(?P<name>' + NAME_RE + r')'
        r'(?:\s+of\s+(?P<state>[A-Za-z\s]+))?'
        r'\.'
    )

    match = SPEAKER_RE.match(line)
    if not match:
        return {"matched": False}

    return {
        "matched": True,
        "speaker_name": match.group("name"),
        "state_disambiguation": match.group("state")
    }
        

Full Name Disambiguation

Mr. AUSTIN SCOTT of Georgia. Mr. Speaker, by direction of the Committee on Agriculture, I call up the bill...

→

3. Mixed-case name prefixes

Most names are fully uppercase: SMITH, JONES, JOHNSON. But some surnames have mixed case due to prefixes. A simple [A-Z]+ won't match these. We built the prefix list by examining every recorded member of Congress and compiling the prefixes into a shared config:

          speaker_patterns.json
        

          # NEW: helper to build name pattern with prefix exceptions
def build_name_pattern(prefix_exceptions):
    PREFIX_RE = '(?:' + '|'.join(prefix_exceptions) + ')?'
    return r'(?:[A-Z]+\s+)?' + PREFIX_RE + r'[A-Z][A-Z]+'


def classify_speaker_label(line):
    HONORIFICS = ["Mr.", "Ms.", "Mrs.", "Miss."]
    PREFIX_EXCEPTIONS = ["Mc", "Mac", "Van", "La", "Le", "De", "Del", "St"]

    NAME_RE = build_name_pattern(PREFIX_EXCEPTIONS)

    SPEAKER_RE = re.compile(
        r'^(?P<honorific>' + '|'.join(HONORIFICS) + r')\s+'
        r'(?P<name>' + NAME_RE + r')'
        r'(?:\s+of\s+(?P<state>[A-Za-z\s]+))?'
        r'\.'
    )

    match = SPEAKER_RE.match(line)
    if not match:
        return {"matched": False}

    return {
        "matched": True,
        "speaker_name": match.group("name"),
        "state_disambiguation": match.group("state")
    }
        

Name Prefix Exceptions

Mr. McGOVERN. Mr. Speaker, I yield myself such time as I may consume...

→

Ms. DeGETTE. Mr. Speaker, I rise in opposition to this legislation...

→

4. Apostrophe prefixes

Names with apostrophes like O'Brien or O'Connell require special handling since the apostrophe breaks simple regex patterns.

We solve this two ways: the prefix list explicitly includes O', and the speaker regex allows apostrophes inside the surname (so O'BRIEN and O'CONNELL still match).

          Apostrophe handling
        

          def build_name_pattern(prefix_exceptions):
    PREFIX_RE = '(?:' + '|'.join(prefix_exceptions) + ')?'
    # NEW: allow apostrophes in name pattern
    return r'(?:[A-Z]+\s+)?' + PREFIX_RE + r"[A-Z][A-Z']+"


def classify_speaker_label(line):
    HONORIFICS = ["Mr.", "Ms.", "Mrs.", "Miss."]
    # NEW: add O' prefix
    PREFIX_EXCEPTIONS = ["Mc", "Mac", "O'", "Van", "La", "Le", "De", "Del", "St"]

    NAME_RE = build_name_pattern(PREFIX_EXCEPTIONS)

    SPEAKER_RE = re.compile(
        r'^(?P<honorific>' + '|'.join(HONORIFICS) + r')\s+'
        r'(?P<name>' + NAME_RE + r')'
        r'(?:\s+of\s+(?P<state>[A-Za-z\s]+))?'
        r'\.'
    )

    match = SPEAKER_RE.match(line)
    if not match:
        return {"matched": False}

    return {
        "matched": True,
        "speaker_name": match.group("name"),
        "state_disambiguation": match.group("state")
    }
        

Apostrophe Names

Ms. O'BRIEN. Mr. Speaker, I rise today to speak on behalf of my constituents...

→

Mr. O'CONNELL. Madam President, I want to thank my colleague from New York...

→

5. Multi-word prefixes

Some prefixes are multiple words, which adds another layer of complexity. We handle these in the config alongside single-word prefixes:

          Multi-word prefixes in config
        

          def build_name_pattern(single_prefixes, multi_prefixes):
    SINGLE_RE = '(?:' + '|'.join(single_prefixes) + ')?'
    # NEW: handle multi-word prefixes separately
    MULTI_RE = '(?:' + '|'.join(multi_prefixes) + r'\s+)?'
    # NEW: allow hyphens in surnames
    return r'(?:[A-Z]+\s+)?' + MULTI_RE + SINGLE_RE + r"[A-Z][A-Z'\\-]+"


def classify_speaker_label(line):
    HONORIFICS = ["Mr.", "Ms.", "Mrs.", "Miss."]
    PREFIX_SINGLE = ["Mc", "Mac", "O'", "Van", "La", "Le", "De", "Del", "St"]
    PREFIX_MULTI = ["De La", "De Los", "Van der", "Van den"]

    NAME_RE = build_name_pattern(PREFIX_SINGLE, PREFIX_MULTI)

    SPEAKER_RE = re.compile(
        r'^(?P<honorific>' + '|'.join(HONORIFICS) + r')\s+'
        r'(?P<name>' + NAME_RE + r')'
        r'(?:\s+of\s+(?P<state>[A-Za-z\s]+))?'
        r'\.'
    )

    match = SPEAKER_RE.match(line)
    if not match:
        return {"matched": False}

    return {
        "matched": True,
        "speaker_name": match.group("name"),
        "state_disambiguation": match.group("state")
    }
        

The regex dynamically builds alternation patterns from this config, handling both single-word and multi-word prefixes:

Multi-Word Prefixes

Mr. De La CRUZ of Texas. Mr. Speaker, I rise today to honor...

→

Ms. Van der BERG. Madam President, I rise in support of this measure...

→

6. Nickname matching from member records

Sometimes the Record uses a member's nickname instead of their formal first name. We only treat this as a valid speaker name if the nickname exists in our member database (from Congress.gov). We don't guess nicknames—we extract them from the member's official name when it includes parentheses or quotes.

This happens after the speaker-line regex fires. Once a speaker line is detected, we pass the extracted name to the matcher, which resolves full names and nicknames against the member records.

          Nickname extraction and match
        

          def build_name_pattern(single_prefixes, multi_prefixes):
    SINGLE_RE = '(?:' + '|'.join(single_prefixes) + ')?'
    MULTI_RE = '(?:' + '|'.join(multi_prefixes) + r'\s+)?'
    return r'(?:[A-Z]+\s+)?' + MULTI_RE + SINGLE_RE + r"[A-Z][A-Z'\\-]+"


def classify_speaker_label(line):
    HONORIFICS = ["Mr.", "Ms.", "Mrs.", "Miss."]
    PREFIX_SINGLE = ["Mc", "Mac", "O'", "Van", "La", "Le", "De", "Del", "St"]
    PREFIX_MULTI = ["De La", "De Los", "Van der", "Van den"]

    NAME_RE = build_name_pattern(PREFIX_SINGLE, PREFIX_MULTI)

    SPEAKER_RE = re.compile(
        r'^(?P<honorific>' + '|'.join(HONORIFICS) + r')\s+'
        r'(?P<name>' + NAME_RE + r')'
        r'(?:\s+of\s+(?P<state>[A-Za-z\s]+))?'
        r'\.'
    )

    match = SPEAKER_RE.match(line)
    if not match:
        return {"matched": False}

    return {
        "matched": True,
        "speaker_name": match.group("name"),
        "state_disambiguation": match.group("state")
    }


# NEW: extract nicknames from member database records
# DB name formats: "Robert C. 'Bobby' Scott", "Claude (Buddy), Jr. Leach"
def extract_nicknames(official_name):
    paren_matches = re.findall(r'\(([^)]+)\)', official_name)
    quote_matches = re.findall(r'"([^"]+)"', official_name)
    return paren_matches + quote_matches


def build_nickname_index(members):
    index = defaultdict(list)
    for member in members:
        nicknames = extract_nicknames(member.official_name)
        for nickname in nicknames:
            key = f"{nickname.upper()} {member.last_name.upper()}"
            index[key].append(member)
    return index


# Build index once at startup
nickname_index = build_nickname_index(all_members)


def resolve_speaker(extracted_name, chamber):
    # Try direct match first
    if match := find_member_by_name(extracted_name, chamber):
        return match

    # Try nickname match
    key = extracted_name.upper()
    if matches := nickname_index.get(key):
        return matches[0]

    return None
        

Nickname Matching

Congressional Record

Mr. BOBBY SCOTT. Mr. Speaker, I rise...

↓

Extract name

"BOBBY SCOTT"

↓

Nickname index lookup

"BOBBY SCOTT" ?

↓

Member record match

—

7. Punctuation errors

The standard pattern expects a period after the speaker name. We discovered that occasionally a comma appears instead—likely a transcription typo. Without handling it, the comma-speaker line doesn't trigger a new speaker. The result: the speaker marker and their statement get attributed to the previous speaker.

We fix this in a post-processing pass that runs after the initial parse. We scan each statement's text looking for patterns like Mr. WALBERG, Mr. Speaker.... To avoid false positives (like vote lists), we validate what comes after the comma—it must be a presiding officer address or a common speech opener:

          parse_congrec.py — Post-processing comma fix
        

          def build_name_pattern(single_prefixes, multi_prefixes):
    SINGLE_RE = '(?:' + '|'.join(single_prefixes) + ')?'
    MULTI_RE = '(?:' + '|'.join(multi_prefixes) + r'\s+)?'
    return r'(?:[A-Z]+\s+)?' + MULTI_RE + SINGLE_RE + r"[A-Z][A-Z'\\-]+"


def classify_speaker_label(line):
    HONORIFICS = ["Mr.", "Ms.", "Mrs.", "Miss."]
    PREFIX_SINGLE = ["Mc", "Mac", "O'", "Van", "La", "Le", "De", "Del", "St"]
    PREFIX_MULTI = ["De La", "De Los", "Van der", "Van den"]

    NAME_RE = build_name_pattern(PREFIX_SINGLE, PREFIX_MULTI)

    SPEAKER_RE = re.compile(
        r'^(?P<honorific>' + '|'.join(HONORIFICS) + r')\s+'
        r'(?P<name>' + NAME_RE + r')'
        r'(?:\s+of\s+(?P<state>[A-Za-z\s]+))?'
        r'\.'
    )

    match = SPEAKER_RE.match(line)
    if not match:
        return {"matched": False}

    return {
        "matched": True,
        "speaker_name": match.group("name"),
        "state_disambiguation": match.group("state")
    }


def extract_nicknames(official_name):
    paren_matches = re.findall(r'\(([^)]+)\)', official_name)
    quote_matches = re.findall(r'"([^"]+)"', official_name)
    return paren_matches + quote_matches


def build_nickname_index(members):
    index = defaultdict(list)
    for member in members:
        nicknames = extract_nicknames(member.official_name)
        for nickname in nicknames:
            key = f"{nickname.upper()} {member.last_name.upper()}"
            index[key].append(member)
    return index


nickname_index = build_nickname_index(all_members)


def resolve_speaker(extracted_name, chamber):
    if match := find_member_by_name(extracted_name, chamber):
        return match

    key = extracted_name.upper()
    if matches := nickname_index.get(key):
        return matches[0]

    return None


def parse_congressional_record(lines):
    statements = []
    current = None

    # Phase 1: line-by-line parsing
    for line in lines:
        result = classify_speaker_label(line)
        if result["matched"]:
            if current:
                statements.append(current)
            current = Statement(speaker=result["speaker_name"], text=line)
        else:
            if current:
                current.text += " " + line

    if current:
        statements.append(current)

    # Phase 2: post-processing fixes
    # Comma typos cause missed speakers — the buried "Mr. NAME," pattern
    # ends up inside the previous statement's text. We scan and fix.
    statements = fix_comma_speaker_typos(statements)

    return statements


def fix_comma_speaker_typos(statements):
    COMMA_SPEAKER_RE = re.compile(r'^(Mr|Ms|Mrs|Miss)\.\s+([A-Z][A-Z\'\-]+),')
    PRESIDING_OFFICER_RE = re.compile(
        r'^(Mr\.|Madam|Mrs\.)\s+(Speaker|President|Chair|Clerk)', re.I
    )
    VALID_SPEECH_STARTS = ['i rise', 'i yield', 'i ask', 'i thank', 'i move']

    for statement in statements:
        match = COMMA_SPEAKER_RE.match(statement.text)
        if not match:
            continue

        after_comma = statement.text[match.end():].strip().lower()

        # Validate: must be presiding officer address or speech opener
        if PRESIDING_OFFICER_RE.match(after_comma):
            split_and_reattribute(statement, match)
        elif any(after_comma.startswith(opener) for opener in VALID_SPEECH_STARTS):
            split_and_reattribute(statement, match)

    return statements
        

Mr. WALBERG, Mr. Speaker, I rise today...

Initial Parse

Attributed to previous speaker

After Comma Fix

—

8. Avoiding false matches in titles

Topic titles sometimes contain member names. We don't want to treat those as speaker lines:

          Title with member name (not a speaker)
        
                      TRIBUTE TO MR. CARTER, A LOCAL VETERAN

  Mr. SMITH. Mr. Speaker, I rise to honor my colleague...

We run a title detection phase before parsing. This identifies topic headers based on formatting patterns (all-caps lines, title case between blank lines, etc.). During the main parse loop, any line that's part of a detected title is skipped before we check for speaker patterns—so names in titles never get matched as speakers.

9. Permission intros are not speaker lines

House floor speeches often begin with a permission parenthetical that contains the speaker's name and sometimes their state:

          CREC-2025-03-03-pt1-PgH927-7.htm — Permission parenthetical
        

                              RECOGNIZING READ ACROSS AMERICA

  (Mr. THOMPSON of Pennsylvania asked and was given permission to address
  the House for 1 minute and to revise and extend his remarks.)

  Mr. THOMPSON of Pennsylvania. Mr. Speaker, I rise to recognize
Read Across America Week. On Sunday, we kicked off Read Across America
Day, marking the start of National Reading Month. Each March, teachers,
students, parents, librarians, and more all acknowledge the importance
of literacy and reading for all Americans...
        

We do not extract state or speaker info from these lines. The speaker regex is written so it does not trigger on permission parentheticals (see the speaker-line pattern above), so they never open a new speaker. The actual speech begins at the real speaker line, where we pick up the state if it's present.

10. State mentions in speech body are not speaker lines

Members often refer to colleagues by state in the body of their speech—thanking them, yielding time, or acknowledging their work. These mentions can look like speaker cues, but they're just part of the current speaker's text.

          CREC-2025-03-03-pt1-PgH930.htm — State mentioned in speech
        

            Ms. MACE. Mr. Speaker, I thank my friend and the distinguished
chairman of the Committee on Oversight and Government Reform,
the gentleman from Kentucky (Mr. Comer), for yielding. I thank
both the chairman and the ranking member, Mr. Connolly, for their
leadership...
        

Here Ms. MACE mentions "the gentleman from Kentucky (Mr. Comer)" in the middle of her speech. This does not open a new speaker because the speaker regex is anchored to the start of a line and requires an honorific + surname + period (e.g., Mr. COMER.), which the mid-sentence mention doesn't match.

The Matching Hierarchy

After all this regex work, we have an extracted name—but we still need to match it to a real person. Our member database (from Congress.gov) includes all publicly listed members, so we filter by congress number. Members sometimes switch chambers (Jim Banks moved from the House in the 118th to the Senate in the 119th), so we verify chamber too.

It helps to separate the process into two phases: extraction (the regex pulls out a name and optional state) and resolution (we choose the best member given that extracted evidence). The resolution order is precision-first: we try the most specific, information-rich matches before the ambiguous fallbacks.

Matching hierarchy with corner-case references:

Extraction vs. Resolution

Extraction pulls a name (and optional state) from the speaker line. Resolution then decides which member that evidence points to.

Why Order Matters: Precision First

Resolution tries rules from most specific to least specific. If we tried "last name only" first, we'd grab the first match and never consult the more precise evidence we already extracted.

Normalize accented characters — pre-step applied to all matches (global, not a specific corner case)
Detect speaker line — driven by cumulative regex + post-processing (Corner cases 1–6)
Exact full name match — uses optional first names (Corner case 2)
Nickname matching — resolves names using member nicknames (Corner case 7)
Last name + state — uses extracted of STATE (Corner case 1)
Last name only — baseline fallback when unique in chamber (no specific corner case)

Examples: Full Flow

Let's see how this works with examples. Each one shows extraction (name/state) and then the first rule that resolves the member.

Example: Full Name

When the Record provides a first + last name, the full name match resolves immediately.

Full Name Flow

Congressional Record

Mr. AUSTIN SCOTT of Georgia. Mr. Speaker, by direction of the Committee on Agriculture...

↓

Extraction

name = "AUSTIN SCOTT", state = "Georgia"

↓

Resolution: Try full name match

"AUSTIN SCOTT" → Rep. Austin Scott (R-GA) ✓

Example: Nickname

Nicknames resolve only when the nickname is in the official member records.

Nickname Flow

Congressional Record

Mr. BOBBY SCOTT. Mr. Speaker, I rise in opposition...

↓

Extraction

name = "BOBBY SCOTT"

↓

Resolution: Try full name match

No member named "BOBBY SCOTT" → skip

↓

Resolution: Try nickname match

Nickname index: "BOBBY SCOTT" → Robert C. "Bobby" Scott (D-VA) ✓

Example: Last Name + State

The regex extracts last name and state together in the same pass; resolution uses the state to disambiguate.

State Disambiguation Flow

Congressional Record

Mr. THOMPSON of Mississippi. Mr. Speaker, as we stand here today...

↓

Extraction

name = "THOMPSON", state = "Mississippi"

↓

Resolution: Try full name match

No first name provided → skip

↓

Resolution: Try nickname match

"THOMPSON" not a nickname → skip

↓

Resolution: Try last name + state

THOMPSON + Mississippi → Rep. Thompson (MS) ✓

What happens without state? We risk matching the wrong THOMPSON or failing loudly and needing a manual correction.

Example: Last Name Only

When the name is unique in a chamber, last-name-only matching resolves cleanly.

Last Name Only Flow

Congressional Record

Ms. MACE. Mr. Speaker, I rise today...

↓

Extraction

name = "MACE"

↓

Resolution

Last name only → Rep. Nancy Mace (SC) ✓

Speaker Continuity Across Files

Here's something that surprised us during development: the Congressional Record is split across hundreds of separate files per day. Each topic or section gets its own file. But a single speech might span multiple files, and the speaker's name only appears at the beginning of their speech—not at the start of each file.

We discovered this when we noticed that some files start with speech text but no speaker line. Look at these consecutive files from March 3, 2025:

          CREC-2025-03-03-pt1-PgS1445-7.htm
        

                                  Nomination of Linda McMahon

  Mr. GRASSLEY. Mr. President, the Senate will soon vote to confirm
Linda McMahon to be Secretary of Education. I know that some people
feel that the Secretary of Education should have extensive experience
in a school system...
        

          CREC-2025-03-03-pt1-PgS1445-8.htm — No speaker line!
        
                                    World Hearing Day

  Mr. President, on another subject, this day, March 3, is World
Hearing Day. Approximately 38 million Americans report hearing loss...

File -8 has no speaker line at all—it just starts with "Mr. President, on another subject..." This is still Senator Grassley speaking; he's simply transitioning to a new topic mid-speech. Without tracking speaker continuity across files, we'd lose attribution for this entire section.

The Solution: Carryover State

Our solution is to maintain a "carryover state" that tracks the last known speaker as we process files sequentially. The key insight: files are named with page numbers and suffixes that indicate ordering (e.g., S1445-7 comes before S1445-8). When we encounter a file without a speaker line, we check if it's sequential with the previous file. If so, we inherit the speaker.

But we have to be careful—not every consecutive file is a continuation. We validate that the sequence is truly contiguous (same page series, sequential numbering) before applying carryover. A jump from S1445-8 to S1447 would break the chain.

The carryover logic looks something like this:

          parse_congrec.py
        

          # Carryover logic for multi-file speeches
continuing_speaker = last_statement_speaker or carryover_speaker

if continuing_speaker:
    open_speech_with_label(
        continuing_speaker,
        continuing_state,
        is_continuation=True
    )
        

Here's what that looks like in practice:

Cross-File Speaker Tracking

S1445-7.htm

Mr. GRASSLEY

then

S1445-8.htm

—

then

S1445-9.htm

—

Explicitly detected

Inherited from previous file

Testing an LLM Fallback

We initially hypothesized that some multi-speaker cases might slip through our regex-based parser. To test this, we added an LLM safety net: every statement was run through GPT-5 Nano using the OpenAI Batch API. The prompt was detailed—it included examples of what counts as multiple speakers (two distinct "Mr./Ms. NAME." patterns) and what doesn't (quoted material, yielding time without the second person actually speaking).

After processing all of 2025, about 1.5% of statements were flagged as potentially containing multiple speakers. We then manually examined a sample of the flagged results to validate the approach.

The results were disappointing. The model was flagging statements where a speaker quoted or referenced another member—not cases where the parser actually missed a speaker change. A senator quoting what their colleague said is not a multi-speaker parsing failure; it's just one speaker relaying another's words. The attribution is correct.

We found zero true positives—no cases where the parser had genuinely failed to detect a speaker change. The 1.5% flag rate was entirely false positives, likely due to prompt ambiguity or model confusion about what "multiple speakers" means in this context.

We decided to drop this approach. The regex-based parser was already doing its job correctly, and the LLM "safety net" was adding cost without catching real issues. Sometimes the simplest solution is the right one.

Validating Speaker Attribution at Scale

In addition to the LLM multi-speaker check, we validate speaker attribution by sampling data and sending it to GPT-5.2. For each sampled statement, we send the original raw HTML file plus our extracted statement text and speaker attribution. The model independently verifies whether the attribution is correct by examining the source document.

This catches attribution errors that our parser might have made—especially in complex multi-file carryover cases where a speaker continues across page boundaries. When the model flags a mismatch, we investigate: is it a parser bug, an edge case we hadn't seen, or a one-off anomaly in the source data?

We also run deterministic checks: scripts that find residual speaker markers (patterns like "Mr. SMITH." appearing mid-statement, suggesting we failed to split properly) and scripts that compare speaker counts between our extracted JSON and the source HTML. These complement the LLM validation with fast, reproducible checks.

This is part of an ongoing series on making the Congressional Record accessible. Next up: Topic Identification—how we extract and normalize the subjects being discussed.