Skip to content

Methodology

How this data is collected, processed, and presented

This document describes the technical pipeline that powers Cooked. Every figure displayed on this site traces back to a federal public record. Where we compute derived metrics, the method, formula, and limitations are documented below.

Important: Statistical patterns are not evidence of wrongdoing. A campaign finance anomaly may reflect legitimate fundraising strategy, data collection artifacts, or reporting timing. Correlation between lobbying activity and voting records does not establish causation.

1. Data Sources

All data originates from official federal disclosure systems. No private datasets, scraped content, or third-party aggregators are used. The following APIs and datasets are queried directly:

Federal Election Commission (FEC)

Endpoint: api.open.fec.gov/v1

Individual contributions (Schedule A) and independent expenditures (Schedule E) for all federal candidates. Data covers House, Senate, and Presidential races. FEC filings are reported by committees and released on a rolling basis. Negative amounts (refunds, redesignations) are clamped to zero during ingestion.

Congress.gov & Senate.gov

Endpoints: api.congress.gov/v3, Senate XML roll-call feed

Roll-call vote records for the U.S. House (via Congress.gov API) and U.S. Senate (via the official XML feed). Each vote row records the member, position (Yea/Nay/Not Voting), question text, and result. Vote data is available for the 117th Congress onward.

Lobbying Disclosure Act (LDA) API

Endpoint: lda.senate.gov/api/v1

LD-2 quarterly lobbying filings and LD-203 semi-annual PAC contribution reports filed by registered lobbying firms. LD-2 filings include client name, registrant (lobbying firm), issue codes, and free-text descriptions that often reference specific bills. LD-203 filings report contributions from lobbying firm PACs to federal candidates.

GovInfo BILLSTATUS

Endpoint: api.govinfo.gov

Official bill metadata from the Government Publishing Office. Used to resolve bill numbers referenced in vote questions and lobbying descriptions into titles, policy areas, and latest actions. Bill references in raw data are matched using a normalized type-number-congress key.

congress-legislators Dataset

Source: github.com/unitedstates/congress-legislators

Open-source directory mapping every legislator to their official identifiers: bioguide ID, FEC candidate ID, LIS ID, THOMAS ID, and ICPSR ID. This crosswalk is used to link FEC contribution records to Congressional vote records for the same person. The dataset is periodically snapshot-versioned in the database to support incremental matching improvements.

Voteview (UCLA)

Source: voteview.com

Academic dataset of roll-call voting records maintained by UCLA political scientists. Used as a secondary validation source to cross-check member alignment and vote record consistency against the official Congress.gov data. Not used for primary display; diagnostic only.

2. Ingestion Pipeline

Source data is fetched through a series of specialized workers, each targeting one upstream API. Workers run on a recurring schedule and store records in dedicated tables per source.

Processing Order

The pipeline runs in a fixed order to respect data dependencies:

  1. 1.Identity — Fetch the congress-legislators dataset, build crosswalks linking bioguide IDs to FEC candidate IDs.
  2. 2.Congress — Ingest recent House and Senate roll-call votes.
  3. 3.FEC — Fetch contributions and independent expenditures for all federal candidates, using crosswalk IDs to match candidates.
  4. 4.GovInfo — Fetch bill metadata for bills referenced in votes and lobbying filings.
  5. 5.Lobbying (LD-2) — Fetch quarterly disclosure filings from the LDA API.
  6. 6.LD-203 — Fetch PAC contribution reports from lobbying registrants.
  7. 7.Normalize — Rebuild the derived graph from all raw tables (see Section 3).

Data Quality Controls

  • Contribution amounts are validated as finite numbers; non-numeric values are discarded.
  • Dates are validated against ISO 8601 format; malformed dates are rejected at ingest time.
  • Candidate matching uses crosswalk-backed identifiers where available, falling back to name-based search with state and office constraints.

3. Normalization & Graph Construction

Raw records from different sources use different identifier systems. A contributor in FEC data has no inherent link to a lobbying client in LDA data or a member in Congressional records. The normalization step reads all source records and builds a unified directed graph linking people, organizations, bills, votes, and issue areas.

Entity Resolution

Each entity (politician, contributor, committee, lobby client, lobbying firm, bill, issue area) is identified by a canonical key from its source system. For politicians, the bioguide ID serves as the primary key. For contributors, a composite of normalized name, employer, and occupation is used. For lobby clients, the LDA-assigned client ID is canonical. This approach ensures that the same real-world entity is consistently represented across data sources.

Edge Construction

The graph currently defines 12 active directed edge kinds. Each edge carries metadata (accumulated dollar amounts, vote positions, filing references) and links to evidence records that trace back to specific raw table rows with source record IDs.

EdgeFromToSource
ContributionIndividualCommitteeFEC Sched. A
Committee receiptCommitteePoliticianFEC Sched. A
Outside spendingPAC/OrgPoliticianFEC Sched. E
Vote castPoliticianRoll callCongress.gov
Vote on billRoll callBillGovInfo
Lobbying on issueLobby clientIssue areaLDA LD-2
Lobbying on billLobby clientBillLDA LD-2
Lobby-politician linkLobby clientPoliticianLD-2 + vote join
Registrant-clientLobbying firmClient orgLDA LD-2
Registrant PACLobbying firmPoliticianLDA LD-203
Bill policy areaBillIssue areaGovInfo

4. Statistical Signals

After graph construction, a set of signal builders analyze each politician's subgraph and emit structured findings. Each signal carries a confidence level (low, medium, high), a numeric score, a human-readable summary, and evidence references linking back to source records.

The system currently supports 23 signal kinds. A given refresh may materialize fewer kinds depending on source coverage and whether any politician crosses the minimum threshold. Each supported signal is described below with its computation method and the minimum condition for the signal to appear. Signals are also assigned a confidence tier (low, medium, high) based on the magnitude of the finding.

SignalMethodThreshold
Outside spendingTotal independent expenditure volume (support + oppose) from FEC Schedule E filings.Any amount
Issue overlapSet intersection of issue codes that appear in both lobbying filings (LD-2) and contribution employer categories for the same politician.≥ 1 area
Vote deviationMeasures the fraction of a politician's total funding that flows into issue areas where the politician has no corresponding vote record. High values suggest a disconnect between funding sources and legislative activity.Any mismatch
Temporal proximityFinds contributions that fall within a 30-day window before or 7-day window after a roll-call vote on a related bill. Window is asymmetric because pre-vote contributions are more analytically interesting.≥ 1 match
Lobby-donate overlapIdentifies organizations whose name appears as both a lobbying client (LD-2 filings) and a contributor employer (FEC Schedule A) for the same politician.≥ 1 match
Funding concentrationHerfindahl-Hirschman Index (HHI): sum of squared contribution shares, scaled to 0-10,000. Higher values indicate fewer donors controlling more of the total.HHI ≥ 1000
Party alignmentFraction of scored roll calls where the politician voted with their party's majority direction. Only roll calls with at least 3 same-party votes are scored.≥ 5 scored votes
Contribution outlierZ-score of the politician's top contributor amount versus the distribution of top contributor amounts across all politicians. Uses Bessel-corrected sample standard deviation.Z ≥ 1.5
Peer anomalyLeave-one-out Z-score across multiple axes (HHI, top donor share, outside spend ratio) against peers grouped by office and party. Each axis is independently tested.|Z| > 2 on ≥ 1 axis
Funding-vote alignmentFinds lobby clients that are also contributor employers; measures the share of voted bills with convergent lobbying activity.≥ 2 convergent bill-votes OR > 30%
Industry shiftCosine similarity of employer contribution vectors between consecutive election cycles. Vectors are constructed per employer, weighted by contribution amount. A large shift indicates the donor base has materially changed.Similarity < 70%
Employer-cluster givingGroups contributions by normalized employer within a filing cycle and counts distinct contributors per employer. Large clusters suggest coordinated giving.≥ 8 contributors from same employer
Contribution velocity spikeSliding 30-day window over sorted contributions; computes z-score of peak window vs. baseline mean contribution rate.Peak window Z > 2.5
Pre-vote surgeComputes baseline daily contribution rate, then measures the 14-day pre-vote contribution rate against that baseline.Pre-vote rate > 3× baseline
Abstention on passage votesCompares a politician's participation rate on final passage votes vs. their overall participation rate. Flags selective non-voting.≥ 15pp gap, overall > 80%, ≥ 3 missed
Dark money outside spend ratioClassifies independent expenditure spenders as disclosed PACs or dark money 501(c)(4)s using name-pattern heuristics; computes the dark money share.> 30% dark money AND > $50K
Lobby client clusterTriple-path detection: contributions from employers that are lobby clients, registrants filing for those clients, and LD-203 PAC contributions from the same network.All 3 paths active
Registrant client captureHerfindahl index on registrant revenue distribution across lobbying clients. Identifies registrants dominated by a single client.HHI > 7000 (top client > 70%)
Multi-registrant bill pressureCounts independent lobbying registrants targeting the same bill the politician voted on, via graph traversal.≥ 3 registrants on same bill
Post-vote rewardMirror of pre-vote surge: measures the 30-day post-vote contribution rate vs. baseline daily rate.Post-vote rate > 2.5× baseline
Shared donor cohortPairwise Jaccard similarity on normalized employer donor sets across politicians. Identifies politicians funded by the same networks.Jaccard ≥ 0.25 (25% overlap)
Bipartisan defection clusterCross-politician signal: identifies roll calls where both parties have members voting against their party majority.≥ 5 bipartisan defection roll calls
Lobby bill abstentionCompares abstention rate on bills with active lobbying vs. overall abstention rate. Flags selective disengagement on lobbied legislation.Lobbied abstention ≥ 2× overall rate

5. Display Conventions

Name Formatting

FEC data stores individual names in “LAST, FIRST MIDDLE” format (all caps). During normalization, names are parsed and reordered to “First Last” using a title-case formatter. Organization names (PACs, lobbying clients, registrants) are title-cased without reordering. A suffix detection step prevents misinterpreting organization commas (e.g., “Brooklyn College Foundation, Inc.”) as personal name separators.

Dollar Amounts

Amounts above $1M are shown with one decimal (e.g., $2.4M); amounts above $1K are shown in thousands (e.g., $43K); smaller amounts are displayed as whole dollars.

Confidence Levels

Each signal is assigned a confidence level based on the strength and quantity of supporting evidence. “High” typically requires multiple corroborating data points above the threshold. “Medium” reflects a single strong signal or multiple weaker ones. “Low” is used for findings that meet the minimum threshold but lack strong corroboration.

Peer Comparisons

When the interface shows a ratio relative to peers (e.g., “7.6x peer avg”), this is the actual quotient of the politician's value divided by the peer group mean. Peer groups are defined by office (House or Senate) and party affiliation. A minimum group size of 5 is required. The leave-one-out method is used: the politician being evaluated is excluded from the peer statistics to avoid self-contamination.

6. Known Limitations

Incomplete contribution coverage

FEC Schedule A data is reported on a rolling basis. The database contains the most recent available filings but does not represent a complete historical record for every candidate. Concentration metrics (HHI, top donor share) are computed from imported rows only and may differ from figures derived from the full FEC dataset.

Vote coverage begins at the 117th Congress

Roll-call vote records are available from the 117th Congress (2021) onward. Politicians who served before this period will not have vote data, and party alignment metrics will reflect only recent legislative sessions.

Lobbying-to-politician links are indirect

LD-2 lobbying filings disclose which bills and issues a client lobbied on, but not which specific legislators were contacted. The link between a lobbying client and a politician is inferred by finding politicians who voted on bills referenced in the client's filings. This means the connection reflects shared legislative activity, not direct lobbying contact.

Name-based matching has limits

When crosswalk-backed identifiers are unavailable, candidate matching falls back to name similarity with state and office constraints. This can produce incorrect matches for common names. A nickname resolution table handles common variants (e.g., “Ted” matches both “Edward” and “Theodore”), but unusual name spellings may still cause missed matches.

Independent politicians require confirmed caucus affiliation

For peer-anomaly grouping, independent politicians (e.g., Sen. Sanders, Sen. King) are mapped to their caucus party when that affiliation is confirmed in metadata. Independents without a confirmed caucus are excluded from peer comparisons rather than assigned to a default group.

Lobby-donate matching uses employer name heuristics

The lobby-donate overlap signal matches lobbying clients to contributor employers by normalized business name, stripping common suffixes (LLC, Inc., Corp, etc.). This fuzzy matching can produce false positives when two distinct organizations share a normalized name, or false negatives when the same organization uses substantially different names across FEC and LDA filings.

7. Evidence Traceability

Every signal and edge in the system carries evidence references that link back to specific records in the raw source tables. Each evidence reference includes the source table name, the record's unique identifier, and a human-readable label. On politician profile pages, evidence links resolve to the original filing on FEC.gov, Congress.gov, or lda.senate.gov.

The full evidence chain for any politician can be exported as a structured document from the profile page via the “Export evidence” link.

This methodology is current as of March 2026. The pipeline, signal definitions, and thresholds may be updated as new data sources are integrated or existing computations are refined. Material changes will be reflected on this page.