Address clustering is the technical foundation of geographic federal contractor intelligence. The basic idea sounds simple: group SAM.gov registered entities by their physical business address, then look at the resulting groups. In practice, the method requires careful handling of several technical problems — address normalization, deduplication, threshold selection, risk scoring, and exclusion cross-referencing — before the output is reliable enough to support compliance decisions.
This article explains the address clustering methodology used by Convergence Data Analytics: how raw SAM.gov data is processed, what the clustering threshold means, how risk scores are computed, and what the method can and cannot tell you about a specific cluster.
Step 1: Address normalization
The first technical problem is that raw addresses in SAM.gov contain countless inconsistencies. The same physical building might appear in the database as "525 Corporate Drive Suite 200," "525 CORPORATE DR STE 200," "525 Corporate Drive #200," and dozens of other variants. Without normalization, each variant clusters separately and the true geographic picture is hidden behind formatting differences.
Our normalization process strips and standardizes the following components:
- Suite and unit designators — STE, Suite, #, Unit, Apt, Apartment, Bldg, Building, Floor, Room, etc.
- Road type abbreviations — Street/St, Drive/Dr, Avenue/Ave, Boulevard/Blvd, Road/Rd, etc.
- Directional prefixes and suffixes — North/N, South/S, East/E, West/W, Northeast/NE, etc.
- Capitalization — everything uppercased for consistent comparison
- Punctuation and extra whitespace — commas, periods, multiple spaces collapsed to single spaces
The normalization is intentional in what it does not strip. We preserve numeric components like building numbers and street numbers because those are the actual identifying components of an address. We preserve full street names because two different roads can have the same suite numbers. The goal is to collapse formatting variation while preserving real address identity.
Edge cases require careful handling. The string "STE" appears in words like "STEAMBOAT" and "ESTERO" — if the regex were too greedy, it would incorrectly strip portions of those street names. Our normalization uses word-boundary anchors and trailing-position requirements to ensure that "STE 201" gets stripped but "STEAMBOAT SPRINGS RD" does not. This kind of detail matters when you are processing 2.68 million records and a single bad pattern produces thousands of false matches.
Step 2: The clustering threshold
Once addresses are normalized, the clustering step is straightforward: group all entities sharing the same normalized address plus the same city. The threshold for forming a cluster is three or more uniquely named entities. Two thresholds are deliberate choices.
Three is the minimum. Two entities at the same address could trivially be a small office shared between a parent company and one subsidiary — not a meaningful pattern. Three or more uniquely named entities at the same physical location represents a pattern worth examining, even if most such patterns turn out to have legitimate explanations.
Uniquely named is the requirement. The "uniqueness" check matters because some entities have multiple SAM registrations under slightly different name variants. Counting these as separate entities would inflate cluster sizes artificially. By requiring three uniquely named legal entities, we ensure each cluster represents three or more genuinely distinct firms rather than three registration variants of the same firm.
The result of this filtering: 67,594 sellable address clusters from 2,684,826 total SAM registrations. Roughly 2.5 percent of all entities end up in clusters. The other 97.5 percent are at addresses that contain no other co-registered entities, or at most one or two others — below the cluster threshold.
Step 3: Risk scoring
Each cluster receives an automated risk score on a 0–100 scale. The score is a screening indicator, not a conclusion. A high score means a cluster exhibits more of the patterns that compliance teams typically want to review — not that anything wrong has happened or is happening.
The score combines several weighted factors:
- Entity count (up to 20 points). Larger clusters receive more points, capped to prevent extremely high counts from dominating the score.
- Coordinated SAM expiration dates (15 points). When three or more entities at the same address have SAM registrations expiring within a 90-day window, that pattern receives points. Coordinated expirations can occur for legitimate reasons (same registration date, same renewal cycle) but the pattern is worth flagging.
- Mixed active/expired with active contracts (15 points). A cluster where some entities are active and have ongoing contracts while others are expired triggers this factor.
- Total contract value concentration (up to 20 points). Clusters with substantial federal contract activity score higher than clusters with no awards.
- Set-aside certification patterns (10 points). Clusters where multiple entities hold the same set-aside certification (8(a), HUBZone, SDVOSB, WOSB) trigger this factor.
- NAICS code concentration (10 points). Clusters where most entities operate in the same NAICS code space score higher than diverse clusters.
- Active contract holders (10 points). The number of entities in the cluster currently holding federal contracts.
The total possible score is 100. In practice, most clusters score in the 20–40 range. Scores above 60 are uncommon and typically reflect addresses with multiple risk factors stacking together. Scores above 80 are rare and warrant priority review — but again, "review" means look more closely, not "this is wrong."
Step 4: Exclusion cross-referencing
After clustering and scoring, every entity in every cluster is checked against the 167,681 records in the SAM Exclusion List. The exclusion list is the federal government's record of entities barred from contracting, and a positive match is the most consequential finding our methodology produces.
Two matching methods run in parallel:
- UEI exact match. If the UEI in a SAM registration exactly matches a UEI in the exclusion record, that is the highest-confidence match. UEIs are unique identifiers, so this match has essentially zero false positive risk.
- Firm name exact match (city-restricted). If the legal entity name exactly matches an exclusion record name and both are in the same city, that is a high-confidence match. The city restriction prevents matching unrelated entities with similar names in different geographic locations.
A confirmed exclusion match elevates the cluster status to "confirmed finding." Of 67,594 sellable clusters, 129 contain at least one confirmed match — 0.19 percent overall, but concentrated in the largest contracting markets.
What the methodology cannot tell you
It is worth being explicit about the limits of the method. Address clustering is a screening tool, not an investigative conclusion. It can tell you:
- Where federal contractor entities are co-located
- Which clusters have unusual characteristics (size, expiration patterns, NAICS concentration)
- Which clusters contain entities matching the exclusion list
- The geographic distribution of contractor density across cities and states
It cannot tell you:
- Whether co-located entities are actually affiliated, owned by the same parties, or operationally connected
- Whether any pattern reflects intentional behavior or coincidence
- Whether a specific cluster represents wrongdoing of any kind
- Anything about the underlying business purposes, ownership structures, or relationships of the entities involved
The methodology exists to provide screening signal — to surface patterns that warrant standard review and to give compliance teams the geographic context that single-entity SAM lookups cannot provide. Patterns identified may have legitimate explanations including normal corporate structures, shared office buildings, registered agent services, business incubators, or standard business practices. More on legitimate co-location patterns.
Why DISTINCT and deduplication matter
One technical detail that turns out to be critical: every entity query in the underlying system uses SQL DISTINCT and is wrapped in a Python deduplication function. This sounds pedantic, but it matters because some entities have multiple SAM registrations — one current and several historical, or the same entity registered as both an active record and an expired record. Without DISTINCT, the same UEI would appear multiple times in cluster results, inflating entity counts and producing duplicate rows in the CSV export.
The DISTINCT requirement runs at two levels. At the database level, the SQL query selects DISTINCT entities by UEI. At the application level, a Python dedupe_entities() function provides a safety net to catch any duplicates that survive the database query. The combination ensures that entity counts in the published reports reflect actual unique entities, not duplicate registrations of the same entity.
This may sound obvious, but it is the kind of detail that distinguishes production-quality data from quick analysis. Independent reviews of our methodology specifically called out duplicate entity rows as a quality issue in early versions of the platform — and the fix was to enforce DISTINCT and deduplication everywhere entity data flows. The published reports today contain zero duplicate entities.
Quality assurance
Every step of the methodology has been independently reviewed and refined through multiple iterations. Key quality controls include:
- Virtual office filtering. Known registered agent and virtual office addresses are documented and auto-cleared to prevent false high-density clusters.
- Common word blocklist. Words like "TECHNOLOGY," "DEFENSE," "FEDERAL," and "SOLUTIONS" are excluded from name-stem matching to prevent false positives on generic terms.
- Auto-clear of known firms. Universities, government entities, established defense contractors, and utilities are automatically cleared from the review queue when they appear in clusters.
- Pre-sale verification. Every confirmed exclusion match is verified against the live sam.gov record before being included in a published report.
- Independent review. The methodology has been audited four times by external reviewers, with rating progression from 7/10 in early versions to 9.6/10 in the current production methodology.
For more on the underlying technical implementation, see our full methodology page.