Address clustering is the technical foundation of geographic federal contractor intelligence. The basic idea sounds simple: group SAM.gov registered entities by their physical business address, then look at the resulting groups. In practice, the method requires careful handling of several technical problems — address normalization, deduplication, threshold selection, risk scoring, and exclusion cross-referencing — before the output is reliable enough to support compliance decisions.

This article explains the address clustering methodology used by Convergence Data Analytics: how raw SAM.gov data is processed, what the clustering threshold means, how risk scores are computed, and what the method can and cannot tell you about a specific cluster.

Step 1: Address normalization

The first technical problem is that raw addresses in SAM.gov contain countless inconsistencies. The same physical building might appear in the database as "525 Corporate Drive Suite 200," "525 CORPORATE DR STE 200," "525 Corporate Drive #200," and dozens of other variants. Without normalization, each variant clusters separately and the true geographic picture is hidden behind formatting differences.

Our normalization process strips and standardizes the following components:

The normalization is intentional in what it does not strip. We preserve numeric components like building numbers and street numbers because those are the actual identifying components of an address. We preserve full street names because two different roads can have the same suite numbers. The goal is to collapse formatting variation while preserving real address identity.

Edge cases require careful handling. The string "STE" appears in words like "STEAMBOAT" and "ESTERO" — if the regex were too greedy, it would incorrectly strip portions of those street names. Our normalization uses word-boundary anchors and trailing-position requirements to ensure that "STE 201" gets stripped but "STEAMBOAT SPRINGS RD" does not. This kind of detail matters when you are processing 2.68 million records and a single bad pattern produces thousands of false matches.

Step 2: The clustering threshold

Once addresses are normalized, the clustering step is straightforward: group all entities sharing the same normalized address plus the same city. The threshold for forming a cluster is three or more uniquely named entities. Two thresholds are deliberate choices.

Three is the minimum. Two entities at the same address could trivially be a small office shared between a parent company and one subsidiary — not a meaningful pattern. Three or more uniquely named entities at the same physical location represents a pattern worth examining, even if most such patterns turn out to have legitimate explanations.

Uniquely named is the requirement. The "uniqueness" check matters because some entities have multiple SAM registrations under slightly different name variants. Counting these as separate entities would inflate cluster sizes artificially. By requiring three uniquely named legal entities, we ensure each cluster represents three or more genuinely distinct firms rather than three registration variants of the same firm.

The result of this filtering: 67,594 sellable address clusters from 2,684,826 total SAM registrations. Roughly 2.5 percent of all entities end up in clusters. The other 97.5 percent are at addresses that contain no other co-registered entities, or at most one or two others — below the cluster threshold.

Step 3: Risk scoring

Each cluster receives an automated risk score on a 0–100 scale. The score is a screening indicator, not a conclusion. A high score means a cluster exhibits more of the patterns that compliance teams typically want to review — not that anything wrong has happened or is happening.

The score combines several weighted factors:

The total possible score is 100. In practice, most clusters score in the 20–40 range. Scores above 60 are uncommon and typically reflect addresses with multiple risk factors stacking together. Scores above 80 are rare and warrant priority review — but again, "review" means look more closely, not "this is wrong."

Step 4: Exclusion cross-referencing

After clustering and scoring, every entity in every cluster is checked against the 167,681 records in the SAM Exclusion List. The exclusion list is the federal government's record of entities barred from contracting, and a positive match is the most consequential finding our methodology produces.

Two matching methods run in parallel:

A confirmed exclusion match elevates the cluster status to "confirmed finding." Of 67,594 sellable clusters, 129 contain at least one confirmed match — 0.19 percent overall, but concentrated in the largest contracting markets.

What the methodology cannot tell you

It is worth being explicit about the limits of the method. Address clustering is a screening tool, not an investigative conclusion. It can tell you:

It cannot tell you:

The methodology exists to provide screening signal — to surface patterns that warrant standard review and to give compliance teams the geographic context that single-entity SAM lookups cannot provide. Patterns identified may have legitimate explanations including normal corporate structures, shared office buildings, registered agent services, business incubators, or standard business practices. More on legitimate co-location patterns.

Why DISTINCT and deduplication matter

One technical detail that turns out to be critical: every entity query in the underlying system uses SQL DISTINCT and is wrapped in a Python deduplication function. This sounds pedantic, but it matters because some entities have multiple SAM registrations — one current and several historical, or the same entity registered as both an active record and an expired record. Without DISTINCT, the same UEI would appear multiple times in cluster results, inflating entity counts and producing duplicate rows in the CSV export.

The DISTINCT requirement runs at two levels. At the database level, the SQL query selects DISTINCT entities by UEI. At the application level, a Python dedupe_entities() function provides a safety net to catch any duplicates that survive the database query. The combination ensures that entity counts in the published reports reflect actual unique entities, not duplicate registrations of the same entity.

This may sound obvious, but it is the kind of detail that distinguishes production-quality data from quick analysis. Independent reviews of our methodology specifically called out duplicate entity rows as a quality issue in early versions of the platform — and the fix was to enforce DISTINCT and deduplication everywhere entity data flows. The published reports today contain zero duplicate entities.

Quality assurance

Every step of the methodology has been independently reviewed and refined through multiple iterations. Key quality controls include:

For more on the underlying technical implementation, see our full methodology page.