Many organizations struggle with duplicate customer sites and accounts scattered across operational systems. These duplicates introduce confusion in invoicing, shipping, reporting, and analytics. The immediate need to resolve this data fragmentation often outpaces the availability of enterprise-wide tooling.
This post outlines how we built a pragmatic, SQL-based deduplication framework to identify, cluster, and prepare duplicate records for cleanup—without relying on a formal data governance platform. We walk through the architecture, logic, and lessons learned while keeping the implementation grounded in widely available tools and query techniques.
Step 1: Fuzzy Matching and Site Clustering Using SAS
The first step was to detect likely duplicate customer sites based on attributes such as address, postal code, business purpose, and customer name. Using SAS Data Quality functions, we configured fuzzy matching rules and similarity thresholds to create a flat output of matched site pairs and cluster assignments.
Each row in the resulting dataset linked a site ID to a Cluster ID—representing a group of sites that should be treated as potentially duplicative.
Step 2: Load Results into an Analytical Database
We then loaded the matching output into a dedicated SQL Server environment designed for analytical processing. This environment consisted of a few core tables:
- Sites: Basic customer site attributes
- Accounts: Top-level customer records
- MatchResults: SAS output including match scores and cluster assignments
- ClusterView: A derived view tying site-level clusters back to their associated accounts
This layer allowed us to inspect, refine, and build upon the matching results using SQL.
Step 3: Build Clusters and Super Clusters with SQL Logic
While the initial clusters grouped sites together, it quickly became clear that related accounts often shared sites or belonged to overlapping clusters. We extended the logic by introducing:
- DuplicateGroup: A refined group of duplicate sites
- SuperCluster: A roll-up of accounts that shared multiple duplicate groups
Using SQL window functions and DISTINCT
joins, we connected sites to accounts and accounts to each other, effectively uncovering multi-account relationships that hadn’t been visible.
SELECT
AccountNumber,
MIN(ClusterID) OVER (PARTITION BY AccountNumber) AS SuperClusterID
FROM MatchResults
This logic established which accounts could safely be viewed as belonging to the same logical customer entity, even if they were set up separately in the ERP.
Step 4: Apply Ranking Logic to Identify Survivors
Once clusters were formed, we needed to determine which site in each group would serve as the “survivor” record. Ranking logic was applied using DENSE_RANK()
to evaluate:
- Site status (active vs. inactive)
- Flags indicating primary usage
- Most recent invoice date
- Site ID (as a final tie-breaker)
DENSE_RANK() OVER (
PARTITION BY DuplicateGroup, SiteUse
ORDER BY SiteStatus, IsPrimary DESC, LastInvoiceDate DESC, SiteID DESC
) AS SiteRank
Based on this rank, we assigned actions to each site:
- Migrate: Keep as the active site and migrate others into it
- Merge: Deactivate or consolidate into the survivor
- Leave As-Is: If not part of a duplicate group
Step 5: Generate Merge Files for Operational Action
Once survivor and victim designations were made, we produced merge files that paired each survivor with its associated victims. These files could be used for manual review or integration into downstream cleanup processes.
SELECT
Survivor.SiteID AS SurvivorID,
Victim.SiteID AS VictimID,
'Y' AS DeleteAfterMergeFlag,
'De-duplication Merge' AS MergeReason
FROM SiteClusters AS Survivor
JOIN SiteClusters AS Victim
ON Survivor.DuplicateGroup = Victim.DuplicateGroup
AND Victim.Action <> 1
WHERE Survivor.Action = 1
In some cases, filters were added to exclude sites with recent orders or open transactions, allowing low-risk merges to proceed first.
Step 6: Toward Operationalization with Stored Procedures
With the logic established and tested, we packaged the steps into parameterized stored procedures. These routines supported:
- On-demand deduplication
- Scheduled clustering runs
- Filtering by account type, transaction history, or site usage
- Audit trail creation for every merge recommendation
This modular approach allowed for gradual adoption and fine-tuning, enabling operational teams to take action with confidence.
Conclusion: A Foundation for Scalable Data Governance
Although this system was built with off-the-shelf tools and SQL logic, it provided meaningful business impact—reducing redundancy, improving reporting accuracy, and preparing the data landscape for future initiatives.
In the future, a mature Master Data Management (MDM) platform can use this SQL-based system and the lessons learned from it as foundational work. The clustering models, survivorship logic, and deduplication patterns developed here serve as a valuable blueprint for scaling data quality efforts enterprise-wide.