Dealing with Duplicates: A Practical SQL-Based Deduplication System

Many organizations struggle with duplicate customer sites and accounts scattered across operational systems. These duplicates introduce confusion in invoicing, shipping, reporting, and analytics. The immediate need to resolve this data fragmentation often outpaces the availability of enterprise-wide tooling.

This post outlines how we built a pragmatic, SQL-based deduplication framework to identify, cluster, and prepare duplicate records for cleanup—without relying on a formal data governance platform. We walk through the architecture, logic, and lessons learned while keeping the implementation grounded in widely available tools and query techniques.

Step 1: Fuzzy Matching and Site Clustering Using SAS

The first step was to detect likely duplicate customer sites based on attributes such as address, postal code, business purpose, and customer name. Using SAS Data Quality functions, we configured fuzzy matching rules and similarity thresholds to create a flat output of matched site pairs and cluster assignments.

Each row in the resulting dataset linked a site ID to a Cluster ID—representing a group of sites that should be treated as potentially duplicative.

Step 2: Load Results into an Analytical Database

We then loaded the matching output into a dedicated SQL Server environment designed for analytical processing. This environment consisted of a few core tables:

Sites: Basic customer site attributes
Accounts: Top-level customer records
MatchResults: SAS output including match scores and cluster assignments
ClusterView: A derived view tying site-level clusters back to their associated accounts

This layer allowed us to inspect, refine, and build upon the matching results using SQL.

Step 3: Build Clusters and Super Clusters with SQL Logic

While the initial clusters grouped sites together, it quickly became clear that related accounts often shared sites or belonged to overlapping clusters. We extended the logic by introducing:

DuplicateGroup: A refined group of duplicate sites
SuperCluster: A roll-up of accounts that shared multiple duplicate groups

Using SQL window functions and DISTINCT joins, we connected sites to accounts and accounts to each other, effectively uncovering multi-account relationships that hadn’t been visible.

SELECT 
    AccountNumber,
    MIN(ClusterID) OVER (PARTITION BY AccountNumber) AS SuperClusterID
FROM MatchResults

This logic established which accounts could safely be viewed as belonging to the same logical customer entity, even if they were set up separately in the ERP.

Step 4: Apply Ranking Logic to Identify Survivors

Once clusters were formed, we needed to determine which site in each group would serve as the “survivor” record. Ranking logic was applied using DENSE_RANK() to evaluate:

Site status (active vs. inactive)
Flags indicating primary usage
Most recent invoice date
Site ID (as a final tie-breaker)

DENSE_RANK() OVER (
    PARTITION BY DuplicateGroup, SiteUse
    ORDER BY SiteStatus, IsPrimary DESC, LastInvoiceDate DESC, SiteID DESC
) AS SiteRank

Based on this rank, we assigned actions to each site:

Migrate: Keep as the active site and migrate others into it
Merge: Deactivate or consolidate into the survivor
Leave As-Is: If not part of a duplicate group

Step 5: Generate Merge Files for Operational Action

Once survivor and victim designations were made, we produced merge files that paired each survivor with its associated victims. These files could be used for manual review or integration into downstream cleanup processes.

SELECT 
    Survivor.SiteID AS SurvivorID,
    Victim.SiteID AS VictimID,
    'Y' AS DeleteAfterMergeFlag,
    'De-duplication Merge' AS MergeReason
FROM SiteClusters AS Survivor
JOIN SiteClusters AS Victim
    ON Survivor.DuplicateGroup = Victim.DuplicateGroup
    AND Victim.Action <> 1
WHERE Survivor.Action = 1

In some cases, filters were added to exclude sites with recent orders or open transactions, allowing low-risk merges to proceed first.

Step 6: Toward Operationalization with Stored Procedures

With the logic established and tested, we packaged the steps into parameterized stored procedures. These routines supported:

On-demand deduplication
Scheduled clustering runs
Filtering by account type, transaction history, or site usage
Audit trail creation for every merge recommendation

This modular approach allowed for gradual adoption and fine-tuning, enabling operational teams to take action with confidence.

Conclusion: A Foundation for Scalable Data Governance

Although this system was built with off-the-shelf tools and SQL logic, it provided meaningful business impact—reducing redundancy, improving reporting accuracy, and preparing the data landscape for future initiatives.

In the future, a mature Master Data Management (MDM) platform can use this SQL-based system and the lessons learned from it as foundational work. The clustering models, survivorship logic, and deduplication patterns developed here serve as a valuable blueprint for scaling data quality efforts enterprise-wide.