Detecting Duplicate Uploads

Humans are human. Often a sample will get uploaded multiple times (perhaps the spouse uploaded the same data). Duplicate sample data can generate bias. In this post I will look at detecting duplicates and then eliminating all of them except for one (the last one uploaded). This is a recommended step before computing statistics as discussed in this prior post.

So how do you detect duplicate dataset easily? As a long time SQL developer I can think of some complex left joins and other pain, but as a statistician there is a simple elegant cheat – use standard deviation of the count column to create a hash that uniquely (at least until you get a few billion samples) identifies a sample.

Create Proc RemoveDuplicateUploads As
Declare @Rows int =1
Declare @ToDelete as Table (LabResultsId int)
While @Rows > 0
	Insert into @ToDelete(LabResultsId)
	Select Min(OldestId) From (
		SELECT R.[LabTestId],Min(T.LabResultsId) as oldestId
		, Cast(1000 * STDEV([BaseOneMillion]) as Int) as Hash
		FROM  [dbo].[LabResultTaxon] T (NOLOCK)
		JOIN [dbo].[LabResults] R (NOLOCK)
		ON T.LabResultsId=R.LabResultsId
		Group By R.LabTestId ) HashTable
	Group by Hash, LabTestId Having Count(1) > 1
	-- We cycle until we have deleted all
	-- We need to delete the taxon before the test
	Delete from [dbo].[LabResultTaxon] Where LabResultsId in (Select LabResultsId from @ToDelete)
	--Now delete the test
	Delete from [dbo].[LabResults] Where LabResultsId in (Select LabResultsId from @ToDelete)

The steps explained:

  • First we get a hash for each test’s measurements (as Hash)
  • Then we find all of the test data for a specific test that have the identical Hash and we have 2 or more tests with those values
  • We insert the OLDEST TestId into a table
  • We then delete the Taxon level data
  • We then delete the Test level data (we can’t delete the test and it’s data is deleted).
  • Once that is done, we repeat because there may be more duplicates – remember we only removed the oldest one.

That’s it. A needed cleanup of donated data to keep our data clean and unique.

5 thoughts on “Detecting Duplicate Uploads

    1. If you want to delete the latest, just change it to Max() instead of Min().
      My observation is that people uploaded to the wrong email etc and attempt to make a correction. Thus the latest is selected.
      Keeping the oldest means it is impossible for someone to correct!

Comments are closed.