Detecting Duplicate Uploads

Humans are human. Often a sample will get uploaded multiple times (perhaps the spouse uploaded the same data). Duplicate sample data can generate bias. In this post I will look at detecting duplicates and then eliminating all of them except for one (the last one uploaded). This is a recommended step before computing statistics as discussed in this prior post.

So how do you detect duplicate dataset easily? As a long time SQL developer I can think of some complex left joins and other pain, but as a statistician there is a simple elegant cheat – use standard deviation of the count column to create a hash that uniquely (at least until you get a few billion samples) identifies a sample.

Create Proc RemoveDuplicateUploads As
Declare @Rows int =1
Declare @ToDelete as Table (LabResultsId int)
While @Rows > 0
BEGIN
	Insert into @ToDelete(LabResultsId)
	Select Min(OldestId) From (
		SELECT R.[LabTestId],Min(T.LabResultsId) as oldestId
		, Cast(1000 * STDEV([BaseOneMillion]) as Int) as Hash
		FROM  [dbo].[LabResultTaxon] T (NOLOCK)
		JOIN [dbo].[LabResults] R (NOLOCK)
		ON T.LabResultsId=R.LabResultsId
		Group By R.LabTestId ) HashTable
	Group by Hash, LabTestId Having Count(1) > 1
	-- We cycle until we have deleted all
	SET @Rows=@@ROWCOUNT
	-- We need to delete the taxon before the test
	Delete from [dbo].[LabResultTaxon] Where LabResultsId in (Select LabResultsId from @ToDelete)
	--Now delete the test
	Delete from [dbo].[LabResults] Where LabResultsId in (Select LabResultsId from @ToDelete)
 END

The steps explained:

First we get a hash for each test’s measurements (as Hash)
Then we find all of the test data for a specific test that have the identical Hash and we have 2 or more tests with those values
We insert the OLDEST TestId into a table
We then delete the Taxon level data
We then delete the Test level data (we can’t delete the test and it’s data is deleted).
Once that is done, we repeat because there may be more duplicates – remember we only removed the oldest one.

That’s it. A needed cleanup of donated data to keep our data clean and unique.

5 thoughts on “Detecting Duplicate Uploads”

Why delete the oldest one rather than the newest? The earliest is likely the one done intentionally.

lassesen says:

August 1, 2019 at 9:22 pm

If you want to delete the latest, just change it to Max() instead of Min().
My observation is that people uploaded to the wrong email etc and attempt to make a correction. Thus the latest is selected.
Keeping the oldest means it is impossible for someone to correct!
1. davidmorr says:
  
  August 1, 2019 at 9:27 pm
  
  Well, you would know if anyone would what happens 🙂

Pingback: Populating with Open Data – Microbiome Prescription

Comments are closed.

Microbiome Prescription Blog

A site exploring the microbiome, what it affects and how to manipulate it.

5 thoughts on “Detecting Duplicate Uploads”

Recent Posts

Pages

Reference Material

Recent Comments