Humans are human. Often a sample will get uploaded multiple times (perhaps the spouse uploaded the same data). Duplicate sample data can generate bias. In this post I will look at detecting duplicates and then eliminating all of them except for one (the last one uploaded). This is a recommended step before computing statistics as discussed in this prior post.
So how do you detect duplicate dataset easily? As a long time SQL developer I can think of some complex left joins and other pain, but as a statistician there is a simple elegant cheat – use standard deviation of the count column to create a hash that uniquely (at least until you get a few billion samples) identifies a sample.
Create Proc RemoveDuplicateUploads As Declare @Rows int =1 Declare @ToDelete as Table (LabResultsId int) While @Rows > 0 BEGIN Insert into @ToDelete(LabResultsId) Select Min(OldestId) From ( SELECT R.[LabTestId],Min(T.LabResultsId) as oldestId , Cast(1000 * STDEV([BaseOneMillion]) as Int) as Hash FROM [dbo].[LabResultTaxon] T (NOLOCK) JOIN [dbo].[LabResults] R (NOLOCK) ON T.LabResultsId=R.LabResultsId Group By R.LabTestId ) HashTable Group by Hash, LabTestId Having Count(1) > 1 -- We cycle until we have deleted all SET @Rows=@@ROWCOUNT -- We need to delete the taxon before the test Delete from [dbo].[LabResultTaxon] Where LabResultsId in (Select LabResultsId from @ToDelete) --Now delete the test Delete from [dbo].[LabResults] Where LabResultsId in (Select LabResultsId from @ToDelete) END
The steps explained:
- First we get a hash for each test’s measurements (as Hash)
- Then we find all of the test data for a specific test that have the identical Hash and we have 2 or more tests with those values
- We insert the OLDEST TestId into a table
- We then delete the Taxon level data
- We then delete the Test level data (we can’t delete the test and it’s data is deleted).
- Once that is done, we repeat because there may be more duplicates – remember we only removed the oldest one.
That’s it. A needed cleanup of donated data to keep our data clean and unique.
Why delete the oldest one rather than the newest? The earliest is likely the one done intentionally.
If you want to delete the latest, just change it to Max() instead of Min().
My observation is that people uploaded to the wrong email etc and attempt to make a correction. Thus the latest is selected.
Keeping the oldest means it is impossible for someone to correct!
Well, you would know if anyone would what happens 🙂