Humans are human. Often a sample will get uploaded multiple times (perhaps the spouse uploaded the same data). Duplicate sample data can generate bias. In this post I will look at detecting duplicates and then eliminating all of them except for one (the last one uploaded). This is a recommended step before computing statistics as discussed in this prior post.
So how do you detect duplicate dataset easily? As a long time SQL developer I can think of some complex left joins and other pain, but as a statistician there is a simple elegant cheat – use standard deviation of the count column to create a hash that uniquely (at least until you get a few billion samples) identifies a sample.
Create Proc RemoveDuplicateUploads As Declare @Rows int =1 Declare @ToDelete as Table (LabResultsId int) While @Rows > 0 BEGIN Insert into @ToDelete(LabResultsId) Select Min(OldestId) From ( SELECT R.[LabTestId],Min(T.LabResultsId) as oldestId , Cast(1000 * STDEV([BaseOneMillion]) as Int) as Hash FROM [dbo].[LabResultTaxon] T (NOLOCK) JOIN [dbo].[LabResults] R (NOLOCK) ON T.LabResultsId=R.LabResultsId Group By R.LabTestId ) HashTable Group by Hash, LabTestId Having Count(1) > 1 -- We cycle until we have deleted all SET @Rows=@@ROWCOUNT -- We need to delete the taxon before the test Delete from [dbo].[LabResultTaxon] Where LabResultsId in (Select LabResultsId from @ToDelete) --Now delete the test Delete from [dbo].[LabResults] Where LabResultsId in (Select LabResultsId from @ToDelete) END
The steps explained:
- First we get a hash for each test’s measurements (as Hash)
- Then we find all of the test data for a specific test that have the identical Hash and we have 2 or more tests with those values
- We insert the OLDEST TestId into a table
- We then delete the Taxon level data
- We then delete the Test level data (we can’t delete the test and it’s data is deleted).
- Once that is done, we repeat because there may be more duplicates – remember we only removed the oldest one.
That’s it. A needed cleanup of donated data to keep our data clean and unique.
Recent Comments