Populating with Open Data

I have done open sharing of data at http://lassesen.com/ubiome in formats designed for data science import and processing. In this post, I taken some of this data and make it uploadable to our new database schema. I will also work on utilities to allow data to be exported and shared going forward. If you get data donated on your site (AND you obtain consent for sharing the data anonymously) remember to share it with other citizen scientists.

There are a series of tables that needed to be exported and then imported. The contents of the export is illustrated below. Note: Guid in the export will be converted to an Id by the import.

To deal with some issues (and to support interchange of data between people using this open source, I added two GUID columns. This means that every sample uploaded to a site, or report given to a site is unique and may be exchange between sites without duplicates. If the same microbiome report is uploaded to multiple sites, then there will be duplicates — until you de-duplicate the data (see this earlier post).

The process is simple:

  • Bring in all of the reference data (Continuous, Category, Labs, LabTests definitions) using their names as keys to prevent duplication
  • Import the lab and other results — using SyncGuid to prevent duplication.

This is done by passing the XML from one site to a stored procedure (using Datasets) and then processing it. It is important to test the code to insure that the same data being imported multiple times does not result in extra rows. The TSQL below gives the counts on key tables.

Select 'LabResultTaxon' as TableName, Count(1) as Count from LabResultTaxon
UNION
Select 'LabResults' as TableName, Count(1) as Count from LabResults
UNION
Select 'LabTests' as TableName, Count(1) as Count from LabTests
UNION
Select 'Labs' as TableName, Count(1) as Count from Labs
UNION
Select 'OwnerReport' as TableName, Count(1) as Count from OwnerReport
UNION
Select 'ReportCategory' as TableName, Count(1) as Count from ReportCategory
UNION
Select 'ReportContinuous' as TableName, Count(1) as Count from ReportContinuous
Example – your numbers should be around these

After the import, you should recalculate core statistics for each LabTest type (see this post for a reminder). Every thing has been rolled into the Library, so the program is very short (but the library code and tSQL code is not).

var import = new DataSet();
import.ReadXmlSchema(schemaFile.FullName);
import.ReadXml(exportFile.FullName);
DataInterfaces.ConnectionString = "Server=LAPTOP-....";
DataInterfaces.Import(import);
LabTests.ComputeAllLabs(4);

Source for Importer: https://github.com/Lassesen/Microbiome2/tree/master/Import

Data Location: https://github.com/Lassesen/Microbiome2/tree/master/Import/Data

Updated SQL Server Schema: https://github.com/Lassesen/Microbiome2/tree/master/DBV2

Non Parametric Detection

As we have seen, the microbiome is NOT a normal or bell curve. I struggled for almost a year trying to get statistical significance out of the data with parametric techniques. While I did get some results, the results were disappointing. When I switched to a non-parametric approach, I shouted EUREKA (without becoming a streaker thru town, unlike a certain ancient greek did).

In the last post we dealt with both continuous and category factors associated with a person. In terms of my existing site, using symptom explorer you will see tables such as the one shown below with used 4-quantilies.

In our earlier post on statistics, we saw how we can compute the quantiles for the available taxonomy. In this lesson we will use that data plus a category variable to detect significance as shown above in real time. This means that the results may change as more data is added — to me, this makes it a living research document.

First the Nerd Stuff — Moving to Libraries

For this example, I have consolidated into a library most of the key stuff from prior posts. The class diagram is below. I plan to keep expanding it with future posts.

https://github.com/Lassesen/Microbiome2/tree/master/MicrobiomeLibrary

Computating the non-parametric

This is done by selecting a LabTest (remember that technically we cannot compare uBiome numbers to XenoType numbers to Thryve numbers) and then some Category. I opted not to go down the control group to category group path because with my donated data, it is not reliable. I opted to go down the population to category group path, which while technically less sensitive — it is a reasonable approach.

We need to associate Category and Continuous Reports to Lab Results and this means just adding one new table LabResultReport as shown below, it links two timeline items together.

From the @LabTestId and @CategoryId we just need to select which quantile to use. Did we divide data into 3,4,5,6,7 etc. buckets. If you look at the prior post, we see that it is easy to select which one, “Q3_”, “Q4_”, etc is the @quantileRoot. We need one more value: @MinSamples – if we do not have at reasonable number, there is almost no change of getting significant. I usually require 4 data points per bin — so Q3_ -> 12, Q4_-> 16, Q5->20.

Passing these number to a stored procedure, we get a dataset back as shown below:

  • Quantiles
    • Taxon
    • Count
    • StatisticsValue
    • StatisticsName (i.e. Q3_1,Q3_2 or Q4_1,Q4_2,Q4_3 etc)
  • User Data
    • Taxon
    • Value
  • Taxon Data
    • Taxon
    • TaxonName

The process is simply counting the data in User Data in each range and then applying some simple statistics to get P Values.

In terms of the calling program, the code is very simple:

var data = DataInterfaces.GetNonParametricCategoryDataSet(1, 1, "Q4_", 20);
var matrix = MicrobiomeLibrary.Statistics.NonParametric.CategoricSignficance(data);
matrix.WriteXml("Matrix.xml");

I just dump the data to a file for simplicity sake. You can open this file via excel to get a nice spreadsheet.

For myself, I wrote a long running (24hrs) program that iterated thru the range of values for Categories (and combinations of categories!) with different quantiles.

https://github.com/Lassesen/Microbiome2/tree/master/ConsoleTestLibrary

Homework

When we work with Continuous variables, we need to convert the ranges into quantiles (just like we did for taxon). This could be done using the ranges we entered, or by breaking into quantiles. Personally, using quantiles would be my preference because too many numbers are not bell/normal curves but are assumed to be just that. I will leave people to do pull requests with their code suggestions.

Connecting the dots…

We have a microbiome, we have lab results, we have official conditions (ICD), we have symptoms. Last, we have substances (for example, probiotics) that modify the microbiome and thus may alter:

  • lab results
  • official condition status (i.e. mild, severe, acute)
  • microbiome
  • symptoms (one symptom may disappear or appear)

Information on expected impact of the above come from medical studies.

The typical question is “What should I take to improve {lab results|symptoms|official diagnosis|microbiome}?” The response should be typically, “Base on study A,B,C,K, you should take X to improve {lab results|symptoms|official diagnosis|microbiome}? “

The answers may come indirectly and may be by inference. For example:

I wish to improve my diabetes.

  • Severity of diabetes is connected with high A bacteria and low B bacteria and high levels of TNF-alpha
  • Substance X has no published studies for diabetes
  • Substance X has published studies for decreasing A and not altering B.
  • Substance Y has published studies for increasing B and not altering A but it does reduce TNF-alpha levels.

The inference is that you should consider taking X and Y to improve your diabetes. In some cases, you may find something like:

I wish to improve my mother’s Alzheimer’s Disease.

  • Severity of Alzheimer’s Disease is connected with high A bacteria and low B bacteria.
  • Substance X and Y has published studies for Alzheimer’s Disease showing positive results
  • Substance X has published studies for decreasing A and not altering B.
  • Substance Y has published studies for increasing B and not altering A.

The database schema below attempts to capture this information from citations (studies).

Let us look at what information may be in a study and map the information to tables (following are made up study results for illustrations)

  • Salted Herrings at 20gm/day improves IBS from Study A
    • Modifier: Salted Herring
    • Citation: A
    • ICDCode: IBS
    • ICDModifierCitation
      • DirectionOfImpact: +1
      • AmountOfImpact: NULL — nothing reported
      • UsageInformation: 20gm/day
  • Same study found TNF-Alpha Increases by 20% above control
    • Confinuous Reference: TNF-Alpha
    • ContinousModifierCitation:
      • DirectionOfInpact: +1
      • Amount of Impact: 1.2 (1 being no change)
      • UsageInformation: 20gm/day
  • Same study found Asthma Disappear in 30% of patients
    • CategoryReference: Asthma (Yes or No remember)
      • ContinousModifierCitation:DirectionOfInpact: -1
      • Amount of Impact: 0.8 (1 being no change)
      • UsageInformation: 20gm/day
  • Same study found Sillium bacteria increased in patients
    • TaxonHierarchy: Sillium
      • TaxonModifierCitation:DirectionOfInpact: +1
      • Amount of Impact: nothing reported
      • UsageInformation: 20gm/day

So the results of one study ended up with entries in 4 tables.

We have a lot of possible inferences here:

  • Sillium impacts TNF-Alpha
  • Low Sillium may be associated with Asthma

All of this stuff becomes facts in our Artificial Intelligence/Expert System engine which I will cover in a few weeks.

Alternative Names

Alternative names is actually critical for text mining (i.e. having programs determine if there is important data is a study, paragraph or sentence). Studies may use a multitude of names for the same thing. For example, you may decided to use the latin name for herbs, Hypericum perforatum and then have the alternative names “St. John Wart” and “Saint John Wart”. The alternative names should be unique, hence the unique index is placed on this column.

Bottom Line

Above is the full solution, I have only partially implemented it and the only one of the table that I have been populating has been TaxonModifierCitation. Readers have asked question about TNF-Alpha, Interleukin 10 (IL-10), also known as human cytokine synthesis inhibitory factor (CSIF). My own resources could only stretched to review and processing this table. Ideally, a crowd-source efforts (or a wealthy patron to have Ph.D. students) would allow the full solution to be populated.

Other Medical Properties

Where are we?

  • We have implemented a microbiome reference from ncbi data
  • We have implemented a medical condition reference from ICD
  • We have implement the ability to store personal microbiome values.
  • We have implement the computations on microbiome values (including non-parametrics, i.e. quantiles).

In this post we will look at storing personal medical conditions, lab results and symptoms. Medical parameters change over time, we need to date when these parameters is in effect (just like we date when microbiomes are done). All of the information at a particular time is a Report. I have my draft database diagram below (the SQL is uploaded at

https://github.com/Lassesen/Microbiome2/tree/master/DBV2

We are also going to add two more reference types which are up the user to define. There is no reference sources for these like we have for ICD and NCBI.

At the data modelling level these reference types are:

  • Categoric, effectively Yes or No
    • Do I have this ICD diagnosis?
    • Do I have this symptoms?
    • Am I a male or female?
  • Continuous
    • What is my C-Reactive Protein result?
    • What level is my Vitamin 1,25 D
    • How old am I

I call these items: Continuous Factors and Category Factors. For processing simplicity, instead of giving a choice for eye color of {Brown, Hazel, Blue, Green, Gray, Amber etc) – each is an independent choice (in the database). You could group them in the DisplayGroup as ‘Eye color’ and in the user interface allow only one to be chosen (although some people do have different color eyes!)

You will note that ICDCode is an optional column Category because I expect most sites will only use a subset of them. Having an Id (Integer) makes processing a lot easier for some non-parametric techniques.

Also note that we have Associations, for example, you may to wish to associate certain symptoms with IBS or IL-10 levels with Chronic Fatigue Syndrome. Thus if someone selects IBS, you may wish to present a short list of appropriate symptoms to select from (more on that in a later post – I did a patent filing on that approach years ago).

For the above tables we may wish to classify or group items for the sake of display. With this number of reports (from C-Reactive Protein to Brain Fog to DNA mutations) organizing the presentation on a web site can be a challenge.

On my current implementation, you will see that I have broken things into two large groups:

  • Symptoms
  • uBiome metabolism numbers

For symptoms, I ended up adding an additional layer as shown below. There can be poor UI designs – for example, hitting a person with 300 sequential questions often we result in the questions being only partially completed.

One way of handling a UI hierarchy

For common lab tests, I would suggest using Lab Tests Online as a starting point — remember each laboratory uses slightly different processes and have different reference ranges.

Continuous ranges also have ranges of values (with labels, for example on age: Infant 0-2, Toddler 2-5, Child 5-10, PreTeen 11-12 etc; or for other items: High, Normal, Low). The ranges help with interpretation and may often be used for the first cut with non-parametric techniques.

For symptoms, I strongly suggest for any ICD diagnosis, you search the web for symptoms often seen with each condition.

Bottom Line

Unlike prior tables where we can populate from prepared sources, we have to research, select and populate the table by hand. This can be a time consuming process to do right. Similarly, getting the user interface right is also time consuming.

Detecting Duplicate Uploads

Humans are human. Often a sample will get uploaded multiple times (perhaps the spouse uploaded the same data). Duplicate sample data can generate bias. In this post I will look at detecting duplicates and then eliminating all of them except for one (the last one uploaded). This is a recommended step before computing statistics as discussed in this prior post.

So how do you detect duplicate dataset easily? As a long time SQL developer I can think of some complex left joins and other pain, but as a statistician there is a simple elegant cheat – use standard deviation of the count column to create a hash that uniquely (at least until you get a few billion samples) identifies a sample.

Create Proc RemoveDuplicateUploads As
Declare @Rows int =1
Declare @ToDelete as Table (LabResultsId int)
While @Rows > 0
BEGIN
	Insert into @ToDelete(LabResultsId)
	Select Min(OldestId) From (
		SELECT R.[LabTestId],Min(T.LabResultsId) as oldestId
		, Cast(1000 * STDEV([BaseOneMillion]) as Int) as Hash
		FROM  [dbo].[LabResultTaxon] T (NOLOCK)
		JOIN [dbo].[LabResults] R (NOLOCK)
		ON T.LabResultsId=R.LabResultsId
		Group By R.LabTestId ) HashTable
	Group by Hash, LabTestId Having Count(1) > 1
	-- We cycle until we have deleted all
	SET @Rows=@@ROWCOUNT
	-- We need to delete the taxon before the test
	Delete from [dbo].[LabResultTaxon] Where LabResultsId in (Select LabResultsId from @ToDelete)
	--Now delete the test
	Delete from [dbo].[LabResults] Where LabResultsId in (Select LabResultsId from @ToDelete)
 END

The steps explained:

  • First we get a hash for each test’s measurements (as Hash)
  • Then we find all of the test data for a specific test that have the identical Hash and we have 2 or more tests with those values
  • We insert the OLDEST TestId into a table
  • We then delete the Taxon level data
  • We then delete the Test level data (we can’t delete the test and it’s data is deleted).
  • Once that is done, we repeat because there may be more duplicates – remember we only removed the oldest one.

That’s it. A needed cleanup of donated data to keep our data clean and unique.

Computing Statistics

For my last of the 4 installment per weekend, I will look at computing statistics. I have found non-parametrics analysis to get more interesting results, but classic core statistics are also helpful to understand the nature of each taxonomic layer.

Data Schema Update

We are dealing with this section of the schema which has another table added.

For thoses who are not statisticians, terms like Kurtosis may sound like a medical condition and not a statistical measure. I have enclosed links to Wikipedia articles on each of these items below.

The code sample does just one LabTest and one Quantitle. For a production implementation, I would do iterate over

Select LabTestId from LabTests

And quantiles = {3,4,5,6..}

The code is at:
https://github.com/Lassesen/Microbiome2/tree/master/TaxonomyStatistics

Remember to update database schema too.

Bottom Line

That’s it for this weeks installments. Homework on this one is whether you should include Zero (none-found) values or not. This is actually a complex question which may depend on what percentage of samples have a specific taxonomic unit.

Example of uploading a PDF Report

The prior posts dealt with uploading text files – perhaps a csv file or a json file. Sometimes reports are sent as PDF. Today I received a good example of a PDF to show how to upload it.

An example of the file is shown below.

XenoGene S.L., CIF B93406692. NICA 46447. C/ Tampa, nº 2, C.P. 29007 de Málaga. Tel. 951 15 23 23 Fax 95

We use the Taxon Name and Rank process as done in this earlier post. The trick is pretty simple: decompose the file using itext7 or earlier versions. And then create an appropriate data table (same format as in the earlier post), and use the same store procedure.

As before, not found bacteria are written to a file for later analysis of exception. In this case there was only two items not found :-).

<XenoGene>
  <Table>
    <taxon_rank>species</taxon_rank>
    <taxon_name>Burkholderia sp. CCGE1002</taxon_name>
    <BaseOneMillion>100</BaseOneMillion>
  </Table>
  <Table>
    <taxon_rank>species</taxon_rank>
    <taxon_name>tsetse mealybug Leptospira santarosai</taxon_name>
    <BaseOneMillion>100</BaseOneMillion>
  </Table>
</XenoGene>

That’s it. Inspect the source to see the details. Note: For different reports, you will need to build custom parsers.

Source is at: https://github.com/Lassesen/Microbiome2/tree/master/uploadXenoGene

Obtaining Source Citations

Part of being open sources is being open-source on where you are getting information. For myself, my primary criteria is published peer-reviewed scientific papers. You must determine your own criteria (for example, using unpublished studies referenced and summarized in patent filings).

Citations are the connectors between bacteria, probiotics, medical conditions, lab measurements etc. They are the evidence chain of how things are related.

In this post I will describe column by column what I feel should be in your citation table as a minimum:

  • CitationId – unique identifier used in relationships : int
  • Title – The title of the study or document: nvarchar(255)
  • CitationJson – A summary using the PubMed pattern: nvarchar(max)
  • RawText – A pure text version of the document. It will be used for Text Mining and manual inspection : nvarchar(max)
  • RawDocument – A binary version of the document. Often this will be a PDF file or may be an image. varbinary(max)
  • RawDocumentMimeType – The mime type for the RawDocument: varchar(60)
  • Summary – This is typically the PubMed summary for the study: nvarchar(max)
  • PMIDPubMed ID : int
  • PMCID: PubMed Citation Id: varchar(14)
  • DOIDigital Object Identifier (DOI) : varchar(60)

Ideally, a component should extract the RawDocument into RawText. For example, OCR for images and various converters for PDF. This type of conversion is outside of the scope of this blog series. They can be done (I have written such professionally).

Source: https://github.com/Lassesen/Microbiome2/tree/master/CitationsConsole

Below are the class diagrams:

For some background on the PubMed API calls, see this page. The

The following shows the response searching for “Corgi”.
https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&term=corgi&retmode=json&retmax=10

{
    "header": {
        "type": "esearch",
        "version": "0.3"
    },
    "esearchresult": {
        "count": "101",
        "retmax": "10",
        "retstart": "0",
        "idlist": [
            "31194653",
            "31172540",
            "30577744",
            "29794371",
            "29681557",
            "29355995",
            "28942296",
            "28933509",
            "28855429",
            "28566164"
        ],
        "translationset": [
        ],
        "translationstack": [
            {
                "term": "corgi[All Fields]",
                "field": "All Fields",
                "count": "101",
                "explode": "N"
            },
            "GROUP"
        ],
        "querytranslation": "corgi[All Fields]"
    }
}

Running it is simple, just give it the name of the condition (it defaults to 100 citations, but you can increase it. I usually run with 10,000 citations at a time for each bacteria taxonomy name).

CitationConsole “Some Unusual Condition”

What is the net result? Running it with “SIBO”, we see in the database.

So we have our data ready for either text mining or manual review.

Home Work

We can now fill up our citation tables, in some cases with the complete text (excluding appendices) as well as a short summary. How are you going to process the data?

Processing this data is where there is a major effort needed.

Importing medical conditions ( ICD tables)

We want to view relationships between microbiome and various medical conditions. There is also a need to know the impact of various probiotics on various conditions.

The International Classification of Disease (ICD) is the equivalent as taxon numbers for bacteria and is a well defined comprehensive list of medical conditions. We may need to add alternative names for some of the conditions to match common usage.

If something is not there (it should be), we can always add our own codes and diagnosis. For example there is no “SIBO” or ” Small intestine bacterial overgrowth” item, scanning on like we see that the code of A04.9 is often used. We would insert that into the ICDAltName table (likely two rows, one for each variation).

First we need to download some files as shown below.

https://www.cms.gov/medicare/coding/icd10/2019-icd-10-cm.html

The file that we wish to use is shown below.

This file is the parameter to pass to the C# console application that will upload and populate the table. Most of the magic happens in SQL.

We end up with tables as shown below. In the next posts we look at connecting conditions to probiotics to studies, and conditions to microbiomes to studies.

Example of the data after uploading is shown below. There are around 44 thousand IDC codes – fortunately, they are in a hierarchy.

The source code is at: https://github.com/Lassesen/Microbiome2/tree/master/UploadICD

You just need to give the location of the two files downloaded above

uploadICD.exe D:\Downloads\2019-ICD-10-CM-Tables-and-Index\icd10cm_index_2019.xml D:\Downloads\2019-ICD-10-CM-Tables-and-Index\icd10cm_tabular_2019.xml

Homework Item

We have some 44 thousand items which are in a hierarchy if you use the code. There are many ways to handle that on a web site — what approach will you do?

Thryve Upload Code has been added

This is just a variation of the uBiome upload because thryve provides the taxon numbers.

Example of their download file

"taxon_id","rank","name","parent","count"
"165179","species","Prevotella copri","Prevotella","36128"
"838","genus","Prevotella","Prevotellaceae","36982"
"171552","family","Prevotellaceae","Bacteroidales","40738"

The data format is automatically detected. Classes in the code converted to Interfaces for eventual consolidation into a DLL containing all of the uploads.

I have not updated the folder name

https://github.com/Lassesen/Microbiome2/tree/master/ubiomeUpload

Remember that the database schema has also been updates.