Technical Notes on Upload Flow

If something is wrong, then one of the following steps may have failed. See if you can identify which one and then do the suggested way to correct it.

Transferring Data from your Computer to the Database

  1. File is uploaded
  2. File is broken down into individual lines
  3. Each line is parse (broken apart). The following errors have occurred:
    1. The provider changed their file format. We show sample of what we expect to see on each upload page. If your file does not match, send the file to so we may add a new adapter.
    2. The user read the file into Excel or some other program and then saved it. This can cause tabs to be replaced with commas and other format changes that breaks our adapter. Obtain a fresh copy from your provider and directly upload it.
  4. If the file does not contain NCBI taxon numbers, we proceed to do name matching
    1. Many bacteria have dozens of names, we do best efforts for matching
    2. We log names that we have failed to match for periodic review and addition to our lookup database (3,016,747 names at present)
  5. We convert percentages or other measures to counts out of one million (old ubiome standard)

Your Sample Has been Saved to the Database at this point

It is possible that some of the processing described below failed. If a table or chart appears wrong, You can do each step by hand to attempt to correct this failure to process by using the last drop down on the samples page.

If one of the bottom ones report an error, the error is automatically reported and is usually fixed within 24 hours
  1. We then do Upload PostProcessing which consists of the steps below [Note: the source data rarely changes for these – if they do, we do a bulk update of all samples]
    1. Computing a Hash of the upload (so we can identify duplicates – please delete any duplicates)
    2. Update our online bacteria database for any new bacteria not seen (at present we have 19,590 different bacteria[NCBI numbers] reported from different tests)
    3. We add missing members of the NCBI hierarchy
      1. Some labs do not report the complete hierarchy, or use their own.
      2. We apply NCBI rules for what is the parent of a particular bacteria.
      3. This allows a uniform presentation across many labs
      4. There have been odd conflicts — where the count of the children of a family exceeds what the lab reported for the family. The root cause is that they are using a different hierarchy. We keep the lab’s value but it can result in some odd Krona Charts and other visual presentations.
    4. We identify and count bacteria that are rarely seen in other samples (1%,2%,4%,8%,16%)
    5. We then compute some Health Statistics
      1. GHMI Healthy, UnHealthy
      2. Condition Health
    6. We compute KEGG Module raw numbers
    7. We compute KEGG Product raw numbers
    8. We compute KEGG Enzymes raw numbers
    9. We compute KEGG Substrate raw Products
    10. We compute KEGG Compound Produced
    11. We compute KEGG Compound Consumed (Substrate)
    12. We compute End Product raw numbers
  2. We now proceed to add percentiles to the above 6 tables(Bacteria Count, End Product, K-Module, K-Product, K-Enzymes, K-Substrate) using daily recomputed distribution tables:
    1. Taxonomy Reference
    2. End Product Reference
    3. K-Enzymes Reference
    4. K-Module Reference
    5. K-Product Reference
    6. K-Substrate Reference
    7. K-Compound Produced Reference
    8. K-Compound Consumed Reference

PubMed Medical Conditions Handling

Most studies on the US National Library of Medicine reports cohorts averages where above or below control averages with statistical significance. Averages are not the right measure, but it is what is being used…. We punt by using 75%ile and above for a match for higher, and 25% and below for lower.

After the above processing (so we have percentile to work from) we compute these numbers