Upload Taxon Name based Files

In my prior post on ubiome Json files, or taxon number based files, I gave code and the pattern to use. Unfortunately, not every one provides such easy to use data. Look at the following download file format (AmericanGut)

#taxon	relative_abundance
k__Bacteria;p__Bacteroidetes;c__Bacteroidia;o__Bacteroidales;f__Bacteroidaceae;g__Bacteroides	0.287722
k__Bacteria;p__Firmicutes;c__Clostridia;o__Clostridiales;f__;g__	0.130347
k__Bacteria;p__Firmicutes;c__Clostridia;o__Clostridiales;f__Ruminococcaceae;g__	0.116602
k__Bacteria;p__Firmicutes;c__Clostridia;o__Clostridiales;f__Ruminococcaceae;g__Faecalibacterium	0.045869
k__Bacteria;p__Bacteroidetes;c__Bacteroidia;o__Bacteroidales;f__Porphyromonadaceae;g__Parabacteroides	0.042162
k__Bacteria;p__Bacteroidetes;c__Bacteroidia;o__Bacteroidales;f__Rikenellaceae;g__	0.039537
k__Bacteria;p__Firmicutes;c__Clostridia;o__Clostridiales;f__Lachnospiraceae;g__	0.034286

We have the name hierarchy (which unfortunately can vary slightly by lab). The names identify the tax_rank (which can also vary by lab!).

For this console application, I am going to end up with a data table consisting of:

  • tax_rank (converted from f__, s__ etc)
  • tax_name (extracted from the last name before the number)
  • BaseOneMillion (done by multiplying the number by 1,000,000)

Source: https://github.com/Lassesen/Microbiome2/tree/master/taxonNameUpload

This data table is then uploaded and Sql goes thru a series of steps to attempt to match it. Things that are not matched are written to a file (like done in the prior post). In this case, the solution is much easier to patch the differences. Just add the name to the TaxonNames file with it’s appropriate taxon (no need to create a separate substitution table). The Sql attempts to match by taxon rank, name first, if that fails, it falls back to name alone.

For the test file that I used, there was a lot of mismatches

Spot checking for a few, I discovered that these are currently listed as having a parent of “unclassified Gammaproteobacteria” (118884)

For others, there is nothing that seems to be a match and the original data line does not clarify anything.

k__Bacteria;p__Firmicutes;c__Clostridia;o__Clostridiales;f__Clostridiaceae;g__SMB53	0.000927

Bottom Line

For a 103 line input we got 87 matches and the rest unclear. The sample code gives you the information on the challenges. You need to come up with your own resolutions (discard or patch)

P.S. Remember to update your SQL Server Schema to get the new sprocs and data types.

Upload ubiome Json

I have just added another console application. I am doing things in layers and downstream we could create a DLL with everything in it. At this point (especially for those wishing to port it to other languages), doing one feature in one console application is likely the best approach.

This installment takes the ubiome json file and uploads it to the tables.

This is the template to use for uploading tests that provides the standardized taxon number. The only different would be in file parsing.


Input file structure

  "download_time_utc": "2019-06-24T22:46:36.000Z",
  "sequencing_revision": "1346982",
  "site": "gut",
  "sampling_time": "2019-06-12T00:00:00.000Z",
  "notes": "",
  "ubiome_bacteriacounts": [
      "taxon": 1,
      "parent": 0,
      "count": 70967,
      "count_norm": 1000000,
      "tax_name": "root",
      "tax_rank": "root"

After one upload, your data should look like this:

The code is simple, a single method that reads the file name, takes the JSON and makes an object. Walks the object and create a data table. Then calls a stored procedure with this data table and other information.

One thing that is also done is it writes a report on any taxon that it could not match to the ncbi microbiome hierarchy. “Different strokes for different folks”. Actually, more often a taxonomy got deprecated and ubiome has not updated their system.

In the execution folder, you will see a file containing the missing taxon

With the contents like:

If you open the json file, and search for it, you will see what they call it.

You could add this to the taxonHierarchy or ignore it. I searched for it by name and got an apparent match but with a different taxon number.

Resolving this disagreement is up to you. One option is a replacement table of ubiome’s taxon to ncbi taxon where you are confident that they are the same.

Uploading ncbi hierarchy data

First you need to download the ncbi dump files first.

Go to ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/

When you unzip the file, you will see the available data.

Load up the c# project at https://github.com/Lassesen/Microbiome2/tree/master/UploadTaxHier

Modify the DB Configuration string to point to your database. Run the application with the location of the dmp files above.

After the upload, we see the number of records that we have

That’s it! If you want to do periodic updates, I will leave that to the reader to do (and add a pull request). Remember this is open source!

OpenSource Microbiome Project

Readers have expressed interest in some of my work being open sourced. The actual site would be described as an “evolved beta”, rather than subject people to quirks and kludges, I am proceeding as a redesign of a V.2 product. If you are interested, please FOLLOW (top left) to get updates as they happen.

The Repository is at:


The first item that I want to get up for discussion is the core database tables – for review and comments. The Database diagram is shown below.

A few quick notes:

  • Statistics were done as a separate table instead of the typical additional columns because trying multiple quantiles is seen as the way to go for non-parametric analysis. This becomes open ended with items like “Q2_18” – Quantile 2 of a 18 way quantization being possible. With that type of breakdown, we want to know if we are dealing with stale date, so we need to know the computation date.

Next post will deal with populating TaxonHierarchy and TaxonNames from ncbi downloads.

The Journey Begins with your microbiome

Thanks for joining me!

This is a companion site to the analysis site at: 


Most of the content was originally posted on https://cfsremission.com/ with the pages on the left being a restructuring of selected posts from over a thousand posts on that site.

Also a PodCast: https://quaxpodcast.com/2020/02/ep-49-interview-ken-lassesen-using-ai-to-improve-the-gut-microbiome/

Recommended Site For Testing

If you have ME/CFS or other financially disastrous condition, there is always a nasty cost factor for testing. My usual recommendation is for the cheapest, high quality provider that provides information for upload to my analysis site. Some sites provide a mountain more of information — but the benefit from that extra information is almost nothing (and it adds $$$$ and complexity).

  • uBiome.com is shutting down. This had been my personal usual site because using a variety of techniques, the cost was $25/sample. Don’t order from there.
  • BiomeSight.com (EU based) is an excellent buy using our discount code [MICRO]. They have also automated data transfer to our analysis site.
  • Thryve is what I am starting to use. Their reports may be processed here for independent suggestions. I would also recommend