Ranges: Quantiles and Boxes are Insufficient

The page Your microbiome is unique to you illustrates that it is a challenge to compare microbiomes between people. The one probable exemption is two identical twins that eat the same diet and live the same lifestyle.

I often read, “Thryve says my bacteria X is too high /low”. That determination is usually done by comparing against averages — with the microbiome, that is not a good approach.

Determining what is or is not a significant shift in the microbiome is challenging. My original algorithm was based on reverse engineering the normative values that uBiome appear to be working from. This was a quick and dirty solution — the best that I could do several year ago.

I work with computer systems, both microbiomes and complex computer system tend to share similar challenges: they are not normal distributions and often long tailed (skewed) which means that averages and standard deviation often produce poor results for detecting abnormal values.

We will Box Up the Issue!

A common data science process in filtering data for machine learning etc, is excluding outliers. We are actually interested in finding the outliers! This is often done by boxplots. An example of some of phylum level bacteria is shown below. (Note 1.0 = 100%). This is using some 500+ uploaded microbiome samples.

outliers are the round circles

And we can do it to lower levels, for example, order

Down to Species

The solid black line is the median (almost an average). For B.Vulgatus we see that the range of values from 25%ile to median is almost the same as median to 75%ile. For B. uniformis, this is very different.

Guarantee Every Sample Needs Fixing!

See Understanding Boxplots

Using the box or quantile methods and deeming that the normal range is between Q1 (25%ile and 75%ile) means that 50% of all values (0-25%ile + 75% to 100%ile) will be abnormal. This style of thinking leads to expecting a family to have 1.343 kids.

We also have a problem when only 20% of samples have a bacteria. Do we treat these as zero values. If we do, then we have Q1=0, Q3=0, IQR=0 and this the maximum Q3+1.5 *IQR = 0. Thus having any of this bacteria at detectable levels is abnormal!