Technical Notes: Percentages of Percentiles for Health Measure?

This is a part of a series of Technical Notes on Microbiome Analysis

For a while I have been using a variation of this concept for 16s samples that I have reviewed. The concept is very simple to a statistician:

Percentiles is converting data into a native uniform distribution. If you sample for 1000 boxes where each box has 100 balls numbered 1-100, then you expect the distribution of the balls samples to be uniform. It they are not, then something is definitely unfair.


With the microbiome things are a little more complex because a high in a single strain may push it species into high and thus the genus into high. We could do independent levels, for example species only or genus only. The problem is that the population size starts to drop and thus the sensitivity decreases as a result.

I happen to have a small collection of shotgun samples processed through CosmosID. Their report give percentile for most of what they measure. Getting accurate percentiles requires large sample sizes.

Below I have charted the results with single percentile ranges from reports that have between 2000 and 5000 different biological units reported. I have charted using different approach (the kitchen sink and then select taxological levels).

All of these samples are from people with health issues. Note that the numbers come from rounding so 100% is just 99.5 to 100 (and not 99.5 to 100.5) so the spikes at 100 is likely twice as high.

Kitchen Sink

Filter to Species Only

Genus Level

Family Level

Bottom Line

Comparing different levels can be informative, to illustrate, we have species below with good uniformity until we hit the high levels.

Looking at the genus level for the same sample, the pattern is very different.

In this case, we drilled down into these high species and got a predominance of Corynebacterium species that fell into our 100% range (99.5-100 percentiles).

Taxonomy NameAbundance
Anaerococcus mediterraneensis0.005611
Anaerococcus prevotii0.006486
Bacteroides rodentium0.001238
Corynebacteriaceae bacterium ‘ARUP UnID 227’0.000437
Corynebacterium ammoniagenes0.000586
Corynebacterium aurimucosum0.1573
Corynebacterium callunae0.00013
Corynebacterium camporealensis0.002243
Corynebacterium casei0.000726
Corynebacterium comes0.000391
Corynebacterium diphtheriae0.0755
Corynebacterium endometrii0.001051
Corynebacterium flavescens0.001684
Corynebacterium humireducens0.00053
Corynebacterium imitans0.001024
Corynebacterium jeikeium0.01813
Corynebacterium lactis0.000437
Corynebacterium liangguodongii0.000558
Corynebacterium minutissimum0.03511
Corynebacterium phocae0.000865
Corynebacterium pseudotuberculosis0.000233
Corynebacterium renale0.000493
Corynebacterium resistens0.001182
Corynebacterium riegelii0.001321
Corynebacterium segmentosum0.007016
Corynebacterium simulans0.3615
Corynebacterium singulare0.01858
Corynebacterium sp. NML 98-01160.001024
Corynebacterium stationis0.000577
Corynebacterium striatum0.04709
Corynebacterium timonense0.001321
Corynebacterium urealyticum0.00107
Corynebacterium uterequi0.000642
Corynebacterium yudongzhengii0.000689
Cutibacterium acnes0.002298
Dehalococcoides mccartyi0.006123
Dermabacter jinjuensis0.01404
Dermabacter vaginalis0.001265
Fastidiosipila sanguinis0.003536
Finegoldia magna0.06368
Helcococcus kunzii0.00014
Homo sapiens1.985
Lawsonella clevelandensis0.003154
Mycobacterium gallinarum0.000261
Mycobacterium sp. DL5920.00013
Mycobacterium sp. ELW10.001107
Mycobacterium sp. EPa450.002298
Mycobacterium sp. PYR150.008328
Mycolicibacterium aichiense0.000223
Negativicoccus massiliensis0.001935
Peptoniphilus harei0.04272
Peptoniphilus sp. ING2-D1G0.000893
Porphyromonas asaccharolytica0.06443
Porphyromonas bennonis0.000521
Propionibacterium freudenreichii0.000465
Schaalia radingae0.001089
Streptococcus pyogenes0.00241
Streptococcus sp. NCTC 115670.000149
Sutterella stercoricanis0.000149
Tessaracoccus timonensis0.00094
uncultured Chroococcidiopsis sp.0.000242
uncultured Rhizobium sp.0.000772

We could also produce single value statistical measures — for example Chi2. We have an a priori expected value of 1% in each bucket.

IMHO, percentages of percentiles is likely more effective in evaluating an individual person’s gut microbiome. It seems to be able to separate the noise from what is significant, for example Corynebacterium cited above where the cause is a proliferation of species and not dominance of one species.

This has since cascaded into an Eubiosis Index.