New Standards for Microbiome Analysis?

This sketch is suggested for all studies looking for associations of bacteria to any thing else. I will illustrate a sample results based on a sample of 100 people with self-declared Autism against a reference population at the end. The need for a better standard came out of this post, Microbiome Statistics for Dummies. As a FYI, applying all of the methods below had over 200 finding with P < 0.01 from the test sample.

Caveat emptor: This is based on 16s samples using self-declaration for Autism. The purpose of this post is not to report on microbiome shifts for Autism, but to show different statistics that should be used, but tend to be ignored.

Criteria

Measures must have the ability to compute statistical significance. Measures should be multi-faceted and not mono-faceted (i.e. comparing average only). The following measures are proposed as a new standard:

  • Odds Ratio based on the Median of the targeted population.
  • Odds Ratio based on the Incidence of being seen between populations
  • Odds Ratio based on the Median of the reference population.
  • Difference of Median between target and general population
  • Difference of Means between target and general population
  • Difference of Skews between target and general population

Odds Ratio

To compute an odds ratio (OR) and assess its statistical significance, you

  • (1) calculate the OR from a 2×2 table, then
  • (2) compute a confidence interval (CI) and/or a p‑value
    • via chi‑square test, or Fisher’s exact test

So we have for the three odds ratio cited above:

  • Median Odds Ratio When Detected
    • Numbers above/below the Median of the target/reference populations
      • For Reference Population
      • For Target Population
  • Incidence Odds Ratio When Detected
    • Number where it was detected, Number where it was not detected
      • For Reference Population
      • For Target Population

Difference of Skews

While this is not normally done, given the very high skewness (20+) seen with bacteria, it is a worthwhile investigation when sample sizes are sufficient. A description of how to do this is below.

Difference of Means

Given the high skewness of most bacteria, this is a well known standard approach will typically underestimate the significance.

Real Data using 100 Autism People

Bacteria Incidence P < 0.01

This first table caused me to do a double take because Bifidobacterium is generally low in Autism and yet we have a table with the odds of having Autism being increased with certain Bifidobacterium. The key factor to remember is that Bifidobacterium is rarely reported with Autism — but when it is reported, the amounts tend to be higher; additionally, certain species tend to be a lot more common with Autism.

To verify my computations, I also included the percentage of time it is seen with Autism and without. These Bifidobacterium are far more common to appear in samples.

tax_namerankIncidence Odds RATIOChi2Symptom
% Seen
Reference
% Seen
Bifidobacterium catenulatum subsp. kashiwanohensesubspecies2.839.808885419
Bifidobacterium angulatumspecies2.832.814144516
Hungateiclostridiumgenus2.417.914633012
Hungateiclostridiaceaefamily2.417.75393012
Bifidobacterium scardoviispecies2.113.647353416
Clostridium chartatabidumspecies1.914.422164824
Bifidobacterium catenulatum PV20-2strain1.915.853116333
Bifidobacterium catenulatumspecies1.915.016486435
Parascardoviagenus1.89.7033123519
Bifidobacterium cuniculispecies1.89.1411493418
Bifidobacterium gallicumspecies1.712.554688248
ant endosymbiontsclade1.77.5309993923
Candidatus Blochmanniellagenus1.77.5309993923
Enterobacteriaceae incertae sedisno rank1.66.7822943924
Enterobacter hormaecheispecies1.66.9622394125
Moorella groupnorank1.68.6163426642
unclassified Bacteroidetes Order II.order1.68.7406277548
Bifidobacterium indicumspecies1.68.4105217649

Using Symptom Median P < 0.01

We have 152 bacteria identified. Bifidobacterium species again featured. For brevity, I show only the top 25 below. Remember using the Medium of those with Autism, 50% of Autistic people will be below, and 50% above — there is no need to show those counts.

Note that the top line is saying if the amount of Bifidobacterium is above 0.85 that the odds ratio is just .26 (greatly reduced). The Odds Ratio applies to those with the reported amount above the Median.

REMINDER: We are looking at only samples finding bacteria.

tax_nameRankMedianOdds RatioChi2BelowAbove
Bifidobacteriumgenus0.85050.2649.33114817
Actinomycetotaphylum1.08690.2844.53126872
Bifidobacteriaceaefamily0.7640.2843.43075870
Bifidobacterialesorder0.7640.2843.43075870
Bifidobacterium catenulatumspecies0.010.29391148331
Bifidobacterium asteroidesspecies0.0050.2936.8872255
Bifidobacterium subtilespecies0.00650.3331.11093357
Actinomycetesclass0.715950.3430.729781013
Bifidobacterium cuniculispecies0.0060.3130.7598188
Parascardoviagenus0.0040.3230.6608192
Leyellagenus0.0050.3528.41314454
Leyella stercoreaspecies0.0050.3528.41314454
Bacteroides rodentiumspecies0.06952.7226.110602878
Bifidobacterium gallicumspecies0.0840.3724.91508559
Moraxellalesorder0.0040.3724.71597595
Moraxellaceaefamily0.0040.3724.71597595
Eukaryotasuperkingdom0.0040.3724.61089401
Phocaeicola paurosaccharolyticusspecies0.0252.6224.110802828
Geopsychrobacteraceaefamily0.0170.3922.71400540
Desulfuromusagenus0.0170.3922.71400540
Burkholderiales genera incertae sedisno rank0.0180.3822.41024393
Desulfuromonadalesorder0.0150.3922.41575614
Desulfuromonadiaclass0.0150.3922.31574614
Desulfuromonadaceaefamily0.0160.3921.91446568
Psychrobactergenus0.0030.3921.51250493

Using Reference Median P < 0.01

In this case, we have 89 bacteria. Bifidobacterium is very common again. To clarify matters a little: If the sample has Bifidobacterium reported and the amount is over 0.115, the odds of this person having Autism is 3.66.

The Odds Ratio applies to those with the reported amount above the Median

tax_nameRankReference
Median
Odds
Ratio
Chi2Symptom
Below
Symptom
Above
Bifidobacterialesorder0.1163.7132.12178
Bifidobacteriaceaefamily0.1163.7132.12178
Bifidobacteriumgenus0.1153.6631.32177
Actinomycetesclass0.1753.3528.52377
Caloramator indicusspecies0.0050.0925.8343
Bifidobacterium gallicumspecies0.0093.3723.91964
Actinomycetotaphylum0.232.8522.52674
Hathewaya histolyticaspecies0.1590.3918.37128
Hathewayagenus0.1590.3918.37128
Anaerovibrio lipolyticusspecies0.0280.3817.16324
Rhodothermotaphylum0.0120.4116.96928
Rhodothermiaclass0.0120.4116.96928
Rhodothermalesorder0.0120.4116.96928
Clostridium thermosuccinogenesspecies0.0080.3716.35721
Pseudoclostridiumgenus0.0080.3716.35721
Anaerovibriogenus0.0280.4115.16326
Enterobacteriaceaefamily0.0552.1212.73268
Enterobacteralesorder0.0572.1212.73268
Peptoniphilusgenus0.0520.4612.66530
Porphyromonas canisspecies0.0050.3912.14618
Bifidobacterium adolescentisspecies0.0112.18122861
Phocaeicola massiliensisspecies0.0150.42125222
Olivibactergenus0.0060.4211.34820
Phocaeicola paurosaccharolyticusspecies0.0440.4811.26431
Eukaryotasuperkingdom0.0023.511828

Comparing Averages

A classic question on using averages: Do you include samples where a bacteria was not found as a zero, or exclude it from your average? I am inclined to suggest that both should be done.

tax_nameRankSymptom AverageReference AverageSymptom Average
With Zero
Reference Average
With Zero
Bifidobacterium catenulatumspecies0.0310.0160.020.006
Bifidobacterium gallicumspecies0.8450.2930.6950.142
Bifidobacterium angulatumspecies0.0340.0150.0150.002
Bifidobacterium asteroidesspecies0.0070.0050.0030.001
Caloramator indicusspecies0.0070.0370.0020.013
Bifidobacterium cuniculispecies0.010.0060.0030.001
Bifidobacterium subtilespecies0.0180.0060.0080.002
Bacteroides rodentiumspecies0.1870.3970.1820.366
Phocaeicola paurosaccharolyticusspecies0.0350.060.0330.055
Bifidobacterium indicumspecies0.0280.0190.0210.009
Hathewaya histolyticaspecies0.180.2810.1760.261
Bifidobacterium scardoviispecies0.0050.0090.0020.001
Leyella stercoreaspecies0.7120.6080.3030.252
Phocaeicola sartoriispecies0.0410.0840.0380.076
Anaerovibrio lipolyticusspecies0.0550.1140.0470.098
Phascolarctobacterium succinatutensspecies0.0330.0610.0220.045
Sarcina maximaspecies0.1090.0340.0740.017
Clostridium thermosuccinogenesspecies0.0080.0150.0060.012
Butyricimonas virosaspecies0.0130.0140.0040.007
Bifidobacterium adolescentisspecies0.4040.2990.3560.229
Veillonella montpellierensisspecies0.0370.030.0220.017
Sporolactobacillus putidusspecies0.0390.0180.0150.005
Caloramator uzoniensisspecies0.0050.010.0020.005
Johnsonella ignavaspecies0.0650.0510.0630.047
Bacteroides cellulosilyticusspecies0.6310.860.5690.762

Averages to Median

Bacteria tend to have very high skewness rendering the use of means very unsafe. IMHO, medians is a far better statistics to use. As shown below, medians are almost always below the average because of extreme values. Often a mean will be at the 90%ile for a bacteria.

tax_nameRankSymptom AvarageReference AverageSymptom MedianReference Median
Bifidobacterium catenulatumspecies0.0310.0160.010.003
Bifidobacterium gallicumspecies0.8450.2930.0840.009
Bifidobacterium angulatumspecies0.0340.0150.0090.004
Bifidobacterium asteroidesspecies0.0070.0050.0050.002
Caloramator indicusspecies0.0070.0370.0020.005
Bifidobacterium cuniculispecies0.010.0060.0060.003
Bifidobacterium subtilespecies0.0180.0060.0070.003
Bacteroides rodentiumspecies0.1870.3970.070.191
Phocaeicola paurosaccharolyticusspecies0.0350.060.0250.044
Bifidobacterium indicumspecies0.0280.0190.0110.005
Hathewaya histolyticaspecies0.180.2810.0920.159
Bifidobacterium scardoviispecies0.0050.0090.0030.002
Leyella stercoreaspecies0.7120.6080.0050.003
Phocaeicola sartoriispecies0.0410.0840.0170.033
Anaerovibrio lipolyticusspecies0.0550.1140.0140.028
Phascolarctobacterium succinatutensspecies0.0330.0610.0040.009
Sarcina maximaspecies0.1090.0340.0170.007
Clostridium thermosuccinogenesspecies0.0080.0150.0050.008
Butyricimonas virosaspecies0.0130.0140.0110.006
Bifidobacterium adolescentisspecies0.4040.2990.0440.011
Veillonella montpellierensisspecies0.0370.030.0190.007
Sporolactobacillus putidusspecies0.0390.0180.0150.005
Caloramator uzoniensisspecies0.0050.010.0020.004
Johnsonella ignavaspecies0.0650.0510.0190.03
Bacteroides cellulosilyticusspecies0.6310.860.020.079

Bottom Line

The purpose of this post is to discourage people from under-using the available data. Patterns can be counterintuitive, for example: Bifidobacteriums is detected less often, but when detected the amount is higher.

My goal for doing this deep dive is to look at different Odds Ratios. Odds ratios allows accurate prediction of likely and developing conditions. A sweet side effect is that it also allows priories on each bacteria to be objectively computed with a goal of higher success with interventions.

I have more investigations to do, especially double checking computations and doing cross validation with existing samples.