Technical Note: Prevalence, Average and Not Reported

In reviewing many papers with the microbiome I noticed that often the researchers restrict their examinations to the taxa that is reported in all samples. I suspect this is due to a lack of sufficient statistical training and/or not understanding the natures of the microbiome.

Recently I came across these papers that uses an approach that I often have used, working off relative frequency of detection a.k.a. prevalence.

This post is going to use samples available at Microbiome Prescription Citizen Science site. We are going to restrict to one lab source and divide the data into two groups based on their self-declare symptoms and diagnosis.

  • Patients with Myalgic encephalomyelitis/chronic fatigue syndrome (ME/CFS) declared [Obs: 271]
  • Patients without Myalgic encephalomyelitis/chronic fatigue syndrome (ME/CFS) declared and other status declared (for example: “Asymptomatic” [Obs:569]

Naive First Pass

We are going to take the average count for each group ignoring no values reported. We are going to restrict it to taxa where we have at least 30 non-zero values [1,564 taxa]. We found some 77 taxa with a t-score over 2.81 (p < 0.005)

taxa nametaxa rankShiftT_score
Prevotella coprispecieslow in ME/CFS-5.27
Prevotellagenuslow in ME/CFS-4.52
Sporolactobacillaceaefamilylow in ME/CFS-4.2
Sporolactobacillus putidusspecieslow in ME/CFS-4.19
Sporolactobacillusgenuslow in ME/CFS-4.19
Prevotellaceaefamilylow in ME/CFS-4.1
Firmicutesphylumhigh in ME/CFS3.94
Blautiagenushigh in ME/CFS3.91
Cetobacterium cetispecieshigh in ME/CFS3.89
Cetobacteriumgenushigh in ME/CFS3.84

Deeming Not Reported to be Zero

In this case we have 78 taxa with a t-score over 2.81 with slight changes of t-scores.

taxa nametaxa rankShiftT_score
Prevotella coprispecieslow in ME/CFS-5.31
Sporolactobacillaceaefamilylow in ME/CFS-4.63
Sporolactobacillus putidusspecieslow in ME/CFS-4.62
Sporolactobacillusgenuslow in ME/CFS-4.62
Prevotellagenuslow in ME/CFS-4.5
Prevotella oulorumspecieslow in ME/CFS-4.35
Prevotellaceaefamilylow in ME/CFS-4.08
Bifidobacterium gallicumspecieslow in ME/CFS-3.97
Firmicutesphylumhigh in ME/CFS3.94
Blautiagenushigh in ME/CFS3.91


We followed the same process as above and limited things to a Chi-2 probability of < 0.005 (as used above) We ended up with 65 taxa.

in MECFS %
Control %
Erysipelothrix inopinataspecies2110.710.3142
Mogibacterium vescumspecies27.715.811.9131.8
Haploplasma cavigenitaliumspecies8.52.85.7133
Prosthecobacter fluviatilisspecies7.72.55.3123.1

Comparing these two lists, we found only 6 taxa in common

  • Bifidobacterium angulatum
  • Propionigenium modestum
  • Pseudomonas viridiflava
  • Cetobacterium ceti
  • Cetobacterium
  • Propionigenium

The next result is that we have 78+65 – 6 = 137 statistically significant bacteria with p < 0.005.

Bottom Line

There are at least two different statistical ways of determining significance. IMHO, the prevalence approach is likely to be a superior tool for diagnosis purposes because it is possible to compute the probability of a match to the above patterns despite some bacteria not being reported.

The full list of bacteria is listed here.