Exploration in Picking Critical Bacteria

This is a continuation of a nerdy technical series with Predicting Conditions from PubMed Studies earlier. I continue using depression as my canary. We use two data sets from two different retail providers: Ombre Labs (USA) and Biomesight (UK).

Means, T-Scores – Walking Two Paths

There are two paths available, with no clear best choice:

  • Compare averages from samples that reported some amount (thus excluding zero values)
    • Lower degree of freedom and thus T-Scores
  • Compare averages from samples with a zero being used when an item is not reported.
    • Higher degrees of freedom and higher T-Scores

Picking one or the other of these paths seems to be done a priori by most researchers. To get a feel for the differences, I processed data from Ombre Labs and Biome Sight by both methods. Ombre Labs is almost solely from American Samples. BiomeSight is from around the world. The data suggests that Americans are more prone to depression.

MeasureBiomesightOmbre Labs
Bacteria Taxon Seen1,8782,586
With Depression87101
Without Depression639343
Percentage with Depression Reported12%23%
Highest T-Score with reported only (No Zero values)9560
ABs(T-Score) over 3.2 (No Zero values)11351185
Highest T-Score with Zero if missing282176
Abs(T-Score) over 3.2 (with Zero if missing)13911492
Highest Composite T-Score (Sqrt of the above t-score multiple by each other)15892
Composite T-Score over 3.2 ( < 0.001)11931306
Percentage of Bacteria with T-Score over 3.263%50%
Larger total sample size results in higher t-scores

We then did a composite T-Score between the two labs by bacteria to see the best indicators (in decreasing order of significance). We then compared to results from studies on the US National Library of Medicine

Blautia wexleraeIncreasedGenus has 50% high and 50% low
Catonella morbiIncreased
Sutterella wadsworthensisDecreasedMatch (genus)
Haemophilus parainfluenzaeDecreasedMatch (genus)
Bacteroides ovatusDecreased
PrevotellaDecreasedGeneral Match at genus
Veillonella disparDecreasedMatch (genus)
Blautia gluceraseaIncreasedGenus has 50% high and 50% low
Blautia hanseniiIncreasedGenus has 50% high and 50% low
LachnospiraDecreasedGeneral Match at genus
[Ruminococcus] torquesIncreasedDisagree
Desulfallaceae Watanabe et al. 2020Increased
Bifidobacterium bifidumDecreasedDisagrees, but the genus was a match

Where there was a study available, we had our results in good agreement. Note that our sample sizes far exceed those of any of the studies we found.

We had about 20% less significant bacteria by eliminating those samples with no/zero counts. By requiring both methods to be significant reduces the risk of false positives. In this case, we have an abundance of significant bacteria and such a criteria is viable.

Many of the items deemed very significant were of low frequency of being seen in samples

  • Erwinia tasmaniensis: seen in 7.2% of Biomesight samples, not reported in Ombre.
    • With Condition: Seen in 3% of samples with count of 353 /million
    • Without Condition: Seen in 7.7% of samples with count of 128 / million
    • Significant with excluding (11.2), significant with deeming zero (23)

This takes us to the phrase of the analysis: Given a random sample, what is the probability of correctly predicting depression. The end goal is to determine the most influential bacteria responsible for depression. One approach is using random forest on the data trimmed to those bacteria deemed significant or very significant. Another approach is use splines mixed with logistic regression. We must be careful not to shove this problem into an existing square or round hole, instead we need to allow the data to speak to us.

Remember, bacteria do not operate independent of each other. They are highly dependent on the metabolites of each other. Some of my earlier explorations found the KEGG Enzymes estimators were better predictors than bacteria. To be continued.