Bacteria interacting with Bacteria

Bacteria are like people, they interact and are influenced. The problem is how to detect the interactions that are clinically significant and the direction of interaction without grabbing stereotypes (i.e. all Italians belong to the Mafia, Irish are lazy, Egyptians are all Islamic Terrorists, etc).

We are going to look at two bacteria interacting: Phocaeicola massiliensis and Paraprevotella

To see the results for other bacteria, look up your favorite bacteria MicrobiomePrescription : Look up a bacteria taxa. See video at bottom for walk through.

The Classic Way

From a collection of samples, we pull all samples that report both bacteria. We take these numbers and drop them into a tool like Excel. Chart the data and try to do a linear regression. This is often pro-forma in research papers because that is rote learning.

A Uniform Way

This is almost the same, except we do not use the actual numbers, but the percentile rankings. This produce stronger regressions values Using the percentile transform the data to a uniform distribution. R2 increased by 10 fold but really a long away from significance.

You can almost see signs of a trend in the middle of lots of noise.

A Non-parametric Way

We use classic Chi2. The process is simple

  • For bacteria A we determine the percentage with a value of 100 or higher, say 5%
  • For bacteria B we determine the percentage with a value of 1000 or higher, say 5%
  • We filter the samples to those with bacteria A being higher than 100
  • If there are no interactions than we expect 5% of bacteria B to be 1000 or more.
  • If we find that 30% of bacteria B is more than 1000, then it appears that high Levels of A results in higher levels of B

From the above we can compute a statistics,Chi2, and thus the statistical significance. In this case, very very significant.

This means that we isolate the impact of high values and low values which the earlier methods did not do, We do not know how the middle value interact but for clinical issues, it is abnormally high and abnormally low values that are of interest.

Implementation

The first question is to pick the high and low threshold values. People can pick arbitrary values and try them. I have my own preference a patent pending algorithm to produce ranges.

The second question or issue is the number of computations. People can download my data set from https://citizenscience.microbiomeprescription.com/ and do the same calculation.

The number of calculations to be done were done in the following datasets with 5%ile and above, and 95%ile and above. The bigger the sample, the better sensitivity and more interactions likely to be discovered.

  • All: 5,191,562 possible pairs on 5189 samples –> 1,270,000+ Interactions found
  • Biomesight: 1,717,410 possible pairs on 2534 samples –> 275,000+ Interactions found
  • Ombre: 1,743,720 possible pairs on 1540 samples –> 220,000+ Interactions found
  • uBiome: 132,860 possible pairs on 791 samples –> 4,700+ Interactions found

For each pair of taxa we have 4 scenarios (Low versus Low, Low vs High, High vs High, High vs Low) Or about 32 million queries retrieving data sets and performing calculations. The bigger the sample size, the more items that are likely to be identified. For thresholds, we use a patent pending algorithm that appears to yield good results (shown above). The alternative would be to enumerate percentages and find the ones that work best (so 100 x 100 x 32 million = 320,000,000,000 queries).

Illustration of the code is below.

Select Sum(Case when c1.Percentile < 19.274700171330668 
                  /* 577309 Low Percentile Threshold*/
           then 1 else 0 End) Obs, /*Low Count that is filtered sample */
         Count(1) Cnt, /*Filtered sample Count*/
         cast(Count(1) * 19.274700171330668/100 as float) [Expected Value]
from UserCounts c1 Join Usercounts c2
on C1.sampleId=c2.sampleId
And C1.taxon=204516
And C2.Taxon=577309
Join Users U on C1.SampleId=SequenceId
Where c2.Percentile < 12.890741292051205 /* 577309 Low Percentile Threshold*/
Group by C1.Taxon,C2.Taxon
dependentindependentLabelObsDirectionExpectedChi2%
204516577309L,L158>7998.5200%
204516577309L,H204<238686%
204516577309H,L138<20239.568%
204516577309H,H724>60943.1119%
Phocaeicola massiliensisParaprevotella
Example of results with counts

There is a question of using Chi2 or using the percentage increased or decreased.

Example:

  • High Paraprevotella, we get more high count and less low count of Phocaeicola massiliensis. In other words Phocaeicola massiliensis numbers increase as a result (i.e. median likely moved up).
  • Low Example: Paraprevotella, we get less high count and more low count of Phocaeicola massiliensis. In other words Phocaeicola massiliensis numbers decrease as a result (i.e. median likely moved up).

Looking at doing linear regression, we do not see the relationship.

Chi2 Low Chi2 HighNumber of Interaction Found
6501427931
50150301973
15025035584
2503507966
3504503172
4505501303
550649695
9501050610
650750505
750850483
851950376
10511150349
12501350223
11501250218
13501450213
15511650158
14501549153
16511750141
18501949138
17511847126
1950205098
2050215083
2150224968
2252234864
2350244956
2451254833
2551265031
2951304328
2753284527
2659274127
3057315026
3153324223
3253334820
2856294919
3862394818
3365345016
4057414114
3470354812
3752384212
355536259
365237498
395640468
416242486
427843293
486749053
576357822
437344232
510251392
477147711
518951891
473747371
If you use your own limits, this can be used to determine if the limits are better or not.

Next Project

Many taxa shifts have nothing in the literature affecting the taxa for use in a clinical context. Identifying taxa with a strong interaction that we can affect should allow us to indirectly influence the target taxa. Yes, gets complex but with modern computer power, very possible to do.