Open Source: The challenge of picking what bacteria to alter

A rich 16s taxonomy report may contain thousands of species. Every documented modifier increases and decreases dozens of species. While we have 1800 potential modifiers, the challenge of finding a perfect modifier is very hard, if not improbable.

This post describes a variety of approaches that could be taken. In the ideal site, all of these choices should be available for a very knowledgeable consumer or medical professional to select the best candidate.

Axiomatic Approach

This means that some expert says what they, usually based on clinical experience, believe to be a healthy microbiome. Typically this has the risk of being a regionalized definition of a healthy microbiome — where the diet and the dna of the region are intrinsically included.

One example is Jason Hawrelak in Hobart, Tasmania, Australia. If you are from Chennai, India and a vegetarian, many of his proportions may not apply. See this post for the discussion on how microbiome varies by country, dna and even longitude.

Ideal species proportions Example
Bacteroides spp<=20%
Faecalibacterium prausnitzii>=10-15%
Eubacterium spp<=15%
Roseburia spp5-10%
Ruminococcus spp<=15%
Blautia spp5-10%
Total butyrate producers>=40%
Bifidobacterium spp>=2.5-5%
Akkermansia spp1-3%
Lactobacillus spp0.01-1%
Escherichia Coli<0.1%
Methanobrevibacter spp~0.01%
Bilophila wadsworthia<0.01%
Desulfovibrio spp<0.01%

Filter by medical conditions or symptoms

We may have 1000 species. From published studies, we know that the average of people with a condition compare to controls may be high or low (and occasionally both). Some of the people with the condition may have a normal value — it is the group that has a low or high value. These values may be connected to the diet, age or other confounders of the group being studied.

Repeatability of results

We may have 20 studies on the microbiome associated with Facebookitis. Some studies report the same species, other species are only reported in just one study. I could assume that the more studies that a species is mentioned, the more reliable that species is associated. So we may have Facebookamina found in 12 studies and Twitteramina in just 1. We have a multitude of choices: use only the species over some threshold (say the median number), create a new weighting for the number of studies (for example, Log(Number of Citations)), etc. This is one of the challenges of building your algorithm.

Of the 1000 species, perhaps 30 are reported high or low in studies for Facebookitis. In our sample of 1000 species we find that we have 20 of them. Of these 20, we find that 12 have matching shifts to the studies. It is these 12 that are our candidates to shift. We ignore everything else.

Of the 10 we do not have, 3 are low. Do we deem this to be a low? If these three are only seen in 4% of the population do we still deem them to be a low? There is only a 12% chance that these will be reported. Is this noise or significant?

We could do this process for multiple other conditions that we have. I tend to avoid tossing in every condition because conditions often are interconnected. My preference is to always start with the most annoying condition or symptoms.

Weighting of one bacteria shift to another

There are many ways of weighing – giving a value to the amount of shift. The weight is important when we try finding modifiers because we are trying to estimate the net expected benefit for each modifiers so we can select the best modifiers.

The classic approach is a naive: how many do you have compared to the average. I do not recommend this approach. Let us consider some factors involved:

We find that only 60% of people have any of this bacteria. We can compute the average two ways:

  • Average over those who have it: Say 0.5 %
  • Average over every one (so those who do not have it is counted as a zero). This means that the average is now 0.3%

If your value is 0.48%, are you high or normal?

If the average for a different bacteria (say genus) is 20% and your value is 22% and you are 0.6% for the prior bacteria (using 0.5% average). At a per million level you are 20,000 high for one and 1000 for the other – do you give a weight of 20,000 and 1000? But one is 10% higher and the other is 20% higher. Surely you should give the one with a bigger shift, a greater weight? Picking the weighting is another step of developing the algorithm.

If one of the bacteria happens to be Clostridium difficile, you likely want it to be zero. This seems like an exception to any logic you developed above.

Hand Picking

The above methods are mechanical. People often have experience or beliefs. Hand picking means going thru and selecting the species one by one after looking at the literature and association for each species that are outside of the expected range.

Expect Range

The expected range can be computed many ways. The classic lab approach is to compute the average and then the standard deviation. The normal range becomes mean – 2 std dev to mean + 2 std dev. IF THIS WAS A TRUE BELL CURVE, this means that 5% are above and 5% are belove. I strongly do not recommend this approach.

I have moved onto actual percentiles of the labs. So if you want to use the 5% criteria, you look it up against the actual data.

My own preference is 10% with the additional criteria that a strongly supporting (correlated) species must also be 10%. The goal is to identify a species-conspiracy and address (arrest) them as a whole.

This is one more decisions that you need to make in developing an algorithm.


Modifiers are the same situation as diseases and symptoms. Multiple studies with different results. Existing diet, DNA, etc may be confounders of the published studies.

Rather than repeat the discussions above (how many studies reported the same thing etc.), just re-read above. A major confounder is that different studies may not have tested for the same things — a no change report will often be omitted from the studies…. leaving what was tested for being uncertain.

Bottom Line

Where are we:

  • We have a collection of bacteria shifts which may be due solely to diet or DNA with no association to any condition or symptoms | given diet and DNA.
  • We need to identify which bacteria to change (a simple true/false)
  • We need to give a value/weight for how relatively important each bacteria is to change.
  • We have modifiers which are likely to impact these bacteria (a simple true/false)
  • We need to give a value/weight for how certain each bacteria is to be change.

Now we need to optimise across all of these variables to get the optimal suggestions. The item to be optimized is the estimated weighted shift of taking a set of modifiers against a set of bacteria. The key word is estimated.

Leave a Reply

Your email address will not be published. Required fields are marked *