One Stool, Two Samples, One Lab — What the shit!

A reader sent me the message below and gave permission to use his sample. I had, about a year ago, wrote The taxonomy nightmare before Christmas… that looks at the differences between lab results using the sample sample (as represented by a FASTQ digital file). We now try one more variation.

Last september I did (again) test my microbiome with Thryve. Because I had some general doubts about the validity of stool samples, I ordered two tests and took two different samples of the same stool and send them in under two different names.
…the results confirmed my doubts as I got different bacteria levels of the ten strains Thryve shows in their overview. 

STRAINS% sample 1% sample 2

So I do not doubt the reliability of each sample, but see that the validity of the sample is the problem. The results of a sample seem to be more or less random and not representative of the microbiome in general.
…so I think that any advice given, based on the results of one sample is arbitrary. If we are to take the importance of the microbiome seriously, we will have to consider a new way of getting a representative sample to have a solid base for interventions concerning our health.

Sampling Statistics

The typical sample seems to contain a round a 100,000 bacteria and is usually reported out of a million (scaled up). “Bacteria in faeces have been extensively studied. It’s estimated there are nearly 100 billion bacteria per gram of wet stool. ” [src] The sample that you sent it was likely no more than one milligram.

To use the “if I was a Martian” model… It is like a spaceship abducting a boatload of people in the Mediterranean…. If the boat is a cruise ship full of fat diabetic elderly Americans you will get one result. If the boat are full of starving Nigerians children trying to become refugees in Europe, a very different result. That is a disturbing concept when you mind is fixed on a deterministic precise definitive result. It’s a sample folks! For most industrial processes, dozens (or hundreds) of samples are required to get quality assurance. For the nerds, some readings: [2015] [Wikipedia]

Example: Two employees working for the same company at the same job earning the same amount and living in the same community. You stop each of them and take a sample of how much money they have in their wallet.
Would you expect them to have the same amount? Would they have the same number of pennies? dimes? quarters? Credit Cards?

A Sampling Analogy

Reviewing these two samples

Fortunately, I have sample comparison tools already on,

Diversity By Taxonomy Rank

I would expect differences in samples to increase as you move down the rank. It is similar to asking at one level [European, African, Asian] on the abducted ship above. At the next level [Swede, Dane, Italian, etc] , the counts between sample will diverge as you do more detail classification.

This is an illustration on why I do fuzzy logic on predicting symptoms with good success according to readers. Using studies from PubMed have been reported to produce poor results according to readers.

When the two samples are used to predict symptoms, we have a strong convergence. While the actors may be different, their impact are similar.

Adjusting for Natural Variation

Using counts without context is a good way to get upset without justification. I use percentiles to provide context and have a comparison page (which I need to revise). At the phylum level we see general agreement between the samples. One rare phylum was lacking in one sample (not found in 30% of Thryve Samples but only 6% of BiomeSight – hint: download the FASTQ files and process them thru BiomeSight [for free!]).

Medical Condition Matches

Going over to Pub Med Medical condition matches, we see a striking similarity between the samples as shown below. So for detecting medical conditions — they are almost identical to each other (despite the differences in bacteria)

End Products Predictions

Again, we have strong agreement between the samples using 3 buckets.

  • Both below 12%ile (i.e. Low)
  • Both below 82%ile (i.e. High)
  • Both in normal range

This means for this type of diagnostic evaluation — they appear to be the same.

Bottom Line

There are several questions that need to be asked (and an answer to one):

  • To the folks at Thryve (and, why are the numbers so different?
  • For users of my analysis site:, for diagnostic purposes there are few differences! We have general agreement for:
    • End Product Production
    • Medical Studies Matches
    • Symptom Matches
    • Detecting high or low levels by percentile

The critical difference between the information lab providers and my site is interpretation sophistication.

So, to answer the reader’s question “The numbers are in major disagreement, but the diagnostic significance of the whole sample is in strong agreement”. Doing the lab analysis is worth it — just ignore the lab’s “value added” suggestions/information.

Leave a Reply

Your email address will not be published. Required fields are marked *