Using Frequency of Detection in Samples

This is a technical note. Recently I came across this doing analysis of Long COVID data.


  • With Long COVID: 55/152 samples or 36.2%
  • Reference (excluding Long COVID samples): 72/996 or 7.2%

This present an interesting insight on possible blinkered thinking when seeing such data. Some examples are:

  • Don’t brother looking —
    • It’s a rare bacteria (just 7% of people have it…)
    • It does not occur in most Long COVID patients, not interesting
  • I computed the means and standard deviations, and the difference is not sufficiently significant, so do not mention

My take is simple, it occurs FIVE times more often. I view microbiome dysbiosis are the result of the “perfect storm” or should I say “imperfect storm”. The wrong concentrations of compounds and enzymes coming together from a host of bacteria. With that dysbiosis view, a rare bacteria oddity like this, hints at a subset. This is contrary to the common view that dysbiosis is caused by a single or small group of bacteria and you can make simple either/or decisions based on their presence or lack of presence.

In the case of the long COVID data, I observed some odd (by traditional thinking) situation. A few examples:

  • A 10 fold difference of frequency with the higher frequency having a higher average – the traditional expectation. More of this bacteria is growing, hence we find more often.
  • A 10 fold difference of frequency with the higher frequency having a lower average, with statistical significance. This is what stopped me to re-examine my perspective, including the need to re-evaluate some blinkers.

The natural question: Determining Significance!

For most people dealing with biological data, presence or non-presence is typically a dependent factor. For example, here are some means for bacteria with the outcome being Crohn’s disease detected or not (the control case). The data will often be dropped into logistic regression.

I went back to flipping bias coins thinking and raise a beer to the memory of Bernoulli. In the above case, the expected bias is that 7.2% of the time the coin will land with a head. We try a new coin and toss it 152 times and get heads 36.2% of the time…

The hypothesis to test is whether the coin is equivalent?

  • The standard deviation of the population is a simple calculation – except we need to change .50 to .362 in formula below. (P.S. The Std Dev of the population is about 1%, so a range of .342 to .382 could be tried safety)

The result is a z-score of -7.43, or clearly significant well beyond a 0.01 level. Thus the presence or lack of presence is statistically significant and should be included in any analysis (but rarely seems to be in most papers)