Performance of Microbiome-Symptom Prediction Model

Preliminary evaluations have shown strong and accurate forecasting ability. I am working with a Ph.D. in Molecular Genetics and processing his microbiome sample the top forecasts were correct:

  • He was a male
  • He had no Health Issues
  • He was blood type O-positive
  • He snored
  • He woke up early

This was not from a blood sample, but from a stool sample. The methodology is described in my earlier posts.

The methodology is computing intensive. Unlike many contemporary artificial intelligence engines, it is easy to understand:

  • In the entire population(n:10,000), a bacteria like unclassified Clostridiales has 10% of samples reported this bacteria with a mean of 0.3%.
  • In the condition / sample population(n:1000) we find that 30% of the samples reported this bacteria with a mean of 4.8%

I used Chi2 to determine significance because significance using means assumes a normal distribution which is a false assumption.

With this methodology it is good to get some indication of what sample sizes for conditions / symptoms is needed (assuming the control is at least the same size). A quick plot informs us that 200 looks like a minimum size with 300+ being desired.

One of the labs that I work with reports percentile ranking on their reports. This makes doing an exploration economical (i.e. cheap). Testing of the symptom/condition group is only needed because the control numbers are gifted you.

This also illustrates that the symptom is not with just one taxa, but the influence of many taxa working in unison to influence the symptom (or be influenced by the condition).

Samples of Symptoms / Conditions

The ones with the most predictors are below.

  • Neurocognitive: Absent-mindedness or forgetfulness
  • Neurological: Impairment of concentration
  • Neurocognitive: Difficulty paying attention for a long period of time
  • General: Fatigue
  • Autonomic Manifestations: irritable bowel syndrome
  • Official Diagnosis: COVID19 (Long Hauler)
  • Sleep: Problems staying asleep
  • Neurocognitive: Slowness of thought
  • Neurological-Sleep: Insomnia
  • Sleep: Daytime drowsiness
  • Immune Manifestations: Bloating
  • Neurocognitive: Problems remembering things

For blood types:

Only two were significant and we have the importance of sample size apparent.

Symptom NameSample NForecast Variables
Blood Type: O Positive284295
Blood Type: A Positive1263

For Age:

Symptom NameSample NForecast Variables
Age: 60-70206143
Age: 0-109842
Age: 50-6016433
Age: 20-3015212
Age: 30-4036911
Age: 40-502448

For Gender

Symptom NameSample NForecast Variable
Gender: Female415214
Gender: Male554200

Bottom Line

From the individual forecast variables, we can compute odds of a match. Consider a simplistic example of three factors

  • Factor A uses 5%ile
  • Factor B uses 10%ile
  • Factor C uses 20%ile

With A,B,C all being true has a 99% chance of being a taxa match (1 – .05 * .1 * .2). This does not mean the person has it, it indicate a significant risk of having it (depending on other factors, DNA, age, gender etc).

Of course, the more forecasting variables, the better the estimate becomes and also the more tolerant the forecast is to variations in the microbiome.