Some statistics related to Symptom Forecasting

I recently finished building out a new symptom forecasting algorithm. The ultimate goal is not to forecast symptoms, but to identify the bacteria associated to symptoms and a numeric value that each bacteria contributes.

The results are shown below. Honestly, they far exceeded my expectation of 70-85% accuracy. Odds Ratio required P < 0.01 to be included. All of the input data used is available for download.

Retail TestResults
BiomesightForecast Symptom and Declared Symptom agrees: 43,085
Declared Symptom failed forecast ( Odds Ratio < 1): 646
Percentage Accuracy: 98.5%
Total Samples: 4,436
Symptoms Evaluated: 363
OmbreForecast Symptom and Declared Symptom agrees: 17,492
Declared Symptom failed forecast ( Odds Ratio < 1): 63
Percentage Accuracy: 99.996%
Total Samples: 1,319
Symptoms Evaluated: 302
uBiomeForecast Symptom and Declared Symptom agrees: 11,856
Declared Symptom failed forecast ( Odds Ratio < 1): 26
Percentage Accuracy: 99.998%
Total Samples: 791
Symptoms Evaluated: 202
ThorneForecast Symptom and Declared Symptom agrees: 1,643
Declared Symptom failed forecast ( Odds Ratio < 1): 11
Percentage Accuracy: 99.99%
Total Samples: 253
Symptoms Evaluated: 23

This was obtained using Odds Ratio and accepting the complexities of the microbiome. A simple example is the assumption of using the average to determine if too high or low. This is wrong in so many ways, especially give the high skewness seen in the microbiome.

There is no one threshold suits all

To assume a single value applies to all symptoms is naive ideological thinking. You can do it with a significant drop in forecast accuracy (I’ve tried it). The following charts illustrate the patterns discovered.

 FCB group for Ombre Data

The chart below illustrate that the best threshold for Odds Ratio vary greatly by symptom with about 45% being the most common one. The average over all samples is 30.8%.

Bacteroides for Biomesight

We see the same behavior elsewhere, with the average 26%.

Clostridia for uBiome

Similar to above with an average of 61.4%

What do the resulting Odds Ratios look like?

The charts below are for Ombre Data Sets. The numbers are log(Odds Ratio) with the actual odds being pretty high (for an understatement).

The next chart includes the failure to forecast ones. With the typical Odds Ratio being much lower than others shown, this weakness in prediction is not unexpected. It illustrates that Odds Ratios are indeed Odds.

Most Frequent Bacteria Used for Odds Ratio

It is interesting to note that Lactobacillus is not a top bacteria (despite its popularity with some). Lactobacillaceae shows up as #91, Lactobacillales as #127, Individual Lactobacillus start around #151

Tax nameRank
Bifidobacteriumgenus
Bifidobacterialesorder
Bifidobacteriaceaefamily
Actinomycetotaphylum
Actinomycetesclass
Sutterella stercoricanisspecies
Listeriaceaefamily
Caloramator fervidusspecies
Devosiaceaefamily
Marvinbryantiagenus
Oscillibacter valericigenesspecies
Devosiagenus
Oscillospiraceaefamily
Bacteroidesgenus
Bacteroides cellulosilyticusspecies
Bacteroides uniformisspecies
Anaerotruncus colihominisspecies
Coriobacteriaceaefamily
Anaerotruncusgenus
Faecalibacteriumgenus
Lachnobacteriumgenus
Paraprevotella xylaniphilaspecies
Mycoplasmatotaphylum
Campylobacter ureolyticusspecies
Mollicutesclass
Bacteroidaceaefamily
Holdemania massiliensisspecies
Metamycoplasmataceaefamily
Desulfitobacteriaceaefamily
Ruminococcaceaefamily

Bifidobacterium for Biomesight

Average is 0.93 which presents a stark contrast to the numbers below.

Odds Ratio show significant increased risk for most symptoms with low Bifidobacterium,

SymptomOdds Ratio with low amount
Immune Manifestations: Hyperphagia (abnormally hunger or desire to eat)14.65
Comorbid: Reactive Hypoglicemia12.89
Comorbid: Panic Attacks9.51
Neurological: Neuropathy7.72
Lactose intolerance5.88
Neuroendocrine: Sweating hands4.4
Comorbid: Constipation and Explosions (not diarrohea)4.14

Summary

Odds ratios are designed to have predictive power. An optimized algorithm using odds ratios appear to have awesome power. The challenges with Odds Ratio are simple:

  • Having sufficient data to obtain P < 0.01 (or better still, P <0.0001) Odds Ratio
  • Addressing each bacteria-symptom individually and not going into simplistic one bacteria level fits all.
  • Computational requirements: After a lot of performance tuning, my computations took over 12 hours on a well equipped Microsoft SQL Server machine with 64 GB of memory. Other tech stacks may take considerably longer, a few (like Multithreaded C++) may be a lot of faster.

I deem my success to doing the above being two factors: A M.Sc. in Operations Research where optimization techniques rules, plus extensive experience with SQL Server Performance (including writing multiple white papers for Microsoft). Last, is not following current normative beliefs on how to approach this issue, I follow the numbers (i.e. statistics).

As a historic note, building out this model caused flash backs to programming in Simula and other modelling languages back in the early 1980’s. There is a “zen” for modelling.

Leave a Reply