The data is available at I did some tutorials a few years back. These are linked to below:

Some challenges for the reader:

  • Compute Percentile for each bacteria (taxon) in samples from the same lab
  • Test the data if it is a normal distribution
  • Run regressions between different taxon/bacteria using:
    • Raw Counts
    • Percentile
    • Which gives stronger results?
    • Reframe this using random forest and other ML techniques.

Then compare your results to that shown here for Clostridium butyricum (taxon 1492).

When you use percentile — what is the name of the resulting distribution? Why is that an advantage?

Next Project

Identify bacteria shifts for symptoms reported.

  • Do for all labs first
  • Filter by individual labs

Which approach gives better associations/models.