A reader raised a valid question which actually triggers other related questions.

You seem to like the “percentage of percentiles” measurement, but I’m not convinced it’s being analyzed appropriately. As I understand it, you first convert to percentiles, getting numbers in [0, 100]. I think this is fine. Then you histogram these percentiles. Because each lab will perform the same measurements every time, I think this is also fine. However, the result is compositional data in the sense of Aitchison, and it should be analyzed in a manner consistent with that. For compositional data, a chi^2 test is inappropriate because it relies on the number of species (or genera) measured.

My suggestion is to apply a centered logratio transform to each person’s percentages and fit a normal distribution to the transformed data. To determine whether someone’s microbiome deviates significantly, calculate a multivariate normal tail probability. Beware that the covariance matrix will be rank deficient (you’re in a ten-dimensional space, but there are only nine parameters because percentages sum to 100). You may want a robust fit because it’s reasonable to expect that the microbiome of someone ill might be an outlier.

For more information about compositional data, see Aitchison, J., “The Statistical Analysis of Compositional Data,” Journal of the Royal Statistical Society. Series B (Methodological) Vol. 44, No. 2 (1982), pp. 139-177; Aitchison, J., “The Statistical Analysis of Compositional Data,” Chapman & Hall, London, 1986; and Aitchison, J. “A Concise Guide to Compositional Data Analysis,” unpublished manuscript, 2005, available online (just Google). For other approaches to compositional data analysis, see Greenacre, Michael; Grunsky, Eric; Bacon-Shone, John; Erb, Ionas; Quinn, Thomas, “Aitchison’s Compositional Data Analysis 40 Years On: A Reappraisal,” arXiv:2201.05197, 13 Jan 2022, to appear in Statistical Science.

## What is the statistical basis for other Diversity Indices?

How to calculate these numbers is well determined — they seem to be *brilliant ideas* tossed out there that seems to fit the data for some study. For some background, see this page. The problem is a lack of rigor, especially statistical rigor.

Diversity indices, particularly the Shannon-Wiener index, have extensively been used in analyzing patterns of diversity at different geographic and ecological scales. These indices have serious conceptual and statistical problems which make comparisons of species richness or species abundances across communities nearly impossible.

Conceptual and statistical problems associated with the use of diversity indices in ecology [2009]

The problem is an absence of a native statistical model. For example, it does not fit the usual ones.

- Normal distribution – “The distribution that shall rule them all” because that is what is always assumed and what is usually taught outside of mathematics department (who knows better)
- Continuous uniform distribution
- Poisson distribution
- Erlang distribution – which I used professionally for many years’
- Cauchy distribution
- For even more choices, see this list

*The key question is simple, what is the distribution underlying diversity Indices*? We read ” In the literature of biodiversity, according to Ricotta (2005), there are a “jungle” of biological measures of diversity.”[2017]. Zheng’s A new diversity estimator[2017] in Journal of Statistical Distributions and Applications where he states “There are many other open problems built on this connection between birthday problem and diversity measures. ” The problem is this, the birthday problems deals with 366 discrete well defined boxes that are well defined. Dealing with the microbiome, we lack these boxes. Consider a measure of a microbiome sample in 2000, there are a large number of different bacteria species in Lactobacillus. Today, we have these species no longer placed in 1 genus, but 25 genus [2020] including:

*Acetilactobacillus,**Agrilactobacillus,**Amylolactobacillus,**Apilactobacillus,**Bombilactobacillus,**Companilactobacillus,**Dellaglioa,**Fructilactobacillus,**Furfurilactobacillus,**Holzapfelia,**Lacticaseibacillus,**Lactiplantibacillus,**Lapidilactobacillus,**Latilactobacillus,**Lentilactobacillus,**Levilactobacillus,**Ligilactobacillus,**Limosilactobacillus,**Liquorilactobacillus,**Loigolactobacilus,**Paucilactobacillus,**Schleiferilactobacillus,**Secundilactobacillus.*

With the same strains/species, our diversity indices will be very different because our boxes are arbitrary and “*soft*” unlike the days of the year or the roll of a dice.

## Back to percentage of percentiles

While I show genus and species in the table for ease of understanding of the typical reader, I originally did it solely with the lowest identifiable levels (the “atoms” or the microbiome) – species. At the species level, it is not *compositional*. There is no composition! Looking at the data that was actually received, I noticed many genus had no species listed. In some cases, the genus had species, but none of the known ones were detected. In other cases, the test did not report any species in over 3000 test results.

On this basis I decided to use try using both species and genus. I soon discovered that they almost always exhibit a similar pattern and chi^2. At this point, I opted for benefiting my readers and not as much rigor as some would like. We could do the lowest taxonomical level reporting across the hierarchy as one solution.

This approach ends up with us side-stepping the classification issues cited above. We are dealing with distinctive, non-overlapping events (a bacteria being identified) and then convert them to percentile giving use a continuous uniform distribution for each of these independent events. IMHO, at this point we have a good model to chi^2 test. We are not dealing with measuring a population, just a sample.

In answer to “*a chi^2 test is inappropriate because it relies on the number of species (or genera) measured.*” is missing the point. If I get two bags of coins from the bank and then flip them to determine if they are biased — whether the bags contains 1000 or 100,000 coins is significant only on the ability to determine the margin of error. The number of species/genus is only significant in that sense. If there is a strong bias with a small number, then having more will not change the bias.

## Recent Comments