In a recent zoom meeting I heard frustrations over building a database from studies because they have seen in some months over 1000 new studies published. Microbiome Prescription is able to process all relevant new studies on a week basis. It is possible. It may not be trivial or easy for others to do.
First Scope the core Tables in the database
The following referential / foreign key tables are suggested.
- Citation: Source of the information. Typically should support both DOI, local file, and URI links.
- Bacteria: A list of bacteria in scope with historic alternative names and regional spellings and misspellings. Often new misspellings are dynamically added. I recommend using NCBI Taxon numbers since they are the most stable.
- Modifiers: A list of substances or actions that may change bacteria. This can become complex in two ways:
- Alternative names, especially for prescription items which may have multiple names in different regions of the world.
- Composition issues: See below
- Medical Conditions/Syndromes/Symptoms: As above, we need a list of alternative names and spellings. Ideally most will their International Classification of Diseases identifier (ICD). ICD has a hierarchy which is very useful for doing inferences when studies are sparse.
- Measures: This can be objective (i.e. quantifiable with a measurement) or subjective (feels happier).
Modifier Composition Example:
Consider beta glucan which will be cited in some studies. This is also found in the items below. Should a study also indicate similar impact from the following. This is not clear because all of the substances contain other substances that may inhibit or reverse the reported result.
- Oats: Among the richest sources of beta glucan, commonly found in oatmeal, oat bran, and oat flour.
- Barley: Another top grain source, often used in soups, stews, and as a cooked grain side.
- Wheat and Wheat Bran: Contain beta glucan in smaller amounts compared to oats and barley.
- Rye, Maize (corn), Sorghum, Triticale, and Durum Wheat: Additional grains with beta glucan content.
- Mushrooms: Certain varieties like shiitake, maitake, and oyster mushrooms contain beta glucan.
- Seaweed and Algae: Types like nori, kelp, and spirulina have beta glucan, often used in supplements.
This issue is also seen with turmeric and curcumin. Key points from studies:
- Turmeric contains various bioactive substances including curcumin, demethoxycurcumin, bisdemethoxycurcumin, and others like polysaccharides and volatile oils. Some studies show that these other components can add anti-inflammatory, analgesic, and anti-fungal effects beyond what curcumin alone provides. For example, turmeric as a whole showed stronger fungal growth inhibition than curcumin alone in certain studies.
- Curcumin is the main active compound often isolated for supplements due to its potent anti-inflammatory, antioxidant, and anticancer effects. It has been shown in randomized controlled trials to reduce arthritis symptoms, pain, and markers of inflammation like TNF and IL-6. Curcumin generally has better bioavailability when formulated with absorption enhancers like piperine (black pepper extract).
The recommended approach to do the data entry with the finest resolution. Later when using the data then selected reasonable inference may be done when there is insufficient data.
The issue of adjectives
Adjectives are words that clarify an object. Above are lists of objects. Additives may be dosage for a drug, BFUs/daily for a probiotic, Stage 1-5 for Cancers. This also applies to studies, i.e. studies on diabetics, athletes, children, dogs, pigs, horses and my favorite, zebra fish.
The recommended approach to do the data entry with the finest resolution. Later when using the data then selected reasonable inference may be done when there is insufficient data. Note that for a study we may have data being reported in the study or cited from prior, perhaps unpublished, studies. This is also an adjective.
Moving on to Verbs
The simple ones are: increases or decreases some measure. As with nouns/objects above, we also have adverbs such as “statistically significant” and “trending”.
The Facts, Just the Facts ma’am
The objects and verbs, with their adverbs and adjective need to form a sentence which is often referred to as a fact. This may be in one table or many tables or perhaps in a network database.
Implementation
There are two general ways of doing the extraction once you have the database design and built.
- Using Natural Language Processing to produce the facts in the database, likely using:
- Named Entity Recognition (NER)
- Deep Learning NLP
- Using Human Reviewer, often done by hiring grad students with the appropriate background
Microbiome Prescription uses a hybrid between these approaches: Facts are proposed by software analysis and this is subsequently reviewed by a human before committing. The Facts shown to the human have been pre-coded for inserting into the database (i.e. Taxon number, ICD code already computed). The human just examines the highlight portion of the article that the fact is based on and click [Yes] or [No] or [Issue]. The [No]/[Issue] are sent back to the NLP developers to further refine their algorithms. The [Yes] is immediately inserted without the need to do any additional coding. Often an [Issue] is a failure of NLP, for example due to misspelling or confusing a probiotic (modifier) with a bacteria.
Microbiome Prescription database has been reviewed independently by teams from microbiome testing companies that have used it’s database. Human review often raise issues because the adjectives or adverbs were insufficient for the reviewers preferences. This was expected. The NLP review reported 100% agreement with the facts extracted using the restricted vocabulary (better than I was expecting). Given that the accuracy of the actual studies is fuzzy, a small amount of errors is not deemed critical. Data entry can easily be over-engineered at significant costs and no improvement of algorithms results using the data.
I strongly recommend doing a similar two step process: NLP followed by human review.
Summary
The costs of doing a perfect database can easily reach several million dollars per year for the human reviewers. At the other end, doing a one time search for some articles and building a naïve database may amount to just a few months of labor — a good task for an uni summer student. A poor match of skills with the team implementing can result in frequent missed deadlines, excessive hours and abandonment of the project.
If you are interested in trying this, I attach a list of DOI that useful information was extracted from. This list was filtered from 2,924,386 articles flagged in the first pass that mentioned a bacteria. This was then reduced by Text Processing Tasks and Application and then manual review to the 17,776 DOI in the attached file. In other words, about 1 in 100 of the first pass articles were useful.
The issue of using the data is another complex issue, especially since reasonable inference is a likely requirement to get good suggestions.
Use the DOI to do a proof of concept and then proceed to tune the process to fit your financial constraints.
Recent Comments