The original process was a quick punt. If the bacteria has the enzymes that produces or consumes a substance, we gave it a value of one for each bacteria. This depends on the lab bacteria matching the KEGG bacteria. KEGG items are about 50% species and 50% strains. This means that if a sister strain to what is in KEGG then it is not counted.
Fixing this is not a trivial project to actually implement.
The new approach
First item is to work off the species level only. If we have 5 strains for a species, we will simply average the information to create a synthetic collections of genes. This should results in less hit and miss for counting up items.
Enzymes are produced from genes. Many different genes may produce the same enzymes. Suppose we have a bacteria X that has 4 enzymes that produces lactate.
- Enzyme #1 comes from 4 genes found in X
- Enzyme #2 comes from 2 genes found in X
- Enzyme #3 comes from 1 genes found in X
- Enzyme #1 comes from 5 genes found in X
So we have 4+2+1+5 = 12 genes in the 4 enzymes that produces lactate. This bacteria count (20,000) x 12 = 240,000 units of production of lactate. With the old approach: bacteria X has one or more enzymes producing lactate, we would then estimate 20,000 x 1 = 20,000 units. Actually, in reality, we do not double count any geneses found is two of the enzymes.
At present, we see some bacteria that are both producers and consumers. With this approach (assuming it is a reasonable simulation of reality), we should be able to determine if it is likely a net producer or net consumer.
That’s it folks, what is in the works. It is not up on the site yet — it’s some nasty calculations to be done.