[Orechem] ORE to support Chemistry Extraction

Sun Mar 8 05:39:31 EDT 2009

Following Bill and my initial excursions into chemistry extraction I'd 
like to suggest we devote some time to sketching out an ORE design for 
supporting this process.

I think that the confluence of groups and technology in this project 
jumps us ahead of the world in what is an area of great interest. It is 
also clear that unless it is supported by an ontological approach it 
won't succeed - most of what is currently done is trivial and messy 
(lexical identification of potential chemicals (NER) and conversion to 
"structures" (N2S)). Although this works OK for sub branches of 
chemistry it's broken for large areas. A major problem is that it does 
not distinguish between molecules and substances and sub-problems are 
that some substances are mutatble and heterogeneous. Although Pubchem 
has a well designed approached most databases (e.g. Chemspider) are 
confounded by this fallacy. Although ChEBI is a worthy first essay at a 
lexicon/thesaurus for chemistry its ontology is recognized to be broken.

I suggest we concentrate on high-throughput extraction of information 
reported for organic chemical syntheses.

It has the advantages that it's important, tractable and has a good 
implicit ontology. In essence a reported synthesis combines the following:
  * a target "molecule" M with a connection table (graph) G
  * Properties (P1...Pn) of a substance S
  * an assertion that S is composed of many identical Ms.
  * assertions that observations O1...On are consistent with expected 
properties P1...Pn of M (usually spectroscopic). It is normally 
difficult to predict properties of S reliably.
  * a recipe (P) for making M from M0, M1 ... under conditions C0, C1... 
using reaction R.

In general M, G and R are unknown or unreported and only accessible 
through their data and metadata and linguistic environment.

In such a report there is both ambiguity and redundancy. It is therefore 
critical to use ontological methods including not only provenance but 
also confidence.

There is almost never a clear statement of the connection table (graph 
G) of M. This may change if journals include InChIs. In the absence of G 
we can deduce it from:
  * a crystal structure X (very high confidence, but probably only 5% 
occurrence).
  * a chemical name N. I am hoping somewhere in the region 50-70% recall 
and 98% accuracy
  * a vector diagram V of the connection table G. I expect ca 25% recall 
and 95% accuracy.
  * the spectroscopic properties (P0...Pn) This is a common 
undergradutae exercise for simply molecules.

This information should contain redundancy and, in good cases will 
reinforce each other. In others it allows us to make good guesses about 
M-C which are further constrained by P0...Pn and R. A typical example of 
constraints in R is "M1 was treated with acetic anhydride to give M2". 
This would normally add one or more Acetyl groups (CH3-C(=O)-) to the 
structure, each adding C2H2O to the formula). Thus even if we didn't 
know the structure of M0 or M1 but did know their formulae we could see 
if they were consistent. Similarly we would expect acetyl groups in the 
NMR and IR spectra of the M2.

All this gives us a very exciting area indeed in which to apply 
ontologies. Nico Adams (copied) has created the tools which can support 
this. We propose that PSU and Cambridge create a system to extract N, X, 
V, C, O, P from  thousands of current published syntheses and to attempt 
to deduce their "structures" (G). Where that can be done we can also try 
to extract R (as CML).

We shall need to agree a means of adding confidence to our triples. 
Later we shall need a means of chaining these confidences.

We should not underestimate the difficulty of getting a transportable 
and re-usable workflow for this. Probably the main problem is the 
heterogenity of the tools - PSU is very UNIX-oriented - Cambridge tends 
to use a Java framework. Tools like ps2edit may have platform-dependent 
problems and anything to do with PDF is a nightmare...