[Orechem] Cambridge Software and software for Redmond meeting

Sun Mar 8 04:57:04 EDT 2009

[Carl - while wrting this mail I feel it should be accessible from the 
wiki ... either linked from a mail page or cut-and-pasted]

WILLIAM J BROUWER wrote:
> looks great peter! I will have the PDF extractor wiki finished for you 
> all before the meeting.

I think that we should have a hackfest at Redmond on this. Getting a 
good supply of high-quality molecules must be an important target.

There are a number of positive threads:

  OPSIN
  =====
  Daniel Lowe (copied) is a first year PhD student in Cambridge and he 
has been making excellent progress with OPSIN (our IUPAC-like 
name2structure converter). It is always difficult to give accurate 
metrics as there are no gold standards, no corpora, no annotation 
guidelines for name2structure. (This does not stop a number of software 
houses making unmeasurable claims for their software.) Given these 
reservations we believe that OPSIN can give good conversion rates on 
Pubchem IUPAC names for organic compounds (up to ca 80%, although some 
corpora give considerably lower values). [Note that there are roughly 3 
approaches to name2structure:
     - lexical lookup. This is the only way for trivial names but 
suffers from synonymy and lexical variants.
     - interpretation of IUPAC-like names. IUPAC-like means that a human 
or program has tried to follow some of the IUPAC rules. An example is 
1-iodo-2-chloro-ethane. It's interpretable by machine.
     - semi-systematic names: 2-chloro-testosterone. You have to know 
what "testosterone" and its numbering system is.
  The errors in conversion (i.e. it manages to get an answer but this is 
wrong) are lower (ca 1%) than most other systems.

OPSIN is Open Source at Sourceforge.
http://sourceforge.net/projects/oscar3-chem/

  OSCAR
  =====
  OSCAR is our chemical entity recognition and extraction software for 
textual documents. OSCAR recognizes chemical names in text (ca 80-85% F1 
according to corpus). OMII/ENGAGE are working with us to refactor OSCAR 
over the next 3 months. OSCAR emits confidence scores for recognized 
entities. OSCAR uses dictionary lookup combined with MEMM and other methods.

We can also extract recipes into a a recipe ontology.

OSCAR is Open Source at Sourceforge.
http://sourceforge.net/projects/oscar3-chem/

  SVG2CML
  =======
  This is my code for converting SVG to chemical concepts. (The PDF is 
generated from PDF by, say, ps2edit which Bill put me onto or from WMF) 
The system can currently extract molecules in CML from vectors/text and 
extract spectra from vectors/text. It relies on born digital vectors 
which are much more tractable than OCR'ed vectors. Currently I suspect 
about 20-25% of J.Org.Chem supplemental data is highly tractable.

=========

Last week I had a useful visit to the SCAI Fraunhofer Institute in Sankt 
Augustin where the bioinformatics group has several initiatives on 
extraction of chemistry from documents. Our main thrust was 
name2structure and it's clear that this depends considerably on source. 
OCR'ed names often contain whitespace and grot (1(one) vs l(ell), etc.). 
SCAI has put effort into dealing with whitespace - unfortunately most of 
their SW in not re-distributable.

They have also been working on OCRed vectors to chemistry (ChemOCR). OCR 
has many disadvantages against born-digital - apart from the 1/l and O/0 
stuff you cannot rely on knowing whether test is horizontally aligned , 
vectors actually meet, etc. They clearly had additional loss in conversion.

Summary
=======

Chemistry has a data drought and holders of data are not likely to let 
us have it without hairy legal negotiations. I see the following 
mainstream resources of substantial chemical data (as opposed to general 
metadata such as CiteSeer already extracts):
  * crystallography (from suppkemental CIFs)
  * organic recipes from supplemental text. Error rates will be ca 5-20% 
I think.
  * Organic structures from vectors in suppkemental data. Recall will be 
ca. 25-50% (too many bitmaps and other grot). But errors should be low 
(< 10%)
  * organic structures from names in supplemental data. I'm guessing 
about 50-70% recal and 95% accuracy of conversion.

I suggest that we create a clear chemistry data extraction project for 
synthetic organic chemistry. I think most of the current s/w has been 
created at Cambridge and PSU. I think ORE will be a key tool (see next 
post).

P.

>