[Orechem] Cambridge Software and software for Redmond meeting
pm286 at cam.ac.uk
Sun Mar 8 04:57:04 EDT 2009
[Carl - while wrting this mail I feel it should be accessible from the
wiki ... either linked from a mail page or cut-and-pasted]
WILLIAM J BROUWER wrote:
> looks great peter! I will have the PDF extractor wiki finished for you
> all before the meeting.
I think that we should have a hackfest at Redmond on this. Getting a
good supply of high-quality molecules must be an important target.
There are a number of positive threads:
Daniel Lowe (copied) is a first year PhD student in Cambridge and he
has been making excellent progress with OPSIN (our IUPAC-like
name2structure converter). It is always difficult to give accurate
metrics as there are no gold standards, no corpora, no annotation
guidelines for name2structure. (This does not stop a number of software
houses making unmeasurable claims for their software.) Given these
reservations we believe that OPSIN can give good conversion rates on
Pubchem IUPAC names for organic compounds (up to ca 80%, although some
corpora give considerably lower values). [Note that there are roughly 3
approaches to name2structure:
- lexical lookup. This is the only way for trivial names but
suffers from synonymy and lexical variants.
- interpretation of IUPAC-like names. IUPAC-like means that a human
or program has tried to follow some of the IUPAC rules. An example is
1-iodo-2-chloro-ethane. It's interpretable by machine.
- semi-systematic names: 2-chloro-testosterone. You have to know
what "testosterone" and its numbering system is.
The errors in conversion (i.e. it manages to get an answer but this is
wrong) are lower (ca 1%) than most other systems.
OPSIN is Open Source at Sourceforge.
OSCAR is our chemical entity recognition and extraction software for
textual documents. OSCAR recognizes chemical names in text (ca 80-85% F1
according to corpus). OMII/ENGAGE are working with us to refactor OSCAR
over the next 3 months. OSCAR emits confidence scores for recognized
entities. OSCAR uses dictionary lookup combined with MEMM and other methods.
We can also extract recipes into a a recipe ontology.
OSCAR is Open Source at Sourceforge.
This is my code for converting SVG to chemical concepts. (The PDF is
generated from PDF by, say, ps2edit which Bill put me onto or from WMF)
The system can currently extract molecules in CML from vectors/text and
extract spectra from vectors/text. It relies on born digital vectors
which are much more tractable than OCR'ed vectors. Currently I suspect
about 20-25% of J.Org.Chem supplemental data is highly tractable.
Last week I had a useful visit to the SCAI Fraunhofer Institute in Sankt
Augustin where the bioinformatics group has several initiatives on
extraction of chemistry from documents. Our main thrust was
name2structure and it's clear that this depends considerably on source.
OCR'ed names often contain whitespace and grot (1(one) vs l(ell), etc.).
SCAI has put effort into dealing with whitespace - unfortunately most of
their SW in not re-distributable.
They have also been working on OCRed vectors to chemistry (ChemOCR). OCR
has many disadvantages against born-digital - apart from the 1/l and O/0
stuff you cannot rely on knowing whether test is horizontally aligned ,
vectors actually meet, etc. They clearly had additional loss in conversion.
Chemistry has a data drought and holders of data are not likely to let
us have it without hairy legal negotiations. I see the following
mainstream resources of substantial chemical data (as opposed to general
metadata such as CiteSeer already extracts):
* crystallography (from suppkemental CIFs)
* organic recipes from supplemental text. Error rates will be ca 5-20%
* Organic structures from vectors in suppkemental data. Recall will be
ca. 25-50% (too many bitmaps and other grot). But errors should be low
* organic structures from names in supplemental data. I'm guessing
about 50-70% recal and 95% accuracy of conversion.
I suggest that we create a clear chemistry data extraction project for
synthetic organic chemistry. I think most of the current s/w has been
created at Cambridge and PSU. I think ORE will be a key tool (see next
More information about the Orechem