[Orechem] High-throughput semantic computation in OREChem

Mon Jun 15 08:48:15 EDT 2009

Peter,

I know that NMREye workflow is already established but have you considered integrating to Chris Steinbeck's NMR prediction web services? See: http://www.ebi.ac.uk/nmrshiftdb/axis/NMRShiftDB.wsdl

I met with Chris and Stefan Kuhn while I was in Cambridge a couple of weeks ago and we will be doing the work to integrate ChemSPider to the web services ourselves in the near future.

If you would like to test NMREye on some real world spectra there are some excellent Open Data now available on ChemSPider and they can be reviewed here: http://www.chemspider.com/spectra.aspx. The molecules and associated JCAMP spectra can be downloaded. I know that this is not the project that you are trying to do but it may form a good test set anyway for NMREye. The data have been curated via the spectral game... www.spectralgame.com<http://www.spectralgame.com>

Best wishes.

Antony Williams, VP Strategic Development
ChemSpider, Royal Society of Chemistry

US Office: 904 Tamaras Circle, Wake Forest, NC-27587

Phone: +1 (919) 201-1516
Fax: +1 (919) 300-5321
Email: antony.williams at chemspider.com

URL: www.chemspider.com
Blog: http://www.chemspider.com/blog
Twitter: http://twitter.com/ChemSpiderman
Skype: tony27587
LinkedIn: http://www.linkedin.com/in/antonywilliams

________________________________
From: orechem-bounces at openarchives.org [orechem-bounces at openarchives.org] On Behalf Of Peter Murray-Rust [pm286 at cam.ac.uk]
Sent: Friday, June 12, 2009 10:31 AM
To: Geoffrey Fox; Marlon Pierce; Jim Downing; Nick Day; m.i.borkum at soton.ac.uk
Cc: orechem at openarchives.org
Subject: [Orechem] High-throughput semantic computation in OREChem

This is to review the subprojects that the computational geeks in OREChem have put together over the last few days. (a) is long term, (b) is immediate
(a) The general goal is to compute NMR spectra for all new published compounds and compare them with spectra. This is a new approach "robot refereeing of chemistry publications" and any differences suggest errors or new chemistry. This is long term (months) and consists of the following (as we have put on the wiki):
* PSU-Lee/Prasenjit retrieve chemistry-rich docs from publisher sites (ask for forgiveness policy) and segment the papers into text+non-text (tables, diagrams). This passes to:
* Mark - Soton extracts molecules and spectra out of this and converts them to SVG. The short-term goal is to get this working by the end of next week in a pragmatic form. (we do not mind if recall is poor as long as we get a few SVGs as we need to develop the machine-learning and/or heuristics and find out what unknown horrors we have to deal with.
Bitmaps are rejected at this stage
* PMR- cambridge develops heuristics to interpret (i) molecules (ii) spectra (C13 and H1). These might later be crowdsourced. The output is CML molecules and spectra. It is unlikely we have assignments
* PSU - Bill+Karl. Analyse spectra with peak-fitting.
* IU - Marlon. (independently) molecules are passed to IU in CML and put into the NMREye workflow for computing peaks (below). IU run this automatically and return results in CML

(b) To get IU up to speed we shall start immediately on simple molecules from Pubchem. This involves just Cambridge and IU.
* The NMREye workflow has been developed and tested and should work on simple organic compounds. It consists of the following:
  - convert PubchemXML2CML (already available in JUMBO)
  - convert CML to Gaussian input. We have an XSLT script, but could convert this to Java in an hour.
  - in parallel - create RDF metadata for provenance to this point (as this does not survive the Gaussian run)
  ... submit and run job ... (IU) ... and collect results
 - convert LOG file to CML (JUMBOMarker, effectively done)
 - convert CML to RDF (JUMBO). Add GaussianOWL dictionary in RDF

upload RDFs into reopository/tripleStore

In (b) we would expect to get 10,000 - 100,000 small molecules from Pubchem of up to, say , 15 first row atoms. These already have 3D coordinates (I am ignoring conformers at this stage). The process should be automatic. Jobs take from 0.1 seconds to 1 day (probably) as they scale with N^4.

P.

I will try to send this to the Wiki

--
Peter Murray-Rust
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge
CB2 1EW, UK
+44-1223-763069

DISCLAIMER:

This communication (including any attachments) is intended for the use of the addressee only and may contain confidential, privileged or copyright material. It may not be relied upon or disclosed to any other person without the consent of the RSC. If you have received it in error, please contact us immediately. Any advice given by the RSC has been carefully formulated but is necessarily based on the information available, and the RSC cannot be held responsible for accuracy or completeness. In this respect, the RSC owes no duty of care and shall not be liable for any resulting damage or loss. The RSC acknowledges that a disclaimer cannot restrict liability at law for personal injury or death arising through a finding of negligence. The RSC does not warrant that its emails or attachments are Virus-free: Please rely on your own screening.