[Orechem] High-throughput semantic computation in OREChem
Carl Lagoze
clagoze at gmail.com
Wed Jun 17 11:31:11 EDT 2009
Thanks for detailing this Peter. It would be great to get this up on
the wiki. Let me know if I can do anything to facilitate.
Carl
On Jun 12, 2009, at 10:31 AM, Peter Murray-Rust wrote:
> This is to review the subprojects that the computational geeks in
> OREChem have put together over the last few days. (a) is long term,
> (b) is immediate
> (a) The general goal is to compute NMR spectra for all new published
> compounds and compare them with spectra. This is a new approach
> "robot refereeing of chemistry publications" and any differences
> suggest errors or new chemistry. This is long term (months) and
> consists of the following (as we have put on the wiki):
> * PSU-Lee/Prasenjit retrieve chemistry-rich docs from publisher
> sites (ask for forgiveness policy) and segment the papers into text
> +non-text (tables, diagrams). This passes to:
> * Mark - Soton extracts molecules and spectra out of this and
> converts them to SVG. The short-term goal is to get this working by
> the end of next week in a pragmatic form. (we do not mind if recall
> is poor as long as we get a few SVGs as we need to develop the
> machine-learning and/or heuristics and find out what unknown horrors
> we have to deal with.
> Bitmaps are rejected at this stage
> * PMR- cambridge develops heuristics to interpret (i) molecules (ii)
> spectra (C13 and H1). These might later be crowdsourced. The output
> is CML molecules and spectra. It is unlikely we have assignments
> * PSU - Bill+Karl. Analyse spectra with peak-fitting.
> * IU - Marlon. (independently) molecules are passed to IU in CML and
> put into the NMREye workflow for computing peaks (below). IU run
> this automatically and return results in CML
>
> (b) To get IU up to speed we shall start immediately on simple
> molecules from Pubchem. This involves just Cambridge and IU.
> * The NMREye workflow has been developed and tested and should work
> on simple organic compounds. It consists of the following:
> - convert PubchemXML2CML (already available in JUMBO)
> - convert CML to Gaussian input. We have an XSLT script, but could
> convert this to Java in an hour.
> - in parallel - create RDF metadata for provenance to this point
> (as this does not survive the Gaussian run)
> ... submit and run job ... (IU) ... and collect results
> - convert LOG file to CML (JUMBOMarker, effectively done)
> - convert CML to RDF (JUMBO). Add GaussianOWL dictionary in RDF
>
> upload RDFs into reopository/tripleStore
>
> In (b) we would expect to get 10,000 - 100,000 small molecules from
> Pubchem of up to, say , 15 first row atoms. These already have 3D
> coordinates (I am ignoring conformers at this stage). The process
> should be automatic. Jobs take from 0.1 seconds to 1 day (probably)
> as they scale with N^4.
>
> P.
>
> I will try to send this to the Wiki
>
>
> --
> Peter Murray-Rust
> Reader in Molecular Informatics
> Unilever Centre, Dep. Of Chemistry
> University of Cambridge
> CB2 1EW, UK
> +44-1223-763069
> _______________________________________________
> Orechem mailing list
> Orechem at openarchives.org
> http://www.openarchives.org/mailman/listinfo/orechem
More information about the Orechem
mailing list