[Orechem] High-throughput semantic computation in OREChem

Wed Jun 17 11:31:11 EDT 2009

Thanks for detailing this Peter.  It would be great to get this up on  
the wiki.  Let me know if I can do anything to facilitate.

Carl
On Jun 12, 2009, at 10:31 AM, Peter Murray-Rust wrote:

> This is to review the subprojects that the computational geeks in  
> OREChem have put together over the last few days. (a) is long term,  
> (b) is immediate
> (a) The general goal is to compute NMR spectra for all new published  
> compounds and compare them with spectra. This is a new approach  
> "robot refereeing of chemistry publications" and any differences  
> suggest errors or new chemistry. This is long term (months) and  
> consists of the following (as we have put on the wiki):
> * PSU-Lee/Prasenjit retrieve chemistry-rich docs from publisher  
> sites (ask for forgiveness policy) and segment the papers into text 
> +non-text (tables, diagrams). This passes to:
> * Mark - Soton extracts molecules and spectra out of this and  
> converts them to SVG. The short-term goal is to get this working by  
> the end of next week in a pragmatic form. (we do not mind if recall  
> is poor as long as we get a few SVGs as we need to develop the  
> machine-learning and/or heuristics and find out what unknown horrors  
> we have to deal with.
> Bitmaps are rejected at this stage
> * PMR- cambridge develops heuristics to interpret (i) molecules (ii)  
> spectra (C13 and H1). These might later be crowdsourced. The output  
> is CML molecules and spectra. It is unlikely we have assignments
> * PSU - Bill+Karl. Analyse spectra with peak-fitting.
> * IU - Marlon. (independently) molecules are passed to IU in CML and  
> put into the NMREye workflow for computing peaks (below). IU run  
> this automatically and return results in CML
>
> (b) To get IU up to speed we shall start immediately on simple  
> molecules from Pubchem. This involves just Cambridge and IU.
> * The NMREye workflow has been developed and tested and should work  
> on simple organic compounds. It consists of the following:
>   - convert PubchemXML2CML (already available in JUMBO)
>   - convert CML to Gaussian input. We have an XSLT script, but could  
> convert this to Java in an hour.
>   - in parallel - create RDF metadata for provenance to this point  
> (as this does not survive the Gaussian run)
>   ... submit and run job ... (IU) ... and collect results
>  - convert LOG file to CML (JUMBOMarker, effectively done)
>  - convert CML to RDF (JUMBO). Add GaussianOWL dictionary in RDF
>
> upload RDFs into reopository/tripleStore
>
> In (b) we would expect to get 10,000 - 100,000 small molecules from  
> Pubchem of up to, say , 15 first row atoms. These already have 3D  
> coordinates (I am ignoring conformers at this stage). The process  
> should be automatic. Jobs take from 0.1 seconds to 1 day (probably)  
> as they scale with N^4.
>
> P.
>
> I will try to send this to the Wiki
>
>
> -- 
> Peter Murray-Rust
> Reader in Molecular Informatics
> Unilever Centre, Dep. Of Chemistry
> University of Cambridge
> CB2 1EW, UK
> +44-1223-763069
> _______________________________________________
> Orechem mailing list
> Orechem at openarchives.org
> http://www.openarchives.org/mailman/listinfo/orechem