[Orechem] Re: High-throughput semantic computation in OREChem

Fri Jun 12 11:14:11 EDT 2009

Great - that's exciting Bill and I am sure that it will be invaluable for
assignment. However I am focusssing on what we can integrate today. The
integration problems are not trivial and the more that the components - or
the sites - are modularised the faster progress we shall

It's important to be pragmatic at this stage - there are things we can do
now and things that are research. We should do both but we must make sure
that the infrastructure continues in a straight line. I detailed what we
could do at present (some as rough proof of concept) that could fit into a
linear workflow. We must make sure that the research efforts in the pipeline
I indicated are small as the integration of itself will still be
challenging.

So I am propopsing that we should ask:
* what can we do by Friday 19?
* what can we do by the start of August?
* what can we do in the rest of the project.

Each part depends on the previous one:
* Mark needs a few papers from Lee/Prasenjit which have good PDF chemistry
* PMR needs a few molecules and spectra in SVG
* Marlon needs a few CML molecules and the NMREye workflow.

I agree that Mark's work on general PDF parsing is exciting but we need a
stream of molecules for the later stages.

I am also going to suggest that we try to arrange weekly telcons to review
progress. The problem of a pipeline/workflow is that all bits have to be
delivering.

P.

On Fri, Jun 12, 2009 at 3:47 PM, WILLIAM J BROUWER <wjb19 at psu.edu> wrote:

> cool peter...
>
> I would also add that there's some mileage in substructure & similarity
> search on spectra. Han gave a great talk this morning, there is strong
> application of his graph mining work to building up complicated spectra on
> the basis of simpler (sub)spectra...
>
> -bill
>
>
> On Fri, Jun 12, 2009 10:31 AM, *Peter Murray-Rust <pm286 at cam.ac.uk>*wrote:
>
> This is to review the subprojects that the computational geeks in OREChem
> have put together over the last few days. (a) is long term, (b) is immediate
> (a) The general goal is to compute NMR spectra for all new published
> compounds and compare them with spectra. This is a new approach "robot
> refereeing of chemistry publications" and any differences suggest errors or
> new chemistry. This is long term (months) and consists of the following (as
> we have put on the wiki):
> * PSU-Lee/Prasenjit retrieve chemistry-rich docs from publisher sites (ask
> for forgiveness policy) and segment the papers into text+non-text (tables,
> diagrams). This passes to:
> * Mark - Soton extracts molecules and spectra out of this and converts them
> to SVG. The short-term goal is to get this working by the end of next week
> in a pragmatic form. (we do not mind if recall is poor as long as we get a
> few SVGs as we need to develop the machine-learning and/or heuristics and
> find out what unknown horrors we have to deal with.
> Bitmaps are rejected at this stage
> * PMR- cambridge develops heuristics to interpret (i) molecules (ii)
> spectra (C13 and H1). These might later be crowdsourced. The output is CML
> molecules and spectra. It is unlikely we have assignments
> * PSU - Bill+Karl. Analyse spectra with peak-fitting.
> * IU - Marlon. (independently) molecules are passed to IU in CML and put
> into the NMREye workflow for computing peaks (below). IU run this
> automatically and return results in CML
>
> (b) To get IU up to speed we shall start immediately on simple molecules
> from Pubchem. This involves just Cambridge and IU.
> * The NMREye workflow has been developed and tested and should work on
> simple organic compounds. It consists of the following:
>   - convert PubchemXML2CML (already available in JUMBO)
>   - convert CML to Gaussian input. We have an XSLT script, but could
> convert this to Java in an hour.
>   - in parallel - create RDF metadata for provenance to this point (as this
> does not survive the Gaussian run)
>   ... submit and run job ... (IU) ... and collect results
>  - convert LOG file to CML (JUMBOMarker, effectively done)
>  - convert CML to RDF (JUMBO). Add GaussianOWL dictionary in RDF
>
> upload RDFs into reopository/tripleStore
>
> In (b) we would expect to get 10,000 - 100,000 small molecules from Pubchem
> of up to, say , 15 first row atoms. These already have 3D coordinates (I am
> ignoring conformers at this stage). The process should be automatic. Jobs
> take from 0.1 seconds to 1 day (probably) as they scale with N^4.
>
> P.
>
> I will try to send this to the Wiki
>
>
> --
> Peter Murray-Rust
> Reader in Molecular Informatics
> Unilever Centre, Dep. Of Chemistry
> University of Cambridge
> CB2 1EW, UK
> +44-1223-763069
>
>
>
>
>

-- 
Peter Murray-Rust
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge
CB2 1EW, UK
+44-1223-763069
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.openarchives.org/pipermail/orechem/attachments/20090612/b6c28478/attachment-0001.htm