[Orechem] Re: High-throughput semantic computation in OREChem

Fri Jun 12 16:12:32 EDT 2009

Great,
hope we can set up some telcons anyway

On Fri, Jun 12, 2009 at 7:24 PM, Marlon Pierce <mpierce at cs.indiana.edu>wrote:

>  My time over the next couple of weeks will be dominated by TeraGrid 09
> preparation, but I will make some headway in getting started and will be
> working in earnest by end of June.
>
>
> Marlon
>
>
>
> Peter Murray-Rust wrote:
>
> Great - that's exciting Bill and I am sure that it will be invaluable for
> assignment. However I am focusssing on what we can integrate today. The
> integration problems are not trivial and the more that the components - or
> the sites - are modularised the faster progress we shall
>
> It's important to be pragmatic at this stage - there are things we can do
> now and things that are research. We should do both but we must make sure
> that the infrastructure continues in a straight line. I detailed what we
> could do at present (some as rough proof of concept) that could fit into a
> linear workflow. We must make sure that the research efforts in the pipeline
> I indicated are small as the integration of itself will still be
> challenging.
>
> So I am propopsing that we should ask:
> * what can we do by Friday 19?
> * what can we do by the start of August?
> * what can we do in the rest of the project.
>
> Each part depends on the previous one:
> * Mark needs a few papers from Lee/Prasenjit which have good PDF chemistry
> * PMR needs a few molecules and spectra in SVG
> * Marlon needs a few CML molecules and the NMREye workflow.
>
> I agree that Mark's work on general PDF parsing is exciting but we need a
> stream of molecules for the later stages.
>
> I am also going to suggest that we try to arrange weekly telcons to review
> progress. The problem of a pipeline/workflow is that all bits have to be
> delivering.
>
> P.
>
>
>
> On Fri, Jun 12, 2009 at 3:47 PM, WILLIAM J BROUWER <wjb19 at psu.edu> wrote:
>
>> cool peter...
>>
>> I would also add that there's some mileage in substructure & similarity
>> search on spectra. Han gave a great talk this morning, there is strong
>> application of his graph mining work to building up complicated spectra on
>> the basis of simpler (sub)spectra...
>>
>> -bill
>>
>>
>> On Fri, Jun 12, 2009 10:31 AM, *Peter Murray-Rust <pm286 at cam.ac.uk>*wrote:
>>
>> This is to review the subprojects that the computational geeks in OREChem
>> have put together over the last few days. (a) is long term, (b) is immediate
>> (a) The general goal is to compute NMR spectra for all new published
>> compounds and compare them with spectra. This is a new approach "robot
>> refereeing of chemistry publications" and any differences suggest errors or
>> new chemistry. This is long term (months) and consists of the following (as
>> we have put on the wiki):
>> * PSU-Lee/Prasenjit retrieve chemistry-rich docs from publisher sites (ask
>> for forgiveness policy) and segment the papers into text+non-text (tables,
>> diagrams). This passes to:
>> * Mark - Soton extracts molecules and spectra out of this and converts
>> them to SVG. The short-term goal is to get this working by the end of next
>> week in a pragmatic form. (we do not mind if recall is poor as long as we
>> get a few SVGs as we need to develop the machine-learning and/or heuristics
>> and find out what unknown horrors we have to deal with.
>> Bitmaps are rejected at this stage
>> * PMR- cambridge develops heuristics to interpret (i) molecules (ii)
>> spectra (C13 and H1). These might later be crowdsourced. The output is CML
>> molecules and spectra. It is unlikely we have assignments
>> * PSU - Bill+Karl. Analyse spectra with peak-fitting.
>> * IU - Marlon. (independently) molecules are passed to IU in CML and put
>> into the NMREye workflow for computing peaks (below). IU run this
>> automatically and return results in CML
>>
>> (b) To get IU up to speed we shall start immediately on simple molecules
>> from Pubchem. This involves just Cambridge and IU.
>> * The NMREye workflow has been developed and tested and should work on
>> simple organic compounds. It consists of the following:
>>   - convert PubchemXML2CML (already available in JUMBO)
>>   - convert CML to Gaussian input. We have an XSLT script, but could
>> convert this to Java in an hour.
>>   - in parallel - create RDF metadata for provenance to this point (as
>> this does not survive the Gaussian run)
>>   ... submit and run job ... (IU) ... and collect results
>>  - convert LOG file to CML (JUMBOMarker, effectively done)
>>  - convert CML to RDF (JUMBO). Add GaussianOWL dictionary in RDF
>>
>> upload RDFs into reopository/tripleStore
>>
>> In (b) we would expect to get 10,000 - 100,000 small molecules from
>> Pubchem of up to, say , 15 first row atoms. These already have 3D
>> coordinates (I am ignoring conformers at this stage). The process should be
>> automatic. Jobs take from 0.1 seconds to 1 day (probably) as they scale with
>> N^4.
>>
>> P.
>>
>> I will try to send this to the Wiki
>>
>>
>> --
>> Peter Murray-Rust
>> Reader in Molecular Informatics
>> Unilever Centre, Dep. Of Chemistry
>> University of Cambridge
>> CB2 1EW, UK
>> +44-1223-763069
>>
>>
>>
>>
>>
>
>
> --
> Peter Murray-Rust
> Reader in Molecular Informatics
> Unilever Centre, Dep. Of Chemistry
> University of Cambridge
> CB2 1EW, UK
> +44-1223-763069
>
>

-- 
Peter Murray-Rust
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge
CB2 1EW, UK
+44-1223-763069
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.openarchives.org/pipermail/orechem/attachments/20090612/dc130576/attachment-0001.htm