[Orechem] Re: High-throughput semantic computation in OREChem

Mon Jun 29 18:20:15 EDT 2009

On Mon, Jun 29, 2009 at 11:11 PM, Marlon Pierce <mpierce at cs.indiana.edu>wrote:

>  * Taking a look at this now.  I grabbed and built Jumbo 5.5-b1 from
> SourceForge.  "mvn clean install" fails some tests (I'm getting "JNI InChI
> has failed to load the native libraries required"), but "mvn clean install
> -Dmaven.test.skip=true" works (compiles).
>

InChI is a menace because it's written in C and messy. JNIchi is a wrapper.

Does it run as well as compile?

>
>
>
> * Assuming my Jumbo version and build are OK, I need to generate CML from
> PubChem with Jumbo. First, which Pubchem XML should I use?  I presume 3D
> version.
>

You need jumbo converter. Then you should run PubChem2CML.

We need some modest docco here because I tend to run it from Eclipse. Nick
has a clojure pipeline. I will talk with him tomorrow. If we can't work that
I will create the docco.

P.

>
>
> * Finally, what is the command for doing this with Jumbo?
>
>
> Thanks, more questions to follow.
>
>
> Marlon
>
>
> Peter Murray-Rust wrote:
>
> Great,
> hope we can set up some telcons anyway
>
> On Fri, Jun 12, 2009 at 7:24 PM, Marlon Pierce <mpierce at cs.indiana.edu>wrote:
>
>>  My time over the next couple of weeks will be dominated by TeraGrid 09
>> preparation, but I will make some headway in getting started and will be
>> working in earnest by end of June.
>>
>>
>> Marlon
>>
>>
>> Peter Murray-Rust wrote:
>>
>> Great - that's exciting Bill and I am sure that it will be invaluable for
>> assignment. However I am focusssing on what we can integrate today. The
>> integration problems are not trivial and the more that the components - or
>> the sites - are modularised the faster progress we shall
>>
>> It's important to be pragmatic at this stage - there are things we can do
>> now and things that are research. We should do both but we must make sure
>> that the infrastructure continues in a straight line. I detailed what we
>> could do at present (some as rough proof of concept) that could fit into a
>> linear workflow. We must make sure that the research efforts in the pipeline
>> I indicated are small as the integration of itself will still be
>> challenging.
>>
>> So I am propopsing that we should ask:
>> * what can we do by Friday 19?
>> * what can we do by the start of August?
>> * what can we do in the rest of the project.
>>
>> Each part depends on the previous one:
>> * Mark needs a few papers from Lee/Prasenjit which have good PDF chemistry
>> * PMR needs a few molecules and spectra in SVG
>> * Marlon needs a few CML molecules and the NMREye workflow.
>>
>> I agree that Mark's work on general PDF parsing is exciting but we need a
>> stream of molecules for the later stages.
>>
>> I am also going to suggest that we try to arrange weekly telcons to review
>> progress. The problem of a pipeline/workflow is that all bits have to be
>> delivering.
>>
>> P.
>>
>>
>>
>> On Fri, Jun 12, 2009 at 3:47 PM, WILLIAM J BROUWER <wjb19 at psu.edu> wrote:
>>
>>> cool peter...
>>>
>>> I would also add that there's some mileage in substructure & similarity
>>> search on spectra. Han gave a great talk this morning, there is strong
>>> application of his graph mining work to building up complicated spectra on
>>> the basis of simpler (sub)spectra...
>>>
>>> -bill
>>>
>>>
>>> On Fri, Jun 12, 2009 10:31 AM, *Peter Murray-Rust <pm286 at cam.ac.uk>*wrote:
>>>
>>> This is to review the subprojects that the computational geeks in OREChem
>>> have put together over the last few days. (a) is long term, (b) is immediate
>>> (a) The general goal is to compute NMR spectra for all new published
>>> compounds and compare them with spectra. This is a new approach "robot
>>> refereeing of chemistry publications" and any differences suggest errors or
>>> new chemistry. This is long term (months) and consists of the following (as
>>> we have put on the wiki):
>>> * PSU-Lee/Prasenjit retrieve chemistry-rich docs from publisher sites
>>> (ask for forgiveness policy) and segment the papers into text+non-text
>>> (tables, diagrams). This passes to:
>>> * Mark - Soton extracts molecules and spectra out of this and converts
>>> them to SVG. The short-term goal is to get this working by the end of next
>>> week in a pragmatic form. (we do not mind if recall is poor as long as we
>>> get a few SVGs as we need to develop the machine-learning and/or heuristics
>>> and find out what unknown horrors we have to deal with.
>>> Bitmaps are rejected at this stage
>>> * PMR- cambridge develops heuristics to interpret (i) molecules (ii)
>>> spectra (C13 and H1). These might later be crowdsourced. The output is CML
>>> molecules and spectra. It is unlikely we have assignments
>>> * PSU - Bill+Karl. Analyse spectra with peak-fitting.
>>> * IU - Marlon. (independently) molecules are passed to IU in CML and put
>>> into the NMREye workflow for computing peaks (below). IU run this
>>> automatically and return results in CML
>>>
>>> (b) To get IU up to speed we shall start immediately on simple molecules
>>> from Pubchem. This involves just Cambridge and IU.
>>> * The NMREye workflow has been developed and tested and should work on
>>> simple organic compounds. It consists of the following:
>>>   - convert PubchemXML2CML (already available in JUMBO)
>>>   - convert CML to Gaussian input. We have an XSLT script, but could
>>> convert this to Java in an hour.
>>>   - in parallel - create RDF metadata for provenance to this point (as
>>> this does not survive the Gaussian run)
>>>   ... submit and run job ... (IU) ... and collect results
>>>  - convert LOG file to CML (JUMBOMarker, effectively done)
>>>  - convert CML to RDF (JUMBO). Add GaussianOWL dictionary in RDF
>>>
>>> upload RDFs into reopository/tripleStore
>>>
>>> In (b) we would expect to get 10,000 - 100,000 small molecules from
>>> Pubchem of up to, say , 15 first row atoms. These already have 3D
>>> coordinates (I am ignoring conformers at this stage). The process should be
>>> automatic. Jobs take from 0.1 seconds to 1 day (probably) as they scale with
>>> N^4.
>>>
>>> P.
>>>
>>> I will try to send this to the Wiki
>>>
>>>
>>> --
>>> Peter Murray-Rust
>>> Reader in Molecular Informatics
>>> Unilever Centre, Dep. Of Chemistry
>>> University of Cambridge
>>> CB2 1EW, UK
>>> +44-1223-763069
>>>
>>>
>>>
>>>
>>>
>>
>>
>> --
>> Peter Murray-Rust
>> Reader in Molecular Informatics
>> Unilever Centre, Dep. Of Chemistry
>> University of Cambridge
>> CB2 1EW, UK
>> +44-1223-763069
>>
>>
>
>
> --
> Peter Murray-Rust
> Reader in Molecular Informatics
> Unilever Centre, Dep. Of Chemistry
> University of Cambridge
> CB2 1EW, UK
> +44-1223-763069
>
>

-- 
Peter Murray-Rust
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge
CB2 1EW, UK
+44-1223-763069
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.openarchives.org/pipermail/orechem/attachments/20090629/c58fb6e0/attachment.htm