<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">

<html>

<head>

  <meta content="text/html;charset=ISO-8859-1" http-equiv="Content-Type">

</head>

<body bgcolor="#ffffff" text="#000000">

* Taking a look at this now.&nbsp; I grabbed and built Jumbo 5.5-b1 from

SourceForge.&nbsp; "mvn clean install" fails some tests (I'm getting "JNI

InChI has failed to load the native libraries required"), but "mvn

clean install -Dmaven.test.skip=true" works (compiles).<br>

<br>

<br>

* Assuming my Jumbo version and build are OK, I need to generate CML

from PubChem with Jumbo. First, which Pubchem XML should I use?&nbsp; I

presume 3D version.<br>

<br>

<br>

* Finally, what is the command for doing this with Jumbo?<br>

<br>

<br>

Thanks, more questions to follow. <br>

<br>

<br>

Marlon<br>

<br>

<br>

Peter Murray-Rust wrote:

<blockquote

 cite="mid:67fd68330906121312v773cbf68we2248629727874c4@mail.gmail.com"

 type="cite">Great,<br>

hope we can set up some telcons anyway<br>

  <br>

  <div class="gmail_quote">On Fri, Jun 12, 2009 at 7:24 PM, Marlon

Pierce <span dir="ltr">&lt;<a moz-do-not-send="true"

 href="mailto:mpierce@cs.indiana.edu">mpierce@cs.indiana.edu</a>&gt;</span>

wrote:<br>

  <blockquote class="gmail_quote"

 style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">

    <div bgcolor="#ffffff" text="#000000">

My time over the next couple of weeks will be dominated by TeraGrid 09

preparation, but I will make some headway in getting started and will

be working in earnest by end of June.<br>

    <font color="#888888"><br>

    <br>

Marlon</font>

    <div>

    <div class="h5"><br>

    <br>

    <br>

Peter Murray-Rust wrote:

    <blockquote type="cite">Great - that's exciting Bill and I am sure

that it will be

invaluable for assignment. However I am focusssing on what we can

integrate today. The integration problems are not trivial and the more

that the components - or the sites - are modularised the faster

progress we shall<br>

      <br>

It's important to be pragmatic at this stage - there are things we can

do now and things that are research. We should do both but we must make

sure that the infrastructure continues in a straight line. I detailed

what we could do at present (some as rough proof of concept) that could

fit into a linear workflow. We must make sure that the research efforts

in the pipeline I indicated are small as the integration of itself will

still be challenging. <br>

      <br>

So I am propopsing that we should ask:<br>

* what can we do by Friday 19?<br>

* what can we do by the start of August?<br>

* what can we do in the rest of the project.<br>

      <br>

Each part depends on the previous one:<br>

* Mark needs a few papers from Lee/Prasenjit which have good PDF

chemistry<br>

* PMR needs a few molecules and spectra in SVG<br>

* Marlon needs a few CML molecules and the NMREye workflow.<br>

      <br>

I agree that Mark's work on general PDF parsing is exciting but we need

a stream of molecules for the later stages.<br>

      <br>

I am also going to suggest that we try to arrange weekly telcons to

review progress. The problem of a pipeline/workflow is that all bits

have to be delivering.<br>

      <br>

P.<br>

      <br>

      <br>

      <br>

      <div class="gmail_quote">On Fri, Jun 12, 2009 at 3:47 PM, WILLIAM

J

BROUWER <span dir="ltr">&lt;<a moz-do-not-send="true"

 href="mailto:wjb19@psu.edu" target="_blank">wjb19@psu.edu</a>&gt;</span>

wrote:<br>

      <blockquote class="gmail_quote"

 style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">

        <div>cool peter...<br>

        <br>

I would also add that there's some mileage in

substructure &amp; similarity search on spectra. Han gave a great talk

this

morning, there is strong application of his graph mining work to

building up

complicated spectra on the basis of simpler (sub)spectra...<br>

        <br>

-bill <br>

        <div>

        <div><br>

        <br>

On Fri, Jun 12, 2009 10:31 AM, <b>Peter Murray-Rust

&lt;<a moz-do-not-send="true" href="mailto:pm286@cam.ac.uk"

 target="_blank">pm286@cam.ac.uk</a>&gt;</b> wrote:<br>

        <blockquote

 style="border-left: 1px solid rgb(0, 0, 0); padding-left: 3px; padding-right: 0px; margin-left: 3px; margin-right: 0px;">

          <p>This is to review the subprojects

that the computational geeks in OREChem have put together over the last

few

days. (a) is long term, (b) is immediate<br>

(a) The general goal is to compute

NMR spectra for all new published compounds and compare them with

spectra. This

is a new approach "robot refereeing of chemistry publications" and any

differences suggest errors or new chemistry. This is long term (months)

and

consists of the following (as we have put on the wiki):<br>

* PSU-Lee/Prasenjit retrieve chemistry-rich docs from publisher sites

(ask for

forgiveness policy) and segment the papers into text+non-text (tables,

diagrams). This passes to:<br>

* Mark - Soton extracts molecules and spectra out

of this and converts them to SVG. The short-term goal is to get this

working by

the end of next week in a pragmatic form. (we do not mind if recall is

poor as

long as we get a few SVGs as we need to develop the machine-learning

and/or

heuristics and find out what unknown horrors we have to deal with. <br>

Bitmaps are rejected at this stage<br>

* PMR- cambridge develops heuristics to

interpret (i) molecules (ii) spectra (C13 and H1). These might later be

crowdsourced. The output is CML molecules and spectra. It is unlikely

we have

assignments<br>

* PSU - Bill+Karl. Analyse spectra with peak-fitting. <br>

* IU - Marlon.

(independently) molecules are passed to IU in CML and put into the

NMREye

workflow for computing peaks (below). IU run this automatically and

return

results in CML<br>

          <br>

(b) To get IU up to speed we shall start immediately on

simple molecules from Pubchem. This involves just Cambridge and IU.<br>

* The

NMREye workflow has been developed and tested and should work on simple

organic

compounds. It consists of the following:<br>

&nbsp; - convert PubchemXML2CML (already available in JUMBO)<br>

&nbsp; -

convert CML to Gaussian input. We have an XSLT script, but could

convert this

to Java in an hour.<br>

&nbsp; - in parallel - create RDF metadata for

provenance to this point (as this does not survive the Gaussian run)<br>

&nbsp; ... submit and run job ... (IU) ... and collect results<br>

&nbsp;-

convert LOG file to CML (JUMBOMarker, effectively done)<br>

&nbsp;- convert CML

to RDF (JUMBO). Add GaussianOWL dictionary in RDF<br>

&nbsp;<br>

upload RDFs into

reopository/tripleStore<br>

          <br>

In (b) we would expect to get 10,000 - 100,000

small molecules from Pubchem of up to, say , 15 first row atoms. These

already

have 3D coordinates (I am ignoring conformers at this stage). The

process

should be automatic. Jobs take from 0.1 seconds to 1 day (probably) as

they

scale with N^4.<br>

          <br>

P.<br>

          <br>

I will try to send this to the Wiki<br>

          <br clear="all">

          <br>

-- <br>

Peter Murray-Rust<br>

Reader in Molecular

Informatics<br>

Unilever Centre, Dep. Of Chemistry<br>

University of

Cambridge<br>

CB2 1EW,

UK<br>

+44-1223-763069<br>

          </p>

        </blockquote>

        <br>

        <br>

        <br>

        </div>

        </div>

        </div>

      </blockquote>

      </div>

      <br>

      <br clear="all">

      <br>

-- <br>

Peter Murray-Rust<br>

Reader in Molecular Informatics<br>

Unilever Centre, Dep. Of Chemistry<br>

University of Cambridge<br>

CB2 1EW, UK<br>

+44-1223-763069<br>

    </blockquote>

    </div>

    </div>

    </div>

  </blockquote>

  </div>

  <br>

  <br clear="all">

  <br>

-- <br>

Peter Murray-Rust<br>

Reader in Molecular Informatics<br>

Unilever Centre, Dep. Of Chemistry<br>

University of Cambridge<br>

CB2 1EW, UK<br>

+44-1223-763069<br>

</blockquote>

</body>

</html>