Great,<br>hope we can set up some telcons anyway<br><br><div class="gmail_quote">On Fri, Jun 12, 2009 at 7:24 PM, Marlon Pierce <span dir="ltr">&lt;<a href="mailto:mpierce@cs.indiana.edu">mpierce@cs.indiana.edu</a>&gt;</span> wrote:<br>

<blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">

<div bgcolor="#ffffff" text="#000000">

My time over the next couple of weeks will be dominated by TeraGrid 09

preparation, but I will make some headway in getting started and will

be working in earnest by end of June.<br><font color="#888888">

<br>

<br>

Marlon</font><div><div></div><div class="h5"><br>

<br>

<br>

Peter Murray-Rust wrote:

<blockquote type="cite">Great - that&#39;s exciting Bill and I am sure that it will be

invaluable for assignment. However I am focusssing on what we can

integrate today. The integration problems are not trivial and the more

that the components - or the sites - are modularised the faster

progress we shall<br>

  <br>

It&#39;s important to be pragmatic at this stage - there are things we can

do now and things that are research. We should do both but we must make

sure that the infrastructure continues in a straight line. I detailed

what we could do at present (some as rough proof of concept) that could

fit into a linear workflow. We must make sure that the research efforts

in the pipeline I indicated are small as the integration of itself will

still be challenging. <br>

  <br>

So I am propopsing that we should ask:<br>

* what can we do by Friday 19?<br>

* what can we do by the start of August?<br>

* what can we do in the rest of the project.<br>

  <br>

Each part depends on the previous one:<br>

* Mark needs a few papers from Lee/Prasenjit which have good PDF

chemistry<br>

* PMR needs a few molecules and spectra in SVG<br>

* Marlon needs a few CML molecules and the NMREye workflow.<br>

  <br>

I agree that Mark&#39;s work on general PDF parsing is exciting but we need

a stream of molecules for the later stages.<br>

  <br>

I am also going to suggest that we try to arrange weekly telcons to

review progress. The problem of a pipeline/workflow is that all bits

have to be delivering.<br>

  <br>

P.<br>

  <br>

  <br>

  <br>

  <div class="gmail_quote">On Fri, Jun 12, 2009 at 3:47 PM, WILLIAM J

BROUWER <span dir="ltr">&lt;<a href="mailto:wjb19@psu.edu" target="_blank">wjb19@psu.edu</a>&gt;</span>

wrote:<br>

  <blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">

    <div>cool peter...<br>

    <br>

I would also add that there&#39;s some mileage in

substructure &amp; similarity search on spectra. Han gave a great talk

this

morning, there is strong application of his graph mining work to

building up

complicated spectra on the basis of simpler (sub)spectra...<br>

    <br>

-bill

    <br>

    <div>

    <div><br>

    <br>

On Fri, Jun 12, 2009 10:31 AM, <b>Peter Murray-Rust

&lt;<a href="mailto:pm286@cam.ac.uk" target="_blank">pm286@cam.ac.uk</a>&gt;</b> wrote:<br>

    <blockquote style="border-left: 1px solid rgb(0, 0, 0); padding-left: 3px; padding-right: 0px; margin-left: 3px; margin-right: 0px;">

      <p>This is to review the subprojects

that the computational geeks in OREChem have put together over the last

few

days. (a) is long term, (b) is immediate<br>

(a) The general goal is to compute

NMR spectra for all new published compounds and compare them with

spectra. This

is a new approach &quot;robot refereeing of chemistry publications&quot; and any

differences suggest errors or new chemistry. This is long term (months)

and

consists of the following (as we have put on the wiki):<br>

* PSU-Lee/Prasenjit retrieve chemistry-rich docs from publisher sites

(ask for

forgiveness policy) and segment the papers into text+non-text (tables,

diagrams). This passes to:<br>

* Mark - Soton extracts molecules and spectra out

of this and converts them to SVG. The short-term goal is to get this

working by

the end of next week in a pragmatic form. (we do not mind if recall is

poor as

long as we get a few SVGs as we need to develop the machine-learning

and/or

heuristics and find out what unknown horrors we have to deal with. <br>

Bitmaps are rejected at this stage<br>

* PMR- cambridge develops heuristics to

interpret (i) molecules (ii) spectra (C13 and H1). These might later be

crowdsourced. The output is CML molecules and spectra. It is unlikely

we have

assignments<br>

* PSU - Bill+Karl. Analyse spectra with peak-fitting. <br>

* IU - Marlon.

(independently) molecules are passed to IU in CML and put into the

NMREye

workflow for computing peaks (below). IU run this automatically and

return

results in CML<br>

      <br>

(b) To get IU up to speed we shall start immediately on

simple molecules from Pubchem. This involves just Cambridge and IU.<br>

* The

NMREye workflow has been developed and tested and should work on simple

organic

compounds. It consists of the following:<br>

  - convert PubchemXML2CML (already available in JUMBO)<br>

  -

convert CML to Gaussian input. We have an XSLT script, but could

convert this

to Java in an hour.<br>

  - in parallel - create RDF metadata for

provenance to this point (as this does not survive the Gaussian run)<br>

  ... submit and run job ... (IU) ... and collect results<br>

 -

convert LOG file to CML (JUMBOMarker, effectively done)<br>

 - convert CML

to RDF (JUMBO). Add GaussianOWL dictionary in RDF<br>

 <br>

upload RDFs into

reopository/tripleStore<br>

      <br>

In (b) we would expect to get 10,000 - 100,000

small molecules from Pubchem of up to, say , 15 first row atoms. These

already

have 3D coordinates (I am ignoring conformers at this stage). The

process

should be automatic. Jobs take from 0.1 seconds to 1 day (probably) as

they

scale with N^4.<br>

      <br>

P.<br>

      <br>

I will try to send this to the Wiki<br>

      <br clear="all">

      <br>

-- <br>

Peter Murray-Rust<br>

Reader in Molecular

Informatics<br>

Unilever Centre, Dep. Of Chemistry<br>

University of

Cambridge<br>

CB2 1EW,

UK<br>

+44-1223-763069<br>

      </p>

    </blockquote>

    <br>

    <br>

    <br>

    </div>

    </div>

    </div>

  </blockquote>

  </div>

  <br>

  <br clear="all">

  <br>

-- <br>

Peter Murray-Rust<br>

Reader in Molecular Informatics<br>

Unilever Centre, Dep. Of Chemistry<br>

University of Cambridge<br>

CB2 1EW, UK<br>

+44-1223-763069<br>

</blockquote>

</div></div></div>

</blockquote></div><br><br clear="all"><br>-- <br>Peter Murray-Rust<br>Reader in Molecular Informatics<br>Unilever Centre, Dep. Of Chemistry<br>University of Cambridge<br>CB2 1EW, UK<br>+44-1223-763069<br>