<div>no worries Peter, understood. If anybody needs anything from me beyond
what's in our paper & at the wiki already ->
http://services.nsdl.org/trac/oreChem/wiki/ChemistryDataExtraction, do let me
know...<br><br>I'm going to start taking a look at the AI/search possibilities
of our fitted spectra database then ->
http://pubs.acs.org/doi/abs/10.1021/ci950092p<br><br>cheers,<br>bill<br><br>On
Fri, Jun 12, 2009 11:14 AM, <b>Peter Murray-Rust <pm286@cam.ac.uk></b>
wrote:<br><blockquote id="quoted_response" style="border-left: 1px solid rgb(0,
0, 0); padding-left: 3px; padding-right: 0px; margin-left: 3px; margin-right:
0px;">
<p>Great - that's exciting Bill and I am sure that it will be invaluable for
assignment. However I am focusssing on what we can integrate today. The
integration problems are not trivial and the more that the components - or the
sites - are modularised the faster progress we shall<br><br>It's important to
be pragmatic at this stage - there are things we can do now and things that are
research. We should do both but we must make sure that the infrastructure
continues in a straight line. I detailed what we could do at present (some as
rough proof of concept) that could fit into a linear workflow. We must make
sure that the research efforts in the pipeline I indicated are small as the
integration of itself will still be challenging. <br><br>So I am propopsing
that we should ask:<br>* what can we do by Friday 19?<br>* what can we do by
the start of August?<br>* what can we do in the rest of the
project.<br><br>Each part depends on the previous one:<br>* Mark needs a few
papers from Lee/Prasenjit which have good PDF chemistry<br>
* PMR needs a few molecules and spectra in SVG<br>* Marlon needs a few CML
molecules and the NMREye workflow.<br><br>I agree that Mark's work on general
PDF parsing is exciting but we need a stream of molecules for the later
stages.<br><br>I am also going to suggest that we try to arrange weekly telcons
to review progress. The problem of a pipeline/workflow is that all bits have to
be delivering.<br><br>P.<br><br><br><br></p>
<div class="gmail_quote">On Fri, Jun 12, 2009 at 3:47 PM, WILLIAM J BROUWER
<span dir=""><<a href="#" target="">wjb19@psu.edu</a>></span>
wrote:<br><blockquote class="gmail_quote" style="border-left: 1px solid
rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">
<div>cool peter...<br><br>I would also add that there's some mileage in
substructure & similarity search on spectra. Han gave a great talk this
morning, there is strong application of his graph mining work to building up
complicated spectra on the basis of simpler (sub)spectra...<br><br>-bill
<br><div>
<div></div>
<div>
<br><br>On Fri, Jun 12, 2009 10:31 AM, <b>Peter Murray-Rust
<<a href="#" target="">pm286@cam.ac.uk</a>></b> wrote:<br><blockquote
style="border-left: 1px solid rgb(0, 0, 0); padding-left: 3px; padding-right:
0px; margin-left: 3px; margin-right: 0px;">
<p>This is to review the subprojects
that the computational geeks in OREChem have put together over the last few
days. (a) is long term, (b) is immediate<br>(a) The general goal is to compute
NMR spectra for all new published compounds and compare them with spectra. This
is a new approach "robot refereeing of chemistry publications" and any
differences suggest errors or new chemistry. This is long term (months) and
consists of the following (as we have put on the wiki):<br>
* PSU-Lee/Prasenjit retrieve chemistry-rich docs from publisher sites (ask for
forgiveness policy) and segment the papers into text+non-text (tables,
diagrams). This passes to:<br>* Mark - Soton extracts molecules and spectra out
of this and converts them to SVG. The short-term goal is to get this working by
the end of next week in a pragmatic form. (we do not mind if recall is poor as
long as we get a few SVGs as we need to develop the machine-learning and/or
heuristics and find out what unknown horrors we have to deal with. <br>
Bitmaps are rejected at this stage<br>* PMR- cambridge develops heuristics to
interpret (i) molecules (ii) spectra (C13 and H1). These might later be
crowdsourced. The output is CML molecules and spectra. It is unlikely we have
assignments<br>
* PSU - Bill+Karl. Analyse spectra with peak-fitting. <br>* IU - Marlon.
(independently) molecules are passed to IU in CML and put into the NMREye
workflow for computing peaks (below). IU run this automatically and return
results in CML<br><br>(b) To get IU up to speed we shall start immediately on
simple molecules from Pubchem. This involves just Cambridge and IU.<br>* The
NMREye workflow has been developed and tested and should work on simple organic
compounds. It consists of the following:<br>
- convert PubchemXML2CML (already available in JUMBO)<br> -
convert CML to Gaussian input. We have an XSLT script, but could convert this
to Java in an hour.<br> - in parallel - create RDF metadata for
provenance to this point (as this does not survive the Gaussian run)<br>
... submit and run job ... (IU) ... and collect results<br> -
convert LOG file to CML (JUMBOMarker, effectively done)<br> - convert CML
to RDF (JUMBO). Add GaussianOWL dictionary in RDF<br> <br>upload RDFs into
reopository/tripleStore<br><br>In (b) we would expect to get 10,000 - 100,000
small molecules from Pubchem of up to, say , 15 first row atoms. These already
have 3D coordinates (I am ignoring conformers at this stage). The process
should be automatic. Jobs take from 0.1 seconds to 1 day (probably) as they
scale with N^4.<br><br>P.<br><br>I will try to send this to the Wiki<br><br
clear="all"><br>-- <br>Peter Murray-Rust<br>Reader in Molecular
Informatics<br>Unilever Centre, Dep. Of Chemistry<br>University of
Cambridge<br>CB2 1EW,
UK<br>+44-1223-763069<br></p>
</blockquote>
<br><br><br>
</div>
</div>
</div>
</blockquote>
</div>
<br><br clear="all"><br>-- <br>Peter Murray-Rust<br>Reader in Molecular
Informatics<br>Unilever Centre, Dep. Of Chemistry<br>University of
Cambridge<br>CB2 1EW, UK<br>+44-1223-763069<br>
</blockquote><br><br><br></div>