[Orechem] latest plan for textmining crystallography in OREChem

Peter Murray-Rust pm286 at cam.ac.uk
Fri Mar 26 17:45:38 EDT 2010


Please treat as project-confidential [I am copying in textmining colleagues
in Cambridge...and Brian McMahon of IUCr]
We discussed the following component of OREChem and intend to go ahead asap.

Textmining crystallography into RDF
===========================

PennState are expert in bibliographic textmining (authors, institutions,
et.) and have great expertise in disambiguation. Cambridge have expertise in
chemical/crystallographic textmining. IUCr operate a journal (Acta Cryst E)
which has been an Open Access for 2008, 2009 and 2010. This has about 7000
open articles (Brian - do you have a better number?) We have already
abstracted the CIFs into CrystalEye but now plan the following:
* agree on a unique ID for the summary page of each crystaleye [==CIF data_
block]. Perhaps work towards this being a DOI
 * extract the bibliographic info from the IUCr site [This is similar to the
current CIF spidering]. This would be a wonderful resource for bibliometrics
with a closed group of 7000 extremely similar papers from all round the
world. We could do time, geographic, subject metrics, etc.
* extract the supplemental info from the IUCr site [This is similar to the
current CIF spidering]
* parse the Experimental section from the paper using ChemicalTagger (Lezan
et al) into CML. This works already!! (see below)
* create a LinkedOpenData Object for the information and link to Wikepedia,
Pubchem, etc. - anywhere that suports LOD.

The mechanics are that we would run a spider on Acta E and creat a
bibliographic feed and a suppdata feed. Na Li would take the bibliographic
feed and extract disambiguated authors in RDF.
We will run a Chemicaltagger service and the experimental will go into that
and be emitted as CML+RDF. This will go back into the triplestore.

============
Example
To a solution of ethyl
3-{5-[(methylsulfonyloxy)methyl]isoxazol-3-yl}-2-phenyl
/H/-pyrazolo[1,5-a]pyridine-5-carboxylate (Meng /et al./, 2010) (0.33 g,
0.75 mmol) in THF (20 ml) was added diethylamine (0.22 ml, 2.25 mmol). The
mixture was stirred for 12 h. Water and dichloromethane were added in turn
and stirred, and layers were separated. The aqueous layer was back-extracted
with dichloromethane. The combined organics were washed with brine, dried
over sodium sulfate, filtered and concentrated. The residue was purified by
column chromatography (yield 89%). The crystals of (I) were obtained from a
hexane-ethyl acetate-dichloromethane
(3:1:1, /v///v///v/) solution by slow evaporation at room temperature (m.p.
363–364 K).

Result:
 *-*<file:///C:/Users/pm286/AppData/Local/Temp/Temp1_crystal1_xml[1].zip/crystal1.xml#>
<Document>
 *-*<file:///C:/Users/pm286/AppData/Local/Temp/Temp1_crystal1_xml[1].zip/crystal1.xml#>
<Sentence>
 *-*<file:///C:/Users/pm286/AppData/Local/Temp/Temp1_crystal1_xml[1].zip/crystal1.xml#>
<PrepPhrase>
 * * <TO>*To*</TO>
 *-*<file:///C:/Users/pm286/AppData/Local/Temp/Temp1_crystal1_xml[1].zip/crystal1.xml#>
<NounPhrase>
 * * <DT>*a*</DT>
 * * <NN-CHEMENTITY>*solution*</NN-CHEMENTITY>
 *-*<file:///C:/Users/pm286/AppData/Local/Temp/Temp1_crystal1_xml[1].zip/crystal1.xml#>
<PrepPhrase>
 * * <IN-OF>*of*</IN-OF>
 *-*<file:///C:/Users/pm286/AppData/Local/Temp/Temp1_crystal1_xml[1].zip/crystal1.xml#>
<NounPhrase>
 *-*<file:///C:/Users/pm286/AppData/Local/Temp/Temp1_crystal1_xml[1].zip/crystal1.xml#>
<MOLECULE>
 *-*<file:///C:/Users/pm286/AppData/Local/Temp/Temp1_crystal1_xml[1].zip/crystal1.xml#>
<OSCARCM>
 * * <OSCAR-CM>*ethyl*</OSCAR-CM>
 * * <OSCAR-CM>*3-{5-[(methylsulfonyloxy)methyl]isoxazol-3-yl}-2-phenyl*</
OSCAR-CM>
 * * <OSCAR-CM>*H-pyrazolo[1,5-a]pyridine-5-carboxylate*</OSCAR-CM>
* * </OSCARCM>
 *-*<file:///C:/Users/pm286/AppData/Local/Temp/Temp1_crystal1_xml[1].zip/crystal1.xml#>
<MIXTURE>
 * * <_-LRB->*(*</_-LRB->
 * * <NNP>*Meng*</NNP>
 * * <FW>*et*</FW>
 * * <FW>*al.*</FW>
 * * <COMMA>*,*</COMMA>
 * * <CD>*2010*</CD>
 * * <_-RRB->*)*</_-RRB->
* * </MIXTURE>
 *-*<file:///C:/Users/pm286/AppData/Local/Temp/Temp1_crystal1_xml[1].zip/crystal1.xml#>
<QUANTITY>
 * * <_-LRB->*(*</_-LRB->
 *-*<file:///C:/Users/pm286/AppData/Local/Temp/Temp1_crystal1_xml[1].zip/crystal1.xml#>
<MASS>
 * * <CD>*0.33*</CD>
 * * <NN-MASS>*g*</NN-MASS>
* * </MASS>
 * * <COMMA>*,*</COMMA>
 *-*<file:///C:/Users/pm286/AppData/Local/Temp/Temp1_crystal1_xml[1].zip/crystal1.xml#>
<AMOUNT>
 * * <CD>*0.75*</CD>
 * * <NN-AMOUNT>*mmol*</NN-AMOUNT>
* * </AMOUNT>
 * * <_-RRB->*)*</_-RRB->
* * </QUANTITY>
* * </MOLECULE>
 *-*<file:///C:/Users/pm286/AppData/Local/Temp/Temp1_crystal1_xml[1].zip/crystal1.xml#>
<PrepPhrase>
 * * <IN-IN>*in*</IN-IN>
 *-*<file:///C:/Users/pm286/AppData/Local/Temp/Temp1_crystal1_xml[1].zip/crystal1.xml#>
<MOLECULE>
 * * <OSCAR-CM>*THF*</OSCAR-CM>
 *-*<file:///C:/Users/pm286/AppData/Local/Temp/Temp1_crystal1_xml[1].zip/crystal1.xml#>
<QUANTITY>
 * * <_-LRB->*(*</_-LRB->
 *-*<file:///C:/Users/pm286/AppData/Local/Temp/Temp1_crystal1_xml[1].zip/crystal1.xml#>
<VOLUME>
 * * <CD>*20*</CD>
 * * <NN-VOL>*ml*</NN-VOL>
* * </VOLUME>
 * * <_-RRB->*)*</_-RRB->
* * </QUANTITY>
* * </MOLECULE>
* * </PrepPhrase>
* * </NounPhrase>
* * </PrepPhrase>
* * </NounPhrase>
* * </PrepPhrase>
 *-*<file:///C:/Users/pm286/AppData/Local/Temp/Temp1_crystal1_xml[1].zip/crystal1.xml#>
<VerbPhrase>
 * * <VBD>*was*</VBD>
 * * <VB-ADD>*added*</VB-ADD>
* * </VerbPhrase>
 *-*<file:///C:/Users/pm286/AppData/Local/Temp/Temp1_crystal1_xml[1].zip/crystal1.xml#>
<NounPhrase>
 *-*<file:///C:/Users/pm286/AppData/Local/Temp/Temp1_crystal1_xml[1].zip/crystal1.xml#>
<MOLECULE>
 * * <OSCAR-CM>*diethylamine*</OSCAR-CM>
 *-*<file:///C:/Users/pm286/AppData/Local/Temp/Temp1_crystal1_xml[1].zip/crystal1.xml#>
<QUANTITY>
 * * <_-LRB->*(*</_-LRB->
 *-*<file:///C:/Users/pm286/AppData/Local/Temp/Temp1_crystal1_xml[1].zip/crystal1.xml#>
<VOLUME>
 * * <CD>*0.22*</CD>
 * * <NN-VOL>*ml*</NN-VOL>
* * </VOLUME>
 * * <COMMA>*,*</COMMA>
 *-*<file:///C:/Users/pm286/AppData/Local/Temp/Temp1_crystal1_xml[1].zip/crystal1.xml#>
<AMOUNT>
 * * <CD>*2.25*</CD>
 * * <NN-AMOUNT>*mmol*</NN-AMOUNT>
* * </AMOUNT>
 * * <_-RRB->*)*</_-RRB->
* * </QUANTITY>
* * </MOLECULE>
* * </NounPhrase>
 * * <STOP>*.*</STOP>
* * </Sentence>

I think that is sensational. OPSIN is able to parse the name (there is a
missing locant for the H; I assume 1H) into

*ethyl* *3-{5-[(methylsulfonyloxy)methyl]isoxazol-3-yl}-2-phenyl-1**
H-pyrazolo[1,5-a]pyridine-5-carboxylate*

try this in http://opsin.ch.cam.ac.uk/

P.


-- 
Peter Murray-Rust
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge
CB2 1EW, UK
+44-1223-763069
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.openarchives.org/pipermail/orechem/attachments/20100326/987eb3d0/attachment-0001.htm


More information about the Orechem mailing list