BioMagResBank/XML
What is BioMagResBank/XML?
As an example application, the entire
BioMagResBank has been transformed into XML format.
In the following, we'll show some example code for the conversion process as well as a structured query
to a BioMagResBank/XML file.
Example code for the STAR/XML conversion
As a short introduction, let's download the file bmr749.str
from
http://www.bmrb.wisc.edu.
The necessary Python code for the conversion to XML is rather short:
import StarDOM.convertStar
inFileName = 'bmr749.str'
outFileName = 'bmr749.xml'
print 'reading', inFileName
dom = StarDOM.convertStar.convert(inFileName, open(inFileName))
print 'writing:', outFileName
StarDOM.convertStar.dumpAsXML(dom, open(outFileName, 'w'))
|
The
output XML file bmr749.xml corresponds to
our DTD (Document Template Definition).
To convert the entire BioMagResBank database, you just have to
extend this example code with a for loop.
As a proof of concept, we converted the mmCIF file
1nk3.mmCIF
with the same Python script to
1nk3.xml.
Querying the data with XML query languages
This section illustrates the use of an XML query language to extract data without
writing special-purpose parsers. In this example, we'll be using a Perl5 implementation
of XQL as the query language.
Statistical queries against an entire database of experimental information
can be performed by using standard XML query languages. Here, we illustrate
this scenario given a typical question:
What are the hydrogen chemical shifts
of all the aspartic acid residues?
In the XML data file, this information will be hidden in layers of LIST
elements which contain nested DATAITEM elements. The relevant
elements look like:
<LIST>
<DATAITEM>
<NAME>
Residue_label
</NAME>
<VALUE>
D
</VALUE>
<DATAITEM>
<DATAITEM>
<NAME>
Atom_name
</NAME>
<VALUE>
H
</VALUE>
</DATAITEM>
<DATAITEM>
<NAME>
Atom_type
</NAME>
<VALUE>
H
</VALUE>
</DATAITEM>
<DATAITEM>
<NAME>
Chem_shift_value
</NAME>
<VALUE>
8.33
</VALUE>
</DATAITEM>
</LIST>
Building up the query string
In this example, we'll be using
Here, we'll build up the query string step by step by applying filters to
XML elements:
- Find lists which contain aspartic acid data: This is
done with a statement like
//LIST[DATAITEM/NAME='Residue_label' and DATAITEM/VALUE='D']
Note that we're looking for LIST elements anywhere
in the hierarchy. This is what the leading // are good for.
Next, our list must contain DATAITEMS denoting they contain
asp-relevant data in the above format. Those LIST elements
can be filtered out by applying the bracketed subquery, which tests
for the proper residue label.
- Get only those lists which contain hydrogen information as well:
Adding the expression
[DATAITEM/NAME='Atom_type' and DATAITEM/VALUE='H']
to our query string above filters the above collection even further.
The result will be only those LIST elements that contain
hydrogen information as well.
- Extract the chemical shift information: Now that we
have the proper
LIST elements at hand, we extract those
DATAITEM subelements which contain the chemical shift information.
This additional filter can be applied by appending the expression
/DATAITEM[NAME='Chem_shift_value']/VALUE
to our search
string.
The entire query string looks like that:
//LIST[DATAITEM/NAME='Residue_label' and DATAITEM/VALUE='D']
[DATAITEM/NAME='Atom_type' and DATAITEM/VALUE='H']
/DATAITEM[NAME='Chem_shift_value']/VALUE
Execute the query
The query can be executed by using a Perl frontend which is part of the XML::XQL Perl module. To obtain this package,
use the CPAN module search (XQL as 'Name').
|