home
BINFS : StarDOM : BioMagResBankXML

BioMagResBank/XML

News: Check out the prototype implementation of navigatable BMR data!

What is BioMagResBank/XML?

As an example application, the entire BioMagResBank has been transformed into XML format.
In the following, we'll show some example code for the conversion process as well as a structured query to a BioMagResBank/XML file.

Example code for the STAR/XML conversion


As a short introduction, let's download the file bmr749.str from http://www.bmrb.wisc.edu.

The necessary Python code for the conversion to XML is rather short:

import StarDOM.convertStar
inFileName = 'bmr749.str'
outFileName = 'bmr749.xml'
print 'reading', inFileName
dom = StarDOM.convertStar.convert(inFileName, open(inFileName))
print 'writing:', outFileName
StarDOM.convertStar.dumpAsXML(dom, open(outFileName, 'w'))

The output XML file bmr749.xml corresponds to our DTD (Document Template Definition). To convert the entire BioMagResBank database, you just have to extend this example code with a for loop.

As a proof of concept, we converted the mmCIF file 1nk3.mmCIF with the same Python script to 1nk3.xml.

Querying the data with XML query languages

This section illustrates the use of an XML query language to extract data without writing special-purpose parsers. In this example, we'll be using a Perl5 implementation of XQL as the query language.

Statistical queries against an entire database of experimental information can be performed by using standard XML query languages. Here, we illustrate this scenario given a typical question: What are the hydrogen chemical shifts of all the aspartic acid residues?
In the XML data file, this information will be hidden in layers of LIST elements which contain nested DATAITEM elements. The relevant elements look like:

<LIST>
<DATAITEM>
	<NAME>
        	Residue_label
	</NAME>
	<VALUE>
        	D
	</VALUE>
<DATAITEM>
<DATAITEM>
	<NAME>
        	Atom_name
	</NAME>
	<VALUE>
		H
	</VALUE>
</DATAITEM>
<DATAITEM>
	<NAME>
		Atom_type
	</NAME>
	<VALUE>
		H
	</VALUE>
</DATAITEM>
<DATAITEM>
	<NAME>
		Chem_shift_value
	</NAME>
	<VALUE>
		8.33
	</VALUE>
</DATAITEM>

</LIST>

Building up the query string

In this example, we'll be using Here, we'll build up the query string step by step by applying filters to XML elements:

  1. Find lists which contain aspartic acid data: This is done with a statement like
    	//LIST[DATAITEM/NAME='Residue_label' and DATAITEM/VALUE='D']
    	
    Note that we're looking for LIST elements anywhere in the hierarchy. This is what the leading // are good for. Next, our list must contain DATAITEMS denoting they contain asp-relevant data in the above format. Those LIST elements can be filtered out by applying the bracketed subquery, which tests for the proper residue label.
  2. Get only those lists which contain hydrogen information as well: Adding the expression
    	[DATAITEM/NAME='Atom_type' and DATAITEM/VALUE='H']
    	
    to our query string above filters the above collection even further. The result will be only those LIST elements that contain hydrogen information as well.
  3. Extract the chemical shift information: Now that we have the proper LIST elements at hand, we extract those DATAITEM subelements which contain the chemical shift information. This additional filter can be applied by appending the expression
    	/DATAITEM[NAME='Chem_shift_value']/VALUE
    	
    to our search string.
The entire query string looks like that:

//LIST[DATAITEM/NAME='Residue_label' and DATAITEM/VALUE='D']
[DATAITEM/NAME='Atom_type' and DATAITEM/VALUE='H']
/DATAITEM[NAME='Chem_shift_value']/VALUE
Execute the query

The query can be executed by using a Perl frontend which is part of the XML::XQL Perl module. To obtain this package, use the CPAN module search (XQL as 'Name').


Jens Linge, Lutz Ehrlich