Integrating and Accessing Molecular Biology Resources

David P. Hansen and Thure Etzold Li Bioscience Ltd, Cambridge U.K.

Introduction

With biological research increasingly focusing on molecular biology and genetics more and more resources are becoming available for this research community to use. These resources include the data which is being collected and stored in databases and applications which operate on these data, producing yet more data. Databases range from simple sequence databases to complex metabolic pathway databases, as well as chemical compound, mutation and literature databases. The value of these databases can only be fully realised when they are integrated so that relationships between data can be found. There are also a vast range of applications available now, from similarity searches and sequence alignment programs to Hidden Markov Models. Results from these applications must also be easily integrated.

The growth of the data available from molecular biology resources has introduced several complexities. The main problem is obtaining a consistent, cross-referenced view of the data. Databases contain widely different data and hence are inevitably presented in a different way. Even databases containing similar data may be formatted totally differently. This leads to relatively simple tasks being quite difficult. For example, searching for similar text in multiple databases becomes an extremely difficult operation. More difficult tasks, such as finding relationships between different entries then become almost impossible. While some databases cross-reference entries to other databases, these explicit references are only a beginning to the many relationships possible among the data.

Further difficulties in integrating resources exist. The quality of the data must be considered, while keeping all the databases and resources updated is a considerable task. With databases and applications changing regularly, a practical way of maintaining and updating data must be found.

In this Chapter, we first examine the resources which are available to the molecular biology community, both databases and applications. We then

Bioinformatics - From Genomes to Drugs. Volume II: Applications. Edited by Thomas Lengauer Copyright © 2002 WILEY-VCH Verlag GmbH, Weinheim ISBN: 3-527-29988-2

discuss approaches to dealing with the problems mentioned above. One resource which attempts to address the above issues is SRS. SRS is now a mature product and is widely used within academic institutes and commercial companies world wide. Our discussion first describes the way SRS tackles the above problems and then compares this approach with the approaches adopted elsewhere.

Molecular biology resources

Molecular biology resources include, but are not restricted to, databases and applications. In this section we examine the databases and applications which have given rise to the need for bioinformatics solutions.

Databases

As mentioned in the introduction, there is now a considerable number of molecular biology and related databases available. While some are freely available, such as the DNA sequence collection EMBL [30] and GENBANK [18], others are only freely available to the academic community, such as the protein sequence database SWISS-PROT [15], and others are only available on subscription, such as the EST databases available from the company Incyte Genomics. Academics and pharmaceutical companies also have their own proprietary data which must be integrated into a system so that relationships with publicly available data can be found.

Molecular biology databases can be divided into groups, such as the sequences databases, which have actual DNA and protein sequence, sequence related databases, such as protein structure databases, whole species databases, metabolism databases and so forth. Figure 1.1 shows a representation of the main publicly available databases. The databases are represented as a network, with lines drawn between databases where a link has been defined between the databases. From this network it is possible to plot a path from a piece of data in one database, through to a related piece of data in another database. This link may sometimes be through the result of an application, which means that an application uses data from one database and produces some results which are included in the application result database and then linked to another database.

It is also necessary at this point to discuss another problem, that of the maintenance of databases. Some databases are maintained and annotated by humans, while other databases are generated. Errors in the humanly annotated databases, databases not being updated and badly maintained

■ Sequence ■ Protein Structure SeqReLated Genome Metabolic I Literature ■ Others I Transfac Mapping

The SRS universe at the EBI.

■ Sequence ■ Protein Structure SeqReLated Genome Metabolic I Literature ■ Others I Transfac Mapping

The SRS universe at the EBI.

databases are just some of the problems which need to be addressed when it comes to the integration of these resources.

1.2.1.1 Sequence databases

The first databases to appear were DNA sequence databases, namely those from the EMBL (Europe), NCBI (USA) and the DDBJ (Japan), known as EMBL [30], GENBANK [18] and DDBJ [1] respectively. These are DNA databases of sequences and their annotations. These databases continue as a collaborative effort, with the three databases sharing their information. So all three databases contain identical data, albeit in a different format.

The EMBL database is a model nucleotide database to consider. Each entry in the database is given a unique ID and information about the entry is presented in a semi-structured way. The actual nucleotide sequence is presented at the end of the entry.

These databases contain genomes, genes, ESTs, variations, introns, exons - all pieces of DNA which have been sequenced and submitted by scientists around the world to the maintainers of the databases.

Each entry in the databases contains keywords and descriptions, which describe the function of the piece of data. Any publication on the sequence and the authors names also form part of the entry. However, the most interesting information is in the annotated feature tables. These contain information on what part of the sequence are coding sequences, what they code for, which regions of the sequence are introns or exons, and any known function of other parts of the sequence.

Protein databases used to be from direct protein sequencing, but now they are made almost exclusively from the translation of ORFs (Open Reading Frames on DNA sequences). Both the European Bioinformatics Institute (EBI) and National Center for Biotechnology Information (NCBI) provide databases, TREMBL [15] and GENPEPT [19] respectively, which are automatic translations from the CDS features of the DNA in their nucleo-tide databases. This includes some automated annotation of the role which the protein plays.

SWISS-PROT [15] is a protein database which includes annotation by humans, as well as automated annotation. SWISS-PROT also includes links to many more databases than in other protein databases, and hence is a rich source of finding links between data.

1.2.1.2 Sequence-related databases

The sequence related databases contain further information about the sequences in the sequence database. Often the information is derived computationally from the sequence data, although experimentally defined databases often contain more detailed information. Entries in sequence related databases almost always provide an explicit link to a sequence database.

One of the best known protein structure databases is the Protein Data Bank (PDB) [23]. PDB archives experimentally determined three-dimensional structures of biological macromolecules and contains atomic coordinates, bibliographic citations, primary and secondary structure information, as well as crystallographic structure factors and NMR experimental data

DSSP [24] and HSSP [17] are databanks of secondary structure, based on proteins in the PDB. HSSP contains homology-derived structures of proteins, with each protein in PDB providing the basis for including sequence homologies and then aligning those sequences with the sequence of known structure from PDB. This provides implied secondary and tertiary structures for many more proteins than could be found experimentally.

Taxonomy [19] is a database which contains the names of all organisms that are represented in the GENBANK with at least one nucleotide or protein sequence. The taxonomy database is organised as a tree, so the link from species to subspecies can be followed. Although taxonomy contains about 30 000 species represented in GENBANK, there is no comprehensive databank with all approximately 2 million known species. Other species data are available from projects such as Biodiversity and Species 2000.

There are also several protein family databases. Prosite [14] and Blocks [22] provide motifs or patterns for protein families, so that new proteins can be examined to see if they fit into a particular protein family. PFAM [13] and

DOMO [21] contain protein families and their alignments, so that changes in the proteins can be observed. Sequences also share domains, or modules, and the ProDom [20] and Smart [2] databases contain these domains and the proteins which fit those domains.

1.2.1.3 Genome databases

The first genome database was the GDB (Genome Databank) which contained all known human loci. With the start of the human genome project this database was discontinued, but is now once again being updated. Currently there are several whole genome databases recently completed or nearing completion. These include many microbial genomes and several eukaryotic (including yeast and C. elegans), available at the Sanger Centre in the UK and from The Instute for Genomic Research (TIGR) in the USA. The Human Genome Mapping Project (HGMP) is an inter-governmental project started in 1995 to sequence the whole human genome. In June 2000 the privately listed company, Celera Genomics, which published the Droso-phila Genome in April 2000 [27], together with the HGMP, announced a draft human sequence.

1.2.1.4 Gene expression

Gene expression databases can be made from two sources, ESTs (Expressed Sequence Tags) or DNA chip expression data.

ESTs are short (200-500bp) DNA sequences generated from the 3' and 5' ends of randomly selected cDNA clones. Quite a large proportion of the sequences in EMBL and GENBANK are actually ESTs. Other propriety data is available from a private company, Incyte Genomics.

There are currently no publicly available DNA chip expression databases, however the Microarray Gene Expression Database Group (MGED) [3] at the EBI aims to facilitate the creation of a gene expression data repository.

1.2.1.5 Mutations and SNPs

Mutations occur at a DNA level and can give rise to a different protein product. This can be the cause of an inherited disease, such as breast cancer or cystic fibrosis) or indicate a condition.

Currently OMIM [19], from the NCBI, is the most comprehensive database of genetic diseases and their associated mutations. However there are many more locus-specific databases, such as the mutation databases for p53 and Cystic Fibrosis.

There are three types of mutations (Figure 1.2). Exchange, where a nu-cleotide is exchanged for another, insertion, where an extra nucleic acid is inserted into the sequence, or deletion, where a nucleic acid is deleted.

Three types of mutations.

There are, of course, several different ways of representing these mutations and creating a uniform way of searching for a specific mutation across dozens of mutation databases is a challenge. The work of Lehvaslaiho et al. [26] at the EBI on creating a unified access to mutation databases attempts to solve the problem of a lack of uniformity accross the mutation databases.

SNPs (Single Nucleotide Polymorphisms) are the most frequent type of variation in humans. A SNP is defined as:

SNPs are single base pair positions in genomic DNA at which different sequence alternatives (alleles) exist in normal individuals in some population(s), wherein the least frequent allele has an abundance of 1% or greater. In practice, the term SNP is typically used more loosely than above. Single base variants in cDNAs (cSNPs) are usually classed as SNPs since most of these will reflect underlying genomic DNA variants. SNP datasets also typically contain variants of less than 1% allele frequency. The 'some population' component of the definition is limited by practical challenges of surveying representative global population samples. [4]

There are several SNP databases, such as MITSNP [31] and one from the SNP consortium, which contain the SNP and its associated effect.

Databanks such as SWISS-PROT and EMBL are also including variation information in their annotations.

1.2.1.6 Pathway and chemical compounds

Databases such as Enzyme [12], Brenda [5], Ligand [28] and EMP [6] contain enzymes which are involved in a large number of reactions. Each enzyme with known enzyme function are catalogued and named by a nomenclature committee. Also included in the databases is information on the reaction and specificity of the enzyme and the various conditions the enzyme will be active under. Information about the structure and stability of the enzyme and the preparation of the reaction may also be included.

Reaction databases, such as MPW (or WIT) [7], KEGG [8] and EcoCyc [9] contain specific information about the pathways of a reaction, giving input and products as well as enzyme information. Links to various protein and chemical databanks are included in these databases.

Publicly available chemical compound databases, consisting of ligands which may include related information to sequence databases, are relatively small, and include Ligand [28] and Clotho (US). Proprietary databases, such as CAS, Beilstein and others, are much larger but are very expensive.

Exchange

Insertion

Deletion

açfflgttact

actgttact

act tact

aqaJgttact

aca tact

Three types of mutations.

1.2.1.7 Transcription factors

There are several transcription factor databases. Transfac [32] and TFD [10] provide databases of transcription factors, their binding sites and the genes which regulate these transcriptions.

1.2.1.8 Others

Some of the most important data is not in databases, but in the literature. Medline (published by the National Library of Medicine) and Excerpta Medica (published by Elsevier) are two databases which provide abstracts of published papers. It is increasingly important to link these resources with the scientific databases listed above.

Applications

Bioinformatics applications have for the most part revolved around similarity search engines such as BLAST, FASTA [29], and CLUSTALW.

BLAST and FASTA, and their derivatives, take a sequence as input and compare that sequence to a database of sequences (the search database). A list of sequences which are similar to the query sequence is then produced with a score of how well the sequences match. The search database is usually composed of all the sequences in a database such as SWISS-PROT or GENBANK, and hence the matched sequence in the search database contains information which may be useful in annotating the query sequence.

CLUSTALW is a sequence alignment tool, which takes as input a set of sequences and produces a multiple alignment of those sequences.

The results from these applications themselves form a database which contain links to other databases (namely the search databases). This is a rich source of finding relationships between sequences which a user may have and the public databases.

Databases and applications world-wide

Catalogues of databases are limited to LiMB, which unfortunately is no longer maintained and out of data and DBCAT [11], from Infobiogen in France, which has information on 400 databases relevant to biology. The first issue of Nucleic Acids Research each year also provides a list of currently available databases.

However, the growth in the number databases implies that it is not always easy to find available resources. The Databank of Databanks [25], part of the SRS server at the EBI, provides a place to not only search for

SitS Servers World Wide_

Yicrir? Gin.rtnt': EMEj.'t' od. .r:cnna. A_sq--j

TmC«gmtMit.t. Bim EiauiiMmpii.CiMja CK EMBwlMt. Unwam m Bmm Cto.

fflCJKriiaL EjpM.FuJ.aJ UJFC3K1EM. ViU.mH. [ns^t:-P>Ftt _. Pans [ri»/.^ LPWJ?.-i:HPjXiHE3 Au;rrJI«.Tinfaiiii Rmn DKPZ Ctny.uy frBF. BiaujJimjg.etfttiuur MIPS -

BuuimmnBH Cmn. Umumic-rPte*. I nip LBCEI EMEml Holt. DiiUil IrelauiJ Weir-jmn liiiam 5CC Rihn.al.lB«! CHREMS.nH:ie. B<r- H»Ly aPZSCED Stt iaJ Uni.iisir/ ol Nj; js. l:iJ.

The SRS databank of databanks.

available databases but also to find the most recently updated version of a particular database.

Figure 1.3 shows a partial list of the publicly available SRS servers taking part in the Databank of Databanks project. Each night each SRS Server retrieves information about all the other servers, such as, the databases and applications which that server has. This is then combined to make a searchable database. It is then possible to search for servers which have a particular database (such as all databases which have "sequence and align" in their description, or the server with the most up to date version of a particular database).

wrt ol JJie bsJi<J IHYCJ* :: r^uJoAy -J11■ □ r H iilW PrtjbtmVF

ftrt ,M Full Search

Quick Search|

St-wcli in AHXeKt

An overview of SRS

SRS attempts to meet all the challenges listed in the introduction to this chapter. Molecular biology resources - databases and applications - can be quickly and easily incorporated into the system. Using the meta definitions of the format and syntax of the resources, SRS creates indices of the data (both databases and the output of applications) which can then be quickly and easily queried. Once a query is performed against one or more databases, either the whole database entry or just the information which the user requires can be extracted from the databases and displayed. Links (both explicit and implicit) between those data and other databases can be made and hence navigation between resources is possible.

Clients WWW Server

Automated Analysis

Data Viewers

Wrappers C++

CORBA

Perl Java

Python

Mptsi-F^vpl

Applications

Object Loaders

Viewer Adapters

SRS core

SRS core

Resources

DBMS J (External! [Flat File!

The SRS system.

SRS consists of several layers, as shown in Figure 1.4. Central to SRS are the twin philosophies of:

(1) leaving data in their original context

(2) the use of meta-description to describe the data.

To achieve this SRS utilizes the special purpose programming language, ICARUS (Interpreter of Commands and Recursive Syntax), which is used both for the meta definition of biological resources and the syntactic description of the data in the resources. The SRS core includes programs which use ICARUS meta definitions and syntactic descriptions to index and retrieve information from resources. In the case of applications, the meta definition includes information on how to run the programs as well as the meta description of the syntax and structure of the output.

To index and retrieve data, SRS uses a token server. The meta definition is used to extract tokens, i.e. strings, from the database for indexing. When a query is performed, SRS searches for the query term against these indices and creates a set of entries which match that query. The fields of the entry which the user wants displayed are then extracted from the entry. This is done by using the connection made in the meta definition between a field and tokens received from the token server.

We now go onto describing the layers of the overall system, as shown in Figure 1.4, in more detail.

The meta-definition layer

Meta definition can be defined as a higher level definition and is widely used in SRS. Meta definition is used to describe both the resources to be integrated and the format (or syntax) of the data in those resources. The data extracted using this syntax can then either be viewed or passed to another program. It also has several ancillary uses within SRS, such as describing the HTML pages which are generated from the web interface.

Meta definition has many advantages, foremost being that changes to the format of input and/or outputs only require minimal changes to the meta definition, and none to any core component of SRS.

In SRS, meta definition is used to describe objects which the SRS core uses. In the case of a database, a library object must be defined. This object contains the name of the library, what sort of library it is (i.e. what group of databanks it belongs to), the name and whereabouts of the flat files containing the data. It also contains a link to a file containing a list of rules which describe the internal syntax of the databank. These syntax rules will be described below.

From the meta definition layer, it is also possible to use alternative viewer adaptors. The view adaptor is given a meta definition, which describes the input that the viewer requires. This is currently used to allow graphical Java applets to use information from SRS for data visualization.

The SRS core

The SRS core consists of several programs which use the meta definition layer to allow the resources to be queried, data from these resources to be returned and, in the case of applications, other programs run on selected data. This includes the creation of indices from the data resources. It is the creation of these indices which makes the querying of the databases quick.

To create the indices, SRS uses the rules given in the syntax description to create tokens. In this case, a token is defined as a piece of text from the database entry. These tokens are stored in token tables which are linked to a particular field in the database by a meta definition in the library meta definition object.

These rules are also used to extract data from the database. An object loader, another object in the SRS system, can be defined in the meta definition layer and describes the way a token table relates to a field to be extracted and viewed.

Wrappers

Another advantage of the meta-definition approach which SRS has taken is that wrappers can be generated to provide a programming interface to SRS. This allows programmers to be able to link in the SRS system to their own

The library select page of the SRS web client.

The library select page of the SRS web client.

programs (C++, Perl, Java or Python) or for a connection to a remote SRS system via the CORBA (Common Object Request Broker Architecture) server.

The wrappers provide full access to the SRS system. Queries can be done and data from the databases returned. It is also possible to get information about the resources and to run applications.

1.3.4 Clients

The top layer of the SRS system is where biologists come to retrieve data from the system and there are a number of ways in which SRS can be used. The most popular client is the web client available at numerous academic sites around the world or from a local installation (Figure 1.5).

Other clients could be data viewers, which extract information from the databases and present it graphically, or other programs which perform automated analysis may use SRS to extract data, perform analysis and write the results to another file.

Integrating molecular biology resources

With the aim of integrating databases to offer sensible queries where only the pertinent information is retrieved, the structure of data is important.

Databases can be thought of as having three levels. All databases contain entries, which are made up of several fields within the entry and then each field has a micro-structure. For the purposes of both querying, and retrieving information from, a particular field, the micro-structure, or syntax, of the field must be known.

14 | ^ Integrating and Accessing Molecular Biology Resources ID AB000360 standard; DNA; HUH; 2582 BP.

KW glycosylphosphatidylinositol-synthesis gene; PIGC. SQ Sequence 2582 BP; 694 A; 494 C; 581 G; 813 T; 0 other;

ggatccctgc tgcagagggg gtaacggtgt ctggcttgcc tgttgtggtc 50 Fig. 1.6

Sample piece of a typical database entry.

In this section we describe the relationship between the meta description of the library object, and other objects, and the syntax description of the fields.

The SRS token server

For each library a syntax description of the data must be supplied. This syntax description must describe the data in the database and takes the form of a set of rules, also called a grammar. This list of rules is written in the SRS programming language ICARUS. The rules in this case provide the syntax part of the ICARUS language (the other part being Meta Definition) and are called productions. Productions are used to extract pieces of text from a database entry, tokens which are written to token tables. It is the data which is in these token tables that is indexed and used when querying the database, or extracting parts of the database for display.

All flat file databases are semi-structured, containing a list of entries, with each entry containing a list of data-fields (e.g an Id, an Accession number, key words, a sequence etc.). Figure 1.6 shows a sample of a database entry. Each of the data-fields consists of strings or tokens. The set of productions for each database must describe how to divide the database into entries and then further into fields and then into the strings or tokens within that data field. It is these tokens within each field which are inserted into an index.

Figure 1.7 shows a subset of the rules for describing in entry in the EMBL database. The $Wrt command is the instruction to write a string into a token table.

Meta definition of molecular biology resources

The meta definition which describes the structure of the resource is also written using ICARUS. For the EMBL database this meta definition consists of the definition of several objects.

entry:

id_line ac_line de.line kw.line sq_line end-line id_line: ac_line: kwJLine: sq_line: end-line: id:

'SQ' sequence

word:

Subset of rules describing an entry in the EMBL database.

$EMBL_DB=$Library:[EMBL group:$SEQUENCE_LIBS format: $EMBL_FORMAT searchName:'*.dat'

The EMBL library object.

The EMBL library object, (Figure 1.8), defines the EMBL library as being part of the sequence library group, with the format being described in the $EMBL_FORMAT object (Figure 1.9) and the files which make up the EMBL database being all those files with the extension 'dat' in the EMBL flat file directory.

The EMBL FORMAT object contains a list of fields which make up the database. In our example EMBL database the library has 5 fields.

The field information is related to a particular token table via the index-Token attribute in the $Field object. Hence, when a particular piece of information from the description field of the EMBL databank is requested, SRS looks in the description field meta definition for the EMBL database

$EMBL_FORMAT=$LibFormat: [syntax: SEMBL-SYNTAX contains: $DNASEQ_DATA f ileType: {$DAT-FILE $SEQ_FILE} fields:{

$Field:[$DF_ID index:id indexToken:id] $Field: [$DF.Accession index:str indexToken:acc] $Field: [$DF_Keywords index :str indexToken: key] $Field: [$DF_Description index:str indexToken:des] $Field: [$DF_DNASequence format:embl]

The EMBL FORMAT object.

for the name of the token table which contains that information. The SRS token server then requests that that token table be built and the tokens, or strings, from the databank can be returned.

Indexing databases

In the case of indexing databases, the token tables for each field which is to be indexed are constructed. This means that for each field defined in the $LibFormat object, a token table is generated for that field based on the indexToken attribute in the meta definition. The values in the token table for each entry are then written to indices which are sorted and can be searched quickly.

As well as the usual indices, link indices can be built which allow for navigation between databases or cross-database queries. Two fields from different databanks must be specified as having a link between them.

Querying and linking databases

Queries are performed on databases based on fields. SRS provides a full query language for interrogating both strings and numeric fields and combining searches across multiple libraries and fields. SRS returns a set as the result of a query. Sets can be combined with the AND ('&'), OR ('|') and BUT NOT ('!') operators, which allow for the union, intersection and "and not" combination of two sets. In addition, two link operators are provided. These allow for 2 sets to be linked together and the returning set to either be the set of entries in set 2 which link to set 1 or the set of entries in set 1 which link to set 2.

Views and object loader

The result of a query or link operation in SRS is a set of entries. This is just a collection of entry-IDs and does not contain any information from the entry in the databank. Views provide a way of viewing part of the information from a databank for those entries in the set. Moreover, related information from a different databanks can be included in the view, thus creating a virtual entry.

Object loaders [16] provide a new and more flexible way of obtaining the data for those entries in the set from the databases. Instead of retrieving a set of strings or text, the object loader retrieves an object containing the re-

Class

Grammar

Object Token tF Loader Server

Loader Spec

Token Server

Object Loader

CORBA

Java

Perl

Object loaders.

quested information. Views within SRS now use object loaders to obtain the data before printing it out in the requested way.

This provides much more flexibility, as the data objects returned by the object loader can be passed to other programs, such as those written in C, C++, Java Perl or Python, passed over the Internet via a CORBA server or used to create more interesting views, such as a view including an image made from data in the object (an example of this can be seen in Figure 1.11 further in this Chapter).

Figure 1.10 shows how objects containing information from multiple databases can be constructed. From the databases, DB1 and DB2, the token server can retrieve information via the grammars describing those databases as described above. An object loader can then be defined which relates an attribute in the object to a specific token table from the token server. The object loader can then take this meta-defined object and load information into it from the databases for each entry in the set.

Applications - analyzing data

As mentioned in Section 1.2 describing the various Molecular Biology resources available, applications have played a large part in developing these resources. In fact some databases are entirely based on the results of applications.

The output from these applications themselves forms a database and often contains information related to entries in other databases. For example, if a BLAST run is performed against the SWISS-PROT database then the output will contain SWISS-PROT accession numbers and parts of se quences in the SWISS-PROT database. In SRS this output is parsed and indexed, so that it forms part of the SRS universe and can be used to link data in different databases.

The SRS data warehouse

Keeping all relevant databases up to date, both internal and external, can be a difficult task. Databases can be updated irregularly and updating them is a routine but cumbersome task. Once the databases have been down-loaded from their repository, certain tasks must be performed. These include possibly reformatting the data, combining with other data to make new databases, indexing the databases.

While all this sounds routine, there are many things which can go wrong and there are many interdependent tasks. If, for example, one particular FTP site is down, then the downloading of the database may not occur. If the indexing of the database goes ahead anyway, then that database will be out of date and databases with links to that database may also contain incorrect information.

A new module of SRS, Prisma, automates this process. Prisma should not just do the down-loading, reformatting and re-indexing tasks normally associated with such a tool, but also the report the progress of the process. Figure 1.11 shows the status of the updating of the databases after a particular run. Each task in the process (such as downloading a file, reformatting a database, indexing the database etc.) is represented by a box and dependent tasks are depicted by a connecting line. This view allows for quick checking of the updating process with any failures, and their effect immediately apparent (if a process fails for some reason the box representing it will become red, while processes waiting on another task are shown in purple).

This extension to SRS has proved very successful at keeping data at both public sites, such as the EBI, as well as private companies up to date.

Accessing integrated data

Integrating the data into a single system is only part of the problem. The interface presented to the scientist must be easy to use and provide several powerful features to enable easy manipulation and viewing of the data. It must also be possible to access the data and use it easily from other programming environments, such as other languages or across a network.

Reset vieu | PrismaView

This entry is tarn: Status for the manual 2} prisma job Help on this view

PEISMASTATUS

This entry is tarn: Status for the manual 2} prisma job Help on this view

PEISMASTATUS

Prisma view.

Prisma view.

The web interface

SRS provides an extremely powerful, but relatively simple, web interface to the databases mentioned above. It is possible to select a database that one wants to search and then retrieve the whole database, or search one or more fields within the database. Figure 1.12 shows the standard query page for the SWISSPROT databases. Users can either search for their search term in the ID, keyword, or other fields of the database, or search all text fields for their search term.

Once a set of entries which match the search terms has been retrieved, various operations can be performed on that set. The simplest is to view each whole entry in it, or to just view various fields from the entries. Predefined views, either part of the SRS package, or defined by the local SRS administrator, allow for the data to be presented in other formats. Figure 1.13 shows the default entry view of a whole SWISS-PROT entry. This looks considerably different to the text in the flat file, as shown in Figure 1.14, with the information easier to read. Either a pre-defined view of the entries can be used to view just some of the fields, or alternatively a view of the data

Querying the SWISS-PROTdatabase.

Querying the SWISS-PROTdatabase.

The Entry Page for a SWISS-PROT Entry.

The Entry Page for a SWISS-PROT Entry.

ID CD22.HUMAN STANDARD; PRT; 847 AA.

AC P20273; Q01665; Q92872; 095699; 095701; 095702; 095703;

DT 15-JUL-1999 (Rel. 38, Last sequence update)

DT 15-JUL-1999 (Rel. 38, Last annotation update)

DE B-CELL RECEPTOR CD22 PRECURSOR (LEU-14) (B-LYMPHOCYTE CELL

ADHESION

OS Homo sapiens (Human).

OC Eukaryota; Hetazoa; Chordata; Craniata; Vertebrata; Euteleostomi;

OC Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo. RN [1]

RP SEQUENCE FROM N.A. (CD22-BETA). RC TISSUE=TONSIL; RX MEDLINE; 91086838. Fig. 1.14

A part of the equivalent SWISS-PROT text entry.

can be built and used to view the data. The data can also be down-loaded onto the user's local disk.

From the set of entries, it is also possible to link to other databases, and find related entries, enabling navigatation across all the available databases. It is also possible to run applications, such as BLAST and FASTA, on retrieved data. The results are indexed and immediately available for querying.

The web interface provides an easy way for scientists to access the SRS functionality. Examples of the SRS web interface can be seen at srs.ebi.ac.uk or www.celera.com/celeraScience and other public web sites.

The application programmers' interfaces (APIs)

The API interfaces to SRS provide an object oriented programmers interface into an SRS system. This allows programmers to access the databases in their SRS system from their programs written in C++, Perl, Java or Python. The databases can be queried and information retrieved about the set of entries. The SRS system itself can be interrogated about the databases available and the fields which can be queried within those databases.

Below (Figure 1.15) is a perl script for searching the SWISS-PROT database for the entry with ID "C931_SOYBN". A set of entries is returned, this time with just one entry in it. The entry can be retrieved using the getEntry method of the set class, and the fieldToString method of the entry class is use lib $ENVSRSEXE; use srsperl; $sess = new Session;

$set = $sess->query("[swissprot-id:C931-S0YBN] " , ""); $numEntries = $set->size();

print "The query produced ", SnumEntries, " entries";

print $e->fieldToString("","","","") ;

print "Now print just the Sequence field";

print $e->fieldToString("Sequence","","","");

An example Perl program performing a query.

used to print the whole entry and just the sequence field respectively. Alternatively loaders can be used to retrieve an object with the fields defined in the loader available to the programmer.

Other interfaces

In addition to the APIs and the web interface, there are other ways of accessing the SRS system. Possibly the most powerful is by using the ICARUS programming language itself. This provides an interface for preforming queries and running applications similar to the APIs but also allows for parsing and easier definition of object loaders etc. Another interface to SRS is via the unix command line and the "getz" program, which allows for querying of databases and extraction of fields.

Other approaches

Other attempts to solve the problem of integrating Molecular Biology resources can be divided into two possible approaches, either using relational databases to store and retrieve data or to use database specific programs to parse flat files.

The National Center for Biotechnology Information (NCBI) presents their series of databases, GENBANK, GENPEPT, etc. on the web with a simple query form. The query form allows users to think about the data differently to how SRS works, with users searching for a DNA or protein sequence rather than searching a specific database. This allows scientists and occasional users to ask simple questions very quickly. However it is not possible to search a specific field in a database or for the user to build their own view of the data.

The NCBI does use a fiat file approach to parse and retrieve the data in their databases and present it on the web. While the NCBI continutes to add databases, there are not as many databases available as some SRS servers, and hence it is difficult to find relationships that may exist between the data displayed and data in other databases. Since the NCBI present their data on their web site it is also not possible for other academic institutions or companies to bring the software in-house for integration of their proprietry data.

The use of Relational Database Management Systems (RDBMS) to store genomic data is particularly wide spread in pharmaceutical companies where security is of significant importance. While RDBMS' provides great security and maintenance of databases they can be very slow when querying the data, especially when asked novel questions linking data in many different tables. With finding links between data in different databases becoming more important, queries can take a very long time. Scalability is also a big problem as the number of databases increases. However with pharmaceutical companies wanting to keep their data in a RDBMS and still look for links in the public databases, it will become imperative for a tool such as SRS to allow the integration of RDBMS and public data into the one system.

Conclusions

In this Chapter, we have detailed the Molecular Biology resources available and the way in which this data may be integrated to allow scientists to ask novel questions and investigate the relationship between data in different databases.

While there are many and varied approaches to this problem, the approach which SRS uses allows new databases to be quickly integrated into the system. Users then have the flexibility to follow a path from data in one database to data in another database and then to view the data in both which may interest them.

References

1 http://www.ddbj.nig.ac.jp/.

2 http://www.smart.embl-heidelberg.de/.

3 http://hgbase.cgr.ki.se/hgbase/snpinfo.htm.

4 http://www.ebi.ac.uk/microarray/MGED/.

5 http://www.brenda.uni-koeln.de/.

6 http://www.biobase.com/EMP/.

7 http://www.cme.msu.edu/WIT/.

8 http://www.genome.ad.jp/kegg/.

9 http://ecocyc.PangeaSystems.com/ecocyc/.

10 http://www.isbi.net/.

11 http://www.infobiogen.fr/services/dbcat/.

12 Bairoch A. The enzyme data bank. Nucleic Acids Res., 21(13):3155-6, 1993.

Continue reading here: Bioinformatics Support of Genome Sequencing Projects

Was this article helpful?

0 0