COMPUTATIONAL BIOLOGY

 
   
     
     
       
 
THEORETICAL MOLECULAR EVOLUTION AND COMPARATIVE GENOMICS
   
     
 
All forms of life contain genetic materials in the form of DNA. Sequencing technologies provided the rough draft DNA sequences for many prokaryotic (e.g.. Bacteria) and eukaryotic (e.g.. Human) genomes. The genes that code for proteins are arranged in the chromosome in a complex but optimal manner. Genes in eukaryotes generally contain two types of sequences: (1) exon (region forming the protein); (2) intron (region not forming the protein). Efforts are still underway in determining the exact number of genes in the human genome. Likewise, there are many other questions related to genes in eukaryotic genomes. How are these genes arranged in the genome? How many paralogs (homologs in the same genome) are present in each genome? How many introns does each gene in the genome contain? Where are these introns present in each gene? What are the characteristics of intron-containing genes in eukaryotes? Are there intronless genes in eukaryotics genomes? These questions remain largely unaddressed.
 
 

Databases

 

 
 

In order to better understand these questions we mined  GenBank and developed specialized datasets and databases. One such database is called ExInt (Sakharkar et al., 2000; Sakharkar et al., 2002). The ExInt database is a collection of intron-containing eukaryotic genes derived from GenBank. Each record in the collection provides information on the different protein sequences coded by intron-containing genes in eukaryotes, the source of sequence, the gene structure (exon-intron arrangement) and other related properties of the sequence under study. This dataset is extremely useful for studying the unified features of gene structures in eukaryotic genes. A concrete understanding of this phenomenon may help to unlock some of the key events in molecular pathogenesis.

 

 
 
Another database called SEGE was developed to collect all intronless genes in eukaryotes (Sakharkar et al., 2002). This database also contains a derived dataset from GenBank. Intronless genes are of particular interest in studies related to genome evolution, gene arrangement and gene discovery. Intronless genes largely circumvent alternative splicing because of the absence of introns in them. Human proteins encoded by functional intronless genes (particularly those without intron-containing paralogs) could be considered as drug targets with less caution. Other databases like IEKB (Intron-Exon Knowledge base), MIDB (Mismatched Intron Database) were also developed by our group and their features were described elsewhere (Sakharkar et al., 2000; Sakharkar et al., 2001). We are also  studying alternative splicing by exon skipping and protein fusion using computational procedures (Yiting et al., 2004).