Science and Technology

Everything You Need To Know About Databases In Biological Experiments


For each level in the biological experiments, there are datasets generated. For example, nucleotide sequences of genes and genomes can be generated by using sequencing techniques. The sequences of nucleotide are routinely submitted or deposited in publically available databases like GenBank, so that any researcher can avail from access to the sequence data. Likewise, sequencing at the transcriptome level produces sequences of RNA, which can be deposited in the public databases as well. Regarding proteins, the data may consists of the amino acid sequences, data about conserved sequence motifs (derived from multiple alignments), and also data describing protein structures. At the highest level of complexity addressed in bioinformatics (which is often considered to be systems biology), there are also databases covering protein, protein interaction, regulatory pathways, and metabolic pathways.

Primary databases

These are the databases consisting of data mostly obtained experimentally including nucleotide sequences and the three dimensional structures. Primary databases consist of three types including:

  • Genome Database
  • Protein Database
  • Complex Database

Genome database consist of sequence database and structural database while protein database include only structural database. Complex database include protein nucleic acid complex database.

Secondary databases

These databases are derived from the analysis or treatments of primary databases, including secondary structures and hydrophobicity plots. Domains are also stored in secondary databases. Protein database include only sequence database, when it comes to the context of secondary database, while complex database include protein nucleic acid interaction database.

It can be possible to generate so called secondary databases by using the primary data sources as an input. An example of this is UniGene. It assembles ESTs (partial mRNA sequences) which is typically originated from the same gene. It also generates the sequence of the full gene from each cluster of the assembled ESTs. The goal is to derive the complete mRNA sequnce of every transcribed gene, including its splice variants.

In case of humans, for example, UniGene consists of approximately 130,000 EST clusters, which is typically regarded as the currently available information about the human transcriptome. Rfam is considered another example of a secondary database. It takes as input the 2.5 million sequences of noncoding RNAs and derives motifs representing conserved regions of related RNAs. Round about 2,500 families of related RNA motifs have been derived using multiple alignments for the identification of the conserved positions. Similar methods are used in the Prosite database to derive motifs representing the conserved regions of approximately 1,800 protein families and domains. In SCOP (Structural Classification Of Proteins) the primary protein data from PDB is used to derive a classification of proteins into structural classes.

This eventually makes it possible to examine related proteins according to what sort of the structural features they share. The KEGG (Kyoto Encyclopedia of Genes and Genomes) resource contains many databases, among them databases for regulatory and metabolic pathways. Here, each entry shows how a number of genes/proteins interact in a particular biological process.


It is one of the most influential sequence databases contains only nucleotide sequences and is managed by NCBI. GenBank is a public database which indicates that anyone can freely access the data available in software. Also, anyone can add new database sequences to it, as long as they abide by the required format. A large number of the scientific journals demand that new sequences must be submitted to GenBank (or to one of the other primary nucleotide databases) prior to publication of their articles. This is one of the reasons that the database is growing so rapidly. GenBank currently holds more than 210 million entries (as of January 2022) and it is growing at a rate of approximately a million sequences per month. There is also a WGS (Whole Genome Sequencing) section of Genbank which collects sequences from WGS project, which are mostly annotated fragments, and this section is even larger and growing more rapidly.

Each GenBank entry consists of two main identifiers, named as Locus and Accession number. Nowadays the two identifiers are normally the same. The Locus was historically a descriptive name, showing which organism the sequence comes from, and other information. The drawback of that was that occasionally locus identifiers had to be updated, when errors in the data were discovered. The accession number was more reliable, since it is only a random letter number string, and there was therefore never any reason to change it.

Similarity scoring

Identity scoring (+1 for match, 1 for mismatch) does not work very well for amino acid sequences. Kind of mutations from one amino acid to another have no effect, or just a little effect, on the protein structure and function, as the two of them have similar properties. Other mutations might have a much stronger affect. Additionally, the genetic code can make a few amino acid mutations easier to get through, and therefore more frequent, than others. Mutating requires only one nucleotide substitution (for example from codon GAT to codon GAA only requires replacing T with A), whereas other mutations require three nucleotide substitutions (compare codon GAT with TGG). For these reasons, the former mismatch should be scored differently from a later mismatch. This is reflected in amino acid substitution matrices, such as BLOSUM62. The complete BLOSUM62 matrix specifies score for all amino acid pairs.

Point Accepted Mutation (PAM)

Dayhoff in 1978 analyzed alignments of 71 protein families, looking at strongly conserved regions where only 1% of amino acids had mutated. The observed substitution frequencies represent the PAM1 matrix. The matrices for more distantly (less conserved) proteins was derived, all the way up to PAM250. The PAM number indicates the number of the accepted mutations present per 100 positions, since it may consider to have been prevalent and authentic in the gene pool over some time. Over long periods of time (sometimes millions of years) the amino acid at one certain position can be replaced multiple times, and can also change back to its original amino acid (often referred to a back mutation). Statistically, it can be shown that 250 random mutations of a sequence of length 100 results in a sequence that is approximately 20% identical to the original.


The BLOSUM (blocks of amino acids substitution matrices) similarity scoring matrices were derived from ungapped local alignments, “blocks”, with different levels of identity. So alignments with >62% identity were used when calculating the substitution frequencies between all amino acid pairs and creating the BLOSUM62 matrix.

How useful was this post?

Click on a star to rate it!

Average rating 5 / 5. Vote count: 1

No votes so far! Be the first to rate this post.

We are sorry that this post was not useful for you!

Let us improve this post!

Tell us how we can improve this post?

Click to comment

You must be logged in to post a comment Login

Leave a Reply

Most Popular

To Top