dotplot with Python

By leonardo maffi
Version 0.10, July 6 2006

[Go back to the index]

A basic implementation of dotplot algorithm in Python: dotplot.zip
Notes: it requires Psyco. The zip contains bioutil.py too, but you may find an updated version of bioutil.py on the software page.

In the recent past of bioinformatics dotplots were used to visualize how much and where two given genome sequences are similar. A known C program able to do this is (sources are available too, somewhere else):
http://www.cgb.ki.se/cgb/groups/sonnhammer/Dotter.html
A part of its description: Dotter is a graphical dotplot program for detailed comparison of two sequences. Here, every residue in one sequence is compared to every residue in the other sequence. The first sequence runs along the x-axis and the second sequence along the y-axis. In regions where the two sequences are similar to each other, a row of high scores will run diagonally across the dot matrix. If you're comparing a sequence against itself to find internal repeats, you'll notice that the main diagonal scores maximally, since it's the 100% perfect self-match. To make the score matrix more intelligible, the pairwise scores are averaged over a sliding window.

It seems today dotplots aren't used much anymore (it is used more powerful and faster sofware), but to learn and experiment some basic bionformatics I have implemented a dotplot program using Python+Psyco. I have used Python because I like it a lot, and because it's fast to debug; it's a good language to develop prototypes (that later can be translated to C, if necessary.) With some small changes this dotplot.py program can be adapted for ShedSkin (shedskin.sourceforge.net), that given dotplot.py structure probably makes it about as fast as dotter.

The dotplot.py program works from shell, given two names of (small, the algorithm is O(n^2)) fasta files to load (plus optionally the width of the scanning window). It compares how much similar the two sequences locally are, using the sliding window and the BLOSUM62 amino acid substitution matrix (that tells how much likely are the change of a certain amino acid to another aminoacid, in some specific but rather common conditions). dotplot.py outputs a PGM (Portable Grey Map) greyscale image, that contains too much grey shaded and must be imported in a graphical program to remove most of the greys.

To perform my tests I have mostly used human protein sequences from (very big file):
ftp://ftp.ncbi.nih.gov/genomes/H_sapiens/protein/protein.fa.gz


Image produced by dotplot.py applied on opsin1.fa and rhodopsin.fa fasta files.
You can see that they share a lot of structure, their amino acid sequences are quite similar.
(More quantitative comparisons can be made, but this is just a start.)



Image produced by dotplot.py applied on zincFinger406.fa and likeZincFinger14.fa fasta files.
You can see the repeated structures (of the zinc finger). Dotter produces much the same image.

 

I have done some tests to how dotplot.py compares to dotter, and sometimes the results look different, I don't know why yet (dotter uses BLOSUM62 too). Here are few examples of the differences:


Image produced by dotplot.py on the "big" collagen1.fa (against itself) fasta file.
(The main diagonal is so much visibile because the sequence is identical to itself when shifted in parallel.)



Image produced by dotter on the collagen1.fa (against itself) fasta file.
You can see some spots that are missing in the precedent image, and a less smeared image, etc.



Three images produced comparing hemoglobinB.fa with cytoglobin.fa fasta files (the two proteins may be similar).
The first image from the left is the output of dotter.py
The second image is the output of dotter.py after the usual histogram processing you have to do to show the true matches.
The third image shows the image shown by dotter. There is a quite well similarity alomost along the main diagonal, that I can't see in the dotplot.py output.


Two images produced comparing hemoglobinB.fa with myoglobin.fa fasta files (the two proteins may be similar).
The first image is the output of dotter.py (after the usual histogram processing you have to do to show the true matches.)
The second image is produced by dotter. Again, near the main diagonal there is a similarity isle missing the dotplot.py output.

A basic implementation of dotplot algorithm in Python: dotplot.zip
Notes: it requires Psyco. The zip contains bioutil.py too, but you may find an updated version of bioutil.py on the software page.

[Go back to the index]