Some explorations of string repetition statistics

Based on a program by Matt Mahoney
By leonardo maffi, v.1.3, Jul 15 2009
(email address on the Home Page)

[Go back to the index]

Update Jul 15 2009: converted the fv.cpp code to the D language, it's inside the same zip. Compiling it with the LDC compiler (on Linux for now) leads to a program a bit faster than the C++ one (mostly thanks to putc_unlocked/getc_unlocked).

On his site about data compression (http://cs.fit.edu/~mmahoney/compression/textdata.html ) Matt Mahoney (designer of PAQ compression algorithm) has released a little and very interesting program called "fv". Given any file it produces an interesting visual representation of its repetition statistics, and essentially its entropic structure too.

I have modified fv a little, now it runs on my PC with just 256 MB of RAM (at the top of the single C++ document there is a constant that can be changed to choose the memory used), and it saves a truecolor BMP image with a name similar to the input file. This is the Unix diff file of my changes, and this is the zip file that contains the modified version, plus its compiled version for Windows (compiled with MinGW). The source code is very dense, with too few comments, but it is efficient and fast for big files too, and the output is refined.

The program can be simplified making it save PPM (pixel map) images that have a simpler structure, and it can be made a bit more complex plotting the values (1, 2, 10, 20, 100, 200, 1'000, 2'000, 10'000, 20'000, etc, on the vertical scale, to do it the program has just to store (for example in a string) the bitmap of tiny versions of '012 characters). On the right there is an example of such vertical scale (the vertical scale is always different, according to the actual length of the input file).

Few explanations from Matt Mahoney:
>The following diagrams show the distribution of string matches of length 1 (black), 2 (red), 4 (green), and 8 (blue). The horizontal axis represents the position in the file. The vertical axis shows the distance backwards to the previous match on a logarithmic scale. The major tick marks reading upwards are 1, 10, 100, 1000, etc.<

This program was designed to show the structure of the huge textual files from the Wikipedia, but it can be used to analyse the structure of any kind of file. It can also be used to 'see' the contents of an unknown file. In this page I just show some of the little surprises I have had applying it to various kinds of files. All the images in this page are scaled to 200%, to better see the details on hi-res monitors. All images are converted from truecolor BMP to 256-colors PNGs to reduce image size, but they look essentially the same.

On Windows it can be created a shortcut for Explorer contextual menu, to apply fv on it, and create an image on the fly that shows the file contents; such image can then be seen with PicaView or popping up a little window using some scripting language like Python that manages windows and BMP images.


random256_30mb.png - I start with a very simple file, 30 millions of pseudorandom bytes, produced with this Python script. (Python uses the Mersenne Twister to generate as pseudorandom generator, that is very good). You can see the results are quite smooth (because the input file is very big and very uniform). You can see two horizontal bands, the black one (distribution of len 1) centred at about 200 (vertical scale) and the red (len 2) at about 60000. There is a faint trace of the third band, the greenish one (len 4), at about 1 million of more. In a random file the longer repetitions are far apart.


random150_1mb.png - 1 million random bytes, just the 0-149 bytes are present. The red band is lower, around 20000. The image is more noisy because there is less data to compute on.


random50_1mb.png - 1 million random bytes, just the 0-49 bytes are present. The two bands are quite lower, at about 50 and 2500.


pure_txt_mail_483kb.png - After the random files, I have tried it on text data, on 1/2 megabyte of pure textual emails, very cleaned up, with just a small header for each one. There is a brown horizontal line at about 70, it probably corresponds to the newlines (those emails often have lines 70 chars long). Here the green and blue colours too can be seen, because a text file has lot of redundancy, and words are repeated all the time. The red band is centred horizontally at about 100, the green one at about 4000. The white vertical bands, usually in the green-blue zone, denote points where the entropic contents change a lot. The very dark spots spread are repetitions, like the email addresses of my usual correspondents, the repetitions inside the header, and the (short) copied parts of emails inside the answers. In the very low parts of the image there are faint horizontal bands, they are probably of little significance.


pure_txt_mail_2.3mb.png - Tested on more than 2 megabytes of very cleaned up pure-txt emails. There are few interesting vertical zones, where the situation changes a lot, the image is much smother essentially because of more data to compute the statistics on.


pure_txt_mail_40mb.png - Even more email text, about 40 megabytes. Despite using 'only' 128 MB of RAM, the fv program can find correlations for the whole file still.


pure_txt_mail_483kb_x2.png - This is a test to show how fv visualises whole block duplications. This file is composed of two copies of the 483 kb long txt file. You can see a dark blue horizontal line from half of the file, that clearly shows the duplication.


small_html_file_160kb.png - An html file (about evolution), about 160000 bytes long. The horizontal brown line at 70 is the length of the lines. There is another fainter line between 6 and 7 (maybe word length), and another even fainter one at about 13. Inside this document there is a lot of redundancy, because of the Html tags.


tk84.dll_1038kb.png - This is a test of the compression of some data. I have used a quite compressible file, a dll of the TCL scripting language, that is about 1 MB. Once leant to understand this kind of images, you can read all kinds of things from them. This image contains lot of information. At the end of the file you can see a zone with black-red band, it is probably a compressed region, maybe an image. The parts where the higher part of the image have a sharp indenting followed by a logarithmic growth denote parts where the entropic contents of the file have a sharp change, where the data is very different and the old statistics collected up to that point are of little use.

tk84.dll.zip_483kb.png - Can fv tell apart compressed files from the random file? This is compressed with Zip, at the lowest compression level (result is about 483 kb). In the last part of the image you can see some vertical bands, zip has failed to remove all the redundancy from the input file. Black and red bands are centred at about 200 and 70000. The images contains few curved images, the zip dictionary has probably reset itself some there. On the right there is the last cropped part of a random file (all possible 256 bytes) of the same length of the zip file. You can see the black and red bands are very similar.


tk84.dll.bz2_379kb.png - This is the same dll compressed with a better compressor, a BWT-based one (bz2). The file is shorter (379 kb instead of 483 kb of the zip) and there are less redundancy irregularities left too. The red zone goes a bit higher than in the zip file, it can be seen better in the following images.


tk84.dll.rar_355kb.png - An even better compressor, WinRar set to its max compression. Irregularities are almost vanished, the file is even shorter.


tk84.dll.paq_211kb.png - In the end I have compressed it with one of the best compressors, PAQ8i, by Matt Mahoney. The file is even smaller (211 kb), all traces of irregularities are absent. It seems fv can't tell this file from a purely random one. The final stage of arithmetic compression produces essentially random data. Better tests, used to test pseudorandom generators, maybe can tell this file from a true (psuedo)random one (this file has a short textual header than can be removed).


smiles_noprogressive.jpg.png - This is a very big jpeg image, a file of about 3 MB, this is compressed with "no progressive" style, optimised Huffman encoding. The last stage of the encoding, the entropic one, has made the file like the random one.


smiles_noprogressive_arith.jpg.png - The same image, but the entropic encoding is an arithmetic encoder (semi-standard for jpegs, see Jpegcrop program). The resulting image is smaller (2.9 MB instead of 3.1).


smiles_noprogressive_huffdefault.jpg.png - This is the same image encoded with a baseline Huffman encoding. The image file is even bigger (3.26 MB) and there is some redundancy left that can be seen as faint horizontal (grey) bands at around 50-200 (vertical scale). (WinRar can compress this jpeg file about 0.5%).


smiles_progressive.jpg.png - The same jpeg file, this time as progressive, with Huffman encoding. You can see 6-7 bands where the file statistics suddenly change, they are the layers of the progressive jpeg (each layer contains different frequencies of the image). The file is a bit smaller (2.9 MB) than the notprogressive optimised Huffman encoding, this happens most of the times, probably thanks to the grouping of the frequencies, that allow a better entropic compression.


smiles_progressive_arith.jpg.png - The same progressive jpeg, but with arithmetic encoding. This time the data is more random, and no vertical bands can be seen. This is the smaller jpeg encoding, about 2.8 MB, this happens most of the times.


oldfield.mp3_32kbps_160kb - A small mp3 file with a small datarate (32 kbps). There are two sharp brown horizontal lines, they are probably related to the mp3 blocks. Over those lines there are many fainter lines, at their multiples. At the end at the bottom it can be seen a black vertical line, between 1 and 2, it means there is a single repeated byte.


podcast.mp3_32kbps_20mb - Another mp3 file at low datarate, but this time much longer, about 20 megabytes. This time the brown line is at about 450, and there is a fainter grey band at about 180-200. This time some green can be seen at about 10 millions. Mp3 files too have a final entropic stage that makes the output data almost random.


mozart_jupiter_16s_44khz_5.8mb.wav.png - After the mp3 file I have tried with a purely PCM wav, much more compressible than the mp3. I have chosen a light passage mostly with strings of about 33 seconds in the middle of the Allegro vivace (first movement) of the Jupiter Symphony By Mozart. This wav file is about 6 megabytes, 44.1 kHz, pcm, stereo. There are many redundancies because nearby samples are quite correlated, and there are two very correlated signals (it's a stereo), and being 16 bit signals, the pairs aren't much correlated. At the beginning of a new string start you can see faint restarting curves. Only black and red bands are visible. There is a strong correlation band at 4, the length of a single instant of time. In this image there are lot of other structures that I am not able to explain.


mozart_jupiter_16_44khz.wav.png - Same file, but this is mono. The black horizontal stripe is at 2. At about 100-200 there is an interesting banding. At about 10000 there are different interesting bands in the red zone. The green zone (4 bytes) is now much more visible, the program can find much more long repetitions.


mozart_jupiter_8s_44khz.wav.png - Same file, but converted to 8 bit/sample, stereo. The banding is now very interesting. Blue too is visible (even if the file is half length of the original one).


mozart_jupiter_8_44khz.wav.png - 8 bit/sample, mono, 44.1 kHz still. Now the fv program can spot lot of repetitions, up to the blue (8 byte) zone. The banding at 60-300 is even stronger, and probably correlated to the typical frequencies emitted by the strings.


mozart_jupiter_8_8khz.wav.png - The same as before, 8 bit/sample, mono, but downsampled to 8 kHz, to show only the most informative armonic part of the signal. The blue parts are vanished. The horizontal banding is now lower, from 10 to about 150. There are two brown zones at the bottom, probably where the volume is very low, and there is strong sequential correlation between samples (1 byte/sample).


nc_000913.fna_4.5mb.png - For the last tests I have chosen a totally different kind of files, some biologic data. This is the complete genome of Escherichia coli K12, each base is a byte. To have a more meaningful image, I have removed the header of this fasta format file and I have removed all the newlines with this Python script. The genome is quite clean and uniform, but can be seen some faint structures anyway. There is a strong brown banding at 3 (vertical axis) relative to the groups of 3 bases that encode for the amino acids. Other horizontal bands can be seen, all in the red/black/brown zone. There there are some faint restating (curved) traces, especially near the end, but the whole file is quite uniform. Only 4 different bytes are used (ACGT) and many repetitions can be found. A file similar to this one (with header and newlines, can be found here).


alt_celera_chr21.fa.filt.png - (gi|89161201|ref|AC_000064.1|AC_000064) Homo sapiens chromosome 21, alternate assembly (based on Celera assembly), whole genome shotgun sequence. I have removed headers and newlines, about 33 MB. Very few structures can be seen, some vertical banding. Ands some irregularities at the very beginning. On the whole it's quite randomly uniform. The strong band relative to 3 (vertical scale) is much more fainter than in the E.coli, I don't know why, maybe it's because of the presence of not coding zones in the genes (not encoding amino acidid with the usual 3-codon code), that makes the 3-letter average signal fainter.


alt_celera_chr_y.fa.filt.png - (gi|89161219|ref|AC_000067.1|AC_000067) Homo sapiens chromosome Y, alternate assembly (based on Celera assembly), whole genome shotgun sequence. I have removed headers and newlines, about 11 MB.


ref_chr_y.fa.filt.png - (gi|89161220|ref|NC_000024.8|NC_000024) Homo sapiens chromosome Y, reference assembly, complete sequence. I have removed headers and newlines, about 57 MB.


human_protein.fa.filt.png - A big part of the human proteonome, from the Celera (there are about 20 different bytes here). I have removed headers and newlines, about 21 MB. This is one of the most interesting images in this page. There are some very interesting structures, some horizontal lines, like one at about 80000, plus some fainter lines at 170000 and more. I don't know what they mean (it means that the same sequence repeats over and over, about 260 times, but the sequences are not evenly spaced). The file structure is quite different. There are also some other horizontal structures at 30, 80, 250.

[Go back to the index]