last update: Sun Aug 20 2000

deTect

Welcome to deTect! This pair of programs was originally implemented as an auxiliary tool for the Hnews news-reader:
Since Hnews makes downloading of huge numbers of files from the news fast and easy, while at the same time news in general are highly redundant, i.e. you will find the same attachments again and again, the problem arose of how to detect identical copies of a file, in order to avoid storing the same e.g. jpeg-file multiple times.

DetectEquals

This comparably simple problem is taken care of by DetectEquals. This program expects one or more directory names as parameters and will then compare all files in these directories, including all their subdirectories - every single file with every other one. Since the number of possible pairs grows proportionally to the square of the number of files, some tricks are employed to keep the computational effort down. As currently implemented, on a computer with a reasonably fast hard-disk directories containing some hundred thousand files can easily be scanned. Especially, there is no need to watch DetectEquals while it works:
You can simply start it in the evening and next morning, you will find a list of all pairs of identical copies it has encountered. Using that method, you can search tera byte of disk space.
On a Pentium 133 with a 5GB Quantum hard disk running Linux 2.2.5, checking 10.000 image takes approximately one minute.
DetectEquals is easy to compile and should - within the UNIX family - be completely platform independent. It does not require any external help programs and can compare any type of files (text, binary, images, ...).

DetectSimilars

But this only solves a part of the problem: The by far largest portion of attachments in the news are images, and with these it frequently occurs, that you will not find IDENTICAL copies of a file, but EXTREMELY SIMILAR ones. Such modifications include added texts, resized images, modified colors or contrast, montages or parts of images. Apart from the last type, these can be detected by computing what are called Normalized Regular Moments of images, and that is exactly what DetectSimilars does.
The main difficulty is, that even if two images are just modifications of the same original, the computed Regular Moments will never be exactly the same. Therefore, DetectSimilars cannot unambiguously decide if two images stem from the same original - naturally not, since, unlike "equal", "similar" is a gradual notion.
Therefore, DetectSimilars will instead produce a long list of all those images for which it found a "reasonably similar" counterpart. You will then have to go through this list and decide in each case which of the pairs that DetectSimilars found to be "similar" are indeed just different versions of the same image.
Another difficulty that DetectSimilars has to deal with, which is not so extreme for DetectEquals, is the amount of data that needs to be handled: In over 99% of all cases (to be compared couples) DetectEquals will be able to see that the two files it has at hand are not identical, merely from the fact that their sizes differ, i.e. it will not even need to touch the files themselves at all.
Quite differently so in the case of DetectSimilars: It needs to decode/read every single image and compute the Regular Moments for it, before it can compute any similarities. I.e. the effort to detect a non-similarity (the normal, most frequent case) is just as big as that to detect a similarity (the special, rare case).
Computing the Regular Moments for some 10.000 images, though, will take several hours, even on a fast computer. In order to minimize this problem, DetectSimilars has not only been optimized to the very last detail, it also uses quite elaborate mechanisms to avoid re-computations wherever possible.
The main "trick" that DetectSimilars applies is to use a storage file. Whenever it computes the Regular Moments for an image, it stores the name (including full path), the files cksum-value and the computed Regular Moments in that file. When you run DetectSimilars again at some later time (e.g. after having deleted some old images and added a few new ones), it will immediately recognize a file it already handled before and just read its Regular Moments from the storage file. In order to make sure that it uses the correct values, it will compute the cksum value of the file it has at hand and compare it with the one stored in the storage file - if the two differ, obviously the file at hand has the same name and stands at the same location, but has simply replaced the one that was there when DetectSimilars was last run.
Even though cksum is VERY fast, it still slows down the process of fetching the adequate Regular Moments from the storage file: Fetching the pre-computed values for 1000 files will take a little over half a minute. If you are sure that none of the images you are currently comparing has replaced one with the same name since you last ran DetectSimilars, you can use "trusting"-mode (option "-t"). In trusting mode, DetectSimilars will assume that a file with the same name and at the same location as in its storage file will really be that file and will not perform the cksum verification. In this mode, fetching 1000 files as above will take less than 2 seconds.
Once DetectSimilars has computed (for new files) or read (for old files) the Regular Moments for all files it is told to compare, it will erase all entries in the storage file belonging to files that do not exist anymore (if you delete any files since you last ran the program) and then update the storage file. Then it will start the actual comparison.
Useful note:Since the one-time process of computing the regular moments for all the image you have collected so far is one that can take very long indeed, you may want to use the compute-only mode (option "-c") to do that. In compute-only mode, DetectSimilars will only compute and store the Regular Moments of the files you specify. You can call it a few times, always for a subsection of all your images, thus step by step building up the storage file. Once the Regular Moments have been precomputed for most of your images, you can then very efficiently work with DetectSimilars.
If your collection changes rather frequently, it might be a good idea to automatically execute DetectSimilars in compute-only, trusting mode every midnight and maybe once a week in non-trusting mode. That way, your storage file will always be up-to-date when you run DetectSimilars thus making it as fast as possible when you want to use it.

DetectSimilars supports the comparison by printing the text for a shell script to stdout (standard output), unless you specify an output file; you should therefore best start it with the -o option, e.g.: "DetectSimilars -osimilars file1 file2 ... ". All text which does not belong into that script (e.g. progress report) will still be displayed on the screen. After DetectSimilars has finished, you can then execute 'similars' to inspect the comparison results.
There are two different formats for this shell script that you can choose among. If you only have a rather small number of images or only need to run the comparison once, you may want to use the easier to create "built-in" comparison routine (this is the default).
For each pair of similar images, e.g. 'a.jpg' and 'b.jpg', DetectSimilars will then write a line like 'MostSimilar "a.jpg" "b.jpg"' into the output file. MostSimilar is a tool that comes with this package. It will print out a line with the two names and start an xv (which is hopefully installed on your machine) for each of them. It will automatically proceed to the next pair, as soon as you close the current two xv's. Thus, after DetectSimilars has finished, all you need to do is start the output file (the script) that has just been created ('similars [Enter]'); you will then be shown one pair after the other. Just paste those lines which correspond to image pairs that really were just different versions of the same image into an editor and erase one of each of these, afterwards. Especially for large numbers of pairs, this method is rather tedious !
The second much more elegant version uses the program MakeThumbs, which is also available on this page. If you want to use this method, you must install this program. Attention: If you already have MakeThumbs, make sure that you have at least version 1.4, older versions are not compatible with this version of DetectSimilars.
When you run DetectSimilars with the MakeThumbs option (-g), it will still print a shell script which you have to handle just as described above. But running this shell script will call MakeThumbs and thus create a number of thumbnail sheets, each showing 15 pairs of "similar" images. Look at these thumbnail sheets (jpeg images) with whatever image viewer (xv, NetScape, ...) you want. You can then very easily and comfortably find the matching couples on those thumbnails.
But the "-g" option does even more: It will also create HTML files for the thumbnail images that were created. If you have NetScape on your computer, load the file "similars.html", which comes with the deTect package, into it: You will see that finding the similar pairs can hardly be any more comfortable than like this: You will see one thumbnail sheet (15 pairs) at a time. The thumbnails are clickable, so you can have closer look at all those pairs which look similar, and decide which one you want to erase. Below the thumbnail-sheet there is a table: once you have decided that you want to remove a specific image, just click on its name in that table, its name will then be added to a delete-list which is held in a separate window. If you clicked on a wrong name by mistake, just click it again, and it will be removed from the delete-list again.
At the very bottom of the page is a 'next page' button, that will lead you on to the next 15 pairs. Once you have reached the end or decided that you have found enough doubles, just save the delete-list that has accumulated in the window, make it executable & execute it.
Attention: DetectSimilars requires three programs:
  1. djpeg to convert ".jpg" to ".pgm"-format
  2. convert to convert ".tif" images to ".pgm"-format
  3. xv as described above (only if not using the "-g" option)
DetectSimilars can directly only read the ".pgm"- and ".gif"-format. Thus the conversion using these tools is necessary for all other image formats.
A suggestion if you do not have djpeg (and know a little C/C++): I only use djpeg for jpeg images, because it is about 10 times faster than convert. If you do have convert, modify "DetectSimilars.cpp" such that it uses that instead of djpeg.
If these tools are not installed on your computer, you need to find or install other converters that do the same job before being able to use DetectSimilars. If your tools use a different command line syntax, you will have to modify the DetectSimilars source code correspondingly !
Do not hesitate to consult me if you have problems installing or using DetectSimilars. I know it isn't the easiest, but it is worth the effort! I have been frequently amazed at the matches it found, many of which I would never have recognized as being related if this program had not explicitly drawn my attention to it!

INSTALLING deTect

First create a directory for deTect, move the downloaded deTect-archive into it and unpack it.
While DetectEquals can be installed "as is", DetectSimilars needs some adjustments, first. To apply these, load "graphics.h" into an editor (e.g. "emacs graphics.h"). You then have to adjust the following two lines, starting line 13:
  1. Line 16: #define DJPEG "/usr/bin/djpeg"
  2. Line 22: #define DTIFF "/usr/X11R6/bin/convert"
In both cases the path specifications at the end have to be modified in such a way, that the point to the locations of "djpeg" and "convert" respectively. If you do not have these programs installed on your computer, you can still compile DetectSimilars, just comment the respective line(s) out. If you do not have "convert", DetectSimilars will not be able to read TIFF files - they are rather rare, so that does not really do much harm - if you do not have "djpeg", though, JPEG files cannot be read. Since they make up about 95% of all image files in the Internet, you should first organize & install "djpeg" and then continue with the installation of DetectSimilars afterwards.

If the program "cksum" is not located in its usual directory "/usr/bin" on your computer, you also have to adapt the entry in line 73 in "DetectSimilars.cpp", that should normally not be necessary, though.

After these adjustments have been made, simply type
make 
The two files DetectEquals and DetectSimilars should be created. You may move DetectEquals into a general directory for executables, e.g. "/usr/bin", DetectSimilars though uses & creates so many auxiliary files, that it has to be run in its own directory. I suggest that you just run
make clean 
after successfully installing the deTect package (that will remove no longer needed object files in the directory) and then in future just run DetectSimilars (and DetectEquals) from that directory. these and the file MostSimilar to a directory for executables, e.g. "/usr/bin", restart the shell you are currently using (otherwise the newly copied programs will not be found), and you are ready to go.
Keep in mind: deTect is still very young and sometimes "uncomfortable" to use, especially DetectSimilars. If you have problems, you are welcome to send me email (see below) and I will try to help to solve the problem. I have little spare time, though, and therefore cannot guarantee anything.




Download the deTect-archive (ZIP, TGZ) or go back to the Unix section ?



This page hosted by Get your own Free Home Page
1