deTect
Welcome to deTect! This pair of programs was originally implemented as
an auxiliary tool for the Hnews news-reader:
Since Hnews makes downloading of huge numbers of files from the news
fast and easy, while at the same time news in general are highly redundant,
i.e. you will find the same attachments again and again, the problem arose
of how to detect identical copies of a file, in order to avoid storing the
same e.g. jpeg-file multiple times.
DetectEquals
This comparably simple problem is taken care of by DetectEquals. This
program expects one or more directory names as parameters and will then
compare all files in these directories, including all their subdirectories -
every single file with every other one. Since the number of possible pairs
grows proportionally to the square of the number of files, some tricks are
employed to keep the computational effort down. As currently implemented,
on a computer with a reasonably fast hard-disk directories containing some
hundred thousand files can easily be scanned. Especially, there is no need to
watch DetectEquals while it works:
You can simply start it in the evening and next morning,
you will find a list of all pairs of identical copies it has encountered.
Using that method, you can search tera byte of disk space.
On a Pentium 133 with a 5GB Quantum hard disk running Linux 2.2.5,
checking 10.000 image takes approximately one minute.
DetectEquals is easy to compile and should - within the UNIX family - be
completely platform independent. It does not require any external help
programs and can compare any type of files (text, binary, images, ...).
DetectSimilars
But this only solves a part of the problem: The by far largest portion of
attachments in the news are images, and with these it frequently occurs,
that you will not find IDENTICAL copies of a file, but EXTREMELY SIMILAR
ones. Such modifications include added texts, resized images, modified
colors or contrast, montages or parts of images. Apart from the last type,
these can be detected by computing what are called Normalized Regular
Moments of images, and that is exactly what DetectSimilars does.
The main difficulty is, that even if two images are just modifications of
the same original, the computed Regular Moments will never be exactly the
same. Therefore, DetectSimilars cannot unambiguously decide if two
images stem from the same original - naturally not, since, unlike "equal",
"similar" is a gradual notion.
Therefore, DetectSimilars will instead produce a long list of all
those images for which it found a "reasonably similar" counterpart. You
will then have to go through this list and decide in each case which of
the pairs that DetectSimilars found to be "similar" are indeed just
different versions of the same image.
Another difficulty that DetectSimilars has to deal with, which
is not so extreme for DetectEquals, is the amount of data that
needs to be handled: In over 99% of all cases (to be compared couples)
DetectEquals will be able to see that the two files it has at hand
are not identical, merely from the fact that their sizes differ, i.e. it
will not even need to touch the files themselves at all.
Quite differently so in the case of DetectSimilars: It needs to
decode/read every single image and compute the Regular Moments for it,
before it can compute any similarities. I.e. the effort to detect a
non-similarity (the normal, most frequent case) is just as big as that
to detect a similarity (the special, rare case).
Computing the Regular Moments for some 10.000 images, though, will take
several hours, even on a fast computer. In order to minimize this problem,
DetectSimilars has not only been optimized to the very last detail,
it also uses quite elaborate mechanisms to avoid re-computations wherever
possible.
The main "trick" that DetectSimilars applies is to use a storage
file. Whenever it computes the Regular Moments for an image, it stores
the name (including full path), the files cksum-value and the computed
Regular Moments in that file. When you run DetectSimilars again
at some later time (e.g. after having deleted some old images and added
a few new ones), it will immediately recognize a file it already handled
before and just read its Regular Moments from the storage file. In order
to make sure that it uses the correct values, it will compute the cksum
value of the file it has at hand and compare it with the one stored in
the storage file - if the two differ, obviously the file at hand has the
same name and stands at the same location, but has simply replaced the
one that was there when DetectSimilars was last run.
Even though cksum is VERY fast, it still slows down the process of
fetching the adequate Regular Moments from the storage file: Fetching
the pre-computed values for 1000 files will take a little over half a
minute. If you are sure that none of the images you are currently
comparing has replaced one with the same name since you last ran
DetectSimilars, you can use "trusting"-mode (option "-t"). In
trusting mode, DetectSimilars will assume that a file with the
same name and at the same location as in its storage file will really
be that file and will not perform the cksum verification. In this mode,
fetching 1000 files as above will take less than 2 seconds.
Once DetectSimilars has computed (for new files) or read (for old
files) the Regular Moments for all files it is told to compare, it will
erase all entries in the storage file belonging to files that do not
exist anymore (if you delete any files since you last ran the program)
and then update the storage file. Then it will start the actual comparison.
Useful note:Since the one-time process of computing the regular
moments for all the image you have collected so far is one that can take
very long indeed, you may want to use the compute-only mode (option "-c")
to do that. In compute-only mode, DetectSimilars will only compute
and store the Regular Moments of the files you specify. You can
call it a few times, always for a subsection of all your images, thus
step by step building up the storage file. Once the Regular Moments have
been precomputed for most of your images, you can then very efficiently
work with DetectSimilars.
If your collection changes rather frequently, it might be a good idea
to automatically execute DetectSimilars in compute-only, trusting
mode every midnight and maybe once a week in non-trusting mode. That way,
your storage file will always be up-to-date when you run DetectSimilars
thus making it as fast as possible when you want to use it.
DetectSimilars supports the comparison by printing the text for a
shell script to stdout (standard output), unless you specify an output
file; you should therefore best start it with the -o option, e.g.:
"DetectSimilars -osimilars file1 file2 ... ".
All text which does not belong into that script (e.g. progress report) will
still be displayed on the screen. After DetectSimilars has finished,
you can then execute 'similars' to inspect the comparison results.
There are two different formats for this shell script that you can
choose among. If you only have a rather small number
of images or only need to run the comparison once, you may want to use
the easier to create "built-in" comparison routine (this is the default).
For each pair of similar images, e.g. 'a.jpg'
and 'b.jpg', DetectSimilars will then write a line like
'MostSimilar "a.jpg" "b.jpg"' into the output file.
MostSimilar is a tool that comes with this package. It will print
out a line with the two names and start an xv
(which is hopefully installed on your machine) for each of them.
It will automatically proceed to the next pair, as soon as you close
the current two xv's. Thus, after DetectSimilars has
finished, all you need to do is start the output file (the script) that has
just been created ('similars [Enter]');
you will then be shown one pair after the other. Just paste those lines which
correspond to image pairs that really were just different versions of the
same image into an editor and erase one of each of these, afterwards.
Especially for large numbers of pairs, this method is rather tedious !
The second much more elegant version uses the program MakeThumbs,
which is also available on this page. If you want to use this method, you
must install this program. Attention: If you already have
MakeThumbs, make sure that you have at least version 1.4, older
versions are not compatible with this version of DetectSimilars.
When you run DetectSimilars with the MakeThumbs option (-g), it will
still print a shell script which you have to handle just as
described above. But running this shell script will call MakeThumbs
and thus create a number of thumbnail sheets, each showing 15 pairs of
"similar" images. Look at these thumbnail sheets (jpeg images) with
whatever image viewer (xv, NetScape, ...) you want.
You can then very easily and comfortably find the matching couples on
those thumbnails.
But the "-g" option does even more: It will also create HTML files for the
thumbnail images that were created. If you have NetScape on your computer,
load the file "similars.html", which comes with the deTect package, into it:
You will see that finding the similar pairs can hardly be any more
comfortable than like this: You will see one thumbnail sheet (15 pairs) at a
time. The thumbnails are clickable, so you can have closer look at all those
pairs which look similar, and decide which one you want to erase. Below the
thumbnail-sheet there is a table: once you have decided that you want to
remove a specific image,
just click on its name in that table, its name will then be
added to a delete-list which is held in a separate window. If you clicked on
a wrong name by mistake, just click it again, and it will be removed from
the delete-list again.
At the very bottom of the page is a 'next page' button, that will lead you
on to the next 15 pairs. Once you have reached the end or decided that you
have found enough doubles, just save the delete-list that has accumulated
in the window, make it executable & execute it.
Attention: DetectSimilars requires three programs:
- djpeg to convert ".jpg" to ".pgm"-format
- convert to convert ".tif" images to ".pgm"-format
- xv as described above (only if not using the "-g" option)
DetectSimilars can directly only read the ".pgm"- and ".gif"-format.
Thus the conversion using these tools is necessary for all other image
formats.
A suggestion if you do not have djpeg (and know a little C/C++):
I only use djpeg for jpeg images, because it is about 10 times
faster than convert. If you do have convert, modify
"DetectSimilars.cpp" such that it uses that instead of djpeg.
If these tools are not installed on your computer, you need to find or
install other converters that do the same job before being able to use
DetectSimilars. If your tools use a different command line syntax,
you will have to modify the DetectSimilars source code correspondingly !
Do not hesitate to consult me if you have problems installing or using
DetectSimilars. I know it isn't the easiest, but it is worth the
effort!
I have been frequently amazed at the matches it found, many of which I
would never have recognized as being related if this program had not
explicitly drawn my attention to it!
INSTALLING deTect
First create a directory for deTect, move the downloaded deTect-archive into
it and unpack it.
While DetectEquals can be installed "as is", DetectSimilars
needs some adjustments, first.
To apply these, load "graphics.h" into an editor (e.g. "emacs graphics.h").
You then have to adjust the following two lines, starting line 13:
- Line 16: #define DJPEG "/usr/bin/djpeg"
- Line 22: #define DTIFF "/usr/X11R6/bin/convert"
In both cases the path specifications at the end have to be modified in
such a way, that the point to the locations of "djpeg" and "convert"
respectively. If you do not have these programs installed on your computer,
you can still compile DetectSimilars, just comment the respective
line(s) out.
If you do not have "convert", DetectSimilars will not be able to
read TIFF files - they are rather rare, so that does not really do much
harm - if you do not have "djpeg", though, JPEG files cannot be read. Since
they make up about 95% of all image files in the Internet, you should first
organize & install "djpeg" and then continue with the installation of
DetectSimilars afterwards.
If the program "cksum" is not located in its usual directory "/usr/bin"
on your computer, you also have to adapt the entry in line 73 in
"DetectSimilars.cpp", that should normally not be necessary, though.
After these adjustments have been made, simply type
make
The two files DetectEquals and DetectSimilars should be
created.
You may move DetectEquals into a general directory for executables,
e.g. "/usr/bin", DetectSimilars though uses & creates so many
auxiliary files, that it has to be run in its own directory. I suggest
that you just run
make clean
after successfully installing the deTect package (that will remove no
longer needed object files in the directory) and then in future just run
DetectSimilars (and DetectEquals) from that directory.
these and the file MostSimilar to a directory for executables, e.g.
"/usr/bin", restart the shell you are currently using (otherwise the newly
copied programs will not be found), and you are ready to go.
Keep in mind: deTect is still very young and sometimes
"uncomfortable" to use, especially DetectSimilars. If you have problems,
you are welcome to send me email (see below) and I will try to help to
solve the problem. I have little spare time, though, and therefore cannot
guarantee anything.
Download the deTect-archive
(ZIP,
TGZ)
or go
back
to the Unix section ?
This page hosted by
Get your own Free Home Page