Web Documents: Percolation

how to start such a working project, and where shall we go to ?

(This document is subject of continuous improvement :-)...)

Goal and Tools

Web documents clustering is a fairly popular question nowadays. In order to get sufficient number of text documents for coming to resonable statistics, Web servers are ideal sources for selecting such documents. Let us suppose, that approximately one, or two, or more thousand documents from the news server, say, clari.usa.top , are at our disposal. A standard Perl program helps to get them. When I did this some time ago, a fragment of the list of documents' addresses looked like this:

..............................................
<37a6bb62.0@news.accessus.net
37a6bbe9.0@news.accessus.net
37a6bc41.0@news.accessus.net
9tzp3.49$uh3.203823@nnrp2.proxad.net
37A725C2.2B19@usit.net
37A6FEAA.72DA31BE@remove-this.hotmail.com
7o6v2p$9t8$1@coranto.ucs.mun.ca
7o72gm$9dg@newsops.execpc.com
IQstZDAjJxp3Ewd5@clarsach.demon.co.uk
19990803135654.09209.00006817@ng-cc1.aol.com
7o7bb2$a5d$1@news5.isdnet.net
..............................................
..............................................
7o7c3r$rrj$1@coranto.ucs.mun.ca
19990803153814.02940.00000300@ng-fm1.aol.com
7o7gp1$ek5$1@autumn.news.rcn.net
19990803162122.05734.00003282@ng-cl1.aol.com
lTIp3.12047$J5.136882@c01read02-admin.service.talkway.com
7o7pqs$v0a$1@news.inc.net
7o7q8h$v9e$1@news.inc.net
7o7r03$1034$1@news.inc.net
7o7re0$10b1$1@news.inc.net
7o7ob4$el5$1@bgtnsc03.worldnet.att.net
7o7sdf$118p$1@news.inc.net
..............................................

There are two possibilities then. In the first one, you can predefine a set of key-words. Then, again with a help of Perl, patterns matching can be done in order to count how many times each word of the set of key-words is contained in each document. After that, any document can be represented by a vector whose co-ordinates are corresponding integer numbers for any given key-word. Strictly speaking, a common procedure is consistent with normalization of a vector to a unit sphere. Then all the documents can be interpreted as located on the unit sphere, like stars on the night sky. A big difference with such an interpretation is the sky is a projection from our three-dimensional world, but a typical number of key-words is of order one hundred. So, we deal with a multi-dimensional sky! With its non-negative sector.

Well, the main goal is ahead:

How the documents are arranged in clusters?

Definitions:

  • Two documents are considered as directly coupled, if a distance on the unit sphere between them is less than some predefined length, say, rc .
  • A cluster is defined as a set of documents in which any two can be coupled through a "chain", at least, one "chain", of directly coupled documents.
    Such a definition is for a maximally porous sort of a cluster which is a subject of consideration below.
    In the opposite case of a maximally dense cluster all pairs of documents are directly coupled.

    Now, we can start selecting documents within a cluster. One possible solution is consistent with creating an associative array in which subscript elements (documents) point to directly coupled documents. Starting from any document, and using a recursive algorithm, a cluster can be passed entirely.

    How to search: an algorithm

  • Ascribe zeroes to the values associated with all documents.
  • Select a document, say, doc1, ascribe 1 to doc1, remove those pairs of the associative array which point to doc1.
  • Select one of the documents to which doc1 points, say, doc2. Ascribe 1 to doc2, remove those pairs of the associative array which point to doc2. Start to form a linked list: doc1=>doc2.
  • Select one of the documents to which doc2 points, say, doc3. Ascribe 1 to doc3, remove those pairs of the associative array which point to doc3. Continue to form a linked list: doc1=>doc2=>doc3.
    However, if doc2 points to nothing, remove the last element of the linked list: it becomes doc1. Select the other document to which doc1 points.
  • Repeat with doc3 the operations performed for doc2.
  • And so on...
  • Finally, after the cluster is passed completely, the linked list becomes doc1, and no associated values exist which doc1 points to. All documents, belonging to the cluster, get a value of 1.
  • Repeat with other clusters...
  • The simplest statistical characteristic of web-documents might be the following: how many clusters versus rc exist? It seems, this function should be universal, if a set of key-words was representative...
    Last updated: Jan 18 2000
    1