Web Documents - Percolation

Web Documents: Percolation

how to start such a working project, and where shall we go to ?

(This document is subject of continuous improvement :-)...)

Goal and Tools

Web documents clustering is a fairly popular question nowadays. In order to get sufficient number of text documents for coming to resonable statistics, Web servers are ideal sources for selecting such documents. Let us suppose, that approximately one, or two, or more thousand documents from the news server, say, clari.usa.top , are at our disposal. A standard Perl program helps to get them. When I did this some time ago, a fragment of the list of documents' addresses looked like this:

..............................................
<37a6bb62.0@news.accessus.net
37a6bbe9.0@news.accessus.net
37a6bc41.0@news.accessus.net
9tzp3.49$uh3.203823@nnrp2.proxad.net
37A725C2.2B19@usit.net
37A6FEAA.72DA31BE@remove-this.hotmail.com
7o6v2p$9t8$1@coranto.ucs.mun.ca
7o72gm$9dg@newsops.execpc.com
IQstZDAjJxp3Ewd5@clarsach.demon.co.uk
19990803135654.09209.00006817@ng-cc1.aol.com
7o7bb2$a5d$1@news5.isdnet.net
.............................................. ..............................................
7o7c3r$rrj$1@coranto.ucs.mun.ca
19990803153814.02940.00000300@ng-fm1.aol.com
7o7gp1$ek5$1@autumn.news.rcn.net
19990803162122.05734.00003282@ng-cl1.aol.com
lTIp3.12047$J5.136882@c01read02-admin.service.talkway.com
7o7pqs$v0a$1@news.inc.net
7o7q8h$v9e$1@news.inc.net
7o7r03$1034$1@news.inc.net
7o7re0$10b1$1@news.inc.net
7o7ob4$el5$1@bgtnsc03.worldnet.att.net
7o7sdf$118p$1@news.inc.net
..............................................

There are two possibilities then. In the first one, you can predefine a set of key-words. Then, again with a help of Perl, patterns matching can be done in order to count how many times each word of the set of key-words is contained in each document. After that, any document can be represented by a vector whose co-ordinates are corresponding integer numbers for any given key-word. Strictly speaking, a common procedure is consistent with normalization of a vector to a unit sphere. Then all the documents can be interpreted as located on the unit sphere, like stars on the night sky. A big difference with such an interpretation is the sky is a projection from our three-dimensional world, but a typical number of key-words is of order one hundred. So, we deal with a multi-dimensional sky! With its non-negative sector.

Well, the main goal is ahead:

How the documents are arranged in clusters?
Definitions:
Two documents are considered as directly coupled, if a distance on the unit sphere between them is less than some predefined length, say, r_c .

A cluster is defined as a set of documents in which any two can be coupled through a "chain", at least, one "chain", of directly coupled documents.
Such a definition is for a maximally porous sort of a cluster which is a subject of consideration below.
In the opposite case of a maximally dense cluster all pairs of documents are directly coupled.

Now, we can start selecting documents within a cluster. One possible solution is consistent with creating an associative array in which subscript elements (documents) point to directly coupled documents. Starting from any document, and using a recursive algorithm, a cluster can be passed entirely.

How to search: an algorithm
Ascribe zeroes to the values associated with all documents.

Select a document, say, doc1, ascribe 1 to doc1, remove those pairs of the associative array which point to doc1.

Select one of the documents to which doc1 points, say, doc2. Ascribe 1 to doc2, remove those pairs of the associative array which point to doc2. Start to form a linked list: doc1=>doc2.

Select one of the documents to which doc2 points, say, doc3. Ascribe 1 to doc3, remove those pairs of the associative array which point to doc3. Continue to form a linked list: doc1=>doc2=>doc3.
However, if doc2 points to nothing, remove the last element of the linked list: it becomes doc1. Select the other document to which doc1 points.

Repeat with doc3 the operations performed for doc2.

And so on...

Finally, after the cluster is passed completely, the linked list becomes doc1, and no associated values exist which doc1 points to. All documents, belonging to the cluster, get a value of 1.

Repeat with other clusters...

The simplest statistical characteristic of web-documents might be the following: how many clusters versus r_c exist? It seems, this function should be universal, if a set of key-words was representative...
Last updated: Jan 18 2000

Goal and Tools
Web documents clustering is a fairly popular question nowadays. In order to get sufficient number of text documents for coming to resonable statistics, Web servers are ideal sources for selecting such documents. Let us suppose, that approximately one, or two, or more thousand documents from the news server, say, clari.usa.top , are at our disposal. A standard *Perl* program helps to get them. When I did this some time ago, a fragment of the list of documents' addresses looked like this:
.............................................. <37a6bb62.0@news.accessus.net 37a6bbe9.0@news.accessus.net 37a6bc41.0@news.accessus.net 9tzp3.49$uh3.203823@nnrp2.proxad.net 37A725C2.2B19@usit.net 37A6FEAA.72DA31BE@remove-this.hotmail.com 7o6v2p$9t8$1@coranto.ucs.mun.ca 7o72gm$9dg@newsops.execpc.com IQstZDAjJxp3Ewd5@clarsach.demon.co.uk 19990803135654.09209.00006817@ng-cc1.aol.com 7o7bb2$a5d$1@news5.isdnet.net ..............................................		.............................................. 7o7c3r$rrj$1@coranto.ucs.mun.ca 19990803153814.02940.00000300@ng-fm1.aol.com 7o7gp1$ek5$1@autumn.news.rcn.net 19990803162122.05734.00003282@ng-cl1.aol.com lTIp3.12047$J5.136882@c01read02-admin.service.talkway.com 7o7pqs$v0a$1@news.inc.net 7o7q8h$v9e$1@news.inc.net 7o7r03$1034$1@news.inc.net 7o7re0$10b1$1@news.inc.net 7o7ob4$el5$1@bgtnsc03.worldnet.att.net 7o7sdf$118p$1@news.inc.net ..............................................
	There are two possibilities then. In the first one, you can predefine a set of key-words. Then, again with a help of *Perl, patterns matching can be done in order to count how many times each word of the set of key-words is contained in each document. After that, any document can be represented by a vector* whose co-ordinates are corresponding integer numbers for any given key-word. Strictly speaking, a common procedure is consistent with normalization of a vector to a unit sphere. Then all the documents can be interpreted as located on the unit sphere, like stars on the night sky. A big difference with such an interpretation is the sky is a projection from our three-dimensional world, but a typical number of key-words is of order one hundred. So, we deal with a multi-dimensional sky! With its non-negative sector.
Well, the main goal is ahead: How the documents are arranged in clusters? Definitions: Two documents are considered as directly coupled, if a distance on the unit sphere between them is less than some predefined length, say, r_c . A cluster is defined as a set of documents in which any two can be coupled through a "chain", at least, one "chain", of directly coupled documents. Such a definition is for a maximally porous sort of a cluster which is a subject of consideration below. In the opposite case of a maximally dense cluster all pairs of documents are directly coupled. Now, we can start selecting documents within a cluster. One possible solution is consistent with creating an associative array in which subscript elements (documents) point to directly coupled documents. Starting from any document, and using a recursive algorithm, a cluster can be passed entirely. How to search: an algorithm Ascribe zeroes to the values associated with all documents. Select a document, say, doc1, ascribe 1 to doc1, remove those pairs of the associative array which point to doc1. Select one of the documents to which doc1 points, say, doc2. Ascribe 1 to doc2, remove those pairs of the associative array which point to doc2. Start to form a linked list: doc1=>doc2. Select one of the documents to which doc2 points, say, doc3. Ascribe 1 to doc3, remove those pairs of the associative array which point to doc3. Continue to form a linked list: doc1=>doc2=>doc3. However, if doc2 points to nothing, remove the last element of the linked list: it becomes doc1. Select the other document to which doc1 points. Repeat with doc3 the operations performed for doc2. And so on... Finally, after the cluster is passed completely, the linked list becomes doc1, and no associated values exist which doc1 points to. All documents, belonging to the cluster, get a value of 1. Repeat with other clusters... The simplest statistical characteristic of web-documents might be the following: how many clusters versus r_c exist? It seems, this function should be universal, if a set of key-words was representative...
Last updated:	Jan 18 2000