Goal and Tools |
||
Web documents clustering is a fairly popular question nowadays. In order to get sufficient number of text documents for coming to resonable statistics, Web servers are ideal sources for selecting such documents. Let us suppose, that approximately one, or two, or more thousand documents from the news server, say, clari.usa.top , are at our disposal. A standard Perl program helps to get them. When I did this some time ago, a fragment of the list of documents' addresses looked like this: |
||
.............................................. <37a6bb62.0@news.accessus.net 37a6bbe9.0@news.accessus.net 37a6bc41.0@news.accessus.net 9tzp3.49$uh3.203823@nnrp2.proxad.net 37A725C2.2B19@usit.net 37A6FEAA.72DA31BE@remove-this.hotmail.com 7o6v2p$9t8$1@coranto.ucs.mun.ca 7o72gm$9dg@newsops.execpc.com IQstZDAjJxp3Ewd5@clarsach.demon.co.uk 19990803135654.09209.00006817@ng-cc1.aol.com 7o7bb2$a5d$1@news5.isdnet.net .............................................. |
.............................................. 7o7c3r$rrj$1@coranto.ucs.mun.ca 19990803153814.02940.00000300@ng-fm1.aol.com 7o7gp1$ek5$1@autumn.news.rcn.net 19990803162122.05734.00003282@ng-cl1.aol.com lTIp3.12047$J5.136882@c01read02-admin.service.talkway.com 7o7pqs$v0a$1@news.inc.net 7o7q8h$v9e$1@news.inc.net 7o7r03$1034$1@news.inc.net 7o7re0$10b1$1@news.inc.net 7o7ob4$el5$1@bgtnsc03.worldnet.att.net 7o7sdf$118p$1@news.inc.net .............................................. | |
![]() |
There are two possibilities then. In the first one, you can predefine a set of key-words. Then, again with a help of Perl, patterns matching can be done in order to count how many times each word of the set of key-words is contained in each document. After that, any document can be represented by a vector whose co-ordinates are corresponding integer numbers for any given key-word. Strictly speaking, a common procedure is consistent with normalization of a vector to a unit sphere. Then all the documents can be interpreted as located on the unit sphere, like stars on the night sky. A big difference with such an interpretation is the sky is a projection from our three-dimensional world, but a typical number of key-words is of order one hundred. So, we deal with a multi-dimensional sky! With its non-negative sector. | |
Well, the main goal is ahead:
How the documents are arranged in clusters?
Definitions:
Now, we can start selecting documents within a cluster. One possible solution is consistent with creating an associative array in which subscript elements (documents) point to directly coupled documents. Starting from any document, and using a recursive algorithm, a cluster can be passed entirely.
How to search: an algorithm
| ||
Last updated: | Jan 18 2000 |