ࡱ>    !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSRoot Entry FK8eCCTWordDocument CompObj^n the table. And there you have hashing. A well designed hashing table will yield this number of collisions per insertion or search: T1.0 Hashing Collision Table TABLE FULLNESS COLLISIONS ================================== 50% 2.0 60% 2.5 70% 3.3 90% 10.0 =======>8 TABLE 1.0 ENDS HERE 8<========= That shows better results than even the binary search, with large lists! Research has shown that the most efficient hashing tables, that is, the ones with the least number of collisions, have a prime number of entries. A table size of 1019 should produce less collisions than one of 1000. Research has also shown that if the prime is of the form 4K+3, where K is any positive integer, then collisions are reduced even further. 1019 also meets this second requirement. But, since a table size twice the size of the maximum number of entries it will ever hold is inefficient, the 4K+3 criterion should be abandoned at a certain point in favor of any prime number. Since most of us aren't idiot savants who can just come up with that number to suit our needs, here is a FUNCTION, written by Charles Graham, that accepts the maximum number of entries a table will have, and returns the proper type of prime number, to be used as a hashing table size: S4.0 FSTPRIME.BAS [F210S04.BAS] DEFINT A-Z ' This FUNCTION returns a prime number that is at least 30% greater than ' threshold. It will TRY to return a prime number that also fits into the ' form 4K+3, where k is any integer, but if the prime number is twice the ' size of the threshold, it will ignore this criterion. ' ' Written by Charles Graham ' FUNCTION funFirstPrime (threshold) CONST TRUE = -1 CONST FALSE = NOT TRUE tp30 = INT((threshold * 1.3) + .5) IF tp30 / 2 = tp30 \ 2 THEN tp30 = tp30 + 1 END IF c = tp30 - 2 IF c < 1 THEN c = 1 END IF t2 = threshold * 2 DO c = c + 2 FOR z = 3 TO SQR(c) ind = TRUE IF c / z = c \ z THEN ind = FALSE EXIT FOR END IF NEXT z IF ind THEN IF (c - 3) / 4 = INT((c - 3) / 4) OR c > t2 THEN funFirstPrime = c EXIT DO END IF END IF LOOP END FUNCTION =======>8 SAMPLE 4.0 ENDS HERE 8<========= Q3.1 How do I know when to use sequential searches, when to use binary searches, and when to use hashing? Are there any sort of guidelines? A3.1 Well, first let's consider where hashing is in its prime. (You'll pardon that one, okay?) It is best suited to dynamic list generation where items need to be added on a regular basis, but not deleted, since deletion is fairly difficult to implement on a hashed list. The main strength of a hashing system is its ability to quickly insert new items into the table in such a manner that they can be located quickly "on-the-fly." (See T1.0 for the average number of collisions before locating the correct entry.) Since the collisions increase with the ratio of full buckets to empty buckets, and not with the size of the actual table involved, hashing is more efficient than even binary searches when lists start to become huge. Also, because the binary method of searching demands a sorted list, insertion of items at a later time becomes very cumbersome, even with such techniques as the QuickSort and pushing all entries after the insertion up by one. (Try that technique on a list of 30,000 items, when you only want to add two new items that land near the beginning of the list, and you'll know what disk wear and tear is all about!) Typical applications of the hashing algorithm involve word distribution counts, dictionary table ܥe# PM,l,l X(T):XMS Sans Serif Symbol SystemTimes New RomanCourier New=======>8 SAMPLE 3.0 ENDS HERE 8<========= Q3.0 Advanced Topics -- "Hashing in QuickBASIC" Q3.1 That's pretty fast! I was so used to doing a sequential search on an unsorted list. Now that I have the QuickSort and the BiSearch routines, I can use them as a pair for faster list searches. The thing is, as soon as I want to add something to the list, it puts everything out of order by only one entry, and that hardly seems worth sorting all over again, even with something as fast as Cornel Huth's iterative QuickSort algorithm. Are there any alternatives to this way of doing things? I've heard talk of something called 'hashing' but I don't have any idea of what that is all about. How would I use hashing to avoid having to either resort the list, or use a slow insertion algorithm? Insertion is horrendously slow with disk files. A3.1 Hashing is a very efficient method of record access, be it in RAM or be it with a disk file. Basically, hashed arrays or data files can be quickly searched for a given item by a key index. Whenever you have to add an item to the list, you can at lightening speed, and since hashing "sorts" the array on-the-fly, as it were, there is no need to push records around to add new items to a hashed record. The first concept you must understand with hashing is the key index. Every data structure you design with hashing in mind has to have one field that is unique. This is a prerequisite that you just can't get around. Of course, you could actually combine several fields to generate this unique key, which effectively serves the same purpose. A good application of this is a Fidonet nodelist that uses the node address as the hashing key. No two alike in theory. But just how does this key work? First of all, let's take a look at the Fidonet example. Every full Fidonet address is unique to one node. Assume that the full nodelist has about 15000 entries. Okay, if you want a hashing table to hold 15000 unique entries, then research has shown that the table should be at least 30% greater than the number of entries in it. That would make 19500 table entries. This means that 4500 entries in the list will be left empty for best hashing results. Now, another problem comes up. How does the key come into play? Well, let's look at a simple key: 1153999. Since the list is 19500 long, we certainly can't just put this in record 1153999. Hashing involves dividing the key by the table size and taking the remainder and using that as the record number: 59 ---------- R 3499 19500) 1153999 Okay, 3499 is the record number in which we would put the data. This is the basic idea behind hashing. There is a trouble, however. Collision occurs whenever a node address, when divided by 19500 has a remainder of 3499. That 'bucket' is already full! So, what to do? Generate another bucket number, see if that bucket is full, and if it is, keep generating new buckets until we find an empty bucket. To find an item in a hashed table, we get its key, divide by the table size, and look at the bucket that is represented by the remainder. If that isn't the one, we generate the next bucket address, until we arrive at an empty bucket. If we encounter the correct key BEFORE we arrive at an empty bucket, then we've found our entry. If we arrive at an empty bucket, the record is not igenerators that involve dictionaries that will be added to dynamically, and things of that nature. Consider the word distribution count problem. Each word is a unique key, and so is perfect for hashing. Sequential methods only work well up until the table has so many entries in it that looking up entries in the table becomes a real effort. Remember, words already in the list do not need to be added twice. Binary methods allow for quick searching, but each case of a new word being added to the list requires a sort or cumbersome insertion. This takes time, if a text file is of even average length. Hashing, on the other hand, can increment the count of words already in the list, or add new words to the list, without the overhead of sorting, sequential searches, or push-type insertion. Also, remember that entry deletion is a problem with hashing. Word distribution counts NEVER require entries to be struck, and so are well-suited to hashing systems. A good rule of thumb to determine which method may be best for a given problem is to cosider the points on this table: T2.0 List Management System Ratings List Type SEQUENTIAL BINARY HASHED ===================================================== small list 1 3 2 medium list 3 1 2 large list 3 2 1 huge list 3 2 1 Insertion 2 3 1 Modification 3 2 1 Deletion 1 2 3 Browsing 2 1 3 (Systems are ranked first, second, or third) =======>8 TABLE 2.0 ENDS HERE 8<========= Using this table, we can see that the best method for short lists that require frequent deletions might be the sequential list. The best for huge lists that require insertions, modifications, but not deletions (such as a nodelist index) is probably a hashed list. A hashed list, however, will not do much for you if you regularly want to access the next item, first item in the list, or last item, such as in a list browsing system. Hashed lists have no logical beginning or end, and for this reason, there is no such thing as a "first item" or "next item" in a hashed list. Each entry is a single entity, retrievable only as a single entity, with no relation to any other entry in the hashed list. This excludes applications that require browsing, as I have mentioned, but is perfect for symbol tables, dictionaries, and the like. Q3.2 This is all pretty new to me. Give me a practical review. A3.2 Okay. In the hashed list there is no sense of sequence in the classic sense of the concept. Items are put into buckets based upon the type of calculation I have already discussed, and if the bucket is already in use, a new bucket is found according to a set system. Therefore, two similar items in a hashed table may actually have a physical distance of 500 entries between them. A practical example: We have a hash table 7 buckets big, and you want to store three entries in it, using hashing. For simplicity, let's just store the characters A, B, and C, using their ASCII values as keys. Their buckets would be: Item Formula Bucket ========================= A 65 MOD 7 2 B 66 MOD 7 3 C 67 MOD 7 4 No collisions have occured here, since this is a simple case. Now, let us add just one more item: H. The first bucket that H will request is 72 MOD 2, or 2, which is being used by A. This is collision. Now, we must find an empty bucket, and so, we apply a common method to the old bucket: we subtract an offset from 2. The offset is calulated thus: Offset = TableSize - Bucket, or Offset = 7 -2 Offset = 5 Okay, now, whenever a collision occurs, we recalculate a position using this formula: NewPos = OldPos - Offset NewPos = 2 - 5 NewPos = -3 In cases where NewPos is less than 0, we then add the table size to the interim result: NewPos = NewPos + TableSize, or NewPos = -3 + 7 NewPos = 4 We see that this new bucket, 4, is being used by C, and so we have to recalculate the bucket one more time: NewPos = OldPos - Offset, or NewPos = 4 - 5 NewPos = -1 NewPos <0 so NewPos = NewPos + TableSize, or NewPos = -1 + 7 NewPos = 6 We see that 6 is an empty bucket, and therefore, our table now looks something like this: Entry Bucket ============== 1 (empty bucket) A 2 (no collisions) B 3 (no collisions) C 4 (no collisions) 5 (empty bucket) H 6 (arrived at after two collisions) 7 (empty bucket) Now, remember from past explanations that searches are conducted by comparing each entry to the key until an empty bucket is reached. Therefore, to find A in the table, we calculate a bucket of 65 MOD 7, or 2. We look in bucket 2, and see that our key of A is the same as the table entry A. We have therefore found our entry in one look! Now, let's look for I. That's a bit different, since it isn't in the list. How many looks are needed to tell us that it isn't? Well 73 MOD 7 is 3, and we see immediately that bucket 3 is a B, not an I. We recalculate the next bucket, and get: Offset = 4 NewPos = (3 - 4) or -1 Less than 0, so NewPos = 6 Bucket 6 is occupied by an H, and so we calculate the next bucket: Offset = 4 NewPos = (6-4) = 2 Bucket 2 is occupied by an A, and so: NewPos = (2 - 4) NewPos = -2 + 7 = 5 Finally, bucket 5 is empty. Therefore, since we've arrived at an empty bucket BEFORE arriving at I, we can say that I is not in the list. How many steps required? Four. Quite a bit of overhead on a short list of 7 entries, but consider a list of 100,000 entries! Four searches to find an item is fast! Q3.3 Okay, how about a real working example of hashing in QuickBASIC? Theory is fine for CompSci freaks, but I'm a coffee and pizza programmer, not an egghead. A3.3 I mentioned that one perfect use of hashing is for word distribution counters. Here is one from Rich Geldreich that has been tweaked by me to account for some things that Rich did not know then about hashing table sizes. S5.0 WORDHASH.BAS [F210S05.BAS] 'WORDHASH.BAS v1.10 By Rich Geldreich 1992 ' 'Uses hashing to quickly tally up the frequency of all of the words in a 'text file. (This program assumes that words are seperated by either tab 'or space characters. Also, all words are converted to uppercase before 'the search.) ' DEFINT A-Z DECLARE SUB Show.Counts () DECLARE SUB Process.Line (A$) DECLARE SUB UpdateFreq (A$, KeyIndex) CONST TRUE = -1, FALSE = 0 DIM SHARED TableSize Main: FileName$ = COMMAND$ CLS LOCATE 1, 1 PRINT "WORDHASH.BAS By Rich Geldreich 1992" OPEN FileName$ FOR INPUT AS #1 LEN = 16384 ' In Rich's original version, the TableSize was set at 7000. My version ' guesses at how large the table needs to be based on this: ' There are 5.5 characters in the average word. Therefore, divide the ' text file length by 5.5. For safety, assume that as many as ' half of those will be unique. In normal text, half the words are in the ' hundred most common list, so this plays it pretty safe! It will die ' if you take a file that is over about 50% unique words, however! This ' is for NORMAL text files, not word dictionaries, where all entries are ' unique! ' 'SPLICE IN FROM EARLIER SAMPLE 4.0 IN THIS FAQ ' VVVVVVVVVVVVV TableSize = funFirstPrime(LOF(1) * .09) REDIM SHARED WordTable$(TableSize) REDIM SHARED Counts(TableSize) DIM SHARED New.Words DO UNTIL EOF(1) LINE INPUT #1, A$ Process.Line A$ N = N + 1 LOCATE 3, 1: PRINT N; "lines processed,"; New.Words; "words found" LOOP SUB Process.Line (A$) ASEG = SSEG(A$) 'QuickBASIC 4.5 users change this to VARSEG(A$) AOFS& = SADD(A$) DEF SEG = ASEG + AOFS& \ 16 AAddress = AOFS& AND 15 Astart = AAddress AEndAddress = AAddress + LEN(A$) 'get a word GOSUB GetAWord 'update the frequency of the word until there aren't any words left DO WHILE Word$ <> "" UpdateFreq Word$, KeyIndex GOSUB GetAWord LOOP EXIT SUB GetAWord: Word$ = "" 'find a character P = PEEK(AAddress) DO WHILE (P = 32 OR P = 9) AND AAddress <> AEndAddress AAddress = AAddress + 1 P = PEEK(AAddress) LOOP 'if not at end of string then find a space IF AAddress <> AEndAddress THEN KeyIndex = 0 GOSUB UpdateKeyIndex 'remember where the character started WordStart = AAddress AAddress = AAddress + 1 P = PEEK(AAddress) GOSUB UpdateKeyIndex 'find the leading space DO UNTIL (P = 32 OR P = 9) OR AAddress = AEndAddress AAddress = AAddress + 1 P = PEEK(AAddress) GOSUB UpdateKeyIndex LOOP KeyIndex = KeyIndex - L 'make the word Word$ = UCASE$(MID$(A$, WordStart - Astart + 1, AAddress - WordStart)) END IF RETURN UpdateKeyIndex: IF P >= 97 AND P <= 122 THEN L = P - 32 KeyIndex = KeyIndex + L ELSE L = P KeyIndex = KeyIndex + L END IF RETURN END SUB SUB UpdateFreq (A$, KeyIndex) STATIC collisions 'adjust the keyindex so its within the table KeyIndex = KeyIndex MOD TableSize 'calculate an offset for retries IF KeyIndex = 0 THEN Offset = 1 ELSE Offset = TableSize - KeyIndex END IF 'main loop of hashing DO 'is this entry empty? IF WordTable$(KeyIndex) = "" THEN 'add this entry to the hash table WordTable$(KeyIndex) = A$ New.Words = New.Words + 1 IF New.Words = TableSize THEN BEEP PRINT : PRINT "Not enough room in word table!" END END IF EXIT SUB 'is this what we're looking for? ELSEIF WordTable$(KeyIndex) = A$ THEN 'increment the frequency of the entry Counts(KeyIndex) = Counts(KeyIndex) + 1 EXIT SUB 'this entry contains a string other than what we're looking for: 'adjust the KeyIndex and try again ELSE collisions = collisions + 1 LOCATE 5, 1: PRINT "Collisions: "; collisions KeyIndex = KeyIndex - Offset 'wrap back the keyindex if it's <0 IF KeyIndex < 0 THEN KeyIndex = KeyIndex + TableSize END IF END IF LOOP END SUB =======>8 SAMPLE 5.0 ENDS HERE 8<========= END OF QUIK_BAS FAQ ,-.`a01BCDcd56xy@ANO{|} Q R ]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]8   f g   [ \ 0 1 y z D E 01WXuvwxKL]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]8DEFab)*+jk67]^123z{]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]8LM#$kl;<567ABC !XYZ[z{|}]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]8!"./<=FGMN`acdqrJKhi|}^_uv]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]8vwGHbcDE  Y Z +!,!q!r!!!!!!""]"^"r"s"t"""##]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]8#I#J#####!$"$j$k$$$$$$:%;%y%z%%% & &D&E&F&&&&&&&&&&'''e'f'''''#($(_(`(((((())P)Q))]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]8)))))))))?*@***** + +Q+R+++++&,',m,n,,,,,;-<----------=.>.....//]/^//////]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]8// 0 0S0T00000000001191:1W1X1Y11111(2)2o2p222222337383R3S3T333333334444 4h4i44]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]84444444444:5;5p5q5r5555555555'6(6G6H6b6c6d66666666666 7!7D7E7h7i77777778 8 8R8]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]8R8S88888"9#9h9i99999?:@::::::::::; ;#;$;%;o;p;q;;;;;;;;;;<#<$<%<k<l<<<<<>=?====]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]8===>>3>4>5>t>u>>>??2?3?4?V?W?X???????@@^@_@l@m@n@o@p@z@{@@@@@@@@@@ A A AAA'A(A,A-A9A:A]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]8:AfAgAAAAAABBB`BaBBBBB1C2CzC{CCCCCCCCCDD@DADcDdDDDDDDDDDDDDDD,E-E2E3E4EIEJEKEE]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]8EEEEEEEEEEEFFF,F-F?F@FFFFFFFFFFFFFFFFF G G G"G#G9G:GtGuGGGGGGGGGG H H!H"H>H]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]8>H?H@HmHnHHHHHHHHHHII@IAIdIeIIIIIIIIIIII3J4J5J?J@JFJGJHJWJXJxJyJJJJJJJJJJJJJJJ]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]8JJJJJKK.K/K_K`KKKKKKKKKKKLLLL*L+L1L2LOLPLyLzLLLLLLLMM2M3MqMrMMMMMMMMMNN6N7N]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]87NjNkNNNNNNNOO(O)ObOcOOOOOOO P PPP-P.P6P7P8P?P@PAPkPlPmPPPPP]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]'-.a1CDd6yAOO|} R  g  \ 1 z E  1XvwxLEFb*+kk7^23{M$l<67BC!Y[{}"/=GNadrKi}_vwHcE  Z ,!r!!!!"^"s"t""#J###"$k$$$$;%z%z%% &E&F&&&&&&''f'''$(`(((()Q))))))))@*** +R+++',n,,,<------->...//^///// 0T0000001:1X1Y111)2p22223383S3T3333344 4i4444444;5q5r5555555(6H6c6d6666666!7E7i7777 8 8S888#9i9i999@:::::: ;$;%;p;q;;;;;;<$<%<l<<<<?====>4>5>u>>?3?4?W?X????@_@m@o@p@{@@@@@@@ A AA(A-A:AgAAAABBaBBB2C{CCCCCCDADdDDDDDDDD-E3E4EJEKEEEEEEEFFF-F@FFFFFFFFFF G G#G:GuGGGGGG H"H?H?H@HnHHHHHHIAIeIIIIIII4J5J@JGJHJXJyJJJJJJJJJJJJK/K`KKKKKKLL+L2LPLzLLLLLM3MrMMMMMN7NkNNNNO)OcOOOO PP.P.P7P8P@PAPlPmPPPP P)P:M&K@Normala "A@"Default Paragraph Font@ FMicrosoft Word 6.0 Document MSWordDoc9qࡱ> ࡱ>