Knowledge Mapper  

Home News Info Download Demo
Dump index Read index Query Markup Publish

Markup

The hyperlink and title for each document is written to the output file. The previous word is extracted and its length checked. If longer than a certain number of preset characters, it is marked for insertion into the list. Before this is done, there is a check for this word in the stop word list. The process is repeated for the next word. The next step is to produce a plain ASCII text file. Information on this hit is written to the hits file. This information and the ASCII file are used in the next phase to generate the marked-up file. The memory used for additional themes is freed after each theme has been processed. The next line is then read from the input file.

The final phase uses these files to generate a new HTML version of each document with theme words hyperlinked. When the end user clicks on such a link, the list of documents for that theme is displayed.

The method used is to make a list of all the mark-up files, and process each of these. All previous HTML files are usually deleted. A flag that is set from the command line controls this. When processing a mark-up file, all of the entries in the file are loaded into memory. The number of entries in the list is not easy to predict, since it depends on the size of the document, the number of hits per theme as well as the overall number of themes. A linked list is therefore used for this purpose. This is a linear collection of self-referencing structures or nodes that are connected by pointers. The list is accessed by a pointer to the first list, with each node that follows accessed through the link pointer in each node. Memory for nodes are assigned dynamically, with each node created as necessary. After processing the mark-up file, all of the nodes in the list are destroyed. A new list is created for the next mark-up file.

Entries are sorted by the order in which they should appear. A check is then done for overlapping entries on the same line, e.g. COMPANY and PUBLISHING COMPANY. The shorter of the two is then discarded. Since more than two entries can overlap (such as ZIFF-DAVIES PUBLISHING COMPANY), the list is checked again and again until all duplicates are deleted. There is a check to guard against infinite loops. An error is generated when the loop repeats more than 100 times.

Powered by SWISH-E