|
Knowledge Mapper |
|
Markup
The hyperlink and title for each document is written to the output file. The previous
word is extracted and its length checked. If longer than a certain number of preset
characters, it is marked for insertion into the list. Before this is done, there is a
check for this word in the stop word list. The process is repeated for the next word.
The next step is to produce a plain ASCII text file. Information on this hit is
written to the hits file. This information and the ASCII file are used in the next
phase to generate the marked-up file. The memory used for additional themes is freed
after each theme has been processed. The next line is then read from the input file.
The final phase uses these files to generate a new HTML version of each document with
theme words hyperlinked. When the end user clicks on such a link, the list of documents
for that theme is displayed.
The method used is to make a list of all the mark-up files, and process each of these.
All previous HTML files are usually deleted. A flag that is set from the command line
controls this. When processing a mark-up file, all of the entries in the file are
loaded into memory. The number of entries in the list is not easy to predict, since
it depends on the size of the document, the number of hits per theme as well as the
overall number of themes. A linked list is therefore used for this purpose. This is
a linear collection of self-referencing structures or nodes that are connected by
pointers. The list is accessed by a pointer to the first list, with each node that
follows accessed through the link pointer in each node. Memory for nodes are assigned
dynamically, with each node created as necessary. After processing the mark-up file,
all of the nodes in the list are destroyed. A new list is created for the next
mark-up file.
Entries are sorted by the order in which they should appear. A check is then done for
overlapping entries on the same line, e.g. COMPANY and PUBLISHING COMPANY. The shorter
of the two is then discarded. Since more than two entries can overlap (such as
ZIFF-DAVIES PUBLISHING COMPANY), the list is checked again and again until all
duplicates are deleted. There is a check to guard against infinite loops. An error
is generated when the loop repeats more than 100 times.