|
Knowledge Mapper |
|
Query
This module reads the theme list, does queries and stores the results. This phase
requires the most processor power and disk I/O. During this phase the web pages listing
the documents relating to each theme is produced. The input files for the next phase
are generated as output files. These are called mark-up files. There is one mark-up
file for each document in the repository. They contain the following information:
- Theme name
- Line and column where mark-up starts
- Line and column where mark-up ends
Options that can be switched from the command line on are:
- Input file name
- Maximum documents per theme
- Additional theme threshold percentage
- Working directory
- Minimum document rank
- Maximum number of themes
- Summary mode
- Expert mode
- Flag to write text files
- Debug mode
- Debug level
- Flag to process additional theme file only
- Flag to overwrite files
- Help
We read the stop word list from file and then all the words are stored in a binary tree.
Before processing the input file, the program must check available features, initialize
and open SWISH-E. At the start of the application code, we set the search properties and
open the input file. The output is called a hits list file, containing the following data
in each line:
- Library name
- Document ID
- Number of lines
- File name
- Document title
During testing it was found that combinations of two or three words occur frequently,
e.g. NETWORK MANAGEMENT or VICE PRESIDENT. A mechanism to find such combinations was
therefore implemented. In every document in the list, the words before and after the
best hit are saved in a binary tree. The numbers of occurrences for each of these words
are stored in a counter.
After all of the hits for all of the documents for a theme have been processed, the
counters are checked. The number of times that each of these words occur is calculated
as a percentage of the documents that contain it. If this percentage is above a preset
threshold, the theme is written to an output file. This file with additional themes can
be edited and used as input during a subsequent run. No further additional themes need to
be listed during such a run. A command line option makes provision for this.
The list produced is of value, but cannot be used without human intervention. Since the
combinations identified are based on purely statistical grounds, a significant proportion
of the entries don?t make much sense from a linguistic point of view. The administrative
user therefore has to edit the list manually before proceeding with the next phase of
processing.
The hyperlink and title for each document is written to the output file. The previous
word is extracted and its length checked. If longer than a certain number of preset
characters, it is marked for insertion into the list. Before this is done, there is a
check for this word in the stop word list. The process is repeated for the next word.
The next step is to produce a plain ASCII text file. Information on this hit is
written to the hits file. This information and the ASCII file are used in the next
phase to generate the marked-up file. The memory used for additional themes is freed
after each theme has been processed. The next line is then read from the input file.