Knowledge Mapper

Home

News

Info

Query

Query

This module reads the theme list, does queries and stores the results. This phase requires the most processor power and disk I/O. During this phase the web pages listing the documents relating to each theme is produced. The input files for the next phase are generated as output files. These are called mark-up files. There is one mark-up file for each document in the repository. They contain the following information:

Theme name
Line and column where mark-up starts
Line and column where mark-up ends

Options that can be switched from the command line on are:

Input file name
Maximum documents per theme
Additional theme threshold percentage
Working directory
Minimum document rank
Maximum number of themes
Summary mode
Expert mode
Flag to write text files
Debug mode
Debug level
Flag to process additional theme file only
Flag to overwrite files
Help

We read the stop word list from file and then all the words are stored in a binary tree. Before processing the input file, the program must check available features, initialize and open SWISH-E. At the start of the application code, we set the search properties and open the input file. The output is called a hits list file, containing the following data in each line:

Library name
Document ID
Number of lines
File name
Document title

During testing it was found that combinations of two or three words occur frequently, e.g. NETWORK MANAGEMENT or VICE PRESIDENT. A mechanism to find such combinations was therefore implemented. In every document in the list, the words before and after the best hit are saved in a binary tree. The numbers of occurrences for each of these words are stored in a counter.

After all of the hits for all of the documents for a theme have been processed, the counters are checked. The number of times that each of these words occur is calculated as a percentage of the documents that contain it. If this percentage is above a preset threshold, the theme is written to an output file. This file with additional themes can be edited and used as input during a subsequent run. No further additional themes need to be listed during such a run. A command line option makes provision for this.

The list produced is of value, but cannot be used without human intervention. Since the combinations identified are based on purely statistical grounds, a significant proportion of the entries don?t make much sense from a linguistic point of view. The administrative user therefore has to edit the list manually before proceeding with the next phase of processing.

The hyperlink and title for each document is written to the output file. The previous word is extracted and its length checked. If longer than a certain number of preset characters, it is marked for insertion into the list. Before this is done, there is a check for this word in the stop word list. The process is repeated for the next word. The next step is to produce a plain ASCII text file. Information on this hit is written to the hits file. This information and the ASCII file are used in the next phase to generate the marked-up file. The memory used for additional themes is freed after each theme has been processed. The next line is then read from the input file.