|
Knowledge Mapper |
|
Description of software modules
The following steps can be taken to obtain a list of themes for a collection of documents:
- Index the documents with a search engine, and write a dump of all the words in the index.
- Read and analyse the contents of the index to get a list of the words or phrases that are used
most often.
- To keep the list short, only nouns and names are added to the list. Names are assumed
to be any word not in a standard English dictionary.
- Do a query on each of the words or phrases in the list compiled.
- Store the list of documents for each result.
- For each document, build a list of theme words or phrases that occur inside it and the
positions where they occur.
- Recreate a markup version of each document with the theme words and phrases hyperlinked to the list of
documents for that theme.
- Finally, publish a master index list.
Development of the categorization engine was done with the C API Toolkit for SWISH-E.
Processing of the input repository is done in phases. A separate program was developed
for each phase. The results of each phase are stored in flat file databases. The output
files of the first phases are the input files for the following phases. Processing proceeds
as follows:
- Index data using SWISH-E
- Compile a dump file with all indexed words
- Read index dump files, identifying nouns and names using WordNet
- Search for each item in list and identify additional themes
- Process additional themes
- Produce marked-up text
- Publish
- Add or delete themes
Most modules use a command line interface (CLI). There is a web-based wizard for
editing and creating configuration files. The code has been tested on the
following platforms:
- Windows 2000, compiled with Visual C++ 6.0.
- Red Hat Linux 8.0, compiled with GCC 3.2.
Work to compile the Windows edition with MinGW is under way.