Knowledge Mapper

Home

News

Info

Download

Demo

Description of software modules

The following steps can be taken to obtain a list of themes for a collection of documents:

Index the documents with a search engine, and write a dump of all the words in the index.
Read and analyse the contents of the index to get a list of the words or phrases that are used most often.
To keep the list short, only nouns and names are added to the list. Names are assumed to be any word not in a standard English dictionary.
Do a query on each of the words or phrases in the list compiled.
Store the list of documents for each result.
For each document, build a list of theme words or phrases that occur inside it and the positions where they occur.
Recreate a markup version of each document with the theme words and phrases hyperlinked to the list of documents for that theme.
Finally, publish a master index list.

Development of the categorization engine was done with the C API Toolkit for SWISH-E.

Processing of the input repository is done in phases. A separate program was developed for each phase. The results of each phase are stored in flat file databases. The output files of the first phases are the input files for the following phases. Processing proceeds as follows:

Index data using SWISH-E
Compile a dump file with all indexed words
Read index dump files, identifying nouns and names using WordNet
Search for each item in list and identify additional themes
Process additional themes
Produce marked-up text
Publish
Add or delete themes

Most modules use a command line interface (CLI). There is a web-based wizard for editing and creating configuration files. The code has been tested on the following platforms:

Windows 2000, compiled with Visual C++ 6.0.
Red Hat Linux 8.0, compiled with GCC 3.2.

Work to compile the Windows edition with MinGW is under way.