Associative Search
Associative search is a type of information retrieval
based on the similarity between documents.
Associative search is useful when the user's information
need is not clearly expressed by one or several keywords,
but the user has some interesting texts
(see
Clipboard Association.)
Associative search is also useful as a feedback tool.
If you find interesting items in the search results, then
associative search with these items as search keys may bring
you more related items which were not previously retrieved.
(see Interaction in
Basic.)
Associative search in DualNAVI consists of following
steps:
- Extraction of characterising words from the selected
documents.
- The default number of characterising words is set at 200.
- This value is set as the
Number of Internal Words of PE in the
Advanced section of Option window.
- For each word (w), which appears at least once in the
selected documents, its score is calculated by
score(w) = tf(w) / TF (w), where
tf(w) and TF(w) are the term frequencies of w
in the selected documents, and in the whole database
respectively.
Then the above number of words in the order of this score are selected.
- These extracted words are used as a query,
and the relevance of each document (d) in the target database
with this query (q) is calculated by the following formula:
(where DF(w) is the number of documents containing w,
and N is the totla number of documents in the database.)
- The results are sorted by this similarity and the top
scored documents are returned as the search results.
- The number of search results is set as the
Number of Articles in the
Basic section of Option window.
(See customization.)
Topic Word Graph
Topic word graphs summarize the search results and
suggest proper words for further search.
- The words composing a topic word graph are those
characterizing the retrieved documents.
- The score for selecting topic words is given by
df(w) / DF(w), where df(w) is the number of
retrieved documents containing w and DF(w) is total
number of documents (in the database) containing w.
- In order to make balanced selection of common words
and specific words, all the candidate words (those appearing
in at least one retrieved document) are classified into
several frequency classes, and proper number of topic words
are selected from each frequency class.
- A link (an edge) between two words means that the two
words are strongly related, that is,
they co-appear in many documents in the retrieval results.
- Each topic word X is linked to another topic word Y
which maximizes the co-occurence strength df(X & Y) / df(Y)
with X among those having higher document
frequency than X. Here df(X & Y) means the number of
retrieved documents which have both X and Y.
- The length of a link does not mean anything.
(Althogh it may be natural to expect a shorter link
means a stronger relation.)
- As for the two dimensional arrangement of topic words,
the vertical coordinate means the (document) frequency of
each word in the retrieved documents,
whereas the horizontal coordinate has no meanings.
High frequency words (common words) are placed in the
upper part, and low frequency words (specific words) are place
in the lower part.
Therefore, this graph can be considered as a hierarchy of topics
in the search results.