Building "bag of words" vectors for each document, maybe delete irrelevant dimensions, and find the most similar n (5..10) documents for each document.
Output : a short, automatic "see also" list for each page, either on the page, or on the border
Two ways to achieve: with or without clustering
Without clustering
Brute force: for each topic, find the n nearest topics (comparing the vectors' distance), and append
SeeAlso: NearestTopic1, NearestTopic2, ... NearestTopicn
With clustering
Make a bottom-up clustering, and stop at n-sized clusters. Calculate medoids (the topic nearest to the weight-center of the cluster). For each document, append a
:Medoid: TheMedoidTopic
In the NormalBorders, write a clever WikiTalk script that generates the SeeAlso list. Maybe for performance reasons there is need for an index topic that contains
TheMedoidTopic: ClusterMemberTopic1, ClusterMemberTopic2, ... ClusterMemberTopicn
lines.]
Topic tree - "Sitemap"
Building a hierarchical clustering of the topics, and finding names for the clusters.
Output : a topic (maybe a tree of topics) with a full, hierarchical table of contents for the namespace.
Hierarchical clustering
Find some good algorithm form the net. 
Finding names
This is easier in FlexWiki than generally in text mining, because the topic name and the summary contain words that strongly describe the content. So generating a cluster's name:
- Get the frequency of the words in the names and summaries for topics in the cluster
- Subtract the frequency of these words in the names and summaries for every topic
- Sort them, the first 1-3 words will give the cluster name.
- Maybe make grammatical analysis: eg. take in the most frequent verb too
- Easy solution: get the most frequent word that ends with "ing", and the most frequent word. I expect results like: "installing FlexWiki"
Search engine
A sophisticated search engine with
- inverse index (speed)
- (field based) relevancy calculation that takes topicnames, summaries, keywords, headings into account with greater coefficient
- ranking that takes links into account (pagerank, or something similar)
This is what I don't want to implement, because it would need me to modify the FlexWiki source, and also find a good way to store the indexes...
Marcell Szabó, student in computer sciences at bme.hu
1/24/2008 7:54:15 AM - FLWCOM-jwdavidson
The software running this site. -> jump to HomePage
10/22/2006 7:52:17 AM - -81.182.199.248
The software running this site. -> jump to HomePage
10/22/2006 7:52:17 AM - -81.182.199.248
Marcell Szabó, student in computer sciences at bme.hu
1/24/2008 7:54:15 AM - FLWCOM-jwdavidson
What's up with the TextMiningProject? Ahh?
9/13/2007 10:51:50 AM - -217.117.80.2
The topics found similar by a text clustering algorithm
1/4/2007 4:44:26 AM - -84.2.157.119
WikiTalk is a language for including dynamic content in FlexWiki topics.
9/25/2008 5:53:56 PM - FLWCOM-jwdavidson
Click to read this topic9/24/2008 5:23:33 PM - FLWCOM-jwdavidson
The software running this site. -> jump to HomePage
10/22/2006 7:52:17 AM - -81.182.199.248
The software running this site. -> jump to HomePage
10/22/2006 7:52:17 AM - -81.182.199.248