A simple binary categorization: spam/not spam, similar to the e-mail spam filters. The process:
training
input: training corpus: many spam and many not spam documents
generating the input: I'll take the flexwiki.com files. A topic version (.awiki) which is succeeded with a version containing "delete" must be a spam. Others are not; especially if they have many, but not too many versions (playpages have too many versions).
algorithm: I'll try some well known clustering algorithms, that use "bag of words", maybe somehow take in count the number of references to the page (heuristically: spam has typically 0 or 1 references).
output: a well trained clustering algorithm
running the clustering algorithm - 1
input: every topic on the wiki
output: a new topic that contains the names of the "potential spam" topics
running: as scheduled task
running the clustering algorithm - 2
running: in the newsletter and rss feed generator
input: a new or modified topic on the wiki
output: decision whether it is spam or not, putting it in the feed as a tip, maybe generating a "potential spam" feed.
Building a hierarchical clustering of the topics, and finding names for the clusters.
Output : a topic (maybe a tree of topics) with a full, hierarchical table of contents for the namespace.
Hierarchical clustering
Find some good algorithm form the net.
Finding names
This is easier in FlexWiki than generally in text mining, because the topic name and the summary contain words that strongly describe the content. So generating a cluster's name:
Get the frequency of the words in the names and summaries for topics in the cluster
Subtract the frequency of these words in the names and summaries for every topic
Sort them, the first 1-3 words will give the cluster name.
Maybe make grammatical analysis: eg. take in the most frequent verb too
Easy solution: get the most frequent word that ends with "ing", and the most frequent word. I expect results like: "installing FlexWiki"
Not so devastating... Click on the boxes to open/close. The words in the boxes are the most frequent words of the cluster. The topics of the cluster are displayed on the right side.
4. Search engine
A sophisticated search engine with
inverse index (speed)
(field based) relevancy calculation that takes topicnames, summaries, keywords, headings into account with greater coefficient
ranking that takes links into account (pagerank, or something similar)
This is what I don't want to implement, because it would need me to modify the FlexWiki source, and also find a good way to store the indexes...
Marcell Szabó, student in computer sciences at bme.hu
1/24/2008 7:54:15 AM - FLWCOM-jwdavidson
The software running this site. -> jump to HomePage
10/22/2006 7:52:17 AM - -81.182.199.248
Marcell Szabó, student in computer sciences at bme.hu
1/24/2008 7:54:15 AM - FLWCOM-jwdavidson
Marcell Szabó, student in computer sciences at bme.hu
1/24/2008 7:54:15 AM - FLWCOM-jwdavidson
What's up with the TextMiningProject? Ahh?
9/13/2007 10:51:50 AM - -217.117.80.2
The topics found similar by a text clustering algorithm
1/4/2007 4:44:26 AM - -84.2.157.119
WikiTalk is a language for including dynamic content in FlexWiki topics.
9/25/2008 5:53:56 PM - FLWCOM-jwdavidson
Click to read this topic
9/24/2008 5:23:33 PM - FLWCOM-jwdavidson
The software running this site. -> jump to HomePage
10/22/2006 7:52:17 AM - -81.182.199.248
The software running this site. -> jump to HomePage