Show Changes Show Changes
Print Print
Recent Changes Recent Changes
Subscriptions Subscriptions
Lost and Found Lost and Found
Find References Find References
Rename Rename
Administration Page Administration Page
Search

History

9/19/2007 7:03:30 PM
-76.84.225.95
1/25/2007 6:04:35 PM
-84.1.175.189
1/25/2007 6:02:39 PM
-84.1.175.189
1/4/2007 3:46:36 AM
-84.2.157.119
1/4/2007 3:46:00 AM
-84.2.157.119
List all versions List all versions

RSS feed for the FlexWiki namespace

Text Mining Project
.
Summary
SzaMa's project for enhancing FlexWiki with text mining algorithms

So, I'm a student, and attend a lecture on text mining. If I implement something useful then I don't have to write exam, and automatically get the best grade. Therefore I want to implement some text mining solutions to enhance FlexWiki, which is an exiting thing and sounds useful anyway. The deadline is 2007. Jan. 24. I'd appreciate any help and/or suggestions.

-- SzaMa - 2006.12.17.

I don't have to write the algorithms from scratch, I can use any source, library. I'm expected to set up 2 useful scenarios.

TextMiningProjectBlog

Scenarios

1. Spam-crap filter

A simple binary categorization: spam/not spam, similar to the e-mail spam filters. The process:

2. Similar topics - "see also"

AboutSimilarTopics

OriginalPlan

Building "bag of words" vectors for each document, maybe delete irrelevant dimensions, and find the most similar n (5..10) documents for each document.

Output : a short, automatic "see also" list for each page, either on the page, or on the border

Two ways to achieve: with or without clustering

Without clustering

Brute force: for each topic, find the n nearest topics (comparing the vectors' distance), and append

  SeeAlso: NearestTopic1, NearestTopic2, ... NearestTopicn 

With clustering

Make a bottom-up clustering, and stop at n-sized clusters. Calculate medoids (the topic nearest to the weight-center of the cluster). For each document, append a

  :Medoid: TheMedoidTopic

In the NormalBorders, write a clever WikiTalk script that generates the SeeAlso list. Maybe for performance reasons there is need for an index topic that contains

  TheMedoidTopic: ClusterMemberTopic1, ClusterMemberTopic2, ... ClusterMemberTopicn

lines.

3. Topic tree - "Sitemap"

Building a hierarchical clustering of the topics, and finding names for the clusters.

Output : a topic (maybe a tree of topics) with a full, hierarchical table of contents for the namespace.

Hierarchical clustering

Find some good algorithm form the net.

Finding names

This is easier in FlexWiki than generally in text mining, because the topic name and the summary contain words that strongly describe the content. So generating a cluster's name:

4. Search engine

A sophisticated search engine with

This is what I don't want to implement, because it would need me to modify the FlexWiki source, and also find a good way to store the indexes...

Not logged in. Log in

Welcome to the home of FlexWiki, a collaboration tool, based on WikiWiki, implemented using Microsoft .NET technologies

This is FlexWiki, an open source wiki engine.

This site supports the new NoFollow anti-spam initiative.
Change Style

Recent Topics

Similar topics (?)