I've told my teacher that on the long run, I want to implement all 4 scenarios, but what I'll do in short term depends on what he demands He told me to implement 2 of my choice.
For the first one, I've chosen the spam clustification, because it seemed to be trivial... Well... Let's see what happened
I've asked for the full flexwiki.com repository on the FlexWikiMailingList, and CraigAndera was so kind to give it to me in few hours. A 50MB zip file with 36 471 files 1564 actual wikipages and the rest are the older versions (including the deleted topics').
I've written an Excel macro to read in the filenames and determine which is spam and which is not
If a version of a topic contains one word, "delete" it means it was deleted, so the previous version was a spam
(1) I assume that if a topic is modified and not deleted, than it wasn't spam, so the version before the latest is a not spam version.
So I got 2885 spam and 1258 notspam samples. With an other Excel macro, I've copied these to a spam and a notspam folder.
And now, I thought, almost ready, find a classification tool, train it and enjoy the results. NOPE
http://kt.ijs.si/Dunja/textgarden/ this seemed to do the trick, but some links are broken, others work, but the prorgam doesn't work. No source code.
http://www.cs.cmu.edu/~mccallum/bow/ another contestant, but "The library does not: Claim to be finished. Have good documentation. Claim to be bug-free." I've spent a day trying to compile it on cygwin with no success.
SVMLight: http://svmlight.joachims.org/ now this is a classifier, an abstract one, doesn't read text files, you have to give it the feature-value pairs. I didn't want to do this
there is nice javadoc, but no specification or howto, so I spent more than a day with finding out what it does, and how to run it. I'm considering writing a short spec and howto, and sending it back to the author...
LOL: the TCT writes out a matrix in a text file, and because of my Hungarian locales, java put , instead of . in the float values. When I wanted to use the file, TCT threw a "not valid float value" exception... So I had to replace them back. BUT! In a 2 MB text file, it took 2 seconds for notepad to replace one occurrence. Even notepad++ and Word died. So I had to write a script that did it line by line...
Right now, I've parsed my training data with TCT, and written a vbscript to convert it into SVMLight input file.
Before trying it, I looked at the files, and realized that the "bag of words" model with stemming is not good for wiki content. Mainly because of the wikiwords - a modified split algorithm would solve the problem. But there are links too, and these are crucial in detecting spam. (2) So my heuristic is that a Ngram would by much better. So I started a new parsing, with Ngrams, and got 15000 terms. I started an other parsing, with dimension cutting: a term must occur in at least 10 docs - got 7700 terms, maybe this will be good.
Aham, precision 38% This is not good. What's the problem? The training data contains lot's of spam. (1) is not true. I'll say now: a topic, that is very long, and not deleted is not spam. And also, too many modifications mean playpage, so I'll filter this too. Back to the Excel macro.
Precision for the BOW training file is 45%, maybe (2) isn't true either
Ok, found the bug: I used the termID-docID pairs as docID-termID
I used Access to swap the docID-termID, but it's still not good...
SzaMa's project for enhancing FlexWiki with text mining algorithms
9/19/2007 7:03:30 PM - -76.84.225.95
There are a number of mailing lists for people who are interested in FlexWiki.
1/24/2008 8:36:46 AM - FLWCOM-jwdavidson
Craig Andera is a consultant for Wangdera Corporation (his company). He blogs at "Pluralsight":http://pluralsight.com"","" and used to teach for DevelopMentor.