Original corpus: Vagabond Globetrotting 3 (2004)
1) concgram corpus
2) parse sentences
3) key sentences to concgrams
4) XSLT transform to knowledgebase
140 characters: The average number of words in an English sentence is around 16, and the average number of words in a tweet is 13.
2011 Questionbase: 12 months of questions
1) aggregate questions
2) filter duplicates
3) delete too few words (<1)
4) delete too many words (>32)
5) delete rows beginning with spaces, characters, numericals
6) google refine, case & clustering
7) gr export (CSV), remove duplicates
8) limit character length to 140
9) remove foreign lanaguages
10) remove miscellaneous characters (TextPad)
11) concordance (ConcGram1)