Meta-Guide.com‎ > ‎

VagabotKB


Notes: 
N-grams are sequences of n consecutive words .. whereas, concgrams may include any co-occuring n words regardless of position .. 
concgrams allow for gaps in the n-grams and order variation too .. 
Latent Semantic Indexing seems very similar to concgramming ..  

References: 
Keyness in Texts (2010) [Book] .. edited by Marina Bondi & Mike Scott
ConcGram 1.0: A Phraseological Search Engine (2009) [PDF] ..  
From n-gram to skipgram to concgram (2006) [PDF] .. fully automated search that reveals all of the word association patterns 

Tools:
ConcApp Concordancer ..
ConcGram List Builder ..
Google Books Ngram Viewer (About) ..
Microsoft Web N-gram Service
(Beta) .. unigram, bigram, trigram, N-gram with N=4 (4-gram) ..
WSConcGram .. "program for finding concgrams, essentially related pairs, triplets, quadruplets" .. (by Mike Scott)



Stock personality: Julia

1) global name change


Wikipedia:
Most common words in English (100) || Adobe FrameMaker .. converts books (ePub) into XML | TopBraid Composer .. converts XML into RDF | Altova XMLSpy .. XML editor & XSLT processor


Original corpus: Vagabond Globetrotting 3 (2004)

1) concgram corpus

2) parse sentences

3) key sentences to concgrams

4) XSLT transform to knowledgebase


VagaBot ChatLogs: Jan11 | Feb11 | Mar11 | Apr11 | May11 | Jun11 | Jul11 | Aug11 | Sep11 | Oct11 | Nov11 | Dec11 || Tools: Google Refine | TextPad | ConcGram1

140 characters: The average number of words in an English sentence is around 16, and the average number of words in a tweet is 13.


2011 Questionbase: 12 months of questions

1) aggregate questions

2) filter duplicates

3) delete too few words (<1)

4) delete too many words (>32)

5) delete rows beginning with spaces, characters, numericals

6) google refine, case & clustering

7) gr export (CSV), remove duplicates

8) limit character length to 140

9) remove foreign lanaguages

10) remove miscellaneous characters (TextPad)

11) concordance (ConcGram1)