about summary refs log tree commit diff stats
path: root/kgramstats.h
Commit message (Collapse)AuthorAgeFilesLines
* hashtags are now randomizedKelly Rauchenberger2016-01-251-4/+20
|
* Rewrote quite a bit of kgramstatsKelly Rauchenberger2016-01-041-11/+73
| | | | | | The algorithm still treats most tokens literally, but now groups together tokens that terminate a clause somehow (so, contain .?!,), without distinguishing between the different terminating characters. For each word that can terminate a sentence, the algorithm creates a histogram of the terminating characters and number of occurrences of those characters for that word (number of occurrences is to allow things like um???? and um,,,,, to still be folded down into um.). Then, when the terminating version of that token is invoked, a random terminating string is added to that token based on the histogram for that word (again, to allow things like the desu-ly use of multiple commas to end clauses). The algorithm now also has a slightly advanced kgram structure; a special "sentence wildcard" kgram value is set aside from normal strings of tokens that can match any terminating token. This kgram value is never printed (it is only ever present in the query kgrams and cannot actually be present in the histograms (it is of a different datatype)) and is used at the beginning of sentence generation to make sure that the first couple of words generated actually form the beginning of a sentence instead of picking up somewhere in the middle of a sentence. It is also used to reset sentence generation in the rare occasion that the end of the corpus is reached.
* Added malapropismsKelly Rauchenberger2015-11-221-8/+7
|
* Kerjiggered the algorithmsKelly Rauchenberger2015-07-191-0/+5
|
* Rewrote weighted random number generatorFeffernoose2013-10-051-1/+2
| | | | | | The previous method of picking which token was the next one was flawed in some mysterious way that ended up picking various words that occurred only once in the input corpus as the first word of the generated output (most notably, "hysterically," "Anarchy," "Yorkshire," and "impunity.").
* Weighed token casing and presence of periodsFeffernoose2013-10-011-3/+9
| | | | | | | | Tokens which differ only by casing or the presence of an ending period are now considered the same token. When tokens are generated, they are cased based on the prevalence of Upper/Title/Lower casing of the token in the input corpus, and similarly, a period is added to the end of the word based on how often the same token was ended with a period in the input corpus.
* Wrote programFeffernoose2013-10-011-0/+28