about summary refs log tree commit diff stats
path: root/kgramstats.cpp
Commit message (Collapse)AuthorAgeFilesLines
* Converted to C++ style randomizationKelly Rauchenberger2019-02-281-18/+25
| | | | The logic in rawr::randomSentence with the cuts might be slightly different now but who even knows what's going on there.
* Allow the sentence to end at the end of a corpusKelly Rauchenberger2019-02-271-0/+7
|
* The beginning of a corpus should be treated as a new sentenceKelly Rauchenberger2019-02-271-0/+27
|
* Fixed crash when maxK is greater than a corpus's lengthKelly Rauchenberger2019-02-271-1/+1
|
* Interned tokens to reduce memory footprintKelly Rauchenberger2018-08-261-35/+60
|
* Made cuts algorithm more aggressiveKelly Rauchenberger2018-02-061-5/+10
|
* Tweaked the cutting algorithm and disabled newlinesKelly Rauchenberger2016-11-271-17/+14
|
* Merge branch 'master' of https://github.com/hatkirby/rawr-ebooksKelly Rauchenberger2016-08-201-1/+1
|\
| * Fixed final closing delimiters appearing on new lineKelly Rauchenberger2016-06-051-1/+1
| |
* | Marked rawr::randomSentence constKelly Rauchenberger2016-08-201-3/+3
|/
* Added ability to require a minimum number of corpora in generated outputKelly Rauchenberger2016-05-311-47/+87
| | | | Also fixed a bug with tokenizing multiple corpora.
* Newlines, colons, and semicolons are now valid terminatorsKelly Rauchenberger2016-05-291-24/+53
|
* Pulled the ebooks functionality out into a libraryKelly Rauchenberger2016-05-201-213/+252
|
* Changed "full sentence mode" to "don't stop believing" modeKelly Rauchenberger2016-03-101-14/+1
|
* Member hiding is funKelly Rauchenberger2016-03-081-2/+2
|
* Full sentences mode!Kelly Rauchenberger2016-03-081-2/+15
|
* Removed aspell session editingKelly Rauchenberger2016-02-281-4/+0
| | | | This wasn't really necessary since it was completely automated anyway, and it caused crashes for reasons that I haven't looked into with some bad corpuses.
* Reverted to an older kgram cut rateKelly Rauchenberger2016-02-201-13/+9
|
* Added percentage display to preprocessing stageKelly Rauchenberger2016-02-201-4/+52
|
* Modified kgram cut rate. It's do or die.Kelly Rauchenberger2016-02-171-10/+13
|
* Attemped to fix line-endings for WindowsKelly Rauchenberger2016-02-171-0/+10
|
* Fixed issue when names.txt was not presentKelly Rauchenberger2016-02-151-24/+13
| | | | Also removed any code mentioning $noun$ because it turns out the current version of the canonical corpus doesn't even use it anymore.
* Tweaked kgram cut rate some more (it never ends)Kelly Rauchenberger2016-02-151-1/+1
|
* Tweaked kgram cut rate AGAINKelly Rauchenberger2016-02-141-1/+2
|
* Fixed incorrect diversity of tokens containing the letters aemnouKelly Rauchenberger2016-02-141-1/+1
|
* Tweaked kgram cut rate againKelly Rauchenberger2016-02-141-2/+2
|
* Fixed problem wherein "$name$'s" was considered a form of "name's"Kelly Rauchenberger2016-02-141-8/+6
|
* Fixed issue where queries with both the wildcard token and a terminating ↵Kelly Rauchenberger2016-02-131-14/+5
| | | | token would reset the prefix
* Tweaked kgram cut rate againKelly Rauchenberger2016-02-091-4/+8
|
* Merge branch 'master' of http://github.com/hatkirby/rawr-ebooksKelly Rauchenberger2016-02-031-0/+1
|\
| * Added #include <cstring> to kgramstatsKelly Rauchenberger2016-02-031-0/+1
| |
* | Declared old-style $name$ and $noun$ canonicalKelly Rauchenberger2016-02-031-0/+6
|/ | | | Without this, they get mixed in by the spell checker with "name" and "noun."
* Token generator now uses aspell to link different spellings of a wordKelly Rauchenberger2016-02-031-3/+56
| | | | This is the grand scheme for the multi-formed word design.
* Terminator characters in the middle of tokens are no longer strippedKelly Rauchenberger2016-02-031-11/+16
| | | | Emoticon checking is also now case sensitive, and a few more emoticons were added to the list.
* Fixed issue where closing opened delimiters wouldn't pop them off the stackKelly Rauchenberger2016-02-011-0/+2
| | | | This would cause a random quotation mark, for instance, to appear at the end of a tweet if a quote had been opened and closed naturally within the tweet.
* Added emoji freevarKelly Rauchenberger2016-02-011-3/+102
| | | | Strings of emojis are tokenized separately from anything else, and added to an emoticon freevar, which is mixed in with regular emoticons like :P. This breaks old-style freevars like $name$ and $noun$ so some legacy support for compatibility is left in but eventually $name$ should be made into an actual new freevar. Emoji data is from gemoji (https://github.com/github/gemoji).
* Rewrote how tokens are handledKelly Rauchenberger2016-01-291-176/+277
| | | | | | A 'word' is now an object that contains a distribution of forms that word can take. For now, most word just contain one form, the canonical one. The only special use is currently hashtags. Malapropisms have been disabled because of compatibility issues and because an upcoming feature is planned to replace it.
* hashtags are now randomizedKelly Rauchenberger2016-01-251-33/+82
|
* Did you know you can put comments in front of ascii art ↵Kelly Rauchenberger2016-01-051-0/+34
| | | | (https://twitter.com/rawr_ebooks/status/684376473369706498)
* Rewrote quite a bit of kgramstatsKelly Rauchenberger2016-01-041-277/+165
| | | | | | The algorithm still treats most tokens literally, but now groups together tokens that terminate a clause somehow (so, contain .?!,), without distinguishing between the different terminating characters. For each word that can terminate a sentence, the algorithm creates a histogram of the terminating characters and number of occurrences of those characters for that word (number of occurrences is to allow things like um???? and um,,,,, to still be folded down into um.). Then, when the terminating version of that token is invoked, a random terminating string is added to that token based on the histogram for that word (again, to allow things like the desu-ly use of multiple commas to end clauses). The algorithm now also has a slightly advanced kgram structure; a special "sentence wildcard" kgram value is set aside from normal strings of tokens that can match any terminating token. This kgram value is never printed (it is only ever present in the query kgrams and cannot actually be present in the histograms (it is of a different datatype)) and is used at the beginning of sentence generation to make sure that the first couple of words generated actually form the beginning of a sentence instead of picking up somewhere in the middle of a sentence. It is also used to reset sentence generation in the rare occasion that the end of the corpus is reached.
* guess what! the algorithmKelly Rauchenberger2015-12-301-31/+56
| | | | | | | this time it's a literal algorithm again not canonizing away punctuation newlines are actually considered new sentences now we look for the end of a sentence and then start after that
* You guessed it,,, twerked the algoKelly Rauchenberger2015-11-231-44/+41
|
* Added malapropismsKelly Rauchenberger2015-11-221-68/+93
|
* I may have made things better. I may have made things worse.Kelly Rauchenberger2015-11-221-5/+5
|
* Added some newline recognitionKelly Rauchenberger2015-07-241-31/+55
|
* Took into account question marks and exclamation marksKelly Rauchenberger2015-07-191-2/+2
|
* Stopped using C++11 because yamlcpp didn't like itKelly Rauchenberger2015-07-191-3/+6
|
* Kerjiggered the algorithmsKelly Rauchenberger2015-07-191-21/+166
|
* Modified kgram shortening rateKelly Rauchenberger2014-04-221-1/+1
|
* Stripped empty tokens from corpusFeffernoose2013-10-061-2/+8
|