about summary refs log tree commit diff stats

rawr-ebooks

I suddenly found it very hilarious. --@Rawr_Ebooks

rawr-ebooks is a very good example of taking things too far. One of the assignments in the algorithms course I took was to implement an algorithm in SML that would generate nonsense statistically similar to an input corpus (basically, a plain text file with words and sentences in it). Of course, the actual point of the assignment was more focused on finding an algorithm that would do this in certain required cost bounds, but after the assignment ended, I decided that the project was too fun to let go and, combined with the recent revelation that @Horse_Ebooks was actually not a bot as widely believed, decided to augment my algorithm with the ability to post to Twitter.

The main project in this repository is a library called librawr that provides an interface for generating nonsense from one or more corpuses. The repository also contains rawr-ebooks, the canonical Twitter bot that started it all, and rawr-gen, which generates the same content as rawr-ebooks but does not post it to Twitter.

librawr

The interface for librawr can be found in kgramstats.h, although for readability purposes the header rawr.h includes the same code. The main interface is a class named rawr.

void rawr::addCorpus(std::string corpus)

This function takes a string and adds its content to the internal corpus. This function does not prepare rawr for generation; you also must call compile.

void rawr::compile(int maxK)

This function prepares rawr for text generation by tokenizing its internal corpus, analyzing word distribution, and building Markov chains. Depending on the size of your corpus, it can take a significant amount of time to run. Currently, it outputs progress information to STDOUT, although the ability to disable this will be added eventually. The parameter maxK controls the order of Markov chain generated; a higher number means that the generation process can keep a longer history of tokens. When maxK is too high, the generated output will be too similar to the provided corpus; when maxK is too low, the generated output will be too random. As a starting point, the canonical bot uses a maxK of 4.

void rawr::setTransformCallback(std::function<std::string(std::string, std::string)> callback)

With this function, you can provide a callback that rawr will call when generating output. When compiling its corpus, rawr collects words that it thinks are different forms of the same word into a distribution; usually this means words that it thinks are typos of each other. The transform callback is called after rawr has chosen the next word to output and after it has chosen the form of that word to output. The first parameter of the callback is the canonical form of the chosen word, while the second parameter is the chosen form. If a transform callback is provided, rawr will call it with this information and use the return value as the string to output.

An example: say you have a std::set<std::string> of words to censor in the output. Because rawr transforms its generated text so much, it can be difficult to censor text after generation, however you can use a transform callback to do the job as so:

std::set<std::string> blacklist;
rawr kgramstats;
// Initialize blacklist and kgramstats
// ...

kgramstats.setTransformCallback([&] (std::string canonical, std::string form) -> std::string {
   // Check if the next word is in the blacklist
   if (blacklist.count(canonical) == 1)
   {
      // If so, return asterisks instead
      return std::string(canonical.length(), '*');
   } else {
      // Otherwise return the form rawr chose
      return form;
   }
});
std::string rawr::randomSentence(int maxL)

This function actually performs text generation. It currently prints debug information to STDOUT as it works, although an option to disable this will be added eventually. The parameter maxL controls the termination condition. Every time a terminator token is generated (which is a token ending in some combination of periods, commas, semicolons, colons, newlines, question marks, and exclamation marks; although a single comma notably is not a terminator), rawr checks if maxL tokens have been output; if so, it ends generation. If not, there is a 25% chance that it will end generation anyway.

rawr-ebooks and rawr-gen

rawr-ebooks is the canonical Twitter bot that uses librawr to generate text. It is hosted at @Rawr_Ebooks. It takes no command line arguments, and instead reads a configuration file called config.yml. An example of the format of this configuration file can be found in config-example.yml. The config file contains the filename of a corpus to use, OAuth strings used to authenticate with Twitter, and the maximum amount of time to wait between tweets.

rawr-ebooks uses a special word in its generation called $name$. It will read in a newline-delimited file in the working directory called names.txt and whenever the word $name$ is chosen to be output, it is replaced by a random line from names.txt.

rawr-gen does not use config.yml. It instead takes the filename of the corpus to use as a command line argument.

Implementation details

I ended up rewriting the algorithm in C++ as the SML implementation did not handle randomization very well and would have been very difficult to adjust to post to Twitter. The new version has many improvements that improve the quality of the generated output, and the input corpus that I use for @Rawr_Ebooks is growing every day. As of May 31st, 2016, it is about 440,000 words long.

librawr uses aspell to detect typos. rawr-ebooks additionally uses yaml-cpp to read configuration data from a file (mainly, where the input corpus is located, and the information used to connect to Twitter), and my own library libtwitter++ to post to Twitter. rawr-gen has no external dependencies other than what librawr uses.