about summary refs log tree commit diff stats
diff options
context:
space:
mode:
authorKelly Rauchenberger <fefferburbia@gmail.com>2016-05-31 11:26:28 -0400
committerKelly Rauchenberger <fefferburbia@gmail.com>2016-05-31 11:26:28 -0400
commitd514d8ee6ab88ca4fdf51a2a824b79c7602ffc95 (patch)
tree538c28243bfabdba53e67d3007c9f695cb1c16d8
parent7b8462390a4947710b856c0c85696f66d541b2d3 (diff)
downloadrawr-ebooks-d514d8ee6ab88ca4fdf51a2a824b79c7602ffc95.tar.gz
rawr-ebooks-d514d8ee6ab88ca4fdf51a2a824b79c7602ffc95.tar.bz2
rawr-ebooks-d514d8ee6ab88ca4fdf51a2a824b79c7602ffc95.zip
Update README.md
-rw-r--r--README.md70
1 files changed, 50 insertions, 20 deletions
diff --git a/README.md b/README.md index 4512015..d8bb191 100644 --- a/README.md +++ b/README.md
@@ -4,34 +4,64 @@
4 4
5rawr-ebooks is a very good example of taking things too far. One of the assignments in the algorithms course I took was to implement an algorithm in SML that would generate nonsense statistically similar to an input corpus (basically, a plain text file with words and sentences in it). Of course, the actual point of the assignment was more focused on finding an algorithm that would do this in certain required cost bounds, but after the assignment ended, I decided that the project was too fun to let go and, combined with the recent revelation that [@Horse_Ebooks](https://twitter.com/Horse_Ebooks) was actually not a bot as widely believed, decided to augment my algorithm with the ability to post to Twitter. 5rawr-ebooks is a very good example of taking things too far. One of the assignments in the algorithms course I took was to implement an algorithm in SML that would generate nonsense statistically similar to an input corpus (basically, a plain text file with words and sentences in it). Of course, the actual point of the assignment was more focused on finding an algorithm that would do this in certain required cost bounds, but after the assignment ended, I decided that the project was too fun to let go and, combined with the recent revelation that [@Horse_Ebooks](https://twitter.com/Horse_Ebooks) was actually not a bot as widely believed, decided to augment my algorithm with the ability to post to Twitter.
6 6
7rawr-ebooks actually consists of two programs: `rawr-ebooks`, which generates nonsense and posts it to a Twitter account every hour, and `rawr-gen`, which generates nonsense on command. `rawr-gen` is probably more useful for the casual, well, anybody. It also, unlike `rawr-ebooks`, does not require that the user have any external libraries installed. 7The main project in this repository is a library called `librawr` that provides an interface for generating nonsense from one or more corpuses. The repository also contains `rawr-ebooks`, the canonical Twitter bot that started it all, and `rawr-gen`, which generates the same content as `rawr-ebooks` but does not post it to Twitter.
8 8
9Here is how one would go about compiling `rawr-gen`: 9## librawr
10The interface for `librawr` can be found in `kgramstats.h`, although for readability purposes the header `rawr.h` includes the same code. The main interface is a class named `rawr`.
10 11
111. Clone rawr-ebooks onto your computer. 12```
13void rawr::addCorpus(std::string corpus)
14```
12 15
13 <pre>git clone http://github.com/hatkirby/rawr-ebooks</pre> 16This function takes a string and adds its content to the internal corpus. This function does not prepare `rawr` for generation; you also must call `compile`.
14
152. Use autoconf and automake to generate the configure file
16 17
17 <pre>autoreconf --install --force</pre> 18```
18 19void rawr::compile(int maxK)
193. Run configure 20```
20 21
21 <pre>./configure</pre> 22This function prepares `rawr` for text generation by tokenizing its internal corpus, analyzing word distribution, and building Markov chains. Depending on the size of your corpus, it can take a significant amount of time to run. Currently, it outputs progress information to STDOUT, although the ability to disable this will be added eventually. The parameter `maxK` controls the order of Markov chain generated; a higher number means that the generation process can keep a longer history of tokens. When `maxK` is too high, the generated output will be too similar to the provided corpus; when `maxK` is too low, the generated output will be too random. As a starting point, the canonical bot uses a `maxK` of 4.
22
234. Make
24 23
25 <pre>make rawr-gen</pre> 24```
26 25void rawr::setTransformCallback(std::function<std::string(std::string, std::string)> callback)
275. Run `rawr-gen` with your corpus. For instance, if your corpus was called `corpus.txt`, you would run: 26```
28 27
29 <pre>./rawr-gen corpus.txt</pre> 28With this function, you can provide a callback that `rawr` will call when generating output. When compiling its corpus, `rawr` collects words that it thinks are different forms of the same word into a distribution; usually this means words that it thinks are typos of each other. The transform callback is called after `rawr` has chosen the next word to output and after it has chosen the form of that word to output. The first parameter of the callback is the canonical form of the chosen word, while the second parameter is the chosen form. If a transform callback is provided, `rawr` will call it with this information and use the return value as the string to output.
30 29
31## Implementation details 30An example: say you have a `std::set<std::string>` of words to censor in the output. Because `rawr` transforms its generated text so much, it can be difficult to censor text after generation, however you can use a transform callback to do the job as so:
31```
32std::set<std::string> blacklist;
33rawr kgramstats;
34// Initialize blacklist and kgramstats
35// ...
36
37kgramstats.setTransformCallback([&] (std::string canonical, std::string form) -> std::string {
38 // Check if the next word is in the blacklist
39 if (blacklist.count(canonical) == 1)
40 {
41 // If so, return asterisks instead
42 return std::string(canonical.length(), '*');
43 } else {
44 // Otherwise return the form rawr chose
45 return form;
46 }
47});
48```
49
50```
51std::string rawr::randomSentence(int maxL)
52```
53
54This function actually performs text generation. It currently prints debug information to STDOUT as it works, although an option to disable this will be added eventually. The parameter `maxL` controls the termination condition. Every time a terminator token is generated (which is a token ending in some combination of periods, commas, semicolons, colons, newlines, question marks, and exclamation marks; although a single comma notably is not a terminator), `rawr` checks if `maxL` tokens have been output; if so, it ends generation. If not, there is a 25% chance that it will end generation anyway.
32 55
33I ended up rewriting the algorithm in C++ as the SML implementation did not handle randomization very well and would have been very difficult to adjust to post to Twitter. The new version has many improvements that improve the quality of the generated output, and the input corpus that I use for @Rawr_Ebooks is growing every day. As of October 6th, 2013, it is about 200,000 words long. 56## rawr-ebooks and rawr-gen
57`rawr-ebooks` is the canonical Twitter bot that uses `librawr` to generate text. It is hosted at [@Rawr_Ebooks](https://twitter.com/rawr_ebooks). It takes no command line arguments, and instead reads a configuration file called `config.yml`. An example of the format of this configuration file can be found in `config-example.yml`. The config file contains the filename of a corpus to use, OAuth strings used to authenticate with Twitter, and the maximum amount of time to wait between tweets.
58
59`rawr-ebooks` uses a special word in its generation called `$name$`. It will read in a newline-delimited file in the working directory called `names.txt` and whenever the word `$name$` is chosen to be output, it is replaced by a random line from `names.txt`.
60
61`rawr-gen` does not use `config.yml`. It instead takes the filename of the corpus to use as a command line argument.
62
63## Implementation details
34 64
35`rawr-ebooks` uses [yamlcpp](https://code.google.com/p/yaml-cpp/) to read configuration data from a file (mainly, where the input corpus is located, and the information used to connect to Twitter), and [twitcurl](https://code.google.com/p/twitcurl/) to post to Twitter. `rawr-gen` has no external dependencies, for ease of use, and accepts a corpus as a command-line argument. 65I ended up rewriting the algorithm in C++ as the SML implementation did not handle randomization very well and would have been very difficult to adjust to post to Twitter. The new version has many improvements that improve the quality of the generated output, and the input corpus that I use for @Rawr_Ebooks is growing every day. As of May 31st, 2016, it is about 440,000 words long.
36 66
37The program is roughly divided into two stages: a preprocessing stage and a generation stage. The preprocessing stage runs once at the beginning of the program's run and generates information to ease in the generation of output. This stage runs in O(t^2) time where t is the number of tokens in the input corpus. As you can probably tell, the preprocessing stage can take a fair bit of time to run sometimes. The generation stage actually generates the output and can occur multiple times per program run (in fact it should, otherwise you aren't making good use of the time spent during the preprocessing stage!). It runs in O(n log t) time, where t is the number of tokens in the input corpus, and n is the number of words to generate, which is usually between 5 and 50. As you can see, the generation stage runs far, far more quickly than the preprocessing stage. \ No newline at end of file 67`librawr` uses [aspell](http://aspell.net/) to detect typos. `rawr-ebooks` additionally uses [yaml-cpp](https://github.com/jbeder/yaml-cpp) to read configuration data from a file (mainly, where the input corpus is located, and the information used to connect to Twitter), and my own library [libtwitter++](https://github.com/hatkirby/libtwittercpp) to post to Twitter. `rawr-gen` has no external dependencies other than what `librawr` uses.