about summary refs log tree commit diff stats
path: root/data/maps/the_entry/rooms/Flipped Link Area.txtpb
blob: 950e9b131ecf2190d1c2b012adf59e68bdcc0ea9 (plain) (blame)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
pre { line-he
name: "Flipped Link Area"
panel_display_name: "Pyramid Area"
panels {
  name: "WANDER"
  path: "Panels/Pilgrimage/cream_4"
  clue: "wander"
  answer: "roam"
  symbols: SUN
  display_name: "WANDER (Flipped)"
}
paintings {
  name: "NEAR"
  path: "Components/Paintings/aches2"
  orientation: "north"
  gravity: Y_PLUS
  display_name: "Flipped Near Painting"
}
paintings {
  name: "FAR"
  path: "Components/Paintings/aches4"
  orientation: "south"
  gravity: Y_PLUS
  display_name: "Flipped Far Painting"
}
is a library called `librawr` that provides an interface for generating nonsense from one or more corpuses. The repository also contains `rawr-ebooks`, the canonical Twitter bot that started it all, and `rawr-gen`, which generates the same content as `rawr-ebooks` but does not post it to Twitter. ## librawr The interface for `librawr` can be found in `kgramstats.h`, although for readability purposes the header `rawr.h` includes the same code. The main interface is a class named `rawr`. ``` void rawr::addCorpus(std::string corpus) ``` This function takes a string and adds its content to the internal corpus. This function does not prepare `rawr` for generation; you also must call `compile`. ``` void rawr::compile(int maxK) ``` This function prepares `rawr` for text generation by tokenizing its internal corpus, analyzing word distribution, and building Markov chains. Depending on the size of your corpus, it can take a significant amount of time to run. Currently, it outputs progress information to STDOUT, although the ability to disable this will be added eventually. The parameter `maxK` controls the order of Markov chain generated; a higher number means that the generation process can keep a longer history of tokens. When `maxK` is too high, the generated output will be too similar to the provided corpus; when `maxK` is too low, the generated output will be too random. As a starting point, the canonical bot uses a `maxK` of 4. ``` void rawr::setTransformCallback(std::function<std::string(std::string, std::string)> callback) ``` With this function, you can provide a callback that `rawr` will call when generating output. When compiling its corpus, `rawr` collects words that it thinks are different forms of the same word into a distribution; usually this means words that it thinks are typos of each other. The transform callback is called after `rawr` has chosen the next word to output and after it has chosen the form of that word to output. The first parameter of the callback is the canonical form of the chosen word, while the second parameter is the chosen form. If a transform callback is provided, `rawr` will call it with this information and use the return value as the string to output. An example: say you have a `std::set<std::string>` of words to censor in the output. Because `rawr` transforms its generated text so much, it can be difficult to censor text after generation, however you can use a transform callback to do the job as so: ``` std::set<std::string> blacklist; rawr kgramstats; // Initialize blacklist and kgramstats // ... kgramstats.setTransformCallback([&] (std::string canonical, std::string form) -> std::string { // Check if the next word is in the blacklist if (blacklist.count(canonical) == 1) { // If so, return asterisks instead return std::string(canonical.length(), '*'); } else { // Otherwise return the form rawr chose return form; } }); ``` ``` std::string rawr::randomSentence(int maxL) ``` This function actually performs text generation. It currently prints debug information to STDOUT as it works, although an option to disable this will be added eventually. The parameter `maxL` controls the termination condition. Every time a terminator token is generated (which is a token ending in some combination of periods, commas, semicolons, colons, newlines, question marks, and exclamation marks; although a single comma notably is not a terminator), `rawr` checks if `maxL` tokens have been output; if so, it ends generation. If not, there is a 25% chance that it will end generation anyway. ## rawr-ebooks and rawr-gen `rawr-ebooks` is the canonical Twitter bot that uses `librawr` to generate text. It is hosted at [@Rawr_Ebooks](https://twitter.com/rawr_ebooks). It takes no command line arguments, and instead reads a configuration file called `config.yml`. An example of the format of this configuration file can be found in `config-example.yml`. The config file contains the filename of a corpus to use, OAuth strings used to authenticate with Twitter, and the maximum amount of time to wait between tweets. `rawr-ebooks` uses a special word in its generation called `$name$`. It will read in a newline-delimited file in the working directory called `names.txt` and whenever the word `$name$` is chosen to be output, it is replaced by a random line from `names.txt`. `rawr-gen` does not use `config.yml`. It instead takes the filename of the corpus to use as a command line argument. ## Implementation details I ended up rewriting the algorithm in C++ as the SML implementation did not handle randomization very well and would have been very difficult to adjust to post to Twitter. The new version has many improvements that improve the quality of the generated output, and the input corpus that I use for @Rawr_Ebooks is growing every day. As of May 31st, 2016, it is about 440,000 words long. `librawr` uses [aspell](http://aspell.net/) to detect typos. `rawr-ebooks` additionally uses [yaml-cpp](https://github.com/jbeder/yaml-cpp) to read configuration data from a file (mainly, where the input corpus is located, and the information used to connect to Twitter), and my own library [libtwitter++](https://github.com/hatkirby/libtwittercpp) to post to Twitter. `rawr-gen` has no external dependencies other than what `librawr` uses.