about summary refs log tree commit diff stats
diff options
context:
space:
mode:
-rw-r--r--Makefile.am6
-rw-r--r--README.md9
-rw-r--r--gen.cpp28
3 files changed, 28 insertions, 15 deletions
diff --git a/Makefile.am b/Makefile.am index c9f61cf..3d4dad6 100644 --- a/Makefile.am +++ b/Makefile.am
@@ -4,7 +4,5 @@ ACLOCAL_AMFLAGS = ${ACLOCAL_FLAGS}
4bin_PROGRAMS = rawr-ebooks rawr-gen 4bin_PROGRAMS = rawr-ebooks rawr-gen
5rawr_ebooks_SOURCES = ebooks.cpp kgramstats.cpp 5rawr_ebooks_SOURCES = ebooks.cpp kgramstats.cpp
6rawr_gen_SOURCES = gen.cpp kgramstats.cpp 6rawr_gen_SOURCES = gen.cpp kgramstats.cpp
7rawr_ebooks_CPPFLAGS = $(LIBTWITCURL_CFLAGS) 7rawr_ebooks_CPPFLAGS = $(LIBTWITCURL_CFLAGS) $(YAML_CFLAGS)
8AM_CPPFLAGS = $(YAML_CFLAGS) 8rawr_ebooks_LDADD = $(LIBTWITCURL_LIBS) $(YAML_LIBS) \ No newline at end of file
9rawr_ebooks_LDADD = $(LIBTWITCURL_LIBS) $(YAML_LIBS)
10rawr_gen_LDADD = $(YAML_LIBS) \ No newline at end of file
diff --git a/README.md b/README.md index e01eb45..4512015 100644 --- a/README.md +++ b/README.md
@@ -4,7 +4,7 @@
4 4
5rawr-ebooks is a very good example of taking things too far. One of the assignments in the algorithms course I took was to implement an algorithm in SML that would generate nonsense statistically similar to an input corpus (basically, a plain text file with words and sentences in it). Of course, the actual point of the assignment was more focused on finding an algorithm that would do this in certain required cost bounds, but after the assignment ended, I decided that the project was too fun to let go and, combined with the recent revelation that [@Horse_Ebooks](https://twitter.com/Horse_Ebooks) was actually not a bot as widely believed, decided to augment my algorithm with the ability to post to Twitter. 5rawr-ebooks is a very good example of taking things too far. One of the assignments in the algorithms course I took was to implement an algorithm in SML that would generate nonsense statistically similar to an input corpus (basically, a plain text file with words and sentences in it). Of course, the actual point of the assignment was more focused on finding an algorithm that would do this in certain required cost bounds, but after the assignment ended, I decided that the project was too fun to let go and, combined with the recent revelation that [@Horse_Ebooks](https://twitter.com/Horse_Ebooks) was actually not a bot as widely believed, decided to augment my algorithm with the ability to post to Twitter.
6 6
7rawr-ebooks actually consists of two programs: `rawr-ebooks`, which generates nonsense and posts it to a Twitter account every hour, and `rawr-gen`, which generates nonsense on command. `rawr-gen` is probably more useful for the casual, well, anybody. 7rawr-ebooks actually consists of two programs: `rawr-ebooks`, which generates nonsense and posts it to a Twitter account every hour, and `rawr-gen`, which generates nonsense on command. `rawr-gen` is probably more useful for the casual, well, anybody. It also, unlike `rawr-ebooks`, does not require that the user have any external libraries installed.
8 8
9Here is how one would go about compiling `rawr-gen`: 9Here is how one would go about compiling `rawr-gen`:
10 10
@@ -24,15 +24,14 @@ Here is how one would go about compiling `rawr-gen`:
24 24
25 <pre>make rawr-gen</pre> 25 <pre>make rawr-gen</pre>
26 26
275. Rename `config-example.yml` to `config.yml` and within it, replace `corpus.txt` with the path to your input 275. Run `rawr-gen` with your corpus. For instance, if your corpus was called `corpus.txt`, you would run:
286. Run `rawr-gen`
29 28
30 <pre>./rawr-gen</pre> 29 <pre>./rawr-gen corpus.txt</pre>
31 30
32## Implementation details 31## Implementation details
33 32
34I ended up rewriting the algorithm in C++ as the SML implementation did not handle randomization very well and would have been very difficult to adjust to post to Twitter. The new version has many improvements that improve the quality of the generated output, and the input corpus that I use for @Rawr_Ebooks is growing every day. As of October 6th, 2013, it is about 200,000 words long. 33I ended up rewriting the algorithm in C++ as the SML implementation did not handle randomization very well and would have been very difficult to adjust to post to Twitter. The new version has many improvements that improve the quality of the generated output, and the input corpus that I use for @Rawr_Ebooks is growing every day. As of October 6th, 2013, it is about 200,000 words long.
35 34
36rawr-ebooks uses [yamlcpp](https://code.google.com/p/yaml-cpp/) to read configuration data from a file (mainly, where the input corpus is located, and the information used to connect to Twitter), and [twitcurl](https://code.google.com/p/twitcurl/) to post to Twitter. 35`rawr-ebooks` uses [yamlcpp](https://code.google.com/p/yaml-cpp/) to read configuration data from a file (mainly, where the input corpus is located, and the information used to connect to Twitter), and [twitcurl](https://code.google.com/p/twitcurl/) to post to Twitter. `rawr-gen` has no external dependencies, for ease of use, and accepts a corpus as a command-line argument.
37 36
38The program is roughly divided into two stages: a preprocessing stage and a generation stage. The preprocessing stage runs once at the beginning of the program's run and generates information to ease in the generation of output. This stage runs in O(t^2) time where t is the number of tokens in the input corpus. As you can probably tell, the preprocessing stage can take a fair bit of time to run sometimes. The generation stage actually generates the output and can occur multiple times per program run (in fact it should, otherwise you aren't making good use of the time spent during the preprocessing stage!). It runs in O(n log t) time, where t is the number of tokens in the input corpus, and n is the number of words to generate, which is usually between 5 and 50. As you can see, the generation stage runs far, far more quickly than the preprocessing stage. \ No newline at end of file 37The program is roughly divided into two stages: a preprocessing stage and a generation stage. The preprocessing stage runs once at the beginning of the program's run and generates information to ease in the generation of output. This stage runs in O(t^2) time where t is the number of tokens in the input corpus. As you can probably tell, the preprocessing stage can take a fair bit of time to run sometimes. The generation stage actually generates the output and can occur multiple times per program run (in fact it should, otherwise you aren't making good use of the time spent during the preprocessing stage!). It runs in O(n log t) time, where t is the number of tokens in the input corpus, and n is the number of words to generate, which is usually between 5 and 50. As you can see, the generation stage runs far, far more quickly than the preprocessing stage. \ No newline at end of file
diff --git a/gen.cpp b/gen.cpp index dc73e0f..e4a58e5 100644 --- a/gen.cpp +++ b/gen.cpp
@@ -7,18 +7,34 @@
7#include <cstdlib> 7#include <cstdlib>
8#include <fstream> 8#include <fstream>
9#include <iostream> 9#include <iostream>
10#include <unistd.h>
11#include <yaml-cpp/yaml.h>
12 10
13using namespace::std; 11using namespace::std;
14 12
15int main(int argc, char** args) 13int main(int argc, char** args)
16{ 14{
17 srand(time(NULL)); 15 srand(time(NULL));
18 16
19 YAML::Node config = YAML::LoadFile("config.yml"); 17 if (argc == 1)
20 18 {
21 ifstream infile(config["corpus"].as<std::string>().c_str()); 19 cout << "rawr-gen, version 1.0" << endl;
20 cout << "Usage: rawr-gen corpus-file" << endl;
21 cout << " where 'corpus-file' is the path to your input" << endl;
22
23 return 0;
24 }
25
26 ifstream infile(args[1]);
27 if (!infile)
28 {
29 cout << "rawr-gen, version 1.0" << endl;
30 cout << "Usage: rawr-gen corpus-file" << endl;
31 cout << " where 'corpus-file' is the path to your input" << endl;
32 cout << endl;
33 cout << "The file you specified does not exist." << endl;
34
35 return 0;
36 }
37
22 string corpus; 38 string corpus;
23 string line; 39 string line;
24 while (getline(infile, line)) 40 while (getline(infile, line))