summary refs log tree commit diff stats
path: root/generator
diff options
context:
space:
mode:
authorKelly Rauchenberger <fefferburbia@gmail.com>2016-04-18 15:09:20 -0400
committerKelly Rauchenberger <fefferburbia@gmail.com>2016-04-18 15:09:20 -0400
commit965a3206df834f846f2c560438c80a707dcee4cb (patch)
tree7b3b817763721b9aac2bf8bc8224ec131859b2e2 /generator
parent04338f2b040fee5142904c062e0e38c836601034 (diff)
downloadverbly-965a3206df834f846f2c560438c80a707dcee4cb.tar.gz
verbly-965a3206df834f846f2c560438c80a707dcee4cb.tar.bz2
verbly-965a3206df834f846f2c560438c80a707dcee4cb.zip
Fixed problem with words containing certain characters
The generator previously had a problem wherein it would ignore WordNet lemmas containing certain non-alpha characters (hyphens, slashes, numbers, apostrophes). In addition to these words not being included in the generated datafile, it had the side effect of causing relationships involving the ignored words (e.g. hypernymy, synonymy, etc) to instead be related to the word with id 0, which did not exist. This rarely caused a failure with direct queries; but it caused hierarchal queries (most notably full hyponymy, which is where the error was noticed) to potentially permit far more lemmas than they should have because a very large number of words could be transitively reached through the sentinel word id 0.

The generator has been fixed to not ignore the words containing special characters, which removed the word id 0 from most relationships and therefore fixed hierarchal queries. The only remaining word id 0s are as a synonym of "free-flying" (synset 301380571) and as an anti-mannernym of "aerially" (synset 400202718). This is because the WordNet data is malformed in the definitions of two words: "aerial" (synset 301380267) and "marine" (synset 301380721). The generator ignored those two lines, causing the described error, although the latter word being ignored did not cause any other errors.

The bug was discovered when the Twitter bot difference (https://github.com/hatkirby/difference) generated a tweet (https://twitter.com/differencebot/status/722084219925700613) as a result of returning the noun "tearaway" in a full hyponym query of "artifact".
Diffstat (limited to 'generator')
-rw-r--r--generator/generator.cpp7
1 files changed, 6 insertions, 1 deletions
diff --git a/generator/generator.cpp b/generator/generator.cpp index e2ebfa1..3201154 100644 --- a/generator/generator.cpp +++ b/generator/generator.cpp
@@ -1103,7 +1103,7 @@ int main(int argc, char** argv)
1103 { 1103 {
1104 ppgs.update(); 1104 ppgs.update();
1105 1105
1106 std::regex relation("^s\\(([134]\\d{8}),(\\d+),'([\\w ]+)',"); 1106 std::regex relation("^s\\(([134]\\d{8}),(\\d+),'(.+)',\\w,\\d+,\\d+\\)\\.$");
1107 std::smatch relation_data; 1107 std::smatch relation_data;
1108 if (!std::regex_search(line, relation_data, relation)) 1108 if (!std::regex_search(line, relation_data, relation))
1109 { 1109 {
@@ -1113,6 +1113,11 @@ int main(int argc, char** argv)
1113 int synset_id = stoi(relation_data[1]); 1113 int synset_id = stoi(relation_data[1]);
1114 int wnum = stoi(relation_data[2]); 1114 int wnum = stoi(relation_data[2]);
1115 std::string word = relation_data[3]; 1115 std::string word = relation_data[3];
1116 size_t word_it;
1117 while ((word_it = word.find("''")) != std::string::npos)
1118 {
1119 word.erase(word_it, 1);
1120 }
1116 1121
1117 std::string query; 1122 std::string query;
1118 switch (synset_id / 100000000) 1123 switch (synset_id / 100000000)