From 965a3206df834f846f2c560438c80a707dcee4cb Mon Sep 17 00:00:00 2001 From: Kelly Rauchenberger Date: Mon, 18 Apr 2016 15:09:20 -0400 Subject: Fixed problem with words containing certain characters The generator previously had a problem wherein it would ignore WordNet lemmas containing certain non-alpha characters (hyphens, slashes, numbers, apostrophes). In addition to these words not being included in the generated datafile, it had the side effect of causing relationships involving the ignored words (e.g. hypernymy, synonymy, etc) to instead be related to the word with id 0, which did not exist. This rarely caused a failure with direct queries; but it caused hierarchal queries (most notably full hyponymy, which is where the error was noticed) to potentially permit far more lemmas than they should have because a very large number of words could be transitively reached through the sentinel word id 0. The generator has been fixed to not ignore the words containing special characters, which removed the word id 0 from most relationships and therefore fixed hierarchal queries. The only remaining word id 0s are as a synonym of "free-flying" (synset 301380571) and as an anti-mannernym of "aerially" (synset 400202718). This is because the WordNet data is malformed in the definitions of two words: "aerial" (synset 301380267) and "marine" (synset 301380721). The generator ignored those two lines, causing the described error, although the latter word being ignored did not cause any other errors. The bug was discovered when the Twitter bot difference (https://github.com/hatkirby/difference) generated a tweet (https://twitter.com/differencebot/status/722084219925700613) as a result of returning the noun "tearaway" in a full hyponym query of "artifact". --- generator/generator.cpp | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-) (limited to 'generator') diff --git a/generator/generator.cpp b/generator/generator.cpp index e2ebfa1..3201154 100644 --- a/generator/generator.cpp +++ b/generator/generator.cpp @@ -1103,7 +1103,7 @@ int main(int argc, char** argv) { ppgs.update(); - std::regex relation("^s\\(([134]\\d{8}),(\\d+),'([\\w ]+)',"); + std::regex relation("^s\\(([134]\\d{8}),(\\d+),'(.+)',\\w,\\d+,\\d+\\)\\.$"); std::smatch relation_data; if (!std::regex_search(line, relation_data, relation)) { @@ -1113,6 +1113,11 @@ int main(int argc, char** argv) int synset_id = stoi(relation_data[1]); int wnum = stoi(relation_data[2]); std::string word = relation_data[3]; + size_t word_it; + while ((word_it = word.find("''")) != std::string::npos) + { + word.erase(word_it, 1); + } std::string query; switch (synset_id / 100000000) -- cgit 1.4.1