diff options
Diffstat (limited to 'docs')
-rw-r--r-- | docs/new_object_structure.md | 72 | ||||
-rw-r--r-- | docs/object_structure_example.md | 26 | ||||
-rw-r--r-- | docs/verb_frames.md | 100 |
3 files changed, 198 insertions, 0 deletions
diff --git a/docs/new_object_structure.md b/docs/new_object_structure.md new file mode 100644 index 0000000..9e25615 --- /dev/null +++ b/docs/new_object_structure.md | |||
@@ -0,0 +1,72 @@ | |||
1 | # New object structure | ||
2 | |||
3 | The rewrite of verbly uses a completely redesigned object structure that was designed to build off of the already-existing WordNet structure and add to it the data we are getting from other sources. | ||
4 | |||
5 | ## notion | ||
6 | Something that can be expressed with words. fields: part of speech, wnid (WordNet ID, optional). nouns also have images field (number of images ImageNet has for this notion). has many words. related to each other through hypernymy, meronymy, synonymy, etc. parts of speech are: | ||
7 | - noun {0} | ||
8 | - adjective {1} | ||
9 | - adverb {2} | ||
10 | - verb {3} | ||
11 | - preposition {4} | ||
12 | |||
13 | relations are: | ||
14 | - hypernymy (noun/noun and verb/verb) | ||
15 | - instantiation (noun/noun) | ||
16 | - meronymy (noun/noun) | ||
17 | - variation (noun/adjective) | ||
18 | - similarity (adjective/adjective) [symmetric] | ||
19 | - entailment (verb/verb) | ||
20 | - causality (verb/verb) | ||
21 | |||
22 | notion also has a special relation "is a" between a preposition and a string group name | ||
23 | |||
24 | ## word | ||
25 | An expression of a concept. belongs to a notion. belongs to a lemma. tag count (optional). adjectives also have position field. verbs optionally belong to groups. has several relations to itself: | ||
26 | - antonymy (noun/noun, adjective/adjective, adverb/adverb, verb/verb) [symmetric] | ||
27 | - specification (adjective/adjective, verb/verb) | ||
28 | - pertainymy (noun/adjective) | ||
29 | - mannernymy (adjective/adverb) | ||
30 | - usage (noun/noun, noun/adjective, noun/adverb, noun/verb) | ||
31 | - topicality (noun/noun, noun/adjective, noun/adverb, noun/verb) | ||
32 | - regionality (noun/noun, noun/adjective, noun/adverb, noun/verb) | ||
33 | |||
34 | adjective positions are: | ||
35 | - predicate {0} | ||
36 | - attributive {1} | ||
37 | - postnominal {2} | ||
38 | |||
39 | ## lemma | ||
40 | A lexical set that can be used to represent words. has many inflections (including the base inflection). has many words (that it represents). relations with itself: | ||
41 | - derivation [not implemented yet] | ||
42 | |||
43 | in implementation, this object has no fields, and thus it does not need a table. uniquely identifiable by base form. constructible from base form. | ||
44 | |||
45 | ## lemma/form | ||
46 | The inflection relationship relates an uninflected lemma to its inflected forms. there can potentially be multiple ways to inflect a lemma, so the tuple (lemma_id, category) is not necessarily unique. field: type of inflection. ex: "care" is a singular (base) inflection of a noun, and a base inflection of a verb. "cares" is both a plural and an s form inflection of "care". the types of inflection are: | ||
47 | - base {0} | ||
48 | - plural (nouns) {1} | ||
49 | - comparative (adjectives and adverbs) {2} | ||
50 | - superlative (adjectives and adverbs) {3} | ||
51 | - past tense (verbs) {4} | ||
52 | - past participle (verbs) {5} | ||
53 | - ing form (verbs) {6} | ||
54 | - s form (verbs) {7} | ||
55 | |||
56 | ## form | ||
57 | An inflection of a lemma. fields: text form, complexity (number of spaces plus one), proper (true if there is at least one capital letter, false otherwise). uniquely identifiable by text form. constructible from text form. has many and belongs to many pronunciations. | ||
58 | |||
59 | ## form/pronunciation | ||
60 | One spelling of a word can have multiple pronunciations (whether by homography or speaker variation), but multiple words can also have the same pronunciation (homophony). the current data we have doesn't tell us which pronunciations go with which words, so we just associate all pronunciations of a form with the form. | ||
61 | |||
62 | ## pronunciation | ||
63 | Fields: phonemes, rhyme phonemes, prerhyme, syllables, stress structure. has many and belongs to many forms. | ||
64 | |||
65 | ## frame | ||
66 | A verb frame. belongs to a group. has many parts. | ||
67 | |||
68 | ## group (word/frame) | ||
69 | A collection of verb frames. has many frames. has many words. this is not really an object per-se, more rather the name given to the cross join between sets of words and sets of frames. in implementation, this join has no fields, and thus it does not need a table. | ||
70 | |||
71 | ## part | ||
72 | An ordered element of a verb frame. belongs to a frame. fields: index (position in the frame), and type. the tuple (frame_id, index) is unique. there are additional fields depending on the type of the frame. noun phrases have role and selrestrs. prepositions have prepositions and preposition_literality. literals have literal_value. in addition, noun phrases have synrestrs, which, in order to be queryable, are located in a separate table called "synrestrs". | ||
diff --git a/docs/object_structure_example.md b/docs/object_structure_example.md new file mode 100644 index 0000000..d056204 --- /dev/null +++ b/docs/object_structure_example.md | |||
@@ -0,0 +1,26 @@ | |||
1 | # Object structure example | ||
2 | |||
3 | verbly has a rather complicated object model. Understanding the object model is key to being able to query for data effectively. To aid in understanding why the object model is set up the way it is, this article steps through an example of some objects in the model, how they're related, and why those relationships are useful. | ||
4 | |||
5 | The point of verbly is to be able to manipulate concepts that can be expressed using words. For instance, "a course that is traveled". We know that this can be expressed with the word "route". So, we create the "word" object to represent this instantiation of a concept. | ||
6 | |||
7 | One issue that arises is that we know there is another word for this concept, i.e. "path". This relationship is called synonymy. We could have a many-to-many relationship between words in order to represent synonymy, but it is easier to create a new object called "notion", which represents something that can be expressed with words. So, both "route" and "path" are words belonging to the notion "a course that is traveled". | ||
8 | |||
9 | The next issue is that words need to be able to be inflected. The noun "route" also has a plural form, "routes". We could put fields on the word object for each possible type of inflection, and just stuff the textual representation of those inflections into there, but there are a couple of problems with that. First, that makes querying for words harder. One of the major functions verbly provides is the ability to query words, so it is important that this is easy to do. If you want to find a word that is spelled "routes", you would have to look for words that have a base form of "routes" OR a plural form of "routes" OR etc... for each type of inflection. | ||
10 | |||
11 | We could solve this by creating an "inflection" object that belongs to the "word" object and contains fields for the text of the inflection, and the type of inflection. This makes querying for words by text easier, but doesn't completely solve the problem. Consider the notion of "a regular itinerary". This contains a word with the singular form "route" and the plural form "routes", as in a bus route. This word belongs to a different notion than the first word we described, but it is inflected in exactly the same way. This is a form of homography, and would require having duplicate "inflection" objects that share the same text. | ||
12 | |||
13 | The way verbly approaches this is by forgetting the "inflection" object, and creating two different objects: "lemma" and "form". Let's start with "form". A "form" is a literal collection of characters, such as "route" or "routes". There is exactly one "form" for every collection of characters that is part of a word. Thus, it doesn't matter that "route" can mean both "a course that is traveled" and "a regular itinerary". They both use the same "form". | ||
14 | |||
15 | How do they use the "form"? Via the "lemma" object. A "lemma" describes how to inflect a word. Every "word" has a "lemma", and multiple "word"s can have the same "lemma", as in the two "route"s we described earlier. A "lemma" also forms a many-to-many relationship with "form" with an inflection type attached. For instance, a "lemma" that has a base form relationship with the "route" form, and a plural form relationship with the "routes" form. | ||
16 | |||
17 | This object model provides many advantages in addition to those described already: | ||
18 | |||
19 | * There is a second form of homography where different words share a form but are inflected differently. Consider the notion "to plan a course that is traveled". There is a word for that notion with the base form "route". However, the rest of the lemma is different, because this is a verb and does not have a plural inflection, but instead has a simple present, a present participle, a simple past, and a past participle. This is a different "lemma" from the first one we described, but it is related to the same "route" form from earlier. | ||
20 | * It is possible for two different inflections of a lemma to have the same form. Consider the verb lemma in the previous paragraph. The simple past inflection and the past participle inflection of that lemma are both spelled "routed". They are joined to the same form, but are distinct inflections. | ||
21 | * It is possible for there to be two different ways to inflect a lemma. There isn't a "route"-related example for this, so consider the adjective word with the base form "small" meaning "inferior in size". The comparative inflection of this word can be spelled "less" or "lesser". These are two different forms, but they are joined to the lemma with the same inflection. | ||
22 | |||
23 | The next issue concerns pronunciations. If there were only one way to pronounce a form, it would be simple to put the pronunciation information into the "form" object. However, speaker variation prevents this from being so. For example, the form "route" can be pronounced as both "root" and "r-ow-t". To handle this, we create a "pronunciation" object that has a many-to-many relationship with "form". The reason that it is many-to-many as opposed to "form" having many "pronunciation"s is that homophony exist. Consider the form "rout", which is the base form of a word for the notion "to defeat completely". It is pronounced "r-ow-t", which is a pronunciation for the form "route". | ||
24 | |||
25 | This is not an exhaustive list of the relationships that verbly objects can have. For more detailed information, check out [the object structure document](https://code.fourisland.com/verbly/about/docs/new_object_structure.md). | ||
26 | |||
diff --git a/docs/verb_frames.md b/docs/verb_frames.md new file mode 100644 index 0000000..64f3f7e --- /dev/null +++ b/docs/verb_frames.md | |||
@@ -0,0 +1,100 @@ | |||
1 | # Verb frames | ||
2 | |||
3 | Verbly's verb frame data comes from VerbNet, a database compiled by the University of Colorado Boulder Department of Linguistics. More information, including a download for v3.2, the version used in the canonical verbly datafile, can be found on [Martha Palmer's website](http://verbs.colorado.edu/~mpalmer/projects/verbnet.html). | ||
4 | |||
5 | The downloadable data for VerbNet v3.2 has a lot of quirks and inadequacies that make it unsuitable for natural language generation. In particular, it makes no distinction between noun phrases and adjective phrases, so figuring out how to fill in that particular blank is in most cases impossible. In order to make the data more usable, I have gone through the data and sanitized it in a lot of places. A patch file, applicable to a clean VerbNet v3.2 download, [can be found in the repository](https://code.fourisland.com/verbly/tree/generator/vn-3.2.diff). This patch will likely continue to be updated as verbly is developed. | ||
6 | |||
7 | ## Syntactic Restrictions | ||
8 | The data from VerbNet allows for a set of syntactic restrictions to follow either AND logic or OR logic; however, OR logic is never used, so we shall be ignoring it in our implementation. Additionally, syntactic restrictions are listed as being additive or subtractive; however, each distinct syntactic restriction always appears positively or always appears negatively. Therefore, we can remove the additive/subtractive modifier from our implementation, and change the meaning of the subtractive restrictions to mean the negation of what they "should" mean. These syntactic restrictions frequently indicate that the noun phrase is not actually a noun phrase and should be treated differently, so these are important to watch out for. | ||
9 | |||
10 | **np_ppart** | ||
11 | As far as I can tell, it has no purpose. It always appears before an ADJP or ADJ though. | ||
12 | |||
13 | **be_sc_ing, ac_ing, sc_ing, np_omit_ing** | ||
14 | Used for gerund phrases. | ||
15 | |||
16 | **oc_ing** | ||
17 | Used for gerund phrases. Always preceded by a noun or an objective pronoun; most of the time, the two are separated by a preposition, but not always. | ||
18 | |||
19 | **poss_ing, possing, pos_ing** | ||
20 | Used for a possessive (whether it be a noun with an apostrophe s or a possessive pronoun) followed by a gerund phrase. | ||
21 | |||
22 | **acc_ing** | ||
23 | Used for a noun (or an objective pronoun) followed by a participle phrase. | ||
24 | |||
25 | **genitive** | ||
26 | Used to indicate that the noun phrase should be possessive. | ||
27 | |||
28 | **that_comp** | ||
29 | Used for the word "that" followed by an independent clause in the simple past perfect tense. | ||
30 | |||
31 | **tensed_that** | ||
32 | Always appears negatively and alongside a that_comp. Use unknown. | ||
33 | |||
34 | **wh_comp** | ||
35 | Used for a phrase starting with the word "whether." The data using this restriction is a bit muddy, so this is not a perfect description. It will likely be cleaned up in a future release. | ||
36 | |||
37 | **what_extract** | ||
38 | Used for a phrase starting with the word "what." The data using this restriction is a bit muddy, so this is not a perfect description. It will likely be cleaned up in a future release. | ||
39 | |||
40 | **how_extract** | ||
41 | Used for a phrase starting with the word "how." The data using this restriction is a bit muddy, so this is not a perfect description. It will likely be cleaned up in a future release. | ||
42 | |||
43 | **sc_to_inf, ac_to_inf, vc_to_inf, rs_to_inf** | ||
44 | Used for infinitive phrases. | ||
45 | |||
46 | **oc_to_inf** | ||
47 | Used for infinitive phrases. Always immediately preceded by a noun or an objective pronoun. | ||
48 | |||
49 | **oc_bare_inf** | ||
50 | Used for infinitive phrases with bare infinitives. Always immediately preceded by a noun or an objective pronoun. | ||
51 | |||
52 | **wh_inf** | ||
53 | Used for the word "how" (or sometimes, "when" or "whether"), followed by an infinitive phrase. Which starting word is used is frame-dependent. The data using this restriction is a bit muddy, so this is not a perfect description. It will likely be cleaned up in a future release. | ||
54 | |||
55 | **what_inf** | ||
56 | Used for the word "what" followed by an infinitive phrase. One frame in empathize-88.2 erroneously uses it to indicate a phrase of the form "what they want." This will likely be cleaned up in a future release. | ||
57 | |||
58 | **wheth_inf** | ||
59 | Used for the word "whether" followed by an infinitive phrase. | ||
60 | |||
61 | **for_comp** | ||
62 | Used to indicate the following format: the word "for", followed by a noun or an objective pronoun, followed by an infinitive phrase. | ||
63 | |||
64 | **quotation** | ||
65 | Used to indicate a quotation. | ||
66 | |||
67 | **plural** | ||
68 | Used to indicate that the noun phrase should be plural. | ||
69 | |||
70 | **definite** | ||
71 | Always used negatively to indicate that a noun phrase should not be definite. | ||
72 | |||
73 | **adv_loc** | ||
74 | Used to indicate either the word "here" or "there." In one case (throw-17.1), the word "away" is acceptable too. This will likely be cleaned up in a future release. | ||
75 | |||
76 | **refl** | ||
77 | Used to indicate the usage of a reflexive pronoun. | ||
78 | |||
79 | **adjp** | ||
80 | Used to indicate an adjective. | ||
81 | |||
82 | **sentential** | ||
83 | Use unknown. | ||
84 | |||
85 | ## Selectional Restrictions | ||
86 | Selectional restrictions are used to semantically filter nouns and prepositions. The namespaces for nouns and prepositions are separate. | ||
87 | |||
88 | Selectional restrictions for nouns are usually found in the role descriptions for each verb group; however they can rarely also be found in a specific NP element. Usually, subgroups inherit their roles from their parents, and NP elements inherit their selectional restrictions from the role it is assigned. When an NP element defines selectional restriction despite the role the element is assigned already having restrictions, or when a role is given restrictions in a subgroup when it already has restrictions in the parent, the parent restrictions are ignored and the child's restrictions are used. | ||
89 | |||
90 | In the original data, restrictions for nouns and roles can be defined using AND logic or OR logic. In a few rare cases, two AND clauses are ORed together. Additionally, restrictions can be either positive or negative. In our implementation, we have flattened the selectional restriction trees in order to make them easier to query and parse. Selectional restrictions are implemented as sets of positive restrictions ORed together. In order to do this, some changes had to be made to the VerbNet data. Specifically, 7 new restrictions were created to represent complex cases in the original data, which were either AND clauses or negative restrictions. The new restrictions are: | ||
91 | |||
92 | - **concrete_inanimate**: concrete && !animate | ||
93 | - **group**: concrete && plural | ||
94 | - **inanimate**: !animate | ||
95 | - **non_region_location**: location && !region | ||
96 | - **non_solid_food**: comestible && !solid | ||
97 | - **slinky**: nonrigid && elongated | ||
98 | - **solid_food**: comestible && solid | ||
99 | |||
100 | For prepositions, selectional restrictions are always positive. Usually at most one restriction is used, but in the rare event that more are present (6 cases out of 146), they are always applied using OR logic. The restrictions used for prepositions are the names of the preposition groups defined in [prepositions.txt](https://code.fourisland.com/verbly/tree/generator/prepositions.txt), which makes querying for applicable prepositions easy. | ||