summary refs log tree commit diff stats
diff options
context:
space:
mode:
authorStar Rauchenberger <fefferburbia@gmail.com>2023-10-03 22:17:22 +0000
committerStar Rauchenberger <fefferburbia@gmail.com>2023-10-03 22:17:22 +0000
commit13ace4d50e9a35090be4914775e30e76ffed393f (patch)
tree9139b8da34e45709cf53ff7b8ca79e36613e1f6b
parente8a2ef4c2513c0d11e050cc44df4c59f50b94f9a (diff)
downloadverbly-13ace4d50e9a35090be4914775e30e76ffed393f.tar.gz
verbly-13ace4d50e9a35090be4914775e30e76ffed393f.tar.bz2
verbly-13ace4d50e9a35090be4914775e30e76ffed393f.zip
Imported docs from Github HEAD master
-rw-r--r--docs/new_object_structure.md72
-rw-r--r--docs/object_structure_example.md26
-rw-r--r--docs/verb_frames.md100
3 files changed, 198 insertions, 0 deletions
diff --git a/docs/new_object_structure.md b/docs/new_object_structure.md new file mode 100644 index 0000000..9e25615 --- /dev/null +++ b/docs/new_object_structure.md
@@ -0,0 +1,72 @@
1# New object structure
2
3The rewrite of verbly uses a completely redesigned object structure that was designed to build off of the already-existing WordNet structure and add to it the data we are getting from other sources.
4
5## notion
6Something that can be expressed with words. fields: part of speech, wnid (WordNet ID, optional). nouns also have images field (number of images ImageNet has for this notion). has many words. related to each other through hypernymy, meronymy, synonymy, etc. parts of speech are:
7- noun {0}
8- adjective {1}
9- adverb {2}
10- verb {3}
11- preposition {4}
12
13relations are:
14- hypernymy (noun/noun and verb/verb)
15- instantiation (noun/noun)
16- meronymy (noun/noun)
17- variation (noun/adjective)
18- similarity (adjective/adjective) [symmetric]
19- entailment (verb/verb)
20- causality (verb/verb)
21
22notion also has a special relation "is a" between a preposition and a string group name
23
24## word
25An expression of a concept. belongs to a notion. belongs to a lemma. tag count (optional). adjectives also have position field. verbs optionally belong to groups. has several relations to itself:
26- antonymy (noun/noun, adjective/adjective, adverb/adverb, verb/verb) [symmetric]
27- specification (adjective/adjective, verb/verb)
28- pertainymy (noun/adjective)
29- mannernymy (adjective/adverb)
30- usage (noun/noun, noun/adjective, noun/adverb, noun/verb)
31- topicality (noun/noun, noun/adjective, noun/adverb, noun/verb)
32- regionality (noun/noun, noun/adjective, noun/adverb, noun/verb)
33
34adjective positions are:
35- predicate {0}
36- attributive {1}
37- postnominal {2}
38
39## lemma
40A lexical set that can be used to represent words. has many inflections (including the base inflection). has many words (that it represents). relations with itself:
41- derivation [not implemented yet]
42
43in implementation, this object has no fields, and thus it does not need a table. uniquely identifiable by base form. constructible from base form.
44
45## lemma/form
46The inflection relationship relates an uninflected lemma to its inflected forms. there can potentially be multiple ways to inflect a lemma, so the tuple (lemma_id, category) is not necessarily unique. field: type of inflection. ex: "care" is a singular (base) inflection of a noun, and a base inflection of a verb. "cares" is both a plural and an s form inflection of "care". the types of inflection are:
47- base {0}
48- plural (nouns) {1}
49- comparative (adjectives and adverbs) {2}
50- superlative (adjectives and adverbs) {3}
51- past tense (verbs) {4}
52- past participle (verbs) {5}
53- ing form (verbs) {6}
54- s form (verbs) {7}
55
56## form
57An inflection of a lemma. fields: text form, complexity (number of spaces plus one), proper (true if there is at least one capital letter, false otherwise). uniquely identifiable by text form. constructible from text form. has many and belongs to many pronunciations.
58
59## form/pronunciation
60One spelling of a word can have multiple pronunciations (whether by homography or speaker variation), but multiple words can also have the same pronunciation (homophony). the current data we have doesn't tell us which pronunciations go with which words, so we just associate all pronunciations of a form with the form.
61
62## pronunciation
63Fields: phonemes, rhyme phonemes, prerhyme, syllables, stress structure. has many and belongs to many forms.
64
65## frame
66A verb frame. belongs to a group. has many parts.
67
68## group (word/frame)
69A collection of verb frames. has many frames. has many words. this is not really an object per-se, more rather the name given to the cross join between sets of words and sets of frames. in implementation, this join has no fields, and thus it does not need a table.
70
71## part
72An ordered element of a verb frame. belongs to a frame. fields: index (position in the frame), and type. the tuple (frame_id, index) is unique. there are additional fields depending on the type of the frame. noun phrases have role and selrestrs. prepositions have prepositions and preposition_literality. literals have literal_value. in addition, noun phrases have synrestrs, which, in order to be queryable, are located in a separate table called "synrestrs".
diff --git a/docs/object_structure_example.md b/docs/object_structure_example.md new file mode 100644 index 0000000..d056204 --- /dev/null +++ b/docs/object_structure_example.md
@@ -0,0 +1,26 @@
1# Object structure example
2
3verbly has a rather complicated object model. Understanding the object model is key to being able to query for data effectively. To aid in understanding why the object model is set up the way it is, this article steps through an example of some objects in the model, how they're related, and why those relationships are useful.
4
5The point of verbly is to be able to manipulate concepts that can be expressed using words. For instance, "a course that is traveled". We know that this can be expressed with the word "route". So, we create the "word" object to represent this instantiation of a concept.
6
7One issue that arises is that we know there is another word for this concept, i.e. "path". This relationship is called synonymy. We could have a many-to-many relationship between words in order to represent synonymy, but it is easier to create a new object called "notion", which represents something that can be expressed with words. So, both "route" and "path" are words belonging to the notion "a course that is traveled".
8
9The next issue is that words need to be able to be inflected. The noun "route" also has a plural form, "routes". We could put fields on the word object for each possible type of inflection, and just stuff the textual representation of those inflections into there, but there are a couple of problems with that. First, that makes querying for words harder. One of the major functions verbly provides is the ability to query words, so it is important that this is easy to do. If you want to find a word that is spelled "routes", you would have to look for words that have a base form of "routes" OR a plural form of "routes" OR etc... for each type of inflection.
10
11We could solve this by creating an "inflection" object that belongs to the "word" object and contains fields for the text of the inflection, and the type of inflection. This makes querying for words by text easier, but doesn't completely solve the problem. Consider the notion of "a regular itinerary". This contains a word with the singular form "route" and the plural form "routes", as in a bus route. This word belongs to a different notion than the first word we described, but it is inflected in exactly the same way. This is a form of homography, and would require having duplicate "inflection" objects that share the same text.
12
13The way verbly approaches this is by forgetting the "inflection" object, and creating two different objects: "lemma" and "form". Let's start with "form". A "form" is a literal collection of characters, such as "route" or "routes". There is exactly one "form" for every collection of characters that is part of a word. Thus, it doesn't matter that "route" can mean both "a course that is traveled" and "a regular itinerary". They both use the same "form".
14
15How do they use the "form"? Via the "lemma" object. A "lemma" describes how to inflect a word. Every "word" has a "lemma", and multiple "word"s can have the same "lemma", as in the two "route"s we described earlier. A "lemma" also forms a many-to-many relationship with "form" with an inflection type attached. For instance, a "lemma" that has a base form relationship with the "route" form, and a plural form relationship with the "routes" form.
16
17This object model provides many advantages in addition to those described already:
18
19* There is a second form of homography where different words share a form but are inflected differently. Consider the notion "to plan a course that is traveled". There is a word for that notion with the base form "route". However, the rest of the lemma is different, because this is a verb and does not have a plural inflection, but instead has a simple present, a present participle, a simple past, and a past participle. This is a different "lemma" from the first one we described, but it is related to the same "route" form from earlier.
20* It is possible for two different inflections of a lemma to have the same form. Consider the verb lemma in the previous paragraph. The simple past inflection and the past participle inflection of that lemma are both spelled "routed". They are joined to the same form, but are distinct inflections.
21* It is possible for there to be two different ways to inflect a lemma. There isn't a "route"-related example for this, so consider the adjective word with the base form "small" meaning "inferior in size". The comparative inflection of this word can be spelled "less" or "lesser". These are two different forms, but they are joined to the lemma with the same inflection.
22
23The next issue concerns pronunciations. If there were only one way to pronounce a form, it would be simple to put the pronunciation information into the "form" object. However, speaker variation prevents this from being so. For example, the form "route" can be pronounced as both "root" and "r-ow-t". To handle this, we create a "pronunciation" object that has a many-to-many relationship with "form". The reason that it is many-to-many as opposed to "form" having many "pronunciation"s is that homophony exist. Consider the form "rout", which is the base form of a word for the notion "to defeat completely". It is pronounced "r-ow-t", which is a pronunciation for the form "route".
24
25This is not an exhaustive list of the relationships that verbly objects can have. For more detailed information, check out [the object structure document](https://code.fourisland.com/verbly/about/docs/new_object_structure.md).
26
diff --git a/docs/verb_frames.md b/docs/verb_frames.md new file mode 100644 index 0000000..64f3f7e --- /dev/null +++ b/docs/verb_frames.md
@@ -0,0 +1,100 @@
1# Verb frames
2
3Verbly's verb frame data comes from VerbNet, a database compiled by the University of Colorado Boulder Department of Linguistics. More information, including a download for v3.2, the version used in the canonical verbly datafile, can be found on [Martha Palmer's website](http://verbs.colorado.edu/~mpalmer/projects/verbnet.html).
4
5The downloadable data for VerbNet v3.2 has a lot of quirks and inadequacies that make it unsuitable for natural language generation. In particular, it makes no distinction between noun phrases and adjective phrases, so figuring out how to fill in that particular blank is in most cases impossible. In order to make the data more usable, I have gone through the data and sanitized it in a lot of places. A patch file, applicable to a clean VerbNet v3.2 download, [can be found in the repository](https://code.fourisland.com/verbly/tree/generator/vn-3.2.diff). This patch will likely continue to be updated as verbly is developed.
6
7## Syntactic Restrictions
8The data from VerbNet allows for a set of syntactic restrictions to follow either AND logic or OR logic; however, OR logic is never used, so we shall be ignoring it in our implementation. Additionally, syntactic restrictions are listed as being additive or subtractive; however, each distinct syntactic restriction always appears positively or always appears negatively. Therefore, we can remove the additive/subtractive modifier from our implementation, and change the meaning of the subtractive restrictions to mean the negation of what they "should" mean. These syntactic restrictions frequently indicate that the noun phrase is not actually a noun phrase and should be treated differently, so these are important to watch out for.
9
10**np_ppart**
11As far as I can tell, it has no purpose. It always appears before an ADJP or ADJ though.
12
13**be_sc_ing, ac_ing, sc_ing, np_omit_ing**
14Used for gerund phrases.
15
16**oc_ing**
17Used for gerund phrases. Always preceded by a noun or an objective pronoun; most of the time, the two are separated by a preposition, but not always.
18
19**poss_ing, possing, pos_ing**
20Used for a possessive (whether it be a noun with an apostrophe s or a possessive pronoun) followed by a gerund phrase.
21
22**acc_ing**
23Used for a noun (or an objective pronoun) followed by a participle phrase.
24
25**genitive**
26Used to indicate that the noun phrase should be possessive.
27
28**that_comp**
29Used for the word "that" followed by an independent clause in the simple past perfect tense.
30
31**tensed_that**
32Always appears negatively and alongside a that_comp. Use unknown.
33
34**wh_comp**
35Used for a phrase starting with the word "whether." The data using this restriction is a bit muddy, so this is not a perfect description. It will likely be cleaned up in a future release.
36
37**what_extract**
38Used for a phrase starting with the word "what." The data using this restriction is a bit muddy, so this is not a perfect description. It will likely be cleaned up in a future release.
39
40**how_extract**
41Used for a phrase starting with the word "how." The data using this restriction is a bit muddy, so this is not a perfect description. It will likely be cleaned up in a future release.
42
43**sc_to_inf, ac_to_inf, vc_to_inf, rs_to_inf**
44Used for infinitive phrases.
45
46**oc_to_inf**
47Used for infinitive phrases. Always immediately preceded by a noun or an objective pronoun.
48
49**oc_bare_inf**
50Used for infinitive phrases with bare infinitives. Always immediately preceded by a noun or an objective pronoun.
51
52**wh_inf**
53Used for the word "how" (or sometimes, "when" or "whether"), followed by an infinitive phrase. Which starting word is used is frame-dependent. The data using this restriction is a bit muddy, so this is not a perfect description. It will likely be cleaned up in a future release.
54
55**what_inf**
56Used for the word "what" followed by an infinitive phrase. One frame in empathize-88.2 erroneously uses it to indicate a phrase of the form "what they want." This will likely be cleaned up in a future release.
57
58**wheth_inf**
59Used for the word "whether" followed by an infinitive phrase.
60
61**for_comp**
62Used to indicate the following format: the word "for", followed by a noun or an objective pronoun, followed by an infinitive phrase.
63
64**quotation**
65Used to indicate a quotation.
66
67**plural**
68Used to indicate that the noun phrase should be plural.
69
70**definite**
71Always used negatively to indicate that a noun phrase should not be definite.
72
73**adv_loc**
74Used to indicate either the word "here" or "there." In one case (throw-17.1), the word "away" is acceptable too. This will likely be cleaned up in a future release.
75
76**refl**
77Used to indicate the usage of a reflexive pronoun.
78
79**adjp**
80Used to indicate an adjective.
81
82**sentential**
83Use unknown.
84
85## Selectional Restrictions
86Selectional restrictions are used to semantically filter nouns and prepositions. The namespaces for nouns and prepositions are separate.
87
88Selectional restrictions for nouns are usually found in the role descriptions for each verb group; however they can rarely also be found in a specific NP element. Usually, subgroups inherit their roles from their parents, and NP elements inherit their selectional restrictions from the role it is assigned. When an NP element defines selectional restriction despite the role the element is assigned already having restrictions, or when a role is given restrictions in a subgroup when it already has restrictions in the parent, the parent restrictions are ignored and the child's restrictions are used.
89
90In the original data, restrictions for nouns and roles can be defined using AND logic or OR logic. In a few rare cases, two AND clauses are ORed together. Additionally, restrictions can be either positive or negative. In our implementation, we have flattened the selectional restriction trees in order to make them easier to query and parse. Selectional restrictions are implemented as sets of positive restrictions ORed together. In order to do this, some changes had to be made to the VerbNet data. Specifically, 7 new restrictions were created to represent complex cases in the original data, which were either AND clauses or negative restrictions. The new restrictions are:
91
92- **concrete_inanimate**: concrete && !animate
93- **group**: concrete && plural
94- **inanimate**: !animate
95- **non_region_location**: location && !region
96- **non_solid_food**: comestible && !solid
97- **slinky**: nonrigid && elongated
98- **solid_food**: comestible && solid
99
100For prepositions, selectional restrictions are always positive. Usually at most one restriction is used, but in the rare event that more are present (6 cases out of 146), they are always applied using OR logic. The restrictions used for prepositions are the names of the preposition groups defined in [prepositions.txt](https://code.fourisland.com/verbly/tree/generator/prepositions.txt), which makes querying for applicable prepositions easy.