| | |
| | |
| |
|
| | <HTML> |
| | <HEAD> |
| | <TITLE>WNDB(5WN) manual page</TITLE> |
| | </HEAD> |
| | <BODY> |
| | <A HREF="#toc">Table of Contents</A><P> |
| | |
| | <H2><A NAME="sect0" HREF="#toc0">NAME </A></H2> |
| | index.noun, data.noun, index.verb, data.verb, index.adj, data.adj, index.adv, |
| | data.adv - WordNet database files <P> |
| | noun.exc, verb.exc. adj.exc adv.exc - morphology |
| | exception lists <P> |
| | sentidx.vrb, sents.vrb - files used by search code to display |
| | sentences illustrating the use of some specific verbs |
| | <H2><A NAME="sect1" HREF="#toc1">DESCRIPTION </A></H2> |
| | For |
| | each syntactic category, two files are needed to represent the contents |
| | of the WordNet database - <B>index. </B><I>pos </I> and <B>data. </B><I>pos </I>, where <I>pos </I> is <B>noun |
| | </B>, <B>verb </B>, <B>adj </B> and <B>adv </B>. The other auxiliary files are used by the WordNet |
| | library's searching functions and are needed to run the various WordNet |
| | browsers. <P> |
| | Each index file is an alphabetized list of all the words found |
| | in WordNet in the corresponding part of speech. On each line, following |
| | the word, is a list of byte offsets (<I>synset_offset </I>s) in the corresponding |
| | data file, one for each synset containing the word. Words in the index |
| | file are in lower case only, regardless of how they were entered in the |
| | lexicographer files. This folds various orthographic representations of |
| | the word into one line enabling database searches to be case insensitive. |
| | See <B><A HREF="wninput.5WN.html">wninput</B>(5WN)</A> |
| | for a detailed description of the lexicographer files |
| | <P> |
| | A data file for a syntactic category contains information corresponding |
| | to the synsets that were specified in the lexicographer files, with relational |
| | pointers resolved to <I>synset_offset </I>s. Each line corresponds to a synset. |
| | Pointers are followed and hierarchies traversed by moving from one synset |
| | to another via the <I>synset_offset </I>s. <P> |
| | The exception list files, <I>pos </I><B>.exc |
| | </B>, are used to help the morphological processor find base forms from irregular |
| | inflections. <P> |
| | The files <B>sentidx.vrb </B> and <B>sents.vrb </B> contain sentences illustrating |
| | the use of specific senses of some verbs. These files are used by the |
| | searching software in response to a request for verb sentence frames. |
| | Generic sentence frames are displayed when an illustrative sentence is |
| | not present. <P> |
| | The various database files are in ASCII formats that are |
| | easily read by both humans and machines. All fields, unless otherwise |
| | noted, are separated by one space character, and all lines are terminated |
| | by a newline character. Fields enclosed in italicized square brackets |
| | may not be present. <P> |
| | See <B><A HREF="wngloss.7WN.html">wngloss</B>(7WN)</A> |
| | for a glossary of WordNet terminology |
| | and a discussion of the database's content and logical organization. |
| | <H3><A NAME="sect2" HREF="#toc2">Index |
| | File Format </A></H3> |
| | Each index file begins with several lines containing a copyright |
| | notice, version number and license agreement. These lines all begin with |
| | two spaces and the line number so they do not interfere with the binary |
| | search algorithm that is used to look up entries in the index files. All |
| | other lines are in the following format. In the field descriptions, <B>number |
| | </B> always refers to a decimal integer unless otherwise defined. <P> |
| | <I>lemma pos synset_cnt p_cnt [ptr_symbol...] sense_cnt tagsense_cnt |
| | synset_offset [synset_offset...] </I> <BR> |
| | <P> |
| | |
| | <DL> |
| |
|
| | <DT><I>lemma</I> </DT> |
| | <DD>lower case ASCII text of word |
| | or collocation. Collocations are formed by joining individual words with |
| | an underscore (<B>_ </B>) character. </DD> |
| |
|
| | <DT><I>pos</I> </DT> |
| | <DD>Syntactic category: <B>n </B> for noun files, |
| | <B>v </B> for verb files, <B>a </B> for adjective files, <B>r </B> for adverb files. </DD> |
| | </DL> |
| | <P> |
| | <P> |
| | All remaining |
| | fields are with respect to senses of <I>lemma </I> in <I>pos </I>. <P> |
| | |
| | <DL> |
| |
|
| | <DT><I>synset_cnt</I> </DT> |
| | <DD>Number |
| | of synsets that <I>lemma </I> is in. This is the number of senses of the word |
| | in WordNet. See <FONT SIZE=-1><B>Sense Numbers </B></FONT> |
| | below for a discussion of how sense numbers |
| | are assigned and the order of <I>synset_offset </I>s in the index files. </DD> |
| |
|
| | <DT><I>p_cnt</I> |
| | </DT> |
| | <DD>Number of different pointers that <I>lemma </I> has in all synsets containing |
| | it. </DD> |
| |
|
| | <DT><I>ptr_symbol</I> </DT> |
| | <DD>A space separated list of <I>p_cnt </I> different types of pointers |
| | that <I>lemma </I> has in all synsets containing it. See <B><A HREF="wninput.5WN.html">wninput</B>(5WN)</A> |
| | for a list |
| | of <I>pointer_symbol </I>s. If all senses of <I>lemma </I> have no pointers, this field |
| | is omitted and <I>p_cnt </I> is <B>0 </B>. </DD> |
| |
|
| | <DT><I>sense_cnt</I> </DT> |
| | <DD>Same as <I>sense_cnt </I> above. This |
| | is redundant, but the field was preserved for compatibility reasons. </DD> |
| |
|
| | <DT><I>tagsense_cnt</I> |
| | </DT> |
| | <DD>Number of senses of <I>lemma </I> that are ranked according to their frequency |
| | of occurrence in semantic concordance texts. </DD> |
| |
|
| | <DT><I>synset_offset</I> </DT> |
| | <DD>Byte offset |
| | in <B>data.<I>pos </I></B> file of a synset containing <I>lemma </I>. Each <I>synset_offset </I> in |
| | the list corresponds to a different sense of <I>lemma </I> in WordNet. <I>synset_offset |
| | </I> is an 8 digit, zero-filled decimal integer that can be used with <B><A HREF="fseek.3.html">fseek</B>(3)</A> |
| | |
| | to read a synset from the data file. When passed to <B><A HREF="read_synset.3WN.html">read_synset</B>(3WN)</A> |
| | along |
| | with the syntactic category, a data structure containing the parsed synset |
| | is returned. </DD> |
| | </DL> |
| | |
| | <H3><A NAME="sect3" HREF="#toc3">Data File Format </A></H3> |
| | Each data file begins with several lines |
| | containing a copyright notice, version number and license agreement. These |
| | lines all begin with two spaces and the line number. All other lines are |
| | in the following format. Integer fields are of fixed length, and are zero-filled. |
| | <P> |
| | <I>synset_offset lex_filenum ss_type w_cnt word lex_id [word lex_id...] p_cnt [ptr...] [frames...] <B>| |
| | </B></I><I> gloss </I> <BR> |
| | <P> |
| | |
| | <DL> |
| |
|
| | <DT><I>synset_offset</I> </DT> |
| | <DD>Current byte offset in the file represented |
| | as an 8 digit decimal integer. </DD> |
| |
|
| | <DT><I>lex_filenum</I> </DT> |
| | <DD>Two digit decimal integer |
| | corresponding to the lexicographer file name containing the synset. See |
| | <B><A HREF="lexnames.5WN.html">lexnames</B>(5WN)</A> |
| | for the list of filenames and their corresponding numbers. |
| | </DD> |
| |
|
| | <DT><I>ss_type</I> </DT> |
| | <DD>One character code indicating the synset type: </DD> |
| | </DL> |
| | <P> |
| | <blockquote><B>n </B><tt> </tt> <tt> </tt> NOUN <BR> |
| | <B>v </B><tt> </tt> <tt> </tt> VERB |
| | <BR> |
| | <B>a </B><tt> </tt> <tt> </tt> ADJECTIVE <BR> |
| | <B>s </B><tt> </tt> <tt> </tt> ADJECTIVE SATELLITE <BR> |
| | <B>r </B><tt> </tt> <tt> </tt> ADVERB <BR> |
| | </blockquote> |
| |
|
| | <DL> |
| |
|
| | <DT><I>w_cnt</I> </DT> |
| | <DD>Two digit hexadecimal |
| | integer indicating the number of words in the synset. </DD> |
| |
|
| | <DT><I>word</I> </DT> |
| | <DD>ASCII form |
| | of a word as entered in the synset by the lexicographer, with spaces replaced |
| | by underscore characters (<B>_ </B>). The text of the word is case sensitive, |
| | in contrast to its form in the corresponding <B>index. </B><I>pos </I> file, that contains |
| | only lower-case forms. In <B>data.adj </B>, a <I>word </I> is followed by a syntactic |
| | marker if one was specified in the lexicographer file. A syntactic marker |
| | is appended, in parentheses, onto <I>word </I> without any intervening spaces. |
| | See <B><A HREF="wninput.5WN.html">wninput</B>(5WN)</A> |
| | for a list of the syntactic markers for adjectives. </DD> |
| |
|
| | <DT><I>lex_id</I> |
| | </DT> |
| | <DD>One digit hexadecimal integer that, when appended onto <I>lemma </I>, uniquely |
| | identifies a sense within a lexicographer file. <I>lex_id </I> numbers usually |
| | start with <B>0 </B>, and are incremented as additional senses of the word are |
| | added to the same file, although there is no requirement that the numbers |
| | be consecutive or begin with <B>0 </B>. Note that a value of <B>0 </B> is the default, |
| | and therefore is not present in lexicographer files. </DD> |
| |
|
| | <DT><I>p_cnt</I> </DT> |
| | <DD>Three digit |
| | decimal integer indicating the number of pointers from this synset to |
| | other synsets. If <I>p_cnt </I> is <B>000 </B> the synset has no pointers. </DD> |
| |
|
| | <DT><I>ptr</I> </DT> |
| | <DD>A pointer |
| | from this synset to another. <I>ptr </I> is of the form: </DD> |
| | </DL> |
| | <P> |
| | <I>pointer_symbol synset_offset pos source/target |
| | </I> <BR> |
| | <P> |
| | where <I>synset_offset </I> is the byte offset of the target synset in the |
| | data file corresponding to <I>pos </I>. <P> |
| | The <I>source/target </I> field distinguishes |
| | lexical and semantic pointers. It is a four byte field, containing two |
| | two-digit hexadecimal integers. The first two digits indicates the word |
| | number in the current (source) synset, the last two digits indicate the |
| | word number in the target synset. A value of <B>0000 </B> means that <I>pointer_symbol |
| | </I> represents a semantic relation between the current (source) synset and |
| | the target synset indicated by <I>synset_offset </I>. <P> |
| | A lexical relation between |
| | two words in different synsets is represented by non-zero values in the |
| | source and target word numbers. The first and last two bytes of this field |
| | indicate the word numbers in the source and target synsets, respectively, |
| | between which the relation holds. Word numbers are assigned to the <I>word |
| | </I> fields in a synset, from left to right, beginning with <B>1 </B>. <P> |
| | See <B><A HREF="wninput.5WN.html">wninput</B>(5WN)</A> |
| | |
| | for a list of <I>pointer_symbol </I>s, and semantic and lexical pointer classifications. |
| |
|
| | <DL> |
| |
|
| | <DT><I>frames</I> </DT> |
| | <DD>In <B>data.verb </B> only, a list of numbers corresponding to the generic |
| | verb sentence frames for <I>word </I>s in the synset. <I>frames </I> is of the form: |
| | </DD> |
| | </DL> |
| | <P> |
| | <I>f_cnt </I> <B>+ </B> <I> f_num w_num [ </I> <B>+ </B> <I> f_num w_num...] </I> <BR> |
| | <P> |
| | where <I>f_cnt </I> a two |
| | digit decimal integer indicating the number of generic frames listed, |
| | <I>f_num </I> is a two digit decimal integer frame number, and <I>w_num </I> is a two |
| | digit hexadecimal integer indicating the word in the synset that the frame |
| | applies to. As with pointers, if this number is <B>00 </B>, <I>f_num </I> applies to |
| | all <I>word </I>s in the synset. If non-zero, it is applicable only to the word |
| | indicated. Word numbers are assigned as described for pointers. Each <I>f_num w_num |
| | </I> pair is preceded by a <B>+ </B>. See <B><A HREF="wninput.5WN.html">wninput</B>(5WN)</A> |
| | for the text of the generic |
| | sentence frames. |
| | <DL> |
| |
|
| | <DT><I>gloss</I> </DT> |
| | <DD>Each synset contains a gloss. A <I>gloss </I> is represented |
| | as a vertical bar (<B>| </B>), followed by a text string that continues until |
| | the end of the line. The gloss may contain a definition, one or more example |
| | sentences, or both. </DD> |
| | </DL> |
| | |
| | <H3><A NAME="sect4" HREF="#toc4">Sense Numbers </A></H3> |
| | Senses in WordNet are generally ordered |
| | from most to least frequently used, with the most common sense numbered |
| | <B>1 </B>. Frequency of use is determined by the number of times a sense is tagged |
| | in the various semantic concordance texts. Senses that are not semantically |
| | tagged follow the ordered senses. The <I>tagsense_cnt </I> field for each entry |
| | in the <B>index.<I>pos </I></B> files indicates how many of the senses in the list have |
| | been tagged. <P> |
| | The <B><A HREF="cntlist.5WN.html">cntlist</B>(5WN)</A> |
| | file provided with the database lists the |
| | number of times each sense is tagged in the semantic concordances. The |
| | data from <B>cntlist </B> is used by <B><A HREF="grind.1WN.html">grind</B>(1WN)</A> |
| | to order the senses of each word. |
| | When the <B>index </B>.<I>pos </I> files are generated, the <I>synset_offset </I>s are output |
| | in sense number order, with sense 1 first in the list. Senses with the |
| | same number of semantic tags are assigned unique but consecutive sense |
| | numbers. The WordNet <FONT SIZE=-1><B>OVERVIEW </B></FONT> |
| | search displays all senses of the specified |
| | word, in all syntactic categories, and indicates which of the senses are |
| | represented in the semantically tagged texts. |
| | <H3><A NAME="sect5" HREF="#toc5">Exception List File Format |
| | </A></H3> |
| | Exception lists are alphabetized lists of inflected forms of words and |
| | their base forms. The first field of each line is an inflected form, followed |
| | by a space separated list of one or more base forms of the word. There |
| | is one exception list file for each syntactic category. <P> |
| | Note that the |
| | noun and verb exception lists were automatically generated from a machine-readable |
| | dictionary, and contain many words that are not in WordNet. Also, for |
| | many of the inflected forms, base forms could be easily derived using |
| | the standard rules of detachment programmed into Morphy (See <B><A HREF="morph.7WN.html">morph</B>(7WN)</A> |
| | ). |
| | These anomalies are allowed to remain in the exception list files, as |
| | they do no harm. <P> |
| | |
| | <H3><A NAME="sect6" HREF="#toc6">Verb Example Sentences </A></H3> |
| | For some verb senses, example |
| | sentences illustrating the use of the verb sense can be displayed. Each |
| | line of the file <B>sentidx.vrb </B> contains a <I>sense_key </I> followed by a space |
| | and a comma separated list of example sentence template numbers, in decimal. |
| | The file <B>sents.vrb </B> lists all of the example sentence templates. Each |
| | line begins with the template number followed by a space. The rest of |
| | the line is the text of a template example sentence, with <B>%s </B> used as |
| | a placeholder in the text for the verb. Both files are sorted alphabetically |
| | so that the <I>sense_key </I> and template sentence number can be used as indices, |
| | via <B><A HREF="binsrch.3WN.html">binsrch</B>(3WN)</A> |
| | ,<B></B> into the appropriate file. <P> |
| | When a request for <FONT SIZE=-1><B>FRAMES |
| | </B></FONT> |
| | is made, the WordNet search code looks for the sense in <B>sentidx.vrb </B>. |
| | If found, the sentence template(s) listed is retrieved from <B>sents.vrb |
| | </B>, and the <B>%s </B> is replaced with the verb. If the sense is not found, the |
| | applicable generic sentence frame(s) listed in <I>frames </I> is displayed. |
| | <H2><A NAME="sect7" HREF="#toc7">NOTES |
| | </A></H2> |
| | Information in the <B>data.<I>pos </I></B> and <B>index.<I>pos </I></B> files represents all of the |
| | word senses and synsets in the WordNet database. The <I>word </I>, <I>lex_id </I>, and |
| | <I>lex_filenum </I> fields together uniquely identify each word sense in WordNet. |
| | These can be encoded in a <I>sense_key </I> as described in <B><A HREF="senseidx.5WN.html">senseidx</B>(5WN)</A> |
| | . Each |
| | synset in the database can be uniquely identified by combining the <I>synset_offset |
| | </I> for the synset with a code for the syntactic category (since it is possible |
| | for synsets in different <B>data.<I>pos </I></B> files to have the same <I>synset_offset |
| | </I>). <P> |
| | The WordNet system provide both command line and window-based browser |
| | interfaces to the database. Both interfaces utilize a common library of |
| | search and morphology code. The source code for the library and interfaces |
| | is included in the WordNet package. See <B><A HREF="wnintro.3WN.html">wnintro</B>(3WN)</A> |
| | for an overview of |
| | the WordNet source code. |
| | <H2><A NAME="sect8" HREF="#toc8">ENVIRONMENT VARIABLES (UNIX) </A></H2> |
| |
|
| | <DL> |
| |
|
| | <DT><B>WNHOME</B> </DT> |
| | <DD>Base directory |
| | for WordNet. Default is <B>/usr/local/WordNet-3.0 </B>. </DD> |
| |
|
| | <DT><B>WNSEARCHDIR</B> </DT> |
| | <DD>Directory in |
| | which the WordNet database has been installed. Default is <B>WNHOME/dict |
| | </B>. </DD> |
| | </DL> |
| | |
| | <H2><A NAME="sect9" HREF="#toc9">REGISTRY (WINDOWS) </A></H2> |
| |
|
| | <DL> |
| |
|
| | <DT><B>HKEY_LOCAL_MACHINE\SOFTWARE\WordNet\3.0\WNHome</B> </DT> |
| | <DD>Base directory |
| | for WordNet. Default is <B>C:\Program Files\WordNet\3.0 </B>. </DD> |
| | </DL> |
| | |
| | <H2><A NAME="sect10" HREF="#toc10">FILES </A></H2> |
| |
|
| | <DL> |
| |
|
| | <DT><B>index.<I>pos </I></B> </DT> |
| | <DD>database |
| | index files </DD> |
| |
|
| | <DT><B>data.<I>pos </I></B> </DT> |
| | <DD>database data files </DD> |
| |
|
| | <DT><B>*.vrb</B> </DT> |
| | <DD>files of sentences illustrating |
| | the use of verbs </DD> |
| |
|
| | <DT><B><I>pos </I>.exc</B> </DT> |
| | <DD>morphology exception lists </DD> |
| | </DL> |
| | |
| | <H2><A NAME="sect11" HREF="#toc11">SEE ALSO </A></H2> |
| | <B><A HREF="grind.1WN.html">grind</B>(1WN)</A> |
| | , |
| | <B><A HREF="wn.1WN.html">wn</B>(1WN)</A> |
| | , <B><A HREF="wnb.1WN.html">wnb</B>(1WN)</A> |
| | , <B><A HREF="wnintro.3WN.html">wnintro</B>(3WN)</A> |
| | , <B><A HREF="binsrch.3WN.html">binsrch</B>(3WN)</A> |
| | , <B><A HREF="wnintro.5WN.html">wnintro</B>(5WN)</A> |
| | , <B><A HREF="cntlist.5WN.html">cntlist</B>(5WN)</A> |
| | , |
| | <B><A HREF="lexnames.5WN.html">lexnames</B>(5WN)</A> |
| | , <B><A HREF="senseidx.5WN.html">senseidx</B>(5WN)</A> |
| | , <B><A HREF="wninput.5WN.html">wninput</B>(5WN)</A> |
| | , <B><A HREF="morphy.7WN.html">morphy</B>(7WN)</A> |
| | , <B><A HREF="wngloss.7WN.html">wngloss</B>(7WN)</A> |
| | , |
| | <B><A HREF="wngroups.7WN.html">wngroups</B>(7WN)</A> |
| | , <B><A HREF="wnstats.7WN.html">wnstats</B>(7WN)</A> |
| | . <P> |
| |
|
| | <HR><P> |
| | <A NAME="toc"><B>Table of Contents</B></A><P> |
| | <UL> |
| | <LI><A NAME="toc0" HREF="#sect0">NAME</A></LI> |
| | <LI><A NAME="toc1" HREF="#sect1">DESCRIPTION</A></LI> |
| | <UL> |
| | <LI><A NAME="toc2" HREF="#sect2">Index File Format</A></LI> |
| | <LI><A NAME="toc3" HREF="#sect3">Data File Format</A></LI> |
| | <LI><A NAME="toc4" HREF="#sect4">Sense Numbers</A></LI> |
| | <LI><A NAME="toc5" HREF="#sect5">Exception List File Format</A></LI> |
| | <LI><A NAME="toc6" HREF="#sect6">Verb Example Sentences</A></LI> |
| | </UL> |
| | <LI><A NAME="toc7" HREF="#sect7">NOTES</A></LI> |
| | <LI><A NAME="toc8" HREF="#sect8">ENVIRONMENT VARIABLES (UNIX)</A></LI> |
| | <LI><A NAME="toc9" HREF="#sect9">REGISTRY (WINDOWS)</A></LI> |
| | <LI><A NAME="toc10" HREF="#sect10">FILES</A></LI> |
| | <LI><A NAME="toc11" HREF="#sect11">SEE ALSO</A></LI> |
| | </UL> |
| | </BODY></HTML> |
| |
|