openie5 / WordNet-3.0 /doc /html /wndb.5WN.html

feat: wordnet 3.0 added for standalone

cb1c1cb almost 3 years ago

18.3 kB

	<!-- manual page source format generated by PolyglotMan v3.0.3a12, -->
	<!-- available via anonymous ftp from ftp.cs.berkeley.edu:/ucb/people/phelps/tcltk/rman.tar.Z -->

	<HTML>
	<HEAD>
	<TITLE>WNDB(5WN) manual page</TITLE>
	</HEAD>
	<BODY>
	<A HREF="#toc">Table of Contents</A><P>

	<H2><A NAME="sect0" HREF="#toc0">NAME </A></H2>
	index.noun, data.noun, index.verb, data.verb, index.adj, data.adj, index.adv,
	data.adv - WordNet database files <P>
	noun.exc, verb.exc. adj.exc adv.exc - morphology
	exception lists <P>
	sentidx.vrb, sents.vrb - files used by search code to display
	sentences illustrating the use of some specific verbs
	<H2><A NAME="sect1" HREF="#toc1">DESCRIPTION </A></H2>
	For
	each syntactic category, two files are needed to represent the contents
	of the WordNet database - <B>index. </B><I>pos </I> and <B>data. </B><I>pos </I>, where <I>pos </I> is <B>noun
	</B>, <B>verb </B>, <B>adj </B> and <B>adv </B>. The other auxiliary files are used by the WordNet
	library's searching functions and are needed to run the various WordNet
	browsers. <P>
	Each index file is an alphabetized list of all the words found
	in WordNet in the corresponding part of speech. On each line, following
	the word, is a list of byte offsets (<I>synset_offset </I>s) in the corresponding
	data file, one for each synset containing the word. Words in the index
	file are in lower case only, regardless of how they were entered in the
	lexicographer files. This folds various orthographic representations of
	the word into one line enabling database searches to be case insensitive.
	See <B><A HREF="wninput.5WN.html">wninput</B>(5WN)</A>
	for a detailed description of the lexicographer files
	<P>
	A data file for a syntactic category contains information corresponding
	to the synsets that were specified in the lexicographer files, with relational
	pointers resolved to <I>synset_offset </I>s. Each line corresponds to a synset.
	Pointers are followed and hierarchies traversed by moving from one synset
	to another via the <I>synset_offset </I>s. <P>
	The exception list files, <I>pos </I><B>.exc
	</B>, are used to help the morphological processor find base forms from irregular
	inflections. <P>
	The files <B>sentidx.vrb </B> and <B>sents.vrb </B> contain sentences illustrating
	the use of specific senses of some verbs. These files are used by the
	searching software in response to a request for verb sentence frames.
	Generic sentence frames are displayed when an illustrative sentence is
	not present. <P>
	The various database files are in ASCII formats that are
	easily read by both humans and machines. All fields, unless otherwise
	noted, are separated by one space character, and all lines are terminated
	by a newline character. Fields enclosed in italicized square brackets
	may not be present. <P>
	See <B><A HREF="wngloss.7WN.html">wngloss</B>(7WN)</A>
	for a glossary of WordNet terminology
	and a discussion of the database's content and logical organization.
	<H3><A NAME="sect2" HREF="#toc2">Index
	File Format </A></H3>
	Each index file begins with several lines containing a copyright
	notice, version number and license agreement. These lines all begin with
	two spaces and the line number so they do not interfere with the binary
	search algorithm that is used to look up entries in the index files. All
	other lines are in the following format. In the field descriptions, <B>number
	</B> always refers to a decimal integer unless otherwise defined. <P>
	<I>lemma  pos  synset_cnt  p_cnt  [ptr_symbol...]  sense_cnt  tagsense_cnt
	synset_offset  [synset_offset...] </I> <BR>
	<P>

	<DL>

	<DT><I>lemma</I> </DT>
	<DD>lower case ASCII text of word
	or collocation. Collocations are formed by joining individual words with
	an underscore (<B>_ </B>) character. </DD>

	<DT><I>pos</I> </DT>
	<DD>Syntactic category: <B>n </B> for noun files,
	<B>v </B> for verb files, <B>a </B> for adjective files, <B>r </B> for adverb files. </DD>
	</DL>
	<P>
	<P>
	All remaining
	fields are with respect to senses of <I>lemma </I> in <I>pos </I>. <P>

	<DL>

	<DT><I>synset_cnt</I> </DT>
	<DD>Number
	of synsets that <I>lemma </I> is in. This is the number of senses of the word
	in WordNet. See <FONT SIZE=-1><B>Sense Numbers </B></FONT>
	below for a discussion of how sense numbers
	are assigned and the order of <I>synset_offset </I>s in the index files. </DD>

	<DT><I>p_cnt</I>
	</DT>
	<DD>Number of different pointers that <I>lemma </I> has in all synsets containing
	it. </DD>

	<DT><I>ptr_symbol</I> </DT>
	<DD>A space separated list of <I>p_cnt </I> different types of pointers
	that <I>lemma </I> has in all synsets containing it. See <B><A HREF="wninput.5WN.html">wninput</B>(5WN)</A>
	for a list
	of <I>pointer_symbol </I>s. If all senses of <I>lemma </I> have no pointers, this field
	is omitted and <I>p_cnt </I> is <B>0 </B>. </DD>

	<DT><I>sense_cnt</I> </DT>
	<DD>Same as <I>sense_cnt </I> above. This
	is redundant, but the field was preserved for compatibility reasons. </DD>

	<DT><I>tagsense_cnt</I>
	</DT>
	<DD>Number of senses of <I>lemma </I> that are ranked according to their frequency
	of occurrence in semantic concordance texts. </DD>

	<DT><I>synset_offset</I> </DT>
	<DD>Byte offset
	in <B>data.<I>pos </I></B> file of a synset containing <I>lemma </I>. Each <I>synset_offset </I> in
	the list corresponds to a different sense of <I>lemma </I> in WordNet. <I>synset_offset
	</I> is an 8 digit, zero-filled decimal integer that can be used with <B><A HREF="fseek.3.html">fseek</B>(3)</A>

	to read a synset from the data file. When passed to <B><A HREF="read_synset.3WN.html">read_synset</B>(3WN)</A>
	along
	with the syntactic category, a data structure containing the parsed synset
	is returned. </DD>
	</DL>

	<H3><A NAME="sect3" HREF="#toc3">Data File Format </A></H3>
	Each data file begins with several lines
	containing a copyright notice, version number and license agreement. These
	lines all begin with two spaces and the line number. All other lines are
	in the following format. Integer fields are of fixed length, and are zero-filled.
	<P>
	<I>synset_offset  lex_filenum  ss_type  w_cnt  word  lex_id  [word  lex_id...]  p_cnt  [ptr...]  [frames...]  <B>\|
	</B></I><I>  gloss </I> <BR>
	<P>

	<DL>

	<DT><I>synset_offset</I> </DT>
	<DD>Current byte offset in the file represented
	as an 8 digit decimal integer. </DD>

	<DT><I>lex_filenum</I> </DT>
	<DD>Two digit decimal integer
	corresponding to the lexicographer file name containing the synset. See
	<B><A HREF="lexnames.5WN.html">lexnames</B>(5WN)</A>
	for the list of filenames and their corresponding numbers.
	</DD>

	<DT><I>ss_type</I> </DT>
	<DD>One character code indicating the synset type: </DD>
	</DL>
	<P>
	<blockquote><B>n </B><tt> </tt> <tt> </tt> NOUN <BR>
	<B>v </B><tt> </tt> <tt> </tt> VERB
	<BR>
	<B>a </B><tt> </tt> <tt> </tt> ADJECTIVE <BR>
	<B>s </B><tt> </tt> <tt> </tt> ADJECTIVE SATELLITE <BR>
	<B>r </B><tt> </tt> <tt> </tt> ADVERB <BR>
	</blockquote>

	<DL>

	<DT><I>w_cnt</I> </DT>
	<DD>Two digit hexadecimal
	integer indicating the number of words in the synset. </DD>

	<DT><I>word</I> </DT>
	<DD>ASCII form
	of a word as entered in the synset by the lexicographer, with spaces replaced
	by underscore characters (<B>_ </B>). The text of the word is case sensitive,
	in contrast to its form in the corresponding <B>index. </B><I>pos </I> file, that contains
	only lower-case forms. In <B>data.adj </B>, a <I>word </I> is followed by a syntactic
	marker if one was specified in the lexicographer file. A syntactic marker
	is appended, in parentheses, onto <I>word </I> without any intervening spaces.
	See <B><A HREF="wninput.5WN.html">wninput</B>(5WN)</A>
	for a list of the syntactic markers for adjectives. </DD>

	<DT><I>lex_id</I>
	</DT>
	<DD>One digit hexadecimal integer that, when appended onto <I>lemma </I>, uniquely
	identifies a sense within a lexicographer file. <I>lex_id </I> numbers usually
	start with <B>0 </B>, and are incremented as additional senses of the word are
	added to the same file, although there is no requirement that the numbers
	be consecutive or begin with <B>0 </B>. Note that a value of <B>0 </B> is the default,
	and therefore is not present in lexicographer files. </DD>

	<DT><I>p_cnt</I> </DT>
	<DD>Three digit
	decimal integer indicating the number of pointers from this synset to
	other synsets. If <I>p_cnt </I> is <B>000 </B> the synset has no pointers. </DD>

	<DT><I>ptr</I> </DT>
	<DD>A pointer
	from this synset to another. <I>ptr </I> is of the form: </DD>
	</DL>
	<P>
	<I>pointer_symbol  synset_offset  pos  source/target
	</I> <BR>
	<P>
	where <I>synset_offset </I> is the byte offset of the target synset in the
	data file corresponding to <I>pos </I>. <P>
	The <I>source/target </I> field distinguishes
	lexical and semantic pointers. It is a four byte field, containing two
	two-digit hexadecimal integers. The first two digits indicates the word
	number in the current (source) synset, the last two digits indicate the
	word number in the target synset. A value of <B>0000 </B> means that <I>pointer_symbol
	</I> represents a semantic relation between the current (source) synset and
	the target synset indicated by <I>synset_offset </I>. <P>
	A lexical relation between
	two words in different synsets is represented by non-zero values in the
	source and target word numbers. The first and last two bytes of this field
	indicate the word numbers in the source and target synsets, respectively,
	between which the relation holds. Word numbers are assigned to the <I>word
	</I> fields in a synset, from left to right, beginning with <B>1 </B>. <P>
	See <B><A HREF="wninput.5WN.html">wninput</B>(5WN)</A>

	for a list of <I>pointer_symbol </I>s, and semantic and lexical pointer classifications.

	<DL>

	<DT><I>frames</I> </DT>
	<DD>In <B>data.verb </B> only, a list of numbers corresponding to the generic
	verb sentence frames for <I>word </I>s in the synset. <I>frames </I> is of the form:
	</DD>
	</DL>
	<P>
	<I>f_cnt   </I> <B>+ </B> <I>  f_num  w_num  [ </I> <B>+ </B> <I>  f_num  w_num...] </I> <BR>
	<P>
	where <I>f_cnt </I> a two
	digit decimal integer indicating the number of generic frames listed,
	<I>f_num </I> is a two digit decimal integer frame number, and <I>w_num </I> is a two
	digit hexadecimal integer indicating the word in the synset that the frame
	applies to. As with pointers, if this number is <B>00 </B>, <I>f_num </I> applies to
	all <I>word </I>s in the synset. If non-zero, it is applicable only to the word
	indicated. Word numbers are assigned as described for pointers. Each <I>f_num  w_num
	</I> pair is preceded by a <B>+ </B>. See <B><A HREF="wninput.5WN.html">wninput</B>(5WN)</A>
	for the text of the generic
	sentence frames.
	<DL>

	<DT><I>gloss</I> </DT>
	<DD>Each synset contains a gloss. A <I>gloss </I> is represented
	as a vertical bar (<B>\| </B>), followed by a text string that continues until
	the end of the line. The gloss may contain a definition, one or more example
	sentences, or both. </DD>
	</DL>

	<H3><A NAME="sect4" HREF="#toc4">Sense Numbers </A></H3>
	Senses in WordNet are generally ordered
	from most to least frequently used, with the most common sense numbered
	<B>1 </B>. Frequency of use is determined by the number of times a sense is tagged
	in the various semantic concordance texts. Senses that are not semantically
	tagged follow the ordered senses. The <I>tagsense_cnt </I> field for each entry
	in the <B>index.<I>pos </I></B> files indicates how many of the senses in the list have
	been tagged. <P>
	The <B><A HREF="cntlist.5WN.html">cntlist</B>(5WN)</A>
	file provided with the database lists the
	number of times each sense is tagged in the semantic concordances. The
	data from <B>cntlist </B> is used by <B><A HREF="grind.1WN.html">grind</B>(1WN)</A>
	to order the senses of each word.
	When the <B>index </B>.<I>pos </I> files are generated, the <I>synset_offset </I>s are output
	in sense number order, with sense 1 first in the list. Senses with the
	same number of semantic tags are assigned unique but consecutive sense
	numbers. The WordNet <FONT SIZE=-1><B>OVERVIEW </B></FONT>
	search displays all senses of the specified
	word, in all syntactic categories, and indicates which of the senses are
	represented in the semantically tagged texts.
	<H3><A NAME="sect5" HREF="#toc5">Exception List File Format
	</A></H3>
	Exception lists are alphabetized lists of inflected forms of words and
	their base forms. The first field of each line is an inflected form, followed
	by a space separated list of one or more base forms of the word. There
	is one exception list file for each syntactic category. <P>
	Note that the
	noun and verb exception lists were automatically generated from a machine-readable
	dictionary, and contain many words that are not in WordNet. Also, for
	many of the inflected forms, base forms could be easily derived using
	the standard rules of detachment programmed into Morphy (See <B><A HREF="morph.7WN.html">morph</B>(7WN)</A>
	).
	These anomalies are allowed to remain in the exception list files, as
	they do no harm. <P>

	<H3><A NAME="sect6" HREF="#toc6">Verb Example Sentences </A></H3>
	For some verb senses, example
	sentences illustrating the use of the verb sense can be displayed. Each
	line of the file <B>sentidx.vrb </B> contains a <I>sense_key </I> followed by a space
	and a comma separated list of example sentence template numbers, in decimal.
	The file <B>sents.vrb </B> lists all of the example sentence templates. Each
	line begins with the template number followed by a space. The rest of
	the line is the text of a template example sentence, with <B>%s </B> used as
	a placeholder in the text for the verb. Both files are sorted alphabetically
	so that the <I>sense_key </I> and template sentence number can be used as indices,
	via <B><A HREF="binsrch.3WN.html">binsrch</B>(3WN)</A>
	,<B></B> into the appropriate file. <P>
	When a request for <FONT SIZE=-1><B>FRAMES
	</B></FONT>
	is made, the WordNet search code looks for the sense in <B>sentidx.vrb </B>.
	If found, the sentence template(s) listed is retrieved from <B>sents.vrb
	</B>, and the <B>%s </B> is replaced with the verb. If the sense is not found, the
	applicable generic sentence frame(s) listed in <I>frames </I> is displayed.
	<H2><A NAME="sect7" HREF="#toc7">NOTES
	</A></H2>
	Information in the <B>data.<I>pos </I></B> and <B>index.<I>pos </I></B> files represents all of the
	word senses and synsets in the WordNet database. The <I>word </I>, <I>lex_id </I>, and
	<I>lex_filenum </I> fields together uniquely identify each word sense in WordNet.
	These can be encoded in a <I>sense_key </I> as described in <B><A HREF="senseidx.5WN.html">senseidx</B>(5WN)</A>
	. Each
	synset in the database can be uniquely identified by combining the <I>synset_offset
	</I> for the synset with a code for the syntactic category (since it is possible
	for synsets in different <B>data.<I>pos </I></B> files to have the same <I>synset_offset
	</I>). <P>
	The WordNet system provide both command line and window-based browser
	interfaces to the database. Both interfaces utilize a common library of
	search and morphology code. The source code for the library and interfaces
	is included in the WordNet package. See <B><A HREF="wnintro.3WN.html">wnintro</B>(3WN)</A>
	for an overview of
	the WordNet source code.
	<H2><A NAME="sect8" HREF="#toc8">ENVIRONMENT VARIABLES (UNIX) </A></H2>

	<DL>

	<DT><B>WNHOME</B> </DT>
	<DD>Base directory
	for WordNet. Default is <B>/usr/local/WordNet-3.0 </B>. </DD>

	<DT><B>WNSEARCHDIR</B> </DT>
	<DD>Directory in
	which the WordNet database has been installed. Default is <B>WNHOME/dict
	</B>. </DD>
	</DL>

	<H2><A NAME="sect9" HREF="#toc9">REGISTRY (WINDOWS) </A></H2>

	<DL>

	<DT><B>HKEY_LOCAL_MACHINE\SOFTWARE\WordNet\3.0\WNHome</B> </DT>
	<DD>Base directory
	for WordNet. Default is <B>C:\Program Files\WordNet\3.0 </B>. </DD>
	</DL>

	<H2><A NAME="sect10" HREF="#toc10">FILES </A></H2>

	<DL>

	<DT><B>index.<I>pos </I></B> </DT>
	<DD>database
	index files </DD>

	<DT><B>data.<I>pos </I></B> </DT>
	<DD>database data files </DD>

	<DT><B>*.vrb</B> </DT>
	<DD>files of sentences illustrating
	the use of verbs </DD>

	<DT><B><I>pos </I>.exc</B> </DT>
	<DD>morphology exception lists </DD>
	</DL>

	<H2><A NAME="sect11" HREF="#toc11">SEE ALSO </A></H2>
	<B><A HREF="grind.1WN.html">grind</B>(1WN)</A>
	,
	<B><A HREF="wn.1WN.html">wn</B>(1WN)</A>
	, <B><A HREF="wnb.1WN.html">wnb</B>(1WN)</A>
	, <B><A HREF="wnintro.3WN.html">wnintro</B>(3WN)</A>
	, <B><A HREF="binsrch.3WN.html">binsrch</B>(3WN)</A>
	, <B><A HREF="wnintro.5WN.html">wnintro</B>(5WN)</A>
	, <B><A HREF="cntlist.5WN.html">cntlist</B>(5WN)</A>
	,
	<B><A HREF="lexnames.5WN.html">lexnames</B>(5WN)</A>
	, <B><A HREF="senseidx.5WN.html">senseidx</B>(5WN)</A>
	, <B><A HREF="wninput.5WN.html">wninput</B>(5WN)</A>
	, <B><A HREF="morphy.7WN.html">morphy</B>(7WN)</A>
	, <B><A HREF="wngloss.7WN.html">wngloss</B>(7WN)</A>
	,
	<B><A HREF="wngroups.7WN.html">wngroups</B>(7WN)</A>
	, <B><A HREF="wnstats.7WN.html">wnstats</B>(7WN)</A>
	. <P>

	<HR><P>
	<A NAME="toc"><B>Table of Contents</B></A><P>
	<UL>
	<LI><A NAME="toc0" HREF="#sect0">NAME</A></LI>
	<LI><A NAME="toc1" HREF="#sect1">DESCRIPTION</A></LI>
	<UL>
	<LI><A NAME="toc2" HREF="#sect2">Index File Format</A></LI>
	<LI><A NAME="toc3" HREF="#sect3">Data File Format</A></LI>
	<LI><A NAME="toc4" HREF="#sect4">Sense Numbers</A></LI>
	<LI><A NAME="toc5" HREF="#sect5">Exception List File Format</A></LI>
	<LI><A NAME="toc6" HREF="#sect6">Verb Example Sentences</A></LI>
	</UL>
	<LI><A NAME="toc7" HREF="#sect7">NOTES</A></LI>
	<LI><A NAME="toc8" HREF="#sect8">ENVIRONMENT VARIABLES (UNIX)</A></LI>
	<LI><A NAME="toc9" HREF="#sect9">REGISTRY (WINDOWS)</A></LI>
	<LI><A NAME="toc10" HREF="#sect10">FILES</A></LI>
	<LI><A NAME="toc11" HREF="#sect11">SEE ALSO</A></LI>
	</UL>
	</BODY></HTML>