Sophie

Sophie

distrib > Mandriva > current > x86_64 > by-pkgid > 3a75abf9a90a5b84755827b469d47dc7 > files > 26

brilltagger-1.14-10mdv2010.0.x86_64.rpm

===============================================================================
Copyright (c) 1993 Massachusetts Institute of Technology and
University of Pennsylvania.  All rights reserved.
This program was written by Eric Brill (brill@goldilocks.lcs.mit.edu)
Feel free to contact me with any questions, comments, . . .
(After July 1 1994, my email address will be brill@blaze.cs.jhu.edu)
===============================================================================
THIS SOFTWARE IS PROVIDED "AS IS", AND M.I.T. MAKES NO REPRESENTATIONS 
OR WARRANTIES, EXPRESS OR IMPLIED.  By way of example, but not 
limitation, M.I.T. MAKES NO REPRESENTATIONS OR WARRANTIES OF 
MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE OR THAT THE USE OF 
THE LICENSED SOFTWARE OR DOCUMENTATION WILL NOT INFRINGE ANY THIRD PARTY 
PATENTS, COPYRIGHTS, TRADEMARKS OR OTHER RIGHTS.
=============================================================================
Please Read The COPYRIGHT file included with the tagger.
=============================================================================

This README file contains a brief description of the tagger,
and information on how to modify the tagger to suit your needs.
For more detailed information on the tagger, see the papers listed at
the end of this file.  

===================================================================

Tagging is done in two stages.  Every word is assigned its most likely
tag in isolation.  Each word in the tagged training corpus has a
lexical entry consisting of a partially ordered list of tags,
indicating the most likely tag for that word, as well as all other
tags seen with that word (in no particular order).  A list of
transformations is provided for determining the most likely tag for
words not in the lexicon.  Unknown words are first assumed to be nouns
(proper nouns if capitalized), and then cues based upon prefixes,
suffixes, infixes, and adjacent word cooccurrence are used to change
the guess of most likely tag.  (To find out how to alter the strategy
of tagging unknown words first as proper nouns if capitalized and
nouns otherwise, see the file README.TRAINING.)  Next, contextual
transformations are used to improve accuracy.

===================================================================

To compile the programs, type (in the tagger base directory):

 make

or first edit the Makefile to suit your needs.

==================================================================

(If you have altered the file structure of the tagger after untarring
the programs, then you will have to adjust the instructions
accordingly). 

First cd into the directory Bin_and_Data/.

To execute the program, type:

tagger LEXICON YOUR-CORPUS BIGRAMS LEXICALRULEFULE CONTEXTUALRULEFILE

where YOUR-CORPUS is the file name of the (currently untagged) corpus
you wish to have tagged, and the other files are all provided with the
tagger.

Options (which are typed after all of the file names) are:

-h             :: help

-w wordlist    :: provide an extra set of words beyond those in LEXICON.
	          See below.

-i filename    :: writes intermediate results from start state tagger
		  into filename

-s number      :: processes the corpus to be tagged "number" lines at
		  a time.  This should be specified if memory problems
	          result from trying to process too large a corpus at
		  once.  For example, on a Sparc 10 with 32 meg RAM,
	          I usually process 250,000 words at a time.  On a
	          machine with 48 meg, I typically do 500,000 words.
	          (note that "number" is the number of lines, not
		  words).   These numbers are just guidelines.  You
	 	  can test out what works best for you if you plan to
		  tag large corpora.

-S 	       :: use start state tagger only.

-F             :: use final state tagger only.  In this case,
	         YOUR-CORPUS is a tagged corpus, whose taggings will
	         be changed according to the final-state-tagger
	         contextual rules.  YOUR-CORPUS should be a tagged
	         corpus ONLY when using this option.
	
The tagger writes to standard output.

=======================================================================

Information on training files:

LEXICON: a list of (word tag1 tag2 ... tagn), where tag1 is the most
likely tag for "word" in the training corpus, and tag2...tagn are
other taggings of "word" seen in the training corpus (in no particular
order).  

There are three different lexica provided with this release:

  LEXICON.BROWN.AND.WSJ was derived from the Penn Treebank tagging of
	the WSJ, roughly 3 million words, and the Brown Corpus.
  LEXICON.BROWN was derived from the Penn Treebank tagging of the
	Brown corpus only.
  LEXICON.WSJ  was derived from only the WSJ. 
  (LEXICON is a link to LEXICON.BROWN.AND.WSJ)

Which lexicon you choose to use will depend on the type of corpus you
wish to tag.

CORPUS: the corpus you wish to tag.  Should be one sentence per line, with
punctuation (and anything else appropriate) tokenized.  Words can be
pretagged by tagging with two slashes:

He 's the winner ( at least , that 's what I was told ) .
The boy//NN said : `` I am here . ''

BIGRAMS: a list of adjacent word pairs seen in the training corpus.  Used to
apply transformations such as: "change the tag from X to Y if word Z ever
appears to the right."  In this distribution, this file is just set to
a dummy list, as a place holder.  This is because "BIGRAMS" is only
used for unknown words, and there is no Brown or WSJ text in the Penn
Treebank that is not tagged.  This file can be augmented if more
unannotated text is available -- see below. 

LEXICALRULEFILE: list of transformations used for initial tagging of words not
in the lexicon.  There are two lexical rule files provided with this
release:

  LEXICALRULEFILE.BROWN was derived from roughly 300,000 words of
	tagged text from the Brown Corpus.
  LEXICALRULEFILE.WSJ was derived from roughly 300,000 words of
	tagged text from the WSJ.
  (LEXICALRULEFILE is a link to LEXICALRULEFILE.WSJ)

CONTEXTUALRULEFILE: list of contextually triggered transformations.
There are three contextual rule files provided with this release:

  CONTEXTUALRULEFILE.BROWN was derived from roughly 600,000 words of
	tagged text from the Brown Corpus.
  CONTEXTUALRULEFILE.WSJ was derived from roughly 600,000 words of
	tagged text from the WSJ
  CONTEXTUALRULEFILE.WSJ.NOLEX was derived from roughly 600,000 words
	of tagged text from the WSJ, disallowing all transformations
	that make reference to words.
  (CONTEXTUALRULEFILE is a link to CONTEXTUALRULEFILE.WSJ)
====================================================================

For information on learning lexical and contextual rules, see the file
README.TRAINING.  Below we discuss how an already trained tagger can
be augmented when more training material becomes available by altering
the data files used by the tagger.

If you have a corpus you want annotated, information about that corpus
can be added to the training files to help the tagger.  For instance,
if a copus contains the words: abcds, abcding and abcded, the tagger
can make some guesses about these words even if they are unknown
words. 

(Note: all perl utility programs can be found in the Utilities/
directory). 

First, the bigram list can be augmented by calling:

incorporate-new-bigrams.prl LEXICALRULEFILE BIGRAMS NEWCORPUS > \
	NEWBIGRAMS

Where NEWCORPUS is the corpus to be tagged. Then, NEWBIGRAMS should be
used instead of BIGRAMS in tagging.   This program assumes NEWCORPUS
will have one sentence per line.

To augment the lexicon, call:

incorporate-new-words.prl LEXICON NEWCORPUS > WORDLIST

Then call the tagger with the -w WORDLIST option.  This does not add
to the lexicon, but adds to the list of known words.  This is of use
to the unknown word tagger, for rules such as: change the tag of an
unknown word to adjective if adding the suffix "ly" results in a word.
To determine whether something "results in a word", we see if it is
listed in the LEXICON or WORDLIST.

====================================================================

The data files can be manually edited.  If the word "alsfjls"
occurs frequently in your corpus and is always a noun, just add the
line:

alsfjls NN

to the LEXICON file.  If it is usually a noun, but can also be an
adjective or a determiner, add:

alsfjls NN JJ DT

You can also manually add rules to the rule lists.  Here are some
examples of the meaning of lexical rules:

0 haspref 1 CD x == if a word has prefix "0" (of length 1 character), 
	            tag it as a "CD"

VBN un fhaspref 2 JJ x == if a word has prefix "un" (of length 2
	           characters), and it is currently tagged as "VBN", 
		   then change the tag to "JJ".

- char JJ x   == If the character "-" appears anywhere in the word, 
	         tag it as "JJ".

ly hassuf 2 RB x  == If a word has suffix "ly", tag it as "RB".

ly addsuf 2 JJ x  == If adding the letters "ly" to the end of a word
	             results in a word (the new word appears in
	             LEXICON or the extended wordlist), tag it as "JJ"


Mr. goodright NNP x == If the word ever appears to the right of "Mr.",
			tag it as NNP.
		 

Note the difference between haspref/fhaspref, goodright/fgoodright,
etc.  Rule names starting with "f" are retricted (only apply if the
current tag matches the specified current tag), while the other rules
change a tagging regardless of the current tagging.


And contextual rules:

NN VB PREVTAG TO == Change a tag from NN to VB if the previous tag is TO
VBP VB PREV1OR2OR3TAG MD == Change  a tag from VBP to VB if one of the
	                     3 previous tags is MD

====================================================================

IMPORTANT:

Tokenization and vocabulary follow the Penn Treebank.  Punctuation must
be removed from words, etc.  The corpus should be in one-sentence-per-line
format.  Example:

We  're going today , are you ?
`` I 'm hungry , '' he said .

Since the tagger was trained on the Penn Treebank, it must either be
augmented (eg. by adding lexical entries for we're, I'm, ", etc.) or
the input must conform to Penn Treebank tokenization 
rules.  For more information about the Penn Treebank style, contact
treebank@unagi.cis.upenn.edu.

Also, one sentence per line is assumed.

==================================================================
Important: The tagger is provided with a default algorithm for
initially guessing the tags of unknown words before applying
the learned rules.  This default algorithm is:

  Tag words beginning with a capital letter with NNP.
  Tag all other words with NN.

If you are using the tagger as-is, then everything is fine.  However,
if you have retrained the tagger, and altered the default algorithm in
unknown-lexical-learn.prl, then you must also alter this algorithm in
the start-state-tagger.c code, so the two are the same.  Instructions
for how to do this can be found in the comments in both programs.

==================================================================
Helpful hints:

If tagging speed matters, and the tagger spends much of its time
processing contextual rules (spitting out c's to the screen), you can
try using only the first n rules from the contextual rule file, for
some n.  The tagger gets diminishing returns as rules lower on the
list are applied, so the trade-off between speed and accuracy might be
favorable by doing so.
===================================================================


For a more thorough description of the learning programs, see:

\bibitem[Bri92] E.~Brill.
\newblock A simple rule-based part of speech tagger.
\newblock In {\em Proceedings of the Third Conference on Applied Natural
  Language Processing, ACL}, Trento, Italy, 1992.

\bibitem[Bri93] E.~Brill.
\newblock {\em A Corpus-Based Approach to Language Learning}.
\newblock PhD thesis, Department of Computer and Information Science,
  University of Pennsylvania, 1993.

\bibitem[Bri94] E.~Brill.
\newblock Some advances in rule-based part of speech tagging
\newblock  Proceedings of the Twelfth National Conference on
   Artificial Intelligence (AAAI-94), Seattle, Wa., 1994.

(these papers are available from anonymous ftp to
lightning.lcs.mit.edu in pub/BRILL/Papers.  The last paper is included
with this distribution, in the Docs/ directory.  After July 1, 1994,
contact me at brill@blaze.cs.jhu.edu for information about how to
obtain these papers).