The tagset

Daniel Ridings
Unit for Digital Documentation
UiO
Norway

The tagset for the Nynorsk implementation of Eric Brill's wordclass tagger is based on previous work I did for the Swedish Parole corpus. There are some basic principles (and I have broken with them on occassion in order to make things easier).

A tag for a wordclass always consists of the same number of characters. For example, "hus" is tagged as SANE0U or SANF0U. All nouns will have a tag of exactly 6 places. If there is a place that is needed at times, but not for a particular analysis, then that place is held with a zero (0).

The first character usually denotes the wordclass: "S" for substantive.

Each character after the first one provides information about attributes. I have deviated from this principle in the interest of clarity on x occassions.

The first characters used are:

If this principle was applied across the board, it would be necessary to use non-intuitive characters in the first position for adverbs, pronouns, prepositions, infinitive markers, interjections, and subordination. Tags for these classes are verbose instead:

NOUNS

Positionexplanation
1 S (wordclass)
2 A/P for appelative or proper noun
3 M/F/N for the gender
4 E/F for singular or plural
5 case (usually 0 but can be G for the genitive)
6 U/B for the indefinite or definite

It would be misleading to write "N" (normal) in position 5 when the noun is not in the genitve. I could be confused with "N" the nominative and that is not a relavent category for Norwegian nouns. Therefore we write "0", which means: "This position is required for some analyses in Norwegian, but not in this particular case." It serves as a placeholder.

ADJECTIVES

Positionexplanation
1 A
2 Q (qualitative*)
3 P/K/S degree (positive, comparative, superlative)
4 M/F/N/2 masculine, feminine, neutrum, masculine/feminine (most)
5 E/F (number)
6 G case (rare)
7 U/B (definite, indefinite)

* This is historical and reflects a decision that we made as we worked. It will be explained further below. XXX

PARTICIPLES

Positionexplanation
1P (class)
2F/P (perfect, present)
30 (always 0, not used)
42/N (gender : masc/fem, neutrum, 0 [most common])
5E/F (number: sing/pl)
6never used
7U/B (definite, indefinite)*

* Participles used in predicative position never have a trait for definite or indefinite, but are always 0.

PREPOSITIONS

PREP

DETERMINTERS

Positionexplanation
1D (class)
2D/K/P (demonstrative, numerical, possessive)
3M/F/N (gender)
4E/F (number)
5G (case: genitive, usually 0)
6B/0 (def/indef) ?

PUNCTUATION, etc

Positionexplanation
FEExternal punctuation (. ! ? )
FIInternal punctuation (, ;)
FPparallel matching punctuation, quotes, parentheses etc
GRAFIKKmarkers in the text for divisions etc (***)
SYMBOLpublishing symbols

ABBREVIATIONS

FORK

Interjections

INTERJ

Conjunctions, subjunctions

KONJ

SBU

PRONOUNS

PRON- +
Positionexplanation
6P/R/H (personal, reflexive, interrogative "who")
71/2/3 (first, second, third person)
8M/F/N (gender)
9E/F (number)
10N/A (normal, object form)
11H/I (animate, inanimate)

NUMBERS

TO - ordinal

TALL - number

VERBS

Tagexplanation
INF-Minfinitive marker
V-IMPimperative
V-INFinfinitive
V-INF-GENXXX
V-INF-PRES-ST-FORMXXX
V-INF-ST-FORMXXX
V-PAA-ST-PRETXXX
V-PRESXXX
V-PRES-SIDEFORMXXX
V-PRES-ST-FORMXXX
V-PRETXXX

Foreign words

X

Distribution of the tags

Adverbs

TagAbs. Freq.%-wordclass%-corpus
ADV 2758799.47.4
ADV-UN 1580.60.0

Adverbs : 7.5% of the training corpus

Adjectives

TagAbs. Freq.%-wordclass%-corpus
AP00000 16105.70.4
AQ00000 260.10.0
AQ0000B 7252.60.2
AQ000GB 10.00.0
AQ00E0B 80.00.0
AQ00F00 6172.20.2
AQ00FG0 10.00.0
AQ0FE0U 1000.40.0
AQ0ME0U 2430.90.1
AQ0NE0U 1770.60.0
AQK0000 20837.30.6
AQP0000 5692.00.2
AQP000U 20.00.0
AQP0E00 13024.60.4
AQP0E0B 409814.41.1
AQP0E0U 288910.20.8
AQP0EGB 20.00.0
AQP0F00 552119.51.5
AQP0FG0 20.00.0
AQP2E00 20427.20.6
AQP2E0B 20.00.0
AQP2E0U 13094.60.4
AQPFE0U 370.10.0
AQPME0U 1150.40.0
AQPNE00 1760.60.0
AQPNE0U 27009.50.7
AQS000B 7182.50.2
AQS000U 5231.80.1
AT00000 1930.70.1
ATP0000 50.00.0
ATP0E0B 5301.90.1
ATP0E0U 20.00.0
ATP0F00 440.20.0
ATP2E0U 40.00.0
ATP3E0B 40.00.0
ATPNE0U 30.00.0

Adjectives : 7.7% of the training corpus

Determiners

TagAbs. Freq.%-wordclass%-corpus
DD0E00 10.00.0
DD0F00 217510.20.6
DDFE00 12295.80.3
DDME00 264212.40.7
DDNE00 223010.50.6
DK0000 310.10.0
DK000B 1080.50.0
DK0F00 232010.90.6
DK0FG0 10.00.0
DKFE00 16247.60.4
DKME00 386818.21.0
DKMEG0 10.00.0
DKNE00 256012.00.7
DP0F00 6293.00.2
DPFE00 4972.30.1
DPFF00 10.00.0
DPME00 7573.60.2
DPMF00 10.00.0
DPNE00 5642.60.2
DS0000 630.30.0

Determiners : 5.7% of the training corpus

Punctuation et al.

TagAbs. Freq.%-wordclass%-corpus
FE 2032147.65.5
FI 1949745.75.3
FP 28226.60.8
GRAFIKK 140.00.0
SYMBOL 20.00.0

Punctuation et al. : 11.5% of the training corpus

Abbreviations

TagAbs. Freq.%-wordclass%-corpus
FORK 4100.00.0

Abbreviations : 0.0% of the training corpus

Interjections

TagAbs. Freq.%-wordclass%-corpus
INTERJ 618100.00.2

Interjections : 0.2% of the training corpus

Conjunctions/Subjunctions

TagAbs. Freq.%-wordclass%-corpus
KONJ 2009064.75.4
SBU 1095235.33.0

Conjunctions/Subjunctions : 8.4% of the training corpus

Participles

TagAbs. Freq.%-wordclass%-corpus
PF00000 579873.11.6
PF00E00 4205.30.1
PF00E0B 1241.60.0
PF00E0U 1251.60.0
PF00F00 5737.20.2
PF02E00 1071.30.0
PF02E0B 60.10.0
PF02E0U 3694.70.1
PF02F00 10.00.0
PF0NE00 260.30.0
PF0NE0U 3804.80.1

Participles : 2.1% of the training corpus

Prepositions

TagAbs. Freq.%-wordclass%-corpus
PREP 44427100.012.0

Prepositions : 12.0% of the training corpus

Pronouns

TagAbs. Freq.%-wordclass%-corpus
PRON-000E00 4071.50.1
PRON-000F00 1740.70.0
PRON-H0000H 680.30.0
PRON-H0000I 3741.40.1
PRON-P00E00 2991.10.1
PRON-P00E0H 6852.60.2
PRON-P00F00 570.20.0
PRON-P10EAH 5071.90.1
PRON-P10ENH 20737.80.6
PRON-P10FAH 2891.10.1
PRON-P10FNH 13064.90.4
PRON-P20EAH 1530.60.0
PRON-P20ENH 6032.30.2
PRON-P20FAH 230.10.0
PRON-P20FNH 670.30.0
PRON-P30F00 640.20.0
PRON-P30FAH 5071.90.1
PRON-P30FAI 30.00.0
PRON-P30FNH 18477.00.5
PRON-P30FNI 10.00.0
PRON-P32E00 4521.70.1
PRON-P3FEA0 2150.80.1
PRON-P3FEN0 9473.60.3
PRON-P3ME00 2160.80.1
PRON-P3ME0H 10.00.0
PRON-P3MEA0 6322.40.2
PRON-P3MEN0 386814.61.0
PRON-P3MENH 20.00.0
PRON-P3NE00 795830.02.1
PRON-R000A0 24619.30.7
PRON-R00E00 1330.50.0
PRON-R00F00 150.10.0
PRON-S00F0H 1270.50.0

Pronouns : 7.2% of the training corpus

Substantives

TagAbs. Freq.%-wordclass%-corpus
S00000 1900.20.1
S00000-FO 40.00.0
S00F0U 240.00.0
SA0000 1070.10.0
SA0E0U 20.00.0
SA0F0U 40.00.0
SAF000 10.00.0
SAFE0B 54506.51.5
SAFE0U 59207.11.6
SAFEGB 710.10.0
SAFEGU 180.00.0
SAFF0B 14921.80.4
SAFF0U 24292.90.7
SAFFGB 80.00.0
SAFFGU 90.00.0
SAM000 280.00.0
SAME0B 1069912.82.9
SAME0U 1262315.23.4
SAMEGB 3630.40.1
SAMEGU 800.10.0
SAMF0B 27073.20.7
SAMF0U 46735.61.3
SAMFGB 70.00.0
SAMFGU 250.00.0
SAN000 410.00.0
SANE0B 68208.21.8
SANE0U 66948.01.8
SANEG0 10.00.0
SANEGB 1490.20.0
SANEGU 1000.10.0
SANF0B 16472.00.4
SANF0U 35994.31.0
SANFGB 150.00.0
SANFGU 380.00.0
SP0000 1665420.04.5
SP00G0 6080.70.2

Substantives : 22.5% of the training corpus

Numbers

TagAbs. Freq.%-wordclass%-corpus
T0 140.70.0
TALL 190199.30.5

Numbers : 0.5% of the training corpus

Verbs

TagAbs. Freq.%-wordclass%-corpus
INF-M 42097.81.1
V-IMP 3040.60.1
V-INF 973118.12.6
V-INF-GEN 20.00.0
V-INF-PRES-ST-FORM 4250.80.1
V-INF-ST-FORM 90.00.0
V-PAA-ST-PRET 40.00.0
V-PRES 2189540.85.9
V-PRES-SIDEFORM 10.00.0
V-PRES-ST-FORM 80.00.0
V-PRET 1705431.84.6

Verbs : 14.5% of the training corpus