Searching the Oslo
Multilingual Corpus / English-Norwegian Parallel Corpus with PerlTCE
By Hilde Hasselgård
To get to the Oslo Multilingual Corpus (OMC) or the English-Norwegian
Parallel Corpus (ENPC), you open the browser page at http://www.tekstlab.uio.no/cgi-bin/omc/PerlTCE.cgi,
and type in your username and password.
Note: This link, and some of the others in this document will only work
if you have access to the OMC/ENPC. You will find an application form here. Due to restrictions in the copyright agreement for the texts,
permission is given only to staff and students in the arts
faculties at the universities of Oslo and Bergen who use the corpus for
research or for courses where corpus work is specified as part of the syllabus.
If you have access to the corpus, you will see the following search form:
With the default settings, as shown in the picture, the browser will
search in English original texts from the fiction component of the corpus. The
structure of the OMC is explained at http://www.hf.uio.no/ilos/english/services/omc/.
See also the information on the ENPC at http://www.hf.uio.no/ilos/english/services/omc/enpc/.
Performing a simple
To perform a search, you type in a word (e.g. however) in the 'Enter search' box and hit the 'Submit search' button. The next screen that comes
up, will show you the sentences in which however occurs in the corpus,
with the Norwegian translations immediately following the English sentences.
- The three
boxes in the third row of the search form allow you to specify your
search. The default settings are 'Fiction', 'English', and 'Original'.
first box allows you to choose among different databases. 'ENPC/Fiction'
and 'ENPC/Non-fiction' belong to the English-Norwegian Parallel Corpus,
while the other databases are part of the OMC and contain more languages.
E.g. En-Ge-No contains English originals with
German and Norwegian translations.
second box is for specifying which language you want to search in.
third box gives you the choice between original and translated text.
- When box
for 'hide tags' is ticked (by default),
you will not see the long identification tag of each sentence, and special
characters come out on the screen. Take away the tick and make a search to
see the difference.
- The box
for 'direct speech' allows you to
search in only the dialogue part of the fiction text (when the box is
ticked). NB: This applies only to the ENPC, original texts.
- The box
for 'position' can be used if you are
looking for a word in a certain position in a sentence. Thus if you write
1 in the box, the browser will find only the examples where the word you
look for is the first word in a sentence. –1 will look for the last word
in a sentence.
- The box
for 'context' allows you to specify the
number of sentences (max. 25) to the right and to the left of the sentence
you look for.
- ‘Number of hits to display per page’ can be
set to 50, 100, or 200. The default is 100 for the first page and the rest
of the results on the second. You have to click on a link to see the
second page. Example: If you search for good in ENPC fiction, you
should get the following message on the results page:
before filters: 447. Displaying first 100 matches.
Results: 101 - 447. (after
If you want to search for a word with alternative
forms or spellings, you can write the alternative forms together, separated by
|. (The | means or.) Example: bein|ben
By ticking the box ‘sort
output by matched word’ you get an alphabetically ordered concordance
(word list) if you have searched for alternative forms or used a wildcard in
your search (see below).
The box ‘List texts in corpus’ (below the search form) gives you a list of
the texts included in the database shown in the box.
Don’t use capital letters in your search, not even for
The ‘Enter search’ box can only contain one
word. If you want to search for a string of words, you need to use the filters
It is possible to search for punctuation marks (e.g. ? to find all the questions in the corpus).
The code in brackets that appears at the end of each
example is a reference to a corpus text. “T” at the end of a code shows that
the sentence comes from a translated text.
Searching with filters
The various filters allow you to make a more refined search, e.g.:
for a word at a fixed point in a sentence (e.g. the first word)
for word combinations, using the and/not +/- <filter> box. Red in the search box
and AND +3
blue in the and/not +/- filter box will give you all
examples where red is
followed by blue within
a span of 3 words.
the relationship between original text and translation by using and/not
<filter>. For example, the search string however
combined with the filter AND imidlertid
will give you all examples where the English sentence has however
and the Norwegian sentence has imidlertid).
A filter with NOT, e.g. however combined with NOT imidlertid will give you all examples where however
does not correspond to imidlertid.
filters can be combined with each other. It is also possible to specify
two filters in each category. (E.g Red in the search
box, AND +3
blue in the first and/not +/- filter box and NOT +5
white in the second and/not +/- filter box will give
you combinations of red and blue, but not red, white and blue).
Read the Help
menu for further details about searching in the corpus.
A wildcard is a character that represents one or more unspecified
characters. The wildcard used in the OMC is *. Note that the question mark (?)
is not used as a wildcard in the OMC/ENPC. (On the contrary, a search for “?”
will find all the question marks in the corpus.)
Wildcards are useful if you are unsure of the spelling of a word, or if
a word has alternative spellings.
- If you
want to look up all forms of the word mind (i.e. mind - minds -
minded - minding), you can use the * wildcard to represent any set of
characters. A search for mind* in the ENPC finds minds, minded, and
minding, as well as mind's and mindful. Note that it
does not find mind itself, only words where mind is followed
by one or more characters. To find all forms including mind, type mind|mind* in the “Enter search” field.
English words can be spelt with the endings -ize
or -ise. To make sure you get all uses of
the word realize/realise you can type reali*,
so that you get both spelling variants in the same search. (Alternatively,
you can search for realize|realise.)
Wildcards can, in principle, be used to represent the
beginning or the end of a word. Note that a search for a word with a wildcard
at the beginning (e.g. *ly, to find all words
ending in -ly) will usually take rather long,
because the browser will have to check all the words in the corpus from
beginning to end.
Saving your results
You can save your search results by using the 'Save' or 'Save as' option
in your net browser ('Lagre' / 'Lagre
som'). You can choose between saving the results as
an html file or as a text (txt) file. A text file can be imported into a word
processor and edited. If you do not need to save more than a few of your search
results, the easiest way to save them might be to use the 'Cut-and-paste'
function and paste the examples you want into a Word file.
For a large corpus investigation it is usually practical to store the
results in a database, where they may be annotated, sorted and retrieved in
Using the tagged ENPC
The original texts in the ENPC have been tagged and lemmatized
(meaning they have a word class tag, and that all grammatical forms of a word
are grouped together under one lexeme).
- Log on by
clicking on the link “ENPC (tagged)” on the PerlTCE
browser page. You will see that the interface looks slightly different
from that of the untagged OMC/ENPC.
- The box
marked "L" means "lemma". (A lemma is a group of
grammatically related word forms.) Tick this box and write take as
your search string. Press "search". You will then see all
occurrences of the lemma "take" (take, takes, took, taken)
in the corpus. If the lemma box is not ticked, you will only get the word
- If you
try the same kind of search with like, the search will produce not
only all forms of the verb like, but also the preposition like
plus the nouns likes (as in likes and dislikes) and liking
and the adjective liked. In order to exclude the preposition, for
example, tick the "not" box and choose "PREP" in the
box to the right. Press "search" again. You will still have
nouns and adjectives among your hits, so a better idea is to remove the
tick in the “not” box and instead select all the word class codes preceded
by “V” (“V” on its own won’t find anything in the English material,
unfortunately) plus “ING”.
- The next
step might be to see how often the verb like is followed by an
infinitive or by a present participle. Write "AND +1 to" in the
box after the ‘L box’ in the “Original” row to get LIKE TO. Write “AND +1”
in the same box and select the tag ING to get all examples of the verb like
followed immediately by a present participle.
- There is
further information on how to search in the tagged ENPC just below the
search form (or click here).
- A list of
all the word class tags is found in
the manual to the ENPC.
A word of warning: The tagging has been performed automatically, and
although the analysis is fairly reliable and has been partly checked, there are
still some errors. If you are using the material for research, always check
that your results are correct.
- Look up
words ending in -ish in the ENPC, and see
how this ending (in words like 'reddish') is translated into Norwegian.
- Use an
English-Norwegian dictionary and check the translations of please, pardon, mister,
Then look these words up in the ENPC. Which translations to you find? To
what extent do the corpus findings agree with the dictionary?
- Use the
tagged corpus to look for the verb and noun show. How many do you
find of each? What is the most common Norwegian correspondence of the
verb? Of the noun?
© Hilde Hasselgård and the Department of Literature,
Area Studies and European Languages, University of Oslo