Searching the Oslo Multilingual Corpus / English-Norwegian Parallel Corpus with PerlTCE

By Hilde Hasselgård



Getting started

To get to the Oslo Multilingual Corpus (OMC) or the English-Norwegian Parallel Corpus (ENPC), you open the browser page at http://www.tekstlab.uio.no/cgi-bin/omc/PerlTCE.cgi, and type in your username and password.

Note: This link, and some of the others in this document will only work if you have access to the OMC/ENPC. You will find an application form here. Due to restrictions in the copyright agreement for the texts, permission is given only to staff and students in the arts faculties at the universities of Oslo and Bergen who use the corpus for research or for courses where corpus work is specified as part of the syllabus.

If you have access to the corpus, you will see the following search form:

 

With the default settings, as shown in the picture, the browser will search in English original texts from the fiction component of the corpus. The structure of the OMC is explained at http://www.hf.uio.no/ilos/english/services/omc/. See also the information on the ENPC at http://www.hf.uio.no/ilos/english/services/omc/enpc/.


Performing a simple search

To perform a search, you type in a word (e.g. however) in the 'Enter search' box and hit the 'Submit search' button. The next screen that comes up, will show you the sentences in which however occurs in the corpus, with the Norwegian translations immediately following the English sentences.

Total before filters: 447. Displaying first 100 matches.

good : 447
Results: 101 - 447. (after filters)

·        If you want to search for a word with alternative forms or spellings, you can write the alternative forms together, separated by |. (The | means or.) Example: bein|ben

·        By ticking the box ‘sort output by matched word’ you get an alphabetically ordered concordance (word list) if you have searched for alternative forms or used a wildcard in your search (see below).

·        The box ‘List texts in corpus’ (below the search form) gives you a list of the texts included in the database shown in the box. 

Note:

·        Don’t use capital letters in your search, not even for proper nouns.

·        The ‘Enter search’ box can only contain one word. If you want to search for a string of words, you need to use the filters (see below).

·        It is possible to search for punctuation marks (e.g. ? to find all the questions in the corpus).

·        The code in brackets that appears at the end of each example is a reference to a corpus text. “T” at the end of a code shows that the sentence comes from a translated text.



Searching with filters

The various filters allow you to make a more refined search, e.g.:

 

Read the Help menu for further details about searching in the corpus.


Wildcard (*)

A wildcard is a character that represents one or more unspecified characters. The wildcard used in the OMC is *. Note that the question mark (?) is not used as a wildcard in the OMC/ENPC. (On the contrary, a search for “?” will find all the question marks in the corpus.)

Wildcards are useful if you are unsure of the spelling of a word, or if a word has alternative spellings.

Examples:

 

Wildcards can, in principle, be used to represent the beginning or the end of a word. Note that a search for a word with a wildcard at the beginning (e.g. *ly, to find all words ending in -ly) will usually take rather long, because the browser will have to check all the words in the corpus from beginning to end.



Saving your results

You can save your search results by using the 'Save' or 'Save as' option in your net browser ('Lagre' / 'Lagre som'). You can choose between saving the results as an html file or as a text (txt) file. A text file can be imported into a word processor and edited. If you do not need to save more than a few of your search results, the easiest way to save them might be to use the 'Cut-and-paste' function and paste the examples you want into a Word file.

For a large corpus investigation it is usually practical to store the results in a database, where they may be annotated, sorted and retrieved in various ways.



Using the tagged ENPC

The original texts in the ENPC have been tagged and lemmatized (meaning they have a word class tag, and that all grammatical forms of a word are grouped together under one lexeme).

A word of warning: The tagging has been performed automatically, and although the analysis is fairly reliable and has been partly checked, there are still some errors. If you are using the material for research, always check that your results are correct.



Practice



© Hilde Hasselgård and the Department of Literature, Area Studies and European Languages, University of Oslo