EDD / NO2014 / Corpus

Daniel Ridings

Non-linguistic (oh well, and some linguistic) aspects

Corpus development by EDD for NO2014 began seriously in 2002. At that point two basic specifications were made:

The rest was more or less up to me.

Legacy data

There was a substantial amount of older texts, older, in general, referring to texts for which the copyright had expired. These texts had been created in earlier projects (DOKPRO). They were in SGML, but in a more or less unusuable form of SGML, without substantial manual processing.

I have seen many levels of quality when it comes to corpus texts. Sometimes the SGML is unusuable because the formalism has been used in a runaway form; anything and everything has been marked. Users tend to look at the TEI and feel obligated to use everything the TEI has to offer. The main reason that the existing texts were unusuable was not because of such reasons. DOKPRO was simply before its time. DOKPRO was already doing things that the TEI was created to facilitate, the main problem being that DOKPRO was doing it before the TEI had completed much of its work.

SGML, and the TEI, further down the road, was initially seen as a form for document exchange. SGML was added, documents could be exchanged between sites in a known, documented format, but in most cases each site would strip off the SGML in order to use the documents internally. The requirement to used SGML was a break with this tradtion. I will return to this later.

ORACLE

ORACLE is not the given application for providing access to a large corpus. The corpus, however, was never intended to be a stand-alone resource, even if it works well as one. ORACLE is the glue that connects various working processes into the "big picture."

Data modelling, for an amateur like me (in data modelling context) involves making intelligent guesses at what kind of information a user will want to glean from a data set. The exact information need not be known, but the types of questions, a typology of queries, should be identified. Having done that, then the relationships between members of the data set can be established and the model formed accordingly.

When dealing with natural language, an open-ended, infinite set of data, the above is only possible if one has solved all of the outstanding questions of linguistics. There was really no reason to even try to map linguistic data to a relational database.

The solution was to assume that corpus users wanted to search for words and that they wanted to see these words returned

ORACLE was, is, best at returning larger chunks of data, of textual evidence, based on simple searches. By simple searches I mean that ORACLE begins to feel convoluted as soon as you want to search for:

put, followed by "up with" or "in"

The ORACLE implementation, however, shines when it comes to returning variable, but logical, stretches of context.

Concordances

Concordances are a refinement of KWIC (keyword in context) applications. The refinement is simple, yet fundamental, to their usefulness in linguistic contexts.

The context in concordances is sorted. There is little reason to sort the actual search word in a dynamically created concordance. We know what the result is going to be, because we searched for it. The only surprizing result would be a failure to find anything. If we are only interested in knowing whether or not a word occurs in the corpus, a frequency list would suffice.

What we are primarily interested in when we create concordances of corpus searches are the contexts that words are found it. If these contexts are not sorted, then similar contexts will be scattered about in the KWIC material. If the corpus is large, and they do tend to be large when created for lexicographic purposes, then we end up with thousands of lines of exerpta that are not significantly more helpful than a filing drawer of slips.

When the results are sorted by context, the page or screen in front of us actually creates a visual effect that draws attention to cases where the same context is found line after line. When line after line of the same context is found, then the words create a pattern. The search word is often the same from one line to another as well and anchors the visual effect. The table format of ORACLE data tends to lessen the effectiveness of concordances in this respect. Table keep the search word separate in a column of its own, isolated, visually, from the context.

Another, related, strength of concordances over KWIC results is the ability to sort context in various ways. We can sort on the context on the right hand side of the search word if we are looking for linguistic friends of adjectives and the nouns they qualify or the argument structure of transitive verbs. Then again we might be searching for specific nouns and are interested in the left hand context, since Norwegian is an AN language.

Note that a concordance creates a subset of a corpus, a subset based on a specific search. We devote some attention to balancing the text types that a corpus is made up of. Having done that, we make claims as to the norm of this and that phenomenon. In a balanced corpus we hope to represent linguistic traits according to the same norm in which they occur in natural language (can never be done and there are no scientific criteria for doing so).

Nevertheless, we try. Then what do we do? We disturb that balance by creating subsets that are intentially skewed, skewed by the fact that every single line contains at least one word that we forced to be there; forced to be there by the simple fact that we searched for it.

If it is true that you can know the word by the company it keeps (Firth), then these skewed subsets, based on known search criteria, should also contain another imbalance, an imbalance consisting and of the company that the search words tend to keep. This is the basic idea behind the method developed by Sofia Johansson, a masters student in computational linguistics I supervised at Göteborgs universitet: Kollikon (see pdf file). At some point in time I would like to apply what we learned to the Norwegian corpus and see if we can come a step or two further.

Document tree

I already mentioned that one of the strong points of ORACLE is the ability to return various lengths of motivated context.

Motivated context? Well, would could simply specify x number of words or characters to the left or to the right. That would work, but it would make life a bit difficult for the lexicographer who eventually wants to isolate an example that will be included to illustrate a sense of a word that is being defined and described. Examples will be read by human readers and syntactically corrects sentences are needed. How many characters to the right and left, how many words to the right and left, are needed inorder to capture a full "s-unit" (sentence unit)?

Two aspects of SGML proved to be useful. The one, the document tree, is what I used to provide various amounts of context to the users. (The other was the ability to associate bibliographical information with a text by means of the TEI-header). Not much else was of any import.

What is a document tree? It is a document hierarchy. At the top of the hierarchy is the document itself, let us say the novel. We can give each novel a unique identification number (1, 2, 3, 4, etc) and that number will represent the top of the hierarchy. So if we are referring to text 3, then we can just refer to 3 rather than a long title. Somewhere 3 will be associated with the bibliographic information anyway.

The next level down in a typical novel is "chapter". They too can be numbered, so we have the first chapter of 3, 3.1, the second chapter, 3.2, the third chapter, 3.3 etc. We can always refer uniquely to the third chapter of a particular book by qualifying the first position with the unique number for the book: 1.3 or 2.3 or 3.3 or 4.3 etc.

The next level down from a chapter is often the paragraph. So the first paragraph of the first chapter of book 3 would be: 3.1.1, the second paragraph of the same, 3.1.2, the third paragraph, 3.1.3, the third paragraph of the fourth chapter of book 3, 3.4.3 etc.

Once we get down to the level of the paragraph we have the two most interesting levels for a lexicographer: the sentence and the word. So the fourth word of the second sentence of the third paragraph of the fourth chapter of book 3 is: 3.4.3.2.4

SGML, by creative use of ID's (the numbers above are used to build of ID's) allows us to uniquely identify any single element (a word, sentence, paragraph, chapter, book is an element, some being subelements to others). Once again 3.4.3.2.4 to expess a particular word. If we want to refer to the sentence the word occurs in, we just have to back up one step to 3.4.3.2. If we want to see a little more context, let us say the paragraph, then we just back up one more step again, 3.4.3. If we want the whole chapter, 3.4, the whole book, 3. The corpus application does not permit anyone to go up a level higher than the book (or article, or whatever the text happens to be). To go up one level higher would mean going up to the top of the corpus. Returning the whole corpus as context for a search result is probably just a little bit more information than anyone need to know. There are defensible reasons for going up to the level of the whole text (for removing the text, replacing the text, exporting a whole work out of the corpus, for example).

The application that the lexicographers use for editing permits them to ask for more context, step at a time. The same technique is used for one of the web versions:

Oracle interface

In that version you can search for a word using SQL strings (% = wild card), such as: sam%

That will bring up a list of words that match. If I were to let the user go to a concordance directly, it might tie up their machine for quite a while, since wild card searches can result in numerous concordance lines. Click on one of the lines and you will be taken to a concordance.

Once you have a concordance in front of you, you might see a concordance line that need more context in order to be useful. If you click on such a line, a pop-up window with yet another step up the document tree will be opened. Yet another step usually equals a paragraph, a p-unit, since the concordance line is built out of s-units. One step up from an s-unit is a p-unit.

You will also notice that the concordance line, ideally, does not extend over sentence boundaries. The initial context consists of the immediate syntactic information and that does not include the next or previous sentence. To get to that information, you need to ask for more context.

So the ID value of each concordance line is the same as the s-unit. That being the case, a lexicographer can also save that SGML element, the s-unit, for inclusion in a dictionary article.

Word class tagging

The straw that broke the back of the ORACLE search system. We managed to get around the less than impressive capacity to search for multi-word units, but to tack on attributes (part-of-speech-info = an attribute to a word) was just a bit too much.

I was the first one in Scandinavia, if not the first on the network anywhere, to implement the Corpus Query System from Stuttgart within CGI back in the mid-90'ies and I finally dumped the ORACLE system and went back to it.

I led the development a part of speech tagger for nynorsk based on the formalism from Eric Brill. We didn't have much of a training corpus to go by, so we work by increments. We hand-checked a few thousand words of corpus text, trained the tagger, tagged twice as much, hand-checked that, trained the tagger, tagged even more text, hand-checked that text, re-trained the tagger and so on.

The results can be seen at the links below. The tagger is freely available and can be downloaded from the same links, if you know what you are doing.

Tagger info
User guide in nynorsk

Hints on how word class searching can be used is found in "Nynorskkorpuset ved Norsk Ordbok" by myself and Oddrun Grønvik (second link above).

Marriage of Oracle and CQP

Even though ORACLE was rejected as a search system, it is still used. There is a one-to-one relationship between information in CQP and information in the database. This relationship is maintained through the careful use of ID's.

ORACLE is used as a repository. Texts are put there for future extraction. ORACLE works like a quality control, maintaining the integrity of unique ID's. The workflow for entering texts into ORACLE can be found at:

Admin manual

The application that the lexicographers use accesses the corpus through HTTP and CQP. The results are returned to the Delphi application through XML. This XML information is then used by the Delphi application to retrieve bits of the corpus from ORACLE (once again, using the SGML ID's). So the ORACLE data is never really searched on by the lexicographer since they work with the CQP system, but in the final step, corpus material is integrated with ORACLE through the results that CQP returns to the Delphi application, which is a module in a larger system, a larger system that was the motivation for requiring ORACLE in the first place.

Current status

Almost all corpus documentation is kept routinely up to date on the web.

Texts
Distribution

Some nice things are discretely presented that are rather unique, even internationally. The first link above looks like a mere bibliography, but if you follow one of the bibliographic links, you will be presented with a frequency list.

Every time a text is added to the corpus, the new vocabulary that the text has, compared to existing corpus texts, is presented. In this way the corpus takes the first steps towards becoming what John Sinclair refers to as a "monitor corpus". I am not immediately aware of any other corpus, readily available, that does this.

The new vocabulary that each text brings with it, if you study it, reveals what the text is about. Another ambition is to formalize this information, the "aboutness", in order to create objective keywords for what the text is about. Keywords that an individual puts on a text are subjective. Two people will not provide the same keywords. That being the case, one can argue that the addition of manual keywords is more damaging and hardly scientifically motivated than useful.

The new vocabulary provides the seeds for allowing the individual texts to report what they are about, in an objective, repeatable manner.