Administering the NO2014 corpus

Daniel Ridings, Unit for Digital Documentation, University of Oslo, Norway

NB: This document is being updated continually (sometimes every 15 minutes). It is literally work-in-progress.

This document will describe the administrative work that is required in order to import an SGML file into the concordancing system. It assumes a lot of base knowledge that the corpus group already possesses and cannot be put in the hands of a beginner in that group. It will, however, help a senior member understand why they give certain instructions to a new member.

SGML tagging

The project uses the TEI system of tagging texts but only a subset. Since the corpus is being used by lexicographers who are interested in language phenomena, we have limited our use of the TEI to tags that help identify them. The TEI will permit a whole range of esoteric tags; we don't use them unless they make sense in a lexicographical context.

What is a lexicographical context? The assumption is that lexicographers are interested in 1) words and 2) the context they occur in. The immediate context is provided by a concordance. There are times when that context is too little and the whole paragraph needs to be inspected, or possibly even more context when the whole paragraph is a short turn in a dialogue and anaphora need to be resolved.

The focus on this document will be on administering the files that have already been tagged. The presentation will start schematically but move to more technical details later on.

In general, the simplest structure boils down to the following elements in a text:

The header contains important information that is required if further sophistication is to be added to the corpus access software. The most important pieces of information are bibliographical: title, author, publisher, place and year. There is also a provision to classify texts according to uniform system: medium (book, pamphlet, newspaper, journal) and type (spoken, written). A system based on the classification used in university library systems could also be implemented. The hooks (potential) are there but they have not be implemented.

The most important element for immediate use is the <NOTE> element within the <BIBSTRUCT> element of the <SOURCEDESC> element in the text header. It occurs twice; once to hold the NO-signatur and once to hold the signum used in the left-hand margin of the concordance. Many times these are the same, but not always. Dag og Tid, for instance, has a NO-signatur of DT, but it is cited differently according to the year it was published: DT 2002, DT 2001, DT 2004 etc. The same applies to Syn og Segn, SS, which also gets the year appended to it. Novels, on the other hand, have the same form since the signum itself is unique and represents only one book.

These note-elements look like this:

A journal or newspaper looks like this:

NOTE: In practice we have deviated from this rule. The first works the group worked with were of the kind that varied from year to year. When they went over to novels, they continued, out of habit, to put the year after the signum. This "mistake" turned out to be desirable. The editors appreciated the information that immediately placed a work in a period of time. The signa for novels and books are not immediately transparent or well-known enough that they could be dated at first glance. So in reality the Conc-ref ends up looking like this:

Whatever is written in the note with the Conc-ref is extracted and used in the concordance. A comma (,) and page number are appended when individual concordances are generated.

The text portions of the <NOTE> element should be written exactly as indicated. 1) tag, 2) Conc-ref:(space), 3) signum and 4) closing tag. The colon (:) in particular is used to pick out the various parts of the text and identify the signum. I have tried to write the routine in such a way that accounts for all of the deviant ways of providing this information that I have seen so far, but there are probably more. Avoid them if possible. There is not much room for creativity here.

Things to watch out for

Some issues I have noticed and will elaborate on later:

It helps (saves time) if this is done logically: page breaks fall between paragraphs not at the end or beginning of them. When page breaks fall in the middle of paragraphs, that is ok. Just put them where they fall. But at the beginning or at the end, they cause problems (in import-prepared-into-db.prl). The script 01-checkPB.prl mentioned below in the third step had to be written to create some consistency with regard to this.

Another little adjustment that I end up having to do by hand is the scope of attributes. By that I mean: "How much of a text should be assigned a certain attribute?"

To be concrete. We assign rendition attributes to texts and we use the same mechanism to signal that a word is a foreign word (when that information is already there). For example:

I prefer not to set the punctuation inside the <HI> element. I will probably fix this in one of the scripts (prepare-sgml-for-db.prl). This will remind me to do that.

Adding a text to the corpus

As it is now (31-08-2004) these scripts are hard-coded for file names and directories. For instance, 00-normalize-sgml-for-perl.prl takes a file name as an input parameter and creates a new file with the same name in a directory called "foo" one level down. The script called prepare-sgml-for-db.prl takes a file as input and creates a new file called "old-file.sgm-new" in the foo-directory. That is, it appends "-new" to the file name and saves it under foo.

The reason for this is historical and reflects the process of working out these steps and reviewing the results before moving on to the next step. This will be changed so that one script feeds into the next one until only one final file is created, the one that gets imported into ORACLE.

In the following it is assumed that the SGML files are available in ~/Projects/no2014/sgml and that the catalog file for SGML has been adjusted for your own circumstances. As it is, the file "catalog.sgml" informs the parser that the TEI dtd and accompanying files are installed at /hf/hedvig/muspro-u1/danielr/tei. If they are not (and they should not be, unless you are running these scripts as user "danielr", then you will have to edit the catalog.sgml file. Just search and replace the above string with wherever you have installed the TEI files in your own directory structure.

Here follows the basic steps in the work flow. They will be commented on afterwards. These steps assume 1) There are no footnotes in the text and 2) the text is referred to by NO-signatur and pagenumber. The only text we've run across so far that is not referred to this way is the Bible.

  1. Parse the SGML file. No errors are allowed (one exception; see below).
  2. Create a normalized SGML version
  3. Run the normalized version through 01-checkPB.prl
  4. Run the result of the previous step through 00-normalize-sgml-for-perl.prl
  5. Run the result through prepare-sgml-for-db.prl
  6. Check that the SGML-ids match the level (depth) of elements.
  7. Import the previous step into the database with import-prepared-into-db.prl

NOTE: Here and there I have mentioned a "to do". I have done that unsystematically. I know very well that there are several adjustments that must be made to the routines. So do not let the existing "to do":s imply that I am not aware of the others. I do not mind being reminded so feel free to mention things you thing of yourself. You may very well mention something I have not thought about.

The first 2 steps, nsgmls and sgmlnorm

Once someone gets comfortable with the routines, the first two steps could be combined. nsgmls is a parser. It checks that the document being parsed has been tagged syntactically correct with respect to the TEI dtd:s and the customizations to it that have been created for NO2014. sgmlnorm parses the document just like nsgmls does. It even uses the same flags (-D ~/Projects/osv -c catalogue.sgml osv) on the command line. It differs in what it produces. It differs in the output.

nsgmls is described here. We do not use it, so I will not go into details. The fact that we do not use it is reflected in the last parameter on the commandline:

The >/dev/nul sends the output of nsgmls into what is jestfully called the bit bucket. It is like sending things into the trash can but you never have to empty it. The output simply disappears "into thin air".

sgmlnorm, however, creates a more useful output. We could just as well have used the output from nsgmls in the successive routines. We do not. The reason is that I want a normalized version of the files that have been worked on. By normalized, I mean that:

  1. I want files where all element tags have the same case (upper case).
  2. I want all values assigned to attributes to be enclosed in quotations marks: <HI REND="KURS">.
  3. I want all attribute names (REND) to be in upper case.
  4. I want matching closing tags to all opening tags.

I could ask everyone working with tagging these texts to try and conform to those rules but it would be error prone. Everyone has enough to keep track of the way it is. The problem is that none of those one my wish-list are required to produce valid SGML. SGML does not require any of them. You can write tags in a mixture of upper and lower case at will. You do not have to enclose your attribute values within quotation marks and SGML (more specifically, the TEI dtd) allows you to leave out many closing tags when they can be put there implicitly. For example, if you start a paragraph with <P>, write some text and then start another paragraph with <P>, then the parser can implicitly supply a closing tag, </P> for the first one, since you cannot have paragraphs withing paragraphs.

With one additional requirement the above specifications would also define what is called well-formed XML. The additional requirement would be that empty tags be written in a certain way. So the requirements are not unusual and if you give it a little thought, if you meet these requirements, which are those of XML, you don't really need to follow the TEI dtd. The only thing that is really required is listed above, with one exception, the empty tags. I will get to that when I describe what is done in step 4, 00-normalize-sgml-for-perl.prl. So with a slight change in one of the scripts, you could simply work with well-formed XML.

So why this lengthy introduction to step 2, sgmlnorm? For the simple reason that sgmlnorm does all of this for you. If you recall the output of nsgmls, you will remember that we sent the output to the bit-bucket. We just tossed it out. We were only interested in ascertaining whether or not the document was syntactically correct and if it was not, we wanted to know where to find the erroneous lines of text.

When it comes to sgmlnorm we want to save the output. So instead of sending the output out to nowhere we send it go another file. I usually append "-norm" to the main filename just so I know what I am working with:

That will parse the file again, rewrite the file according to the requirements I set out above and output the new version to file-norm.sgm.

01-checkPB.prl

This script tries to clean up the page-breaks to avoid undesired placement as described above under Things to watch out for. It also tries to move a <PB> tag from inbetween a word that has been hyphenated to after the second part of the word. The consequence of this is that the whole word is then marked as being on the first page, even though it is split between pages. I haven't considered this a problem. Just a, technically, slight inaccuracy.

00-normalize-sgml-for-perl.prl

What this script does is to place all tags on a line by themselves or remove them totally. The ones that are removed are the <HI REND="XYZ"> and </HI> tags. The information is not lost. The value of the attribute is saved and appended to the word that the tag applied to. Ie:

This is the routine that can have trouble later on if you put punctuation within the "highlight". Further down the road this will happen:

That is, the italics attribute will be assigned to the question mark but not to the word. Now that I know about this problem (there are related problems) I'll add the necessary logic into the script.

The script does a few more things to make life easier down the road. It:

prepare-sgml-for-db.prl

This script is the workhorse. It assumes that it will be able to find all text on a line by itself. If all elements occupy their own lines, then even text will be on their own lines. It will not do anything with a text other than save it before it gets past the <TEIHEADER>. What do I mean by "anything"?

One of the major tasks it has is to assign unique SGML ID:s to every single element and token in the document in such a way that the SGML ID:s reflect the position of any given token or element in the document hierarchy (the "document tree"). This information is what is used when the user wants to request more and more context. Technically, they move higher and higer up in the document tree.

Just about everything. It reads a text chunk at a time and sends the text over to an external program (nor-tokenizer).

nor-tokenizer does what its name suggests and in addition, it segments chunks of text into orthographic units ("sentences"). It returns an array of text lines, each one being an <S> (s-unit). Every single linguistically motivated element of a text is returned in isolation from other elements. Sentence terminating punctuation is returned on its own. Similar punctuation (the punctuational equivalent to a "homograph") that is not sentence terminating is kept where it should be. Ie, in cit. ca. dok. fig. sk. s.k. pga. p.g.a are returned as they are. The punctuation is part of the linguistic unit, the abbreviation, and not a sentence terminating period. Special attention is paid to instances where such abbreviations as etc. m.fl. osv. o.s.v. cm. evt. f.Kr. km. o.l. o. l. o. lign. occur at the end of a sentence.

nor-tokenizer does quite a lot. It grows with usage. Texts from the 1800's have added a wealth of abbreviations to those being used today, but must be taken into account if the older texts are to be tokenized and segmented (split into orthographic sentences). It is written in LEX (actually FLEX, the GNU version of Lex). It has been growing successively every since the middle of the 1990's. It started out as swe-tokenizer.

Tokenization and segmentation results in two more elements being added to the document. The <S> element (sentence) and the <W> element (word). This last one is not explicitly written out. Every token within an <S> element represents a <W> (word) or <C> (character, usually punctuation) element. The database structure keeps them unique and the element tags can be mapped to them if every needed. This is the lowest level of the document tree: the word or punctuation. These are uniquely identified and various attributes can be assigned to them. Attributes are things like graphical representation (italics, bold, etc) or linguistic (part of speech, lemma etc).

If you ever have a problem matching up the SGML ID:s with the individual elements (that is, if the id hierarchy does not match the document tree, I will get to that below), then the source of the problem was probably here. It is usually caused by page breaks that have been encoded at the very beginning or very end of paragraphs (instead of between paragraphs) or a new empty element has been introduced into the texts without telling this program about it. Remember, all the elements have been put onto lines by themselves by 00-normalize-for-perl.prl above. This program will pick up an opening element, go one step deeper in the document structure (a new element leads you down the document tree), creates ID:s to reflect this structure and will not bump the ID:s one step back up until it finds a matching closing element.

A disadvantage with SGML, compared to XML, is that empty elements (elements with no closing tag) look just like normal paired elements. The only way to know if an element is empty or not is to read through the DTD. The perl scripts here do not read through the DTD. The text files have already been parsed, so the assumption is that there are no errors. The text files are assumed to be syntactically perfect. There is no reason to parse them time and time again.

When this particular script runs across an empty tag on a line, it has to know that it will not be finding a closing tag. Empty tags are not reflected in the document structure. They are usually just milestones. Page numbers are milestones. They don't reflect the internal structure of a document like headings and paragraphs do. They just signal that in this particular edition, the page ended here. As a result, empty tags receive their own sequence of ID-numbers. They just start with 1 at the beginning of a document and continue sequentially. The sequential number together with the document number, which is unique, insure that the resulting ID:s will be unique, for example, <PB N="85" ID="E-27252-33">. An "E-" is prefixed just to signal visually that the ID number refers to an empty element.

Since SGML does not have an easy mechanism to identify empty elements, the script needs to be told explicitly what they are. That way, when it runs across them it can treat them separately from elements with matching closing tags and keep the ID numbers synchronized with the document tree. If the tagging ever requires a new empty element, this script should be edited. The relevant line is:

The name of the element, in upper case, should be added to that string, surrounded by spaces.

If changes are made to the script, it should then be commited to the revision control system. More on that later.

Check that the SGML-ids match the level (depth) of elements.

The script prepare-sgml-for-db.prl addes ID attributes to each element in the document. The ID-string gets longer the deeper you go into the document: 1.1.3.4.5 (book 1, chapter 1, paragraph 3, sentence 4, word 5) and gets shorter as you go up in the document: 1.2 (book 1, cahpter 2). There should always be closing elements that match the opening elements. sgmlnorm has made sure that even those who can be optionally closed have been so explicitly. A document begins with <TEI.2> and ends with </TEI.2> In the present simplified example, the opening element will have an ID=1 and by the time we get though the document, opening elements and going deeper into the structure (1.1.3.4.5) and closing elements and going back up the document tree (1.2), when we get to </TEI.2>, we should be back to 1.

It is not as complicated as it sounds and it can be easier to visualize by following the steps one would normally take. All we really have to do is to look at the end of the text. If ID:s and elements of the last few lines match, then it is very probable that they do in everything before. It is conceivable that two earlier errors could balance out each other, but I cannot come upon a case where that could happen that would slip by the parsing we did in the first two steps.

The easiest way is to perform the unix command 'tail' on the file that was created by prepare-sgml-for-db.prl. The new file will have the name of the input file with "-new" appended. 'tail' prints out the last 10 lines (by default) of the file you run it on. So 'tail fk_d7kF.sgm-new' results in:

hedvig ~/Projects/no2014/scripts/corpus-admin/work/foo $ tail fk_d7kF.sgm-new 
<S ID='27252.2.1.1.17.3.33.2'>
Sluttord
</S>
</P>
</DIV>
</DIV>
</DIV>
</BODY>
</TEXT>
</TEI.2>

We need at least one opening element so that we can see the ID number. We're only interested in the last opening element and that's what we got here. Sometimes you might need more than 10 lines at the end. In that case 'tail -20 xyz-new' will give you the last 20 lines.

So we have an ID number assigned to an <S> element: 27252.2.1.1.17.3.33.2. We also have a number of closing elements. If we chop off a period and a number from the ID number every time an element closes, we should end up with 27252 and nothing else by the time we get to </TEI.2>. We do not want to run out of numbers too early or have numbers left over when we get there. The top element, the <TEI.2> element, will have an ID of 27252 and that is what we want for the closing one as well.

So the <S> element is 27252.2.1.1.17.3.33.2. When we get to the closing </S> we chop off the '.2' and get 27252.2.1.1.17.3.33. Then there is a closing </P> so we chop off the '.33' and get 27252.2.1.1.17.3. Then a closing </DIV> and we remove the '.3', 27252.2.1.1.17, then another one and we remove the '.17' and get 27252.2.1.1, and another one and we remove the last '.1' and get 27252.2.1, now we are down to the closing </BODY> so we remove '.1' and get 27252.2, then a closing </TEXT> and we remove the '.2' and get 27252, which brings us down to the closing </TEI.2> with only the document number left. Exactly what we wanted. Everything is fine.

What we have done in the preceding paragraph is "walk up the document tree". We went from the sentence to the paragraph, from the paragraph to a sub-section (<DIV>), from that one to one higher up, and then to one higher up still. The same mechanism, using the ID numbers, is used in the concordance program to give progressively more context. So the synchronization of ID numbers with elements is imperative in order for the routines to work that give the user progessively more textual context.

import-prepared-into-db.prl

If everything has gone well so far, then the last step, actually importing the data into Oracle, can be taken. A lot of things are going on behind the scenes. Most of the pre-processing has already been performed and now it is time to populate the tables. The most important are WORDTYPE, OCCURRENCE and TEXT. These are the minimum required for the corpus system to work.

WORDTYPE is a table that records each individual graphical form of a word in the corpus. There is only one entry for each form and it is not case sensitive. That is, 'Og', 'og' and 'OG' occur only one time in the table. This is accomplished by folding all upper-case into lower-case. Each form receives a unique ID.

OCCURRENCE, on the other hand, records each individual occurrence of all of the graphical forms in WORDTYPE. Punctuation is a graphical form, so it will be found here as well. The actually forms are not recorded, only the ID:s from the WORDTYPE table. Each row of data in this table contains: 1) its own unique ID (rarely used), 2) an ID relating the row back to an entry in the WORDTYPE table, 3) an ID relating this row of data to the TEXT table, to the row where the data for an orthographic sentence (s-unit) is stored, 4) positional information, that is, what position in the s-unit this particular form is found and 5) the page number where this token is found. Sentences can cross page boundaries so it is not enough to know what page the sentence is found on.

There are some other things done here as well such as populating the ATTRIBUTE table with information about words that are in italics, have part of speech, have an assigned lemma etc. I will not go into that here.

The basic logic is:

Various other things are done in the process such as picking off attributes from individual tokens. They have been appended to a token in an earlier step, separated by '//', ie: word//KURS .

If all goes well, you just get your prompt back. If it does not go well, you might get something like this:

hedvig ~/Projects/no2014/scripts/corpus-admin/work $
../import-prepared-into-db.prl foo/fk_knivenFF.sgm-new 
WT: 643376 OCC: 21019088 TEXT: 9142562 DOC: 
DBD::Oracle::st execute failed: ORA-03113: end-of-file on
communication channel (DBD ERROR: OCIStmtExecute) at
../import-prepared-into-db.prl line 199, <> chunk 1.
DBD::Oracle::st execute failed: ORA-01041: internal error. hostdef
extension doesn't exist (DBD ERROR: OCIStmtExecute) at
../import-prepared-into-db.prl line 202, <> chunk 1.
DBD::Oracle::st execute failed: ORA-01041: internal error. hostdef
extension doesn't exist (DBD ERROR: OCIStmtExecute) at
../import-prepared-into-db.prl line 185, <> chunk 1.
Segmentation Fault
hedvig ~/Projects/no2014/scripts/corpus-admin/work $  
What do we do now?

What can happen is that the connection between the machine running the import script and the server running ORACLE can be broken. This does not happen much, but it does happen (mostly on week-ends and at night). It usually happens after a few hours and a couple of hundred pages have been imported. We cannot just restart the script. We would run into all kinds of problems with conflicting ID's when we try and put text in the database that is already there.

So ... what do we do?

We need to know what the last piece of information that went into database was. Remember, we will run into error messages if we try to put in duplicate information. So what we want to do is to start where the program left off before it got interrupted.

Start PLSQL Developer and set USD_LEKS_KORPUS as the current schema.

Now remember, the rows in the OCCURRENCE table are populated before we write to the TEXT table (there are <PB> in the orthographic sentence and we just want to know the page number. We do not save the tag like we do if it occurs between structural elements in the text. We also want to pull the //KURS information off of words that have it and populate the ATTRIBUTES table without letting it get passed through to the TEXT table). If the connection with the database was failed during the phase of populating OCCURRENCE but before the TEXT table was populated, then we will have references in OCCURRENCE to a non-existing row in the TEXT table. (In practice, this cannot happen since a COMMIT is not performed until the TEXT table has been populated).

We want to find out the last row of data in the TEXT table, ascertain its ID and make sure that no rows in the OCCURRENCE table refer to an ID higher than that.

We can do this by simply listing the two tables in reverse order (the most recent last). Since these can get to be pretty big (they contain the whole corpus) it would be better to list only the rows in these two tables that belong to the last work that was entered into the database, the one we were working with that got interrupted.

First we figure out what the document ID was. This will give us something to narrow down our presentation of data. We can get this by looking at the DOCUMENT table sorted in descending order (most recent first). This can be seen in the following illustration:

We can now use the ID for the last document that was added to the corpus, the first line, in order to list those lines. Once again, we are only interested in the last few lines. We perform a new search in the database and get this:

We have to read the text from the bottom up, since we have the most recent lines at the top. We can see that one s-unit was closed (text ID 9159624, the second line) but the last line is an incomplete element, text ID 9159625. The s-unit is opened, but that is where the connection with the database appears to be brokken. We cannot, however, be sure. Remember that the OCCURRENCE table is populated before the TEXT table. So it is possible that the connection was lost while populating the OCCURRENCE table but before the TEXT table. This is not likely since a commit is only performed once the TEXT table has been updated, but there is no harm in being sceptical and looking to make sure.

Remember that there is a column in the OCCURRENCE table, T_ID, that relates a row of data to a row in the TEXT table. We now want to make sure that there is no reference in OCCURRENCE to a row of data in TEXT that never made it there because of the interruption. There is no need to look at all the data in OCCURRENCE in reverse order. That would require the DBMS to monkey around with over 20 million rows of data. We can use the T_ID as a criterium to limit the number of rows we have to look at. We are really only interested in the last one.

We can now see that the last T_ID points at 9159623 in TEXT and that row is there. We don't need to remove any rows in OCCURRENCE.

We still have a problem though. We want to rerun the program, skipping everything that has already gone in. The easiest way to pick up where we left off is by running the program as usual, but doing nothing when it comes to entering data into the database until we get to a given SGML ID.

If we now look at the TEXT table, the last SGML ID there is incomplete: 27255.2.1.1.12.3.5.6. So we want to start with that one but part of it, the opening tag, is already there. What we will do is remove the row with ID 9159625 containing SGML_ID 27255.2.1.1.12.3.5.6 and instruct the program to start there. If we did not remove it, we would get an error since that SGML_ID is already there and an attempt to write it again would create a duplicate. Duplicate SGML ID:s are not allowed.

First we remove the line:

Then we use a slightly different version of the import script, continue-import-prepared-into-db.prl and change one line. From:

 if ($id eq "27213.2.1.1.1.20.129.4.7") {
or whatever the old value was. The easiest way to find it is to search for:
$continue = 1;
The line that needs to be changed will be the line before that one. We change the line to the SGML ID that we want to start with. In our case:
 if ($id eq "27255.2.1.1.12.3.5.6") {
After that change, we run the script on the same text. It will not start populating the database until it gets to the material that did not make it through the first time.

Making small corrections

From time to time small mistakes are made during the tagging phase and these should be corrected. For instance, the NO-signatur or conc-ref field in the text header might be encoded incorrectly. The abbreviation might be wrong or perhaps the wrong year might have been entered.

The information that goes into the DOCUMENT table is taken directly from the TEIHEADER of the sgml document. In turn, this information is used every time the user brings up a concordance. The TEIHEADER is also the source of the bibliographic information in the web pages for texts that have been added to the corpus.

So two corrections have to be made. The first one is in the TEIHEADER since errors here will be propagated into results that are generated from this source and the DOCUMENT table has to be corrected. That table is generated one time; when the text is imported into the Oracle database system.

For the sake of demonstration, let us assume that NordliF 1999 was erroneously entered into the TEIHEADER instead of the correct NordliF 2000. This will propagate to the SHORTREF column in the DOCUMENT table which will, in turn, result in the wrong reference information in the concordances. It will also show up in the various informational pages describing the texts recently added since they are generated from TEIHEADER. It will not help to simply correct the relevant web pages because they are regenerated every time a new text is added.

We want to:

  1. modify the entry in the DOCUMENT table (the SHORTREF column)
  2. modify the text line in the TEIHEADER

We will start with the DOCUMENT table which has the advantage of providing us with the text-id so that we can limit our search through all the text in the corpus to just the relevant document. In this example we are using PL/SQL developer again.

We now we want to change the erroneous text to the correct text, from NordliF 1999 to NordliF2000:

The document-id, 6936, can be used to limit our search through the whole corpus for the line that has to be corrected in the TEIHEADER.