Beyond the Edition: On the Linguistic Annotation of Vernacular Texts

Preparing an edition of a classical or medieval text is more often than not a long-term project for an editor. The aim of the exercise is the constitutio textus, the bringing together of divergent and often contradictory sources ideally into an unmitigated whole. The edition provides a point of entry to the text, making the text accessible in its variety, although at the same time it incorporates the editor’s critical understanding of the sources and his or her recommended routes through them. The publication of the edition is not necessarily the end of the editorial work, but beyond this point, the editor will have the company of other contributors in the form of scholars who draw attention to specific aspects of the text by commenting upon it, by annotating it in various ways, or by contextualising it more generally. The focus of this chapter will be on annotation, and specifically on the linguistic annotation of medieval texts. Although the examples given here are drawn from the Nordic vernaculars, the principles are in the main the same for other language families, and there are in fact several annotation projects that span a wide variety of texts that are diverse with respect to their language, their provenance, their dating and their contents. After an initial discussion of the representation of manuscript text, looking in particular at vernacular texts, the chapter will move on to two central types of linguistic annotation, those of morphology and of syntax. While many projects concerning medieval texts have been annotating morphology, there are far fewer that have included syntactic structures in their annotation.

In a digital workflow, annotation is so interwoven with the encoding of a text that it may be useful to see annotation as an integral part of the whole editorial process. The annotation establishes a link to a number of external resources such as dictionaries and grammars, and it makes the text accessible for a broader user base. As I will argue below, morphological and syntactic annotation is not a strictly linguistic endeavour. It also means that literary and historical investigations of a much wider scope can be conducted.

Multi-level representation of a text
Some works are only preserved in a single manuscript, a situation that is probably more common in medieval than in classical literature, and this is certainly the case in Nordic vernacular literature. The majority of the Eddic poems, for example, have been handed down to us in a single manuscript, and the same goes for many other of the most prominent early Nordic works. 1 A work preserved in a single manuscript makes life simple for the editor, but whenever a work has been preserved in more than one manuscript, the editor has to decide whether the various manuscripts should be transcribed in extenso or whether some of them, perhaps all of them except the Leithandschrift, should be reflected solely through variants in the apparatus of the edition. As for the actual representation of the source, the text may be represented with various degrees of fidelity, ranging from a close reproduction of its graphical form to an extensive regularisation of its orthography. In the encoding of vernacular sources, I suggest that three focal levels can be identified along this axis: a facsimile level, a diplomatic level and a normalised level. 3 This multi-level rendering can be illustrated with a short extract from the Old Norwegian Homily Book, with my rather literal translation into English at the end: Sꝩa ſem drıupanda [< drıupande] hunang ero ꝩarrar poꝛt kono. 7 bıartara ꝩıð ſmıoꝛꝩı hálſ haennaɼ. en hınır aefſto lutıɼ haennaɼ ero bıtrıɼ ſem aeıtr. 7 o lyfıan. ok hꝩaſſer ſem tíu aeggıat ſꝩaerð. fǿtr hennaɼ ſtıga nıðr til dꜵuða. 7 lıggıa gꜵtuɼ hennar tıl haelꝩıtıſ.
Like dripping honey are the lips of a harlot, and her throat is brighter than oil. But her end is bitter as wormwood and poison, and sharp as a twoedged sword. Her feet lead down to death, and her road to hell. This admonition, which ultimately is a faithful rendering of Proverbs 5.3-5, was quoted in Ch. 18 'De castitate' of Alcuin's De virtutibus and vitiis, written between 801 and 804. 4 Alcuin's immensely popular text was translated into Old Norwegian around 1200 and became part of the Homily Book.
The three levels exemplified above can be seen as three perspectives on the same passage in the text. They differ in what might be termed granularity, i.e. the degree to which they adhere to the source. On the facsimile level, all characters, including diacritical marks and abbreviation signs, are copied in their position along the base line. On the diplomatic level, a smaller number of characters are used, so that allographs, such as the round and the straight 'r', are represented by a single character, 'r', and abbreviations are expanded. On the normalised level, the orthography is regularised according to the standard grammars and dictionaries of the language. The latter level is in many respects unique for Old Norse (i.e. Old Icelandic and Old Norwegian) texts; there are no comparable standard orthographies for most other European vernaculars. Even for medieval European vernaculars without a standard orthography, however, a certain degree of regularisation is not uncommon, such as the introduction of punctuation (often according to modern rules), the capitalisation of proper names and the first word after a full stop, and perhaps the ironing out of minor spelling variations. It should be underlined that the three levels illustrated here are not variants of the text, since variants would be quoted on the same level of granularity, and there are no variants unless there are at least two manuscript witnesses of the same passage. The levels exemplified here are representations of a single source. They are alternative ways of seeing and representing a specific manuscript rather than a work as such.
As can be seen from the square brackets in the Old Norwegian transcription above, there are several comments and corrections that the editor might wish to add to the text. The first is visible on the facsimile level and points to the fact that the scribe had obviously corrected an 'e' to an 'a' in the third word of the first line. On the diplomatic level, there is another comment regarding the form of the comparative biartare 'lighter, more luminous', which is spelt biartara in the manuscript. However, this form is wrong according to Old Norse grammars, which specify the plural form of the comparative as bjartari, or, in the case of a manuscript with vowel harmony, as bjartare. 5 At some point, the editor might want to correct the text here, although not necessarily on the diplomatic level. In fact, this is an early reflection of the merger of the endings of comparatives that took place in Old Norwegian 5 Vowel harmony was a characteristic trait of Old Norwegian and meant that the height of an unstressed vowel in the ending of a word was controlled by the height of the stressed root vowel. For example, in a word with a high root vowel, such as líf n. 'life', the unstressed vowels would also be high, lífi and lífum, while in a word with a non-high root vowel, such as lof n. 'praise', the unstressed vowels would also be non-high, lofe and lofom.
during the thirteenth century, and should probably be transcribed as such. Finally, the form tíu in the manuscript has been analysed as a metathesis of tuí 'double', part of the adjective tvíeggjaðr 'two-edged'. This correction seems obvious and also helpful for the users of the edition. One of the strengths of the multi-level type of editing exemplified here is that it allows uncorrected and corrected text to live side by side, and it can also draw a useful distinction between what may be termed scribal interventions, such as the correction driupande > driupanda, and editorial interventions, such as tíu > tuí. At the facsimile level, the text is rendered "as is", in an uncorrected state, apart from corrections made by the scribe himself, while on the diplomatic level, and even more so on the normalised level, editorial interventions are allowable. Furthermore, this division into levels has ramifications for the annotation of the text, especially, as we shall see below, with respect to linguistic annotation.
For classical scholars, the focus on minute variation may seem odd. Why immerse oneself in the accidentals of a text, when there are substantives to behold? 6 The answer lies in the fact that vernacular texts are important sources for the history of the early stages of a written language. When the provenance and dating of a source have been established, the orthography has its story to tell. Often, it is a conflated story, since the orthography of a manuscript has to be understood in the context of the orthography of the exemplar, the linguistic norm of the scribe, and, sometimes, even the intended audience. 7 In order to use the orthographical representation of any vernacular manuscript, these influences need to be identified and isolated. In some cases, this can be done with a high degree of certainty. In other cases, the language of the source appears inconsistent, which usually is understood as a conflict of norms, between the copy and the exemplar, or between the orthography of the manuscript and the internalised orthographical norms of the scribe. A normalisation of the orthography will remove these traces of norm conflict and lessen the source value of the manuscript. This might be fine for some scholars, but certainly not for those who are trying to extract the linguistic norm from it. In the model discussed above, the facsimile level and to some extent the diplomatic level offer the necessary level of granularity here, while the normalised level moves the text closer to a representation of the work and away from the source itself.

Encoding procedures
Most editorial projects nowadays will situate their work in an open, digital environment, encoding their texts in an interchangeable format. In recent years, this has become more or less equivalent with XML, Extensible Markup Language, an open and stable format for a variety of texts. 8 This format was chosen for the archive that the author of this chapter heads, the Medieval Nordic Text Archive. 9 A great number of other text archives also use XML, and many follow the guidelines set up by the Text Encoding Initiative (TEI). 10 It has to be admitted that texts encoded in XML look forbidding as they stand. While the raw XML is not intended for most users of an archive, the editor must nevertheless understand and be comfortable with the basics of it. The upside is that when a text has been encoded in an open and to a great extent self-documenting format like XML, it will be accessible to a wide range of users hopefully over a very long period of time. An XML file is a straightforward text file, as simple as they come, and it will be readable as long as basic, unformatted text files can be read.
The Guidelines published by the Text Encoding Initiative, now in its fifth version, specify the encoding of a wide variety of sources, prose as well as poetry. 11 However, in our experience handwritten medieval sources require a number of additional specifications. The Menota Handbook is intended to supplement the TEI Guidelines explaining and exemplifying how XML encoding as recommended by the Text Encoding Initiative can be used for the specific purpose of encoding vernacular medieval documents. 12 For a linguistic annotation of a text, there are two categories that need to be clearly marked up: sentences and words. In XML, the <s> element groups each sentence, and the <w> element each word. To this pair of elements, a <pc> element should be added for punctuation characters, such as full stop, comma, colon and the like. Each element states its opening, e.g. <s>, and its end, e.g. </s>. So, this is how a sentence is contained in the <s> element, and the words in the <w> element and punctuation in the <pc> element: <s> <w>This</w> <w>is<w> <w>a</w> <w>sentence</w> <pc>.</pc> </s> A multilevel edition can be encoded as parallel readings for each word, using elements such as <facs>, <dipl> and <norm> for the three levels exemplified above. 13 The very first word would receive the following encoding: On the <facs> level, the usage of an Insular form of the 'v' is recorded. On the <dipl> level, this character is merged with the ordinary 'v', and on the <norm> level, an accent is added to the vowel to indicate that it is phonemically long. In order to make the multi-level structure explicit, the element <choice> states that the contents of this element are alternatives: It should be underlined that the encoding examples given here are not meant to be typed, character by character, by a transcriber. They are the representations of transcriptions that would be done by various input methods. 14 The actual encoding of a text on several levels is no guarantee that it can easily be displayed in a manner that is accessible for any non-technical reader. In the Menota archive, the display is based on the Corpuscle application. 15 As shown in Fig. 2, this application allows a text to be displayed at up to three parallel levels, including a photographic facsimile. The Corpuscle application is used for several other archives, some of which are rather close in scope and format to Menota, such as the Georgian National Corpus, covering Old, Middle and Modern Georgian. 16 While the characters on the normalised level and largely on the diplomatic level can be rendered by almost any font, many of the characters on the facsimile level require specialised fonts. Until a few years ago, this meant that users of the archive had to install a font containing the necessary characters. Such fonts have been offered by the Medieval Unicode Font Initiative since 2004, and several of these fonts can be 14 The Menota Handbook exemplifies this in its Tutorial, which was introduced in v.
3.0 of the handbook, https://www.menota.org/HB3_T1.xml. 15 The Corpuscle application is a corpus management system for annotated texts developed  downloaded, installed and used free of charge. 17 In spite of the availability of free fonts, Menota could not offer a "plug and play" solution, and some users were surely put off by missing characters, boxes or question marks in the web display. It was a great step forward when the Web Open Font Format (WOFF) was officially launched in 2012. 18 This means that any recent browser can display all necessary characters on the fly, irrespective of which fonts happen to be installed on the user's computer.
Having established this simple (although admittedly verbose) model of representing sentences and words on more than one level in XML, the next question is how to enhance the text with additional information.

Annotating a text
There is a plethora of textual features that can be identified and annotated: motives and themes, allusions, rhetorical devices, names of persons and places, stylistic features, metrical properties, allusions to or readings from other texts, and so on. From a linguistic point of view, phonological, morphological, syntactic and lexical features are all relevant, but, in my experience, morphology and syntax are particularly well suited for annotation.

Morphological annotation
A morphological annotation will as a minimum specify the lemma, i.e. headword, of each running word in the text, and usually also the grammatical form. In the Menota XML, this information is added to each word by way of attributes to the <w> element. In the example of the adverb svá (as the entry is spelt in an Old Norse dictionary), the lemma attribute is simply svá, as shown in a slightly simplified encoding: <w lemma="svá"> <facs>Sꝩa</facs> <dipl>Sva</dipl> <norm>Svá</norm> </w> Adverbs like svá are not inflected, so the only grammatical information in addition to the lemma will be its word class (part of speech). In the Menota project, the msa attribute (the full form being me:msa) specifies the morpho-syntactic analysis of the word. An adverb will receive the value xAV, in which x signifies word class and AV adverb: <w lemma="svá" msa="xAV"> <facs>Sꝩa</facs> <dipl>Sva</dipl> <norm>Svá</norm> </w> Words with more grammatical categories, such as nouns, adjectives and verbs, have a longer list of values for the msa attribute, but the principle is the same, so that, for example, the noun varrar 'lips' will, in addition to its word class, be annotated for gender, case, number and species (the latter category has two values, indefinite as in varrar 'lips' or definite as in varrarnar 'the lips'): <w lemma="vǫrr" msa="xNC gF cN nP sI"> <facs>ꝩarrar</facs> <dipl>varrar</dipl> <norm>varrar</norm> </w> The msa attribute contains one or more name tokens, each specifying a grammatical category; in this case xNC for "noun common", gF for "gender feminine", cN for "case nominative", nP for "number plural" and sI for "species indefinite". 19 Assuming that each word of a text has received morphological annotation, the information can be displayed in various ways, including tabular displays such as the one in Fig. 3. The actual encoding and the display is more or less self-explanatory and draws on a long tradition of traditional dictionary archives based on the venerable index card. 20 It should be pointed out that the usefulness of a linguistic annotation is dependent on the variability of the orthography of the texts. For a literature in which the language of the texts is highly regularised, such as in most corpora of modern texts, a morphological analysis can to a high degree be done by semi-automatic analysis. For medieval vernacular texts in which the orthography is variable not only between sources but even within sources, a linguistic annotation really comes into its own. By the same token, the annotation is more time-consuming and less suited for automatisation. The inflection of many Old Norse words illustrates this point, e.g. the verb verða 'become', which has these (amongst other) forms: The initial v-is dropped in several forms, and the root vowel shifts between e, a, u, y and o due to a combination of Ablaut and Umlaut. This degree of variation within a paradigm is known from many other languages, but what really complicates the analysis here is the fact that each form could be spelt in more than one way, sometimes in a frustratingly high number of ways. For example, the regularised form urðu, 3rd person preterite indicative of verða, might (at least in theory) be spelt urðu, urðv, vrðu, vrðv, urþu, urþv, vrþu, vrþv, urþu, urþv, vrþu, vrþv, urþo, urðo, vrþo, vrðo, and more. On the normalised level, there would only be the form urðu, but on the diplomatic, and especially the facsimile level, there will be many more forms. An added difficulty is that while Old Norwegian and Old Icelandic (i.e. Old Norse) have a normalised orthography, this is not the case for Old Swedish or Old Danish. The Old Norse normalised orthography was established in the late nineteenth century and is used with minor variation in standard grammars, dictionaries and in many editions. 21 No similar norm exists for Old Swedish and Old Danish, even if Old Swedish texts in particular might be suited for normalisation. 22 The Old Danish language is less conducive to normalisation, partly due to the fact that it evolved so quickly in the Middle Ages and partly due to the sparsity of sources up to ca 1300. 23  norm applies to other European vernaculars, and as a consequence, morphological annotation is a desideratum across the board for the European vernaculars. In spite of these difficulties, morphological annotation of the type discussed above is a fairly simple undertaking and it is not the object of many linguistic controversies. After all, the aim of the annotation is to link texts to existing resources such as grammars and dictionaries, and, consequentially, the grammatical categories will be traditional. There is some variation in matters such as the lemma orthography and the grammatical categories, especially word classes, but at least within Medieval Nordic philology, these problems are relatively small. 24

Syntactic annotation
Syntactic annotation is more of a challenge than morphological annotation. Competing syntactic models have evolved over the years, and there is a varying degree of compatibility between them. For a language of comparatively free word order, such as the Medieval Nordic languages, it seems that dependency analysis is a suitable and fairly simple syntactic model. 25 In dependency analysis, each word is described by its function and hierarchical position within the sentence. This is typically displayed in a tree with labels for each word specifying its function, as shown in Fig. 4. It is a characteristic and perhaps unexpected trait of dependency analysis that words rather than phrases are assigned syntactic functions. There are some non-intuitive consequences of this, for example that conjunctions are analysed as heads (as in Fig. 4) and for this reason have full sentences as their dependents. However, the internal hierarchy of the coordinated sentences, each having a predicate as its head, is not affected by the fact that the conjunction has been elevated, as it were, to the position of a head.
While morphological annotation easily can be incorporated in the XML discussed above, syntactic annotation is better carried out in a separate module. The PROIEL project developed at the University of 24  Oslo offers exactly this type of annotation environment. 26 PROIEL initially undertook a syntactic analysis of the New Testament in five old Indo-European languages, the original Greek and early translations into Latin, Gothic, Armenian and Old Church Slavonic. Through cooperation with other projects, the annotated corpus in the PROIEL format has later been extended to many more languages, among them  Fig. 1. The conjunction ok 'and' functions as the head of the two coordinated sentences, while each of these has a predicate as its head, stiga 'go' and liggia 'lie'. The subject of the first sentence is fǿtr 'feet', with hennar 'her' as an attribute, and the subject of the second is gaotur 'roads' also with hennar 'her' as an attribute. The preposition til 'to' in both sentences is heading an oblique specifying the goal of the predicate, dominating daouða 'death' and haelvitis 'hell' respectively, and finally, the adverb niðr 'down' specifies the direction of the predicate stiga 'go'. Copyright: Odd Einar Haugen, License: CC BY-NC-ND.
Old Icelandic, Old Norwegian and Old Swedish. 27 While the PROIEL project was originally designed for the study of information structure, all texts were annotated for morphology and on the basis of this also for syntax. The result is a deep annotation with considerable linguistic information about each text, organised in the form of a treebank. 28 The original PROIEL treebank project and the subsequent projects now form what may be called a treebank family of early attested Indo-European languages, ranging from classical to medieval, and in some cases modern, stages in their development. 29 In total, approx. 1,6 million words have been annotated manually at a high level of accuracy. For Old Norwegian and Old Swedish, there are so far no other treebanks than those in the PROIEL family. 30 The dependency model illustrated here is in some respects close to lexical-functional grammar (LFG), but it contrasts with especially phrase structure models. However, as can be seen in the examples here, the functional categories used in dependency analysis are by and large familiar, such as predicate, subject and object, although some categories, especially the obliques and the external objects, are less familiar. Even so, there seems to be a sufficiently high degree of recognition between central syntactic models. The major criterion in such cases is, I believe, that an insight is only a fruitful insight if it can be transferred from one model to another; if not, it may be an insight solely into the model, not 27 These projects include the ISWOC project for Old English and several Romance languages, the TOROT project for Russian, the Menotec project for Old Norwegian, and the MAÞIR project for Old Swedish. PROIEL has also added many Greek and Latin texts to the original New Testament texts. The exact number of languages depends on the classification; five major language families are represented, i.e. Armenian, Germanic, Greek, Romance and Slavic, and all but Armenian have several branches. The total of languages (or linguistic stages) covered by PROIEL is 18. 28 Treebank is a term that reflects the fact that a syntactic analysis typically takes the form of a tree, and that a collection of such trees make up a bank. what the model purports to explain. A case in point is the fact that a dependency tree (with some modifications) can successfully be converted to an LFG representation, and the other way round; in other words, dependency and LFG models are able to express similar analytic insights. 31

The appeal of annotation
A text annotated for morphology and syntax is indeed a boon for the linguist and the language historian. For many other scholars, for example of a literary or historical inclination, the annotation is one of several resources for a better understanding of opaque or ambiguous passages in a text. The annotated text is a close cousin of the commentary; while the latter can go into extensive detail and list a number of interpretations, the annotated text makes a decision and is usually unambiguous, unless the categories themselves have been designed to be polyvalent. Old Norse poetry offers a host of complex and enigmatic passages. In the Eddic poem Vǫluspá 'The Prophecy of the Seeress', stanza 2.4-6 is still unresolved. As Fig. 5 from Codex Regius shows, the poem is written in continuous lines, and the script is somewhat difficult to read here.  1 Ec man iotna 2 ár um borna, 3 þá er forðom mic 4 foedda hǫfðo; 5 nío man ec heima, 6 nío íviði, 7 miotvið moeran 8 fyr mold neðan.
1 I remember yet 2 the giants of yore, 3 Who gave me bread 4 in the days gone by; 5 Nine worlds I knew, 6 the nine in the tree 7 With mighty roots 8 beneath the mold.
A much-debated question concerns the reading and understanding of the phrase "nio iviþi" in stanza 2.6 of the poem. This has been taken as normalised níu í viði 'nine in the tree' in many editions, since there commonly was no word division between a preposition and its complement, and viði is a bona fide accusative of viðr m. 'wood, tree'. However, after studying the manuscript closely, some philologists conclude that there is an almost invisible abbreviation character after "viþi", in the form of an ur sign. If this is correct, the reading of the word becomes "iviþiur", normalised ívíðjur, meaning 'giantesses'. This reading happens to be supported by another manuscript, Hauksbók in Copenhagen, The Arnamagnaean Collection, AM 544 4º, f. 20r, l. 3, so even if the reading in GKS 2365 4º is pending, ívíðjur 'giantesses' has been adopted in the latest edition of the Eddic poems. 34 In the annotated version of this poem, the latter interpretation has been selected, as can be seen from the morphological analysis in Fig. 6. 35 As for the syntax of the half-stanza, the analysis in Fig. 7  In this analysis, íviðjur 'giantesses' is explicitly analysed as the object in níu [man ek] íviðjur 'nine [I remember] giantesses', and níu 'nine' as an attribute. It is an analysis that makes rather heavy assumptions about ellipted words, but, assuming that íviðjur is a noun rather than a preposition and its complement, the present analysis seems to be the best one.
The point of this example is not that a linguistic annotation offers the definitive answer to an enigmatic reading. That would be presumptuous. Rather, what it offers is a clear and consistent analysis of each Figure 7. The syntactic annotation of Vǫluspá stanza 2.5-8. The main predicate is man (of muna) 'remember', which is present in st. 2.5 and must be seen as covert in the two other sentences, 2.6 and 2.7-8, represented by an encircled "V", and slashed lines pointing to man in order to indicate identity with the overt predicate. Copyright: Odd Einar Haugen, License: CC BY-NC-ND. word in each sentence of the poem in question, not skipping over any difficult passages, and as such it has a high degree of explanatory potential. It offers a point of reference for any interpretation of the text.

Costs and benefits
Nobody would deny that annotation enhances an edition and makes it accessible for a wider audience. What is a matter of discussion is the cost of annotation compared to the benefits it offers. For a literature where many texts are still awaiting an edition (or a sufficiently good edition), the priority would be to edit the remaining texts, unless these texts by common consent were regarded as being of too little value even for the most avid scholar of the period. This is not the case for the field this author is most familiar with, Old Norse literature. The great majority of Old Norse works have been edited, many several times, and there are few works that are not available in a decent edition. Most works are preserved in more than one manuscript, but the really large manuscript traditions known from e.g. classical scholarship are few and far between. In Old Norwegian literature, the 15 preserved manuscripts of Barlaams saga ok Jósafats count as a rather large tradition, and Konungs skuggsjá with 60 preserved manuscripts (the majority being younger Icelandic ones) is one of the largest manuscript traditions, only surpassed by the law code of Magnús Hákonarson. A notable trait is the fact that almost all of the earlier manuscripts of these work are fragmentary, so that the text of each work has to be pieced together from several textual witnesses. And, as stated above, these manuscripts typically have different orthographies, representative of their time and locale.
In comparatively small textual traditions it makes sense and is in many cases feasible to transcribe each of the manuscripts. Any critical edition can only offer a glimpse of the textual variation through its apparatus, and while for many scholars it is fine to have an apparatus that only contains the substantive variants, for other users, the variation in accidentals is as interesting. Even in a fairly small vernacular manuscript tradition, there are simply too many variants for a workable apparatus, so the only way to record them is to edit each manuscript as an individual witness to the work. These transcriptions will preferably be digital ones published in text archives and searchable within these archives.
Many editors would be happy with an edition of the text as it appears in a single, typically best, manuscript. Probably the majority of texts published in an archive like the Medieval Nordic Text Archive will remain at this level. However, some editors would like to add annotation in order to open up the text, and other scholars would like to contribute by annotating editions by previous editors. Incrementally, more and more texts will be annotated. On the whole, this process is likely to be self-regulatory, although canonical texts are more prone to receive annotation than other texts. GKS 2365 4º, the major source for the Eddic poems, is a prime example of this type of text. The question of cost will ultimately be decided on the background of the canonicity and thus the general interest in the texts to be annotated.