1. Introduction
This document is meant to serve as a reference for the encoding of PressMint corpora of historical newspapers. In order for the PressMint corpora to be interoperable (i.e. so that the same scripts can be used to process them), their structure is fairly rigid, primarily in terms of file names and folder structure, and, partially, their TEI XML encoding. This is not to say that all the corpora have to contain exactly the same information because we distinguish obligatory information, which all the corpora should contain, from that which is optional, and present only in the corpora for which it has been possible to gather it from the corpus sources.
This document is a modification of the ParlaMint encoding guidelines, which are a customisation the TEI Guidelines. But while ParlaMint specifies many reguirements on the structures of the docuemnts and obligatory data and metadata, PressMint makes only minimal requirements for the purposes of interoperability although leaves considerable space for optional extensions.
The rest of these recommendations are structured as follows:
- Chapter 2 explains the overall XML structure of a PressMint corpus, and introduces the distinction between the corpus root and corpus components;
- Chapter 3 explains basic requirements and the file conventions a ParlaMint corpus and introduces the top level XML elements and their attributes incl. links;
- Chapter 4 gives the encoding of the corpus metadata, such as the title information, documenting the source of the corpus, taxonomies used etc.;
- Chapter 5 explains how to encode the information to the facsimile-related information, i.e. links to and documentation of the newspaper images;
- Chapter 6 treats the encoding of the the texts of the corpus newspapers;
- Chapter 7 details the addition of linguistic annotations to the corpus;
- Chapter 8 introduces scripts to finalise, validate and convert a PressMint corpus to other formats;
- Chapter 9 gives instructions on how to contribute samples of a PressMint corpus to GitHub;
- Appendix A gives the formal specification of the PressMint schema.
2. Overall corpus structure
2.1. XML structure
The <teiHeader> of a corpus component (further detailed in the Section on Corpus metadata) contains the metadata specific for this component (along with some redundant metadata about its provenance), and which should be unique in the corpus, i.e. the corpus component metadata should distinguish it from all the other components of the corpus.
2.2. Use of XInclude
The fact that a corpus is one XML document does not mean that it is also stored in one file. In fact, PressMint requires that each corpus component is stored in a separate file, with the corpus root, i.e. the top-level <teiCorpus>, also stored as one file.
2.3. File names and directory structure
PressMint has strict rules on how to name the various files that constitute a corpus, and how to collect them in directories.
The file names have the the following structure:
- The corpus root file name should start with the string
PressMint-, followed by the ISO 3166 country code (cf. Section on Standard values) of the country whose team is contributing the corpus, e.g.PressMint-SI.xml. - A corpus component filename should start with the name of the root, followed by an underscore2 and the ISO 8601 formatted date of the publication of the newspaper, for example
PressMint-SI_1899-04-16.xml. In case the exact date of the publications is unknown, only the month or year can be given, e.g.PressMint-SI_1899-04.xmlorPressMint-SI_1899.xml. - The corpus component filename can be extended with and underscore and a string containing only ASCII letters and numbers and the hyphen character, e.g.
PressMint-SI_1899_KRN-NUK.xml. This extra suffix can encode an abbreviation of the newspaper's name, and serve to distinguish two different newspapers that were published on the same day, or distunguish sources from where a newspaper was obtained. - Certain metadata elements from the corpus root <teiHeader> are stored in separate files, in particular the PressMint taxonomies, stored in <taxonomy> elements. The common PressMint taxonomies are named as in
ParlaMint-taxonomy-topics.xml, i.e. start withParlaMint-taxonomy-followed by the a string describing what the taxonomy contains, in this case, the topics that the newspaper articles will be classified into. However, if a taxonomy (or other metadata file) is specific to a particular corpus, then this is indicated by having in the file name (and the IDs contained in it) prefixed by the name of the corpus root, e.g.PressMint-NL-taxonomy-topics.xmlfor additional topics distinguished by the Dutch corpus. Finally, if a taxonomy (or other separate metadata file) refers to the linguistically annotated annotated corpus only, then it should get the suffix.ana, e.g.PressMint-taxonomy-NER.ana.xml. - The file names of the corpus as a whole or corpus components that have been automatically converted from the source XML into some other format should have the same name as the corpus root or components, respectively, but with appropriate file extensions, e.g,
PressMint-SI_1899_KRN-NUK.txt; this is further explained in the Section on Conversions. - As discussed in the Chapter on Linguistic annotation we distinguish the linguistically annotated version of the corpus from the ‘plain-text’ one, with the linguistic annotated version having the additional suffix
.anaon the corpus root and components, e.g.PressMint-SI.ana.xmlorPressMint-SI_1899_KRN-NUK.ana.xml.
For distribution the complete XML corpus should be stored in a directory that has the same name prefix as the corpus root file and extended with the format (e.g. TEI). The directory then contains the corpus root file, its metadata files (such as taxonomies), while the corpus components should be in subdirectories, one per year, for example:
PressMint-SI.TEI/PressMint-SI.xml
PressMint-SI.TEI/PressMint-taxonomy-topic.xml
...
PressMint-SI.TEI/1899/PressMint-SI_1899-01-02.xml
PressMint-SI.TEI/1899/PressMint-SI_1899-01-03.xml
PressMint-SI.TEI/1899/PressMint-SI_1899-01-04.xml
...
PressMint-SI.TEI/1900/PressMint-SI_1900-01-02.xml
PressMint-SI.TEI/1900/PressMint-SI_1900-01-03.xml
PressMint-SI.TEI/1900/PressMint-SI_1900-01-04.xml
...⚓
The lingistically annotated version of the corpus is stored separately, with the main directory and, as mentioned, the corpus root, its metadata files for linguistic annotation, and component filenames having the additional suffix .ana, e.g.
PressMint-SI.TEI.ana/PressMint-SI.ana.xml
PressMint-SI.TEI.ana/PressMint-taxonomy-topic.xml
PressMint-SI.TEI.ana/PressMint-taxonomy-NER.ana.xml
...
PressMint-SI.TEI.ana/1899/PressMint-SI_1899-01-02.ana.xml
PressMint-SI.TEI.ana/1899/PressMint-SI_1899-01-03.ana.xml
PressMint-SI.TEI.ana/1899/PressMint-SI_1899-01-04.ana.xml
...
PressMint-SI.TEI.ana/1900/PressMint-SI_1900-01-02.ana.xml
PressMint-SI.TEI.ana/1900/PressMint-SI_1900-01-03.ana.xml
PressMint-SI.TEI.ana/1900/PressMint-SI_1900-01-04.ana.xml
...⚓
3. General requirements
This section gives some general requirements a PressMint corpus has to meet, in particular those relating to the characters in a corpus, and the use of standards. It also details the structure of the file names of the PressMint root and component files, as well as the attributes expected on the <teiCorpus> and <TEI> tags.
3.1. Characters
The corpus should be encoded in Unicode, using the UTF-8 character encoding, at least for European languages. In cases where the original contains characters from the Unicode Private Use Area, these should, if possible, be given their closest Unicode equivalents or substituted by the Unicode replacement character U+FFFD. End-of-line hyphens, if present in the source files, should be removed, and the split words joined in order to enhance searching the corpus and to simplify linguistic processing.
The following characters, esp. prevalent when the source documents were in Word or HTML, deserve special mention:
- TAB (U+0009) character helps the alignment of strings on successive lines. As PressMint is not interested in preserving the layout, all TAB chacters must be substituted by space characters (U+0020).
- NO-BREAK SPACE (U+00A0) prevents, with some applications, an automatic line break at its position and also collapsing such consecutive characters into a single space. As the use of this character complicates (or breaks) further processing, esp. linguistic annotation, this character must be substituted by the normal space character (U+0020). The same holds for other variants of Unicode space characters (U+2000 - U+200A), which are, however, used much less frequently.
- ZERO WIDTH NO-BREAK SPACE (U+FEFF), also used as the Byte Order Mark (BOM) in Windows files should be removed.
- NON-BREAKING HYPHEN (U+2011), similarly to NO-BREAK SPACE, prevents a line break, in this case following its position. With a similar reasoning as above, this character should be substituted by the normal hyphen character ('-', U+002D).
- SOFT HYPHEN (U+00AD) indicates that a word can be hyphenated at that point. Occurrences of this character should be removed from the corpus.
Text-bearing elements should also not start or end with space characters, and sequences of whitespace characters should be changed into a single space.
3.2. Standard values
Whenever possible, PressMint uses standards for information coding. In particular, the following information must be standardised:
- As the identity of a PressMint corpus is determined by the country that is contributing the corpus, its code appears in many places. For specifying these codes, the ISO 3166 standard should be used, in particular ISO 3166-1 alpha-2 for the two letter codes of the countries.
- The codes for the languages used in the corpora (i.e. the possible values of the xml:lang attribute) should follow BCP 47 (cf. also ‘xml:lang in XML document schemas’. Essentially, this means that the value for a language code should have two letters, following ISO 639-1 or, and only if a two letter code does not exist for a language, the three-letter ISO 639-2/T code. PressMint corpora will use (except for Great Britain) at least two languages, i.e. the language that the newspapers are written in, which we will call the local language and English, as the meta-language, which is (also) used in the metadata.
- Temporal, i.e. time-related information is typically stored in the when, from and to attributes of various elements. To specify a date or time as the value of these attributes, formatting according to the ISO 8601 standard should be used, e.g. 1888-04-01 for the 1st of April 1888. More information on temporal attributes is given in the Section on Temporal attributes.
3.3. Attributes of top-level elements
The Chapter on Overall corpus structure introduced the top level elements of the corpus root file and of the component files (i.e. the <teiCorpus> and <TEI> elements), but did not elaborate on their attributes; these are presented in this section.
- xmlns determines the namespace of the element, and this should always be the TEI namespace, i.e. http://www.tei-c.org/ns/1.0 (apart from the elements using the XInclude directive, cf. the Section on Use of XInclude). Note that lower level elements inherit the namespace of the superordinate element, unless explicitly overridden, so it is only necessary to specify the TEI namespace on the root element of a file.
- xml:id is an attribute from the (implicitly assumed) XML namespace, and gives the identifier of the element bearing it. The value of an ID should be unique in the corpus as a whole and should obey format requirements as defined by W3C. For the corpus root, as well as for the components, it is required that this top level identifier is identical to the file name (without the file extension). The xml:id is a global attribute, so any element can have it. While this is not required, it is necessary for any element that is then referred to (via this same ID) by some other element, such as many elements in the <teiHeader>, as is explained in the Section on Corpus metadata. The subordinate elements in the text that have an ID (such as page breaks, or, more accurately page beginnings), are recommended to have the top level xml:id as a prefix and to indicate the element name in the ID. For example, if the top level ID is PressMint-SI_1899-01-02, the first page beginning would have the ID PressMint-SI_1899-01-02.pb1.
- xml:lang is also a global attribute and gives the language code of the text content of the element; for the corpus root this means the content of its TEI header, while for corpus components this is the textual content of their TEI headers and <text> elements. The convention is that language of the text content of an element is determined by the value of the first xml:lang attribute on its ancestor axis. In cases where the content is multilingual, the language code should be of the majority language. When the proportion of the languages is about equal, then the mul code for multiple languages can also be used.
3.4. Pointing attributes
The PressMint encoding can use pointing attributes for various purposes, e.g. for references to the IDs of the facsimile elements or for references to taxonomy categories.
While a few elements have dedicated pointing attributes, there are some generally used ones. They share the characteristics that they can all be used by a number of different elements and that their value is a series of pointers, i.e. a white-space delimited sequence of references to the values of some xml:id attribute in the corpus or, in general, to an URI. The attributes are:
- facs gives the pointer(s) to the elements of the <facsimile> elements (cf. Section on Newspaper facsimile.
- ana serves to provide an analysis or to classify an element according to some pre-determined vocabulary. In PressMint the target element will typically be a category in a taxonomy.
- ref provides an explicit reference to the full definition or identity for the entity being named. In PressMint it could be used e.g. for referring to the Wikimedia entry for a person. The value of this attribute is often, but not always, an URL, e.g. for associating a place name with its GeoNames URL.
3.5. Temporal attributes
PressMint makes use of temporal information, in particular to encode when a newspaper was published. As mentioned in the Section on Standard values, the ISO 8601 format should be used to specify the dates or times.
The following attributes are used to specify temporal information:
- The when attribute is used when the temporal information refers to a point in time, typically a date, and is used e.g. to give the date when a corpus text was published, or when a change in the corpus was made.
- The from and to attributes give the starting and ending date or time of an interval, e.g. the time period the corpus covers. If only one of the two attributes is present, then the assumption is that this interval extends at least to the start (if from is missing) or after the end (if to is missing) of time period that the particular PressMint corpus covers. Similary, if both attributes are missing, the assumption is that the interval covers the complete time period of a PressMint corpus.
4. Corpus metadata
As mentioned, <teiCorpus> and <TEI> elements contain the obligatory <teiHeader> element, which stores the metadata for the corpus root and components. In this section we explain and give examples of the required and optional metadata that is contained in the <teiHeader>, proceeding through its various elements, and there distinguishing which parts and what content is appropriate for the corpus root, and which for a corpus component.
As a general remark, most metadata contains free text, and it is a requirement of PressMint that this data is given in the English language, to help researchers for other countries to understand it, and it is recommended to also give it in the local language in which the (main portion of) newspapers is written, for a local researcher to be able to use it in their native tongue.
Note that some obligatory PressMint metadata in the <teiHeader> elements is redundant, in the sense that it can be automatically inserted, and, if missing, is so inserted by the PressMint corpus finalisation scripts (cf. the Section on Finalisation of corpora). This means that the compilers of a particular corpus need not code this metadata in their <teiHeader> elements. To highlight which metadata elements are automatically inserted, this information is given as in-line notes in the following sections. These notes look like this:
Note: Note specifying which part of the metadata can be ommitted.
4.1. File description
Note: <editionStmt> and <extent> are automatically inserted by the PressMint finalisation scripts.
4.1.1. Title statement
The title statement, <titleStmt> gives the title of the corpus root or component, along with the specification of the particular session(s) of the parliament contained, the persons responsible for compiling the corpus, and the funder(s) of the project.
The main title has a formulaic structure ‘<Country_name> historical newspaper corpus PressMint-<Country_code> [PressMint]’, with an equivalent structure for the local language. Note that the corpus ‘stamp’ in square brackets can also be ‘[PressMint.ana]’ for the linguistically annotated version of the corpus (as explained in the Chapter on Linguistic annotation) or ‘[PressMint SAMPLE]’ for corpus data samples, as available on the PressMint GitHub repository.
Note: The title stamp is automatically inserted by the PressMint finalisation scripts.
After the titles come one or more responsibility statements, <respStmt>, each one containing one or more person names, <persName>, with an optional ref attribute, giving the (typically ORCID) URL, where more information about the person can be found, and the responsibility element <resp>, which specifies what responsibility the statement is about.
In a similar manner, the <funder> elements give information on the organisations which have financially contributed to the compilation of the corpus, with the names of the organisations given in the <orgName> elements.
Note: Again, the title stamp is automatically inserted by the PressMint finalisation scripts.
4.1.2. Edition statement
Note: The <editionStmt> is automatically inserted by the PressMint finalisation scripts.
4.1.3. Extents
Note: The <extent> is automatically inserted by the PressMint finalisation scripts, however, this is only done for the extents in English. If the corpus compilers want to have the extents also in their local language, they must insert these elements themselves.
4.1.4. Publication statement
Note: The handle <idno> is automatically inserted by the PressMint finalisation scripts.
The <availability> specifies, via its <licence> element the fixed-value CC BY 4.0 URL, and in the following paragraph gives a prose description of the licence, including its URL via the target attribute of <ref>. As usual, the textual information is given in both languages. Finally, the <date> gives the date of the release, where the when gives the date in the ISO 8601 format, while the textual content can give it according to the conventions used in the local language.
Note: The <date> element is automatically inserted by the PressMint finalisation scripts; it's value is the date when the finalisation is performed.
4.1.5. Source description
<title level="j"> and the <date> when the edition was published; note that here when should contain the date in ISO format, while the content can be in the convention appropriate for the local language. If available, the URL of the digital edition on the Web should also be given. In cases where the whole component corresponds to a particular article in the newspaper, this title is marked as <title level="a"> and if the number of the volume / issue of the newspaper is known this can also be encoded as in <biblScope> with the appropriate unit. Further optional metadata, such as the <publisher>, <pubPlace> (place of publication) can also be given, as shown in the following example: 4.2. Encoding description
In contrast, the encoding description of a corpus component contains only two elements, namely (and redundantly) the <projectDesc> and the <tagsDecl>.
4.2.1. Project description
Note: The project description in English is automatically inserted by the PressMint finalisation scripts.
4.2.2. Editorial declaration
4.2.3. Tags declaration
Note: The <tagsDecl> element is automatically inserted by the PressMint finalisation scripts.
4.2.4. Class declaration and taxonomies
The class declaration, <classDecl> is used only in the corpus root and contains only definitions of (most) controlled vocabularies used in PressMint corpora. These vocabularies, possibly hierarchically organised, are encoded using the <taxonomy> element.
NL (followed by hyphen) into the filename.PressMint requires only one taxonomy to be defined in the class declaration of the corpus root (as well as a additionaly ones for the linguistically annotated corpus, as further described in the Section on Linguistic metadata). As mentioned, the taxonomies are defined globally and available as part of the data on the PressMint GitHub repository, and there is a special procedure modifying them, in particular on how to insert translations of a new language.
The obligatory taxonomy is:
- The CAP topics taxonomy, which gives major topic labels of the Comparative Agendas Project.
Furhtermore, there are several obligatory taxonomies which pertain to the linguistically analysed version of the corpus only, cf. the Section on Linguistic taxonomies.
4.2.5. Prefix definitions
Pointing attributes, such as url or ana, take as their value a reference or space-delimited series of references to a URL and/or the value of xml:id elements. If the reference is to an ID, then it is prefixed the hash character, #, e.g. #argic, and if they are to an ID in another XML document, then the hash follows the URL of the document, e.g. https://nl.ijs.si/ME/V6/msd/tables/msd-fslib2-sl.xml#Vmpr1p.
Because complete URLs tend to be long, especially inconvenient when such references are given to many elements, TEI introduces the so called Abbreviated Pointers, whereby a reference can be given in the form of a prefix, which is separated by a colon from the local part of the reference, and the value of this prefix is determined via the <prefixDef> element in the <encodingDesc> of the TEI header.
mte prefix, so for any reference with this prefix, e.g. mte:Vmpr1p, the part after the prefix (Vmpr1p) should be matched against (.+) and the result being the matched part (here the entire tag Vmpr1p) substituted by #$1, i.e. by the hash character followed by the original value, so that mte:Vmpr1p gives https://nl.ijs.si/ME/V6/msd/tables/msd-fslib-sl#Vmpr1p.Finally, each prefix definition also contains a possibly bi-lingual paragraph explaining the definition.
4.3. Profile description
4.3.1. Setting description
4.3.2. Language usage
@default="true".4.4. Revision description
5. Newspaper facsimile
Facsimile (i.e. images) of the newspapers are highly useful, both for providing the original to the trancriptions in their analyis, as well as for allowing better OCR as the state-of-the-art improves. If the facsimile is available it also be also published together with the PressMint corpora, and should be referred to from the corpus, in particular from each corpus component.
How to encode references to the facsimile images in TEI is, in the general case, explained in the Chapter on Representation of Primary Sources of the TEI Guidelines. In this chapter we only provide the basic representation that is directly supported in PressMint.
5.1. The facsimile element
The <facsimile> element should appear in a corpus component immediately after the <teiHeader>, c.f. the Section on Overall XML corpus structure. It contains pointers to the complete facsimile or its parts, i.e. URLs of the images of an issue or its individual pages, and can further structure or document these images.
Apart from modelling pages with <surface>, areas inside them can also be specified. For this, <zone> elements inside <surface> are used; these can specify a rectangle or, in general, a polygon inside it; the details are given in the TEI Section on Digital Facsimiles. Note, however, that if this approach is used, a mechanism needs to be implemented to show the correct zone on the image.
5.1.1. Connecting the text to the facsimile
Note that these elements can appear anywhere in the text, including in the middle of a (end-of-line hyphenated) word, which makes the linguistic annotation of such text more complicated, as texual data is mixed with markup, typically not otherwise the case.
5.2. Structure of newspaper texts
This kind of encoding is appropriate for texts with no internal structure. However, the text can also be split into divisions, encoded using the standard <div> element. PressMint allows two level of divisions. The upper one (if used) corresponds to sections of a newspaper, such as ‘foreign news’, ‘crime reports’, ‘sports’ etc. while the lower one encodes individual articles, advertisements and other self-contained portions of the text. The type of the division is indicated by the value of its type attribute, which has the following recommended values:
- domesticNews: groups articles on domestic news
- foreignNews: groups articles on foreign news
- sportsNews: groups articles on sports events
- crimeNews: groups articles on crime and misdemeanors
- supplements: groups the supplements to the newspaper
- banner: contains the banner of a newspaper
- colophon: contains the colophon (details of printing, responsibility etc.) of the newspaper
- article: contains an individual newspaper article
- advertisement: contains an advertisement
- supplement: contains a supplement to the newspaper
In case a corpus contains also other types of divisions, these can be encoded using a local taxonomy of divisions types are referred to via the ana attribute on <div>.
5.2.1. Gaps
6. Linguistic annotation
This section introduces the PressMint linguistic annotation. An important note is that a linguistically annotated PressMint corpus is stored separately from its base (or plain-text) TEI version, i.e. the version that has been discussed in the preceding sections. The encoding of the linguistically annotated version differs from the plain-text one in the following:
- All the corpus root and components file names have the extension
.ana.xml. For example, if the plain-text TEI root has the file namePressMint-CZ.xml, the linguistically annotated one should bePressMint-CZ.ana.xml, and if the component plain-text files isPressMint-CZ_1901-04-13.xmlthe linguistically annotated one isPressMint-CZ_1901-04-13.ana.xml. - Because the file ID (i.e. the value of the top level element attribute xml:id, as explained in the Section on Attributes of top-level elements) should be the same as the file name (cf. the Section on File names and directory structure), the previous point also means that the linguistically annotated files should have the top level ID suffixed with
.ana, e.g.<teiCorpus xml:id="PressMint-CZ.ana">. - The corpus stamp in the main title of the corpus root or components (cf. the Section on Title statement) which is
[PressMint]in the plain-text version, should be[PressMint.ana]for the linguistically annotated version. - All the plain text of the newspaper paragraphs (i.e. the text immediately contained by the <p> elements) of the plain-text version should be linguistically annotated on the specified levels, as is further explained in the following Section on Linguistic markup.
- The linguistically annotated version of the corpus should also have some added metadata in the TEI header of the corpus root, which is detailed in the Section on Metadata for linguistic annotation.
6.1. Linguistic markup
Linguistic annotation is added only to the immediate text content of <p> elements. For this text, PressMint requires the following additional markup to be present:
- tokens: what is a word, and what is punctuation, with preserved information on inter-token spaces;
- sentences: what is a sentence;
- normalised form (optional): what is the modernised spelling of archaically spelled words;
- lemmas: the base form of each word;
- Universal Dependencies (UD) part-of-speech and morphological features, and, optionally, part-of-speech tags from a different (local) tagset;
- named entities (NE): a name, categorised into the standard four NE classes;
Below, we explain the encoding of each of these levels.
6.1.1. Word-level annotation
The base form or lemmas of a word is given as the value of the lemma attribute, while punctuation characters, <pc>, do not have this attribute.
The UD part-of-speech and morphological features are both packed in the msd attribute, with the part-of-speech having the UPosTag linguistic attribute, and the features separated by the vertical bar.
PressMint also allows (but does not require) part-of-speech tags from some other tagset3 to be added to the linguistic annotation. Where this information is encoded, depends on the type of tagset.
mte: is a prefix that is, via the TEI extended pointer syntax as defined in the TEI header (cf. the Section on Prefix definitions) expanded so that the value of such an ana attribute points to the expansions of the given tag to a feature structure. For example, the value mte:Vmpr1p would be expanded to https://nl.ijs.si/ME/V6/msd/tables/msd-fslib2-sl.xml#Vmpr1p, which then resolves to the feature-structure below: 6.1.2. Text modernisation
The language of older newspapers might differ significantly from the contemporary norm. This has an impact on the quality of linguistic annotations, in cases where the annotation tool has been trained on contemporary texts only, as well as hindering searching for particular words or lemmas in their contemporary spellings. To alleviate this, normalisation (i.e. modernisation) is often used on archaic texts, and the subsequent linguist annotation is performed on such modernised text.
Modern neural approaches typically take a complete chunk of text and normalise it, while more traditional approaches perform the normalisation on individual words. The former has the advantage of being capable not only of modernising the spelling individual words but also substituting archaic words with their contemporary equivalents, modernising multi-word units or even syntactic constructions. However, if such a method is used on a PressMint corpus this means that the linguistically annotated variant of the corpus will contain only the modernised text, and the alignment to the plain-text variant of the corpus will be at the paragraph level only. In other words, losing word-alignment with the original tokens means also losing the ability to search for or directly view the original tokens.
In contrast, traditional methods (such as cSMTiser) will typically normalise only the spelling of individual words, or, at most, sequences of words. This means that the text has to be first tokenised, normalisation applied to such (series of) tokens, and the resulting normalised word-tokens then linguistically annotated. Here both the original and normalised and annotated words are available in the linguistically annotated version of the corpus.
join="right" should be added to the top level word as well as to the last nested word.6.1.3. Named entities
PressMint also requires annotation of Named Entities (NE), which should be categorised into the following four types:
- PER: person
- LOC: location
- ORG: organisation
- MISC: miscellaneous
6.2. Metadata for linguistic annotation
What kind of metadata a plain-text PressMint corpus should contain was explained in the Section on Corpus metadata and in this section we detail what additions must be made to the metadata for the linguistically annotated version. Note that the other changes for this version of a corpus have been already explained at the start of this Chapter. In short, there are two additional parts that should be added to the <teiHeader> of the corpus root, namely a description of the tool(s) used to linguistically annotate the corpus and an additional taxonomy for named entities.
6.2.1. Application information for linguistic processing
6.2.2. Linguistic taxonomies
Some linguistic annotations have fixed vocabularies and these should be encoded as taxonomies in the TEI header of the linguistically analysed corpus root, similarly to other taxonomies, as discussed in the Section on the Class declaration.
7. Validation and conversion
The chapter explains how to validate and finalise a PressMint corpus, and introduces scripts for converting a PressMint corpus to other, derived formats.
7.1. Validating PressMint corpora
The XML structure of PressMint corpora can be validated via RelaxNG schema produced as a customisation of the TEI Guidelines.
The TEI customisation is written as a TEI ODD document, which is, in fact, the XML version of this document, and is available in the TEI/ directory of the PressMint GitHub repository. The XML contains not only the prose guidelines, but also the formal specification of the TEI schema, which is given in the Appendix A. In the XML it contains the formal schema specification, while in the on-line version this is converted to a reference to all the elements, attributes and classes used in PressMint corpora --- quite a lot, as the PressMint schema has been left open enough to accommodate differing requirements in the encoding.
The ODD document is not immediately useful for XML validation but has to be converted with standard TEI XSLT stylesheets to a RelaxNG schema. The TEI ODD and its RelaxNG schema (PressMint.rng (and the HTML guidelelines) are always kept in sync. This schema should be used to check that PressMint component files validate against TEI, typically using Jing (cf. Contributing to PressMint.
7.2. Finalisation of corpora
While the vast majority of converting source encodings into the PressMint corpus format is left to the compilers of a corpus, there are a few metadata elements that can be produced by a common script on the basis of nearly finished corpora, which then results in the final version of the corpus for a particular release. This includes setting the date, edition and handle under which the corpus will be distributed, and also calculating the size of the corpus (cf. the Sections on Extents and on Tags declaration). The script for finalisation can be found in the Scripts/ directory of the PressMint GitHub repository and the README file briefly explains its function; more comments can be found in the script itself.
7.3. Conversions
A TEI encoded document is, in general, not meant to be used directly by software programs, rather, it serves as an interchange and storage format. The PressMint project has produced various scripts to down-convert the XML encoded corpora to other formats and they can be found in the Scripts/ directory of the PressMint GitHub repository, with the README file listing them and explaining their function. In short, the scripts convert the PressMint XML to plain text, to CoNLL-U, and to vertical format. There is also a script that takes a PressMint corpus and makes from it a sample for inclusion to the PressMint GitHub repository.
8. Contributing to PressMint
The PressMint GitHub repository contains these guidelines, the PressMint XML schemas, the scripts used to validate, finalise and convert the PressMint TEI XML corpora to derived formats, and samples of the PressMint corpora. There are four main branches in the repository:
- main is the default branch used for the synchronisation of other branches. It is also used for releasing sample files that correspond to published corpora.
- data serves as a pushing place for new sample files in ./Data/PressMint-XX directories.
- devel: development of scripts and documentation.
The validation procedure for corpora is explained in the Section on Validating PressMint corpora, while the technical aspects of contributing corpora is further explained in the CONTRIBUTING file of the repository.
9. Acknowledgements
The work on these recommendations was funded by the CLARIN Research Infrastructure for Language Resources and Tools.