The structure and encoding of PressMint corpora
2025-11-13

1. Introduction

This document is meant to serve as a reference for the encoding of PressMint corpora of historical newspapers. In order for the PressMint corpora to be interoperable (i.e. so that the same scripts can be used to process them), their structure is fairly rigid, primarily in terms of file names and folder structure, and, partially, their TEI XML encoding. This is not to say that all the corpora have to contain exactly the same information because we distinguish obligatory information, which all the corpora should contain, from that which is optional, and present only in the corpora for which it has been possible to gather it from the corpus sources.

This document is a modification of the ParlaMint encoding guidelines, which are a customisation the TEI Guidelines. But while ParlaMint specifies many reguirements on the structures of the docuemnts and obligatory data and metadata, PressMint makes only minimal requirements for the purposes of interoperability although leaves considerable space for optional extensions.

The rest of these recommendations are structured as follows:

2. Overall corpus structure

2.1. XML structure

The newspapers of one contributing country constitute one PressMint corpus, which is stored as one XML document, with <teiCorpus> as its top-level element. It is composed of a <teiHeader>, giving the metadata for the corpus as a whole (further detailed in the Section on Corpus metadata), followed by a series of <TEI> elements that each contain one corpus component, as illustrated1 below:
             <!-- Corpus root --> <teiCorpus xmlns="http://www.tei-c.org/ns/1.0">   <teiHeader>...</teiHeader>   <TEI>...</TEI> <!-- Corpus component -->   <TEI>...</TEI> <!-- Corpus component -->   ...            <!-- More corpus components -->   </teiCorpus>           
We do not specify what exactly a corpus component should contain, as this can differ substantially between corpora, e.g. it can be a newspaper edition corresponding to a particular day, or a collection of newspapers for a month or even a year. However, at least the year of the publication must be clear.
A corpus component will thus be rooted in the <TEI> element, which then contains its metadata in its own <teiHeader>, followed by the optional <facsimile> element, giving the links to the images, and this by the obligatory <text> element, which contains the text of the particular component, as illustrated below:
<TEI xmlns="http://www.tei-c.org/ns/1.0">  <teiHeader>...</teiHeader>  <facsimile>...</facsimile>  <text>...</text> </TEI>

The <teiHeader> of a corpus component (further detailed in the Section on Corpus metadata) contains the metadata specific for this component (along with some redundant metadata about its provenance), and which should be unique in the corpus, i.e. the corpus component metadata should distinguish it from all the other components of the corpus.

2.2. Use of XInclude

The fact that a corpus is one XML document does not mean that it is also stored in one file. In fact, PressMint requires that each corpus component is stored in a separate file, with the corpus root, i.e. the top-level <teiCorpus>, also stored as one file.

To enable one XML document to be composed of many files, we use the XInclude mechanism, and the corpus root uses this mechanism (i.e. the <include> elements in the XInclude namespace) to include its corpus component files, so a corpus root will be in fact encoded similarly to the following example:
           <!-- Corpus root file --> <teiCorpus xmlns="http://www.tei-c.org/ns/1.0" >    <teiHeader>...</teiHeader>   <xi:include xmlns:xi="http://www.w3.org/2001/XInclude"       href="1899/PressMint-SI_1899-04-16.xml"/>  <!-- Corpus component file -->   <xi:include xmlns:xi="http://www.w3.org/2001/XInclude"       href="1899/PressMint-SI_1899-04-16.xml"/>  <!-- Corpus component file -->   ...                                            <!-- More corpus component files --> </teiCorpus>           
Apart from corpus components, some parts of the overall corpus metadata (i.e. the <taxonomy> elements) are also stored as separate files, and hence also included in the corpus root using the same XInclude mechanism as explained above.

2.3. File names and directory structure

PressMint has strict rules on how to name the various files that constitute a corpus, and how to collect them in directories.

The file names have the the following structure:

  • The corpus root file name should start with the string PressMint-, followed by the ISO 3166 country code (cf. Section on Standard values) of the country whose team is contributing the corpus, e.g. PressMint-SI.xml.
  • A corpus component filename should start with the name of the root, followed by an underscore2 and the ISO 8601 formatted date of the publication of the newspaper, for example PressMint-SI_1899-04-16.xml. In case the exact date of the publications is unknown, only the month or year can be given, e.g. PressMint-SI_1899-04.xml or PressMint-SI_1899.xml.
  • The corpus component filename can be extended with and underscore and a string containing only ASCII letters and numbers and the hyphen character, e.g. PressMint-SI_1899_KRN-NUK.xml. This extra suffix can encode an abbreviation of the newspaper's name, and serve to distinguish two different newspapers that were published on the same day, or distunguish sources from where a newspaper was obtained.
  • Certain metadata elements from the corpus root <teiHeader> are stored in separate files, in particular the PressMint taxonomies, stored in <taxonomy> elements. The common PressMint taxonomies are named as in ParlaMint-taxonomy-topics.xml, i.e. start with ParlaMint-taxonomy- followed by the a string describing what the taxonomy contains, in this case, the topics that the newspaper articles will be classified into. However, if a taxonomy (or other metadata file) is specific to a particular corpus, then this is indicated by having in the file name (and the IDs contained in it) prefixed by the name of the corpus root, e.g. PressMint-NL-taxonomy-topics.xml for additional topics distinguished by the Dutch corpus. Finally, if a taxonomy (or other separate metadata file) refers to the linguistically annotated annotated corpus only, then it should get the suffix .ana, e.g. PressMint-taxonomy-NER.ana.xml.
  • The file names of the corpus as a whole or corpus components that have been automatically converted from the source XML into some other format should have the same name as the corpus root or components, respectively, but with appropriate file extensions, e.g, PressMint-SI_1899_KRN-NUK.txt; this is further explained in the Section on Conversions.
  • As discussed in the Chapter on Linguistic annotation we distinguish the linguistically annotated version of the corpus from the ‘plain-text’ one, with the linguistic annotated version having the additional suffix .ana on the corpus root and components, e.g. PressMint-SI.ana.xml or PressMint-SI_1899_KRN-NUK.ana.xml.

For distribution the complete XML corpus should be stored in a directory that has the same name prefix as the corpus root file and extended with the format (e.g. TEI). The directory then contains the corpus root file, its metadata files (such as taxonomies), while the corpus components should be in subdirectories, one per year, for example:

 PressMint-SI.TEI/PressMint-SI.xml
PressMint-SI.TEI/PressMint-taxonomy-topic.xml
...
PressMint-SI.TEI/1899/PressMint-SI_1899-01-02.xml
PressMint-SI.TEI/1899/PressMint-SI_1899-01-03.xml
PressMint-SI.TEI/1899/PressMint-SI_1899-01-04.xml
...
PressMint-SI.TEI/1900/PressMint-SI_1900-01-02.xml
PressMint-SI.TEI/1900/PressMint-SI_1900-01-03.xml
PressMint-SI.TEI/1900/PressMint-SI_1900-01-04.xml
...

The lingistically annotated version of the corpus is stored separately, with the main directory and, as mentioned, the corpus root, its metadata files for linguistic annotation, and component filenames having the additional suffix .ana, e.g.

 PressMint-SI.TEI.ana/PressMint-SI.ana.xml
PressMint-SI.TEI.ana/PressMint-taxonomy-topic.xml
PressMint-SI.TEI.ana/PressMint-taxonomy-NER.ana.xml
...
PressMint-SI.TEI.ana/1899/PressMint-SI_1899-01-02.ana.xml
PressMint-SI.TEI.ana/1899/PressMint-SI_1899-01-03.ana.xml
PressMint-SI.TEI.ana/1899/PressMint-SI_1899-01-04.ana.xml
...
PressMint-SI.TEI.ana/1900/PressMint-SI_1900-01-02.ana.xml
PressMint-SI.TEI.ana/1900/PressMint-SI_1900-01-03.ana.xml
PressMint-SI.TEI.ana/1900/PressMint-SI_1900-01-04.ana.xml
...

3. General requirements

This section gives some general requirements a PressMint corpus has to meet, in particular those relating to the characters in a corpus, and the use of standards. It also details the structure of the file names of the PressMint root and component files, as well as the attributes expected on the <teiCorpus> and <TEI> tags.

3.1. Characters

The corpus should be encoded in Unicode, using the UTF-8 character encoding, at least for European languages. In cases where the original contains characters from the Unicode Private Use Area, these should, if possible, be given their closest Unicode equivalents or substituted by the Unicode replacement character U+FFFD. End-of-line hyphens, if present in the source files, should be removed, and the split words joined in order to enhance searching the corpus and to simplify linguistic processing.

The following characters, esp. prevalent when the source documents were in Word or HTML, deserve special mention:

  • TAB (U+0009) character helps the alignment of strings on successive lines. As PressMint is not interested in preserving the layout, all TAB chacters must be substituted by space characters (U+0020).
  • NO-BREAK SPACE (U+00A0) prevents, with some applications, an automatic line break at its position and also collapsing such consecutive characters into a single space. As the use of this character complicates (or breaks) further processing, esp. linguistic annotation, this character must be substituted by the normal space character (U+0020). The same holds for other variants of Unicode space characters (U+2000 - U+200A), which are, however, used much less frequently.
  • ZERO WIDTH NO-BREAK SPACE (U+FEFF), also used as the Byte Order Mark (BOM) in Windows files should be removed.
  • NON-BREAKING HYPHEN (U+2011), similarly to NO-BREAK SPACE, prevents a line break, in this case following its position. With a similar reasoning as above, this character should be substituted by the normal hyphen character ('-', U+002D).
  • SOFT HYPHEN (U+00AD) indicates that a word can be hyphenated at that point. Occurrences of this character should be removed from the corpus.

Text-bearing elements should also not start or end with space characters, and sequences of whitespace characters should be changed into a single space.

3.2. Standard values

Whenever possible, PressMint uses standards for information coding. In particular, the following information must be standardised:

  • As the identity of a PressMint corpus is determined by the country that is contributing the corpus, its code appears in many places. For specifying these codes, the ISO 3166 standard should be used, in particular ISO 3166-1 alpha-2 for the two letter codes of the countries.
  • The codes for the languages used in the corpora (i.e. the possible values of the xml:lang attribute) should follow BCP 47 (cf. also xml:lang in XML document schemas. Essentially, this means that the value for a language code should have two letters, following ISO 639-1 or, and only if a two letter code does not exist for a language, the three-letter ISO 639-2/T code. PressMint corpora will use (except for Great Britain) at least two languages, i.e. the language that the newspapers are written in, which we will call the local language and English, as the meta-language, which is (also) used in the metadata.
  • Temporal, i.e. time-related information is typically stored in the when, from and to attributes of various elements. To specify a date or time as the value of these attributes, formatting according to the ISO 8601 standard should be used, e.g. 1888-04-01 for the 1st of April 1888. More information on temporal attributes is given in the Section on Temporal attributes.

3.3. Attributes of top-level elements

The Chapter on Overall corpus structure introduced the top level elements of the corpus root file and of the component files (i.e. the <teiCorpus> and <TEI> elements), but did not elaborate on their attributes; these are presented in this section.

The corpus root has three required attributes, as shown below:
             <teiCorpus xmlns="http://www.tei-c.org/ns/1.0"             xml:id="PressMint-SI"            xml:lang="sl">           
All three attributes can also be used on any other element, and are thus of special importance:
  • xmlns determines the namespace of the element, and this should always be the TEI namespace, i.e. http://www.tei-c.org/ns/1.0 (apart from the elements using the XInclude directive, cf. the Section on Use of XInclude). Note that lower level elements inherit the namespace of the superordinate element, unless explicitly overridden, so it is only necessary to specify the TEI namespace on the root element of a file.
  • xml:id is an attribute from the (implicitly assumed) XML namespace, and gives the identifier of the element bearing it. The value of an ID should be unique in the corpus as a whole and should obey format requirements as defined by W3C. For the corpus root, as well as for the components, it is required that this top level identifier is identical to the file name (without the file extension). The xml:id is a global attribute, so any element can have it. While this is not required, it is necessary for any element that is then referred to (via this same ID) by some other element, such as many elements in the <teiHeader>, as is explained in the Section on Corpus metadata. The subordinate elements in the text that have an ID (such as page breaks, or, more accurately page beginnings), are recommended to have the top level xml:id as a prefix and to indicate the element name in the ID. For example, if the top level ID is PressMint-SI_1899-01-02, the first page beginning would have the ID PressMint-SI_1899-01-02.pb1.
  • xml:lang is also a global attribute and gives the language code of the text content of the element; for the corpus root this means the content of its TEI header, while for corpus components this is the textual content of their TEI headers and <text> elements. The convention is that language of the text content of an element is determined by the value of the first xml:lang attribute on its ancestor axis. In cases where the content is multilingual, the language code should be of the majority language. When the proportion of the languages is about equal, then the mul code for multiple languages can also be used.
A corpus component also has the same three required attributes, but can additionally also have the ana attribute, which associates the text with a category or categories defined in one or more taxonomies:
             <TEI xmlns="http://www.tei-c.org/ns/1.0"       xml:id="PressMint-SI_1899-01-02"       xml:lang="sl"       ana="#SI-frequency.daily">           
The same as for the corpus root, the component also sets the TEI namespace, and gives the language of its textual content, while its xml:id, of course, identifies the particular component. The ana attribute is a pointing attribute, and we introduce the these attributes in the next section.

3.4. Pointing attributes

The PressMint encoding can use pointing attributes for various purposes, e.g. for references to the IDs of the facsimile elements or for references to taxonomy categories.

While a few elements have dedicated pointing attributes, there are some generally used ones. They share the characteristics that they can all be used by a number of different elements and that their value is a series of pointers, i.e. a white-space delimited sequence of references to the values of some xml:id attribute in the corpus or, in general, to an URI. The attributes are:

  • facs gives the pointer(s) to the elements of the <facsimile> elements (cf. Section on Newspaper facsimile.
  • ana serves to provide an analysis or to classify an element according to some pre-determined vocabulary. In PressMint the target element will typically be a category in a taxonomy.
  • ref provides an explicit reference to the full definition or identity for the entity being named. In PressMint it could be used e.g. for referring to the Wikimedia entry for a person. The value of this attribute is often, but not always, an URL, e.g. for associating a place name with its GeoNames URL.
To illustrate, the example below gives some elements that contain one or more of these attributes:
<p facs="#PressMint-SI_1899_KRN-NUK.page1 #PressMint-SI_1899_KRN-NUK.page2">  <pb facs="#PressMint-SI_1899_KRN-NUK.page1"/>  <name ref="https://www.geopedia.world/#T12_x1614772.8705537645_y5789479.6377019035_s12_b2345">Ljubljana</name> </p>

3.5. Temporal attributes

PressMint makes use of temporal information, in particular to encode when a newspaper was published. As mentioned in the Section on Standard values, the ISO 8601 format should be used to specify the dates or times.

The following attributes are used to specify temporal information:

  • The when attribute is used when the temporal information refers to a point in time, typically a date, and is used e.g. to give the date when a corpus text was published, or when a change in the corpus was made.
  • The from and to attributes give the starting and ending date or time of an interval, e.g. the time period the corpus covers. If only one of the two attributes is present, then the assumption is that this interval extends at least to the start (if from is missing) or after the end (if to is missing) of time period that the particular PressMint corpus covers. Similary, if both attributes are missing, the assumption is that the interval covers the complete time period of a PressMint corpus.

4. Corpus metadata

As mentioned, <teiCorpus> and <TEI> elements contain the obligatory <teiHeader> element, which stores the metadata for the corpus root and components. In this section we explain and give examples of the required and optional metadata that is contained in the <teiHeader>, proceeding through its various elements, and there distinguishing which parts and what content is appropriate for the corpus root, and which for a corpus component.

As a general remark, most metadata contains free text, and it is a requirement of PressMint that this data is given in the English language, to help researchers for other countries to understand it, and it is recommended to also give it in the local language in which the (main portion of) newspapers is written, for a local researcher to be able to use it in their native tongue.

Note that some obligatory PressMint metadata in the <teiHeader> elements is redundant, in the sense that it can be automatically inserted, and, if missing, is so inserted by the PressMint corpus finalisation scripts (cf. the Section on Finalisation of corpora). This means that the compilers of a particular corpus need not code this metadata in their <teiHeader> elements. To highlight which metadata elements are automatically inserted, this information is given as in-line notes in the following sections. These notes look like this:

Note: Note specifying which part of the metadata can be ommitted.

A PressMint <teiHeader> contains three obligatory elements: the file description, <fileDesc>, the encoding description, <encodingDesc>, and the profile description, <profileDesc>, and an optional revision description, <revisionDesc>:
<teiHeader>  <fileDesc>...</fileDesc>  <encodingDesc>...</encodingDesc>  <profileDesc>...</profileDesc>  <revisionDesc>...</revisionDesc> </teiHeader>
Below we explain each of these element in turn.

4.1. File description

The file description, <fileDesc> is composed of five obligatory elements, namely the title statement, <titleStmt>, the edition statement, <editionStmt>, the extent, <extent>, the publication statement, <publicationStmt>, and the source description, <sourceDesc>:
<fileDesc>  <titleStmt>...</titleStmt>  <editionStmt>...</editionStmt>  <extent>...</extent>  <publicationStmt>...</publicationStmt>  <sourceDesc>...</sourceDesc> </fileDesc>

Note: <editionStmt> and <extent> are automatically inserted by the PressMint finalisation scripts.

4.1.1. Title statement

The title statement, <titleStmt> gives the title of the corpus root or component, along with the specification of the particular session(s) of the parliament contained, the persons responsible for compiling the corpus, and the funder(s) of the project.

This structure is exemplified by the following corpus root title statement:
<titleStmt>  <title type="main">Korpus starejših slovenskih časopisov PressMint-SI [PressMint]</title>  <title type="mainxml:lang="en">Slovenian historical newspaper corpus PressMint-SI [PressMint]</title>  <respStmt>   <persName ref="https://orcid.org/0000-0002-1560-4099">Tomaž Erjavec</persName>   <resp>Kodiranje PressMint TEI XML</resp>   <resp xml:lang="en">PressMint TEI XML corpus encoding</resp>  </respStmt>  <funder>   <orgName>Raziskovalna infrastruktura CLARIN</orgName>   <orgName xml:lang="en">The CLARIN research infrastructure</orgName>  </funder>  <funder>   <orgName>Slovenska raziskovalna infrastruktura CLARIN.SI</orgName>   <orgName xml:lang="en">The Slovenian research infrastructure CLARIN.SI</orgName>  </funder> </titleStmt>
The title statement starts with two titles (one in English and one in the local language), with the appropriate language code possibly inherited from a superordinate element.

The main title has a formulaic structure ‘<Country_name> historical newspaper corpus PressMint-<Country_code> [PressMint]’, with an equivalent structure for the local language. Note that the corpus ‘stamp’ in square brackets can also be ‘[PressMint.ana]’ for the linguistically annotated version of the corpus (as explained in the Chapter on Linguistic annotation) or ‘[PressMint SAMPLE]’ for corpus data samples, as available on the PressMint GitHub repository.

Note: The title stamp is automatically inserted by the PressMint finalisation scripts.

After the titles come one or more responsibility statements, <respStmt>, each one containing one or more person names, <persName>, with an optional ref attribute, giving the (typically ORCID) URL, where more information about the person can be found, and the responsibility element <resp>, which specifies what responsibility the statement is about.

In a similar manner, the <funder> elements give information on the organisations which have financially contributed to the compilation of the corpus, with the names of the organisations given in the <orgName> elements.

A corpus component has a very similar title statement to the corpus root, except that certain elements specify the metadata of the component, rather than the complete corpus. The can also contain redundant metadata, in particular, the responsibility statement and the funder, as illustrated in the example below:
<titleStmt>  <title>Korpus starejših slovenskih časopisov PressMint-SI, Kmetijske in rokodelske novice, 16. 4. 1899 [PressMint]</title>  <title xml:lang="en">Slovenian historical newspaper corpus PressMint-SI, "Argicultural and Artisan News", April 16th, 1899 [PressMint]</title>  <respStmt>   <persName ref="https://orcid.org/0000-0002-1560-4099">Tomaž Erjavec</persName>   <resp>Kodiranje PressMint TEI XML</resp>   <resp xml:lang="en">PressMint TEI XML corpus encoding</resp>  </respStmt>  <funder>   <orgName>Raziskovalna infrastruktura CLARIN</orgName>   <orgName xml:lang="en">The CLARIN research infrastructure</orgName>  </funder>  <funder>   <orgName>Slovenska raziskovalna infrastruktura CLARIN.SI</orgName>   <orgName xml:lang="en">The Slovenian research infrastructure CLARIN.SI</orgName>  </funder> </titleStmt>
In the example it can be seen that the title of a corpus component is simply an extension of the corpus root title, as it also gives the name and date of the newspaper that the component contains. Note that the component title must be unique in the complete corpus.

Note: Again, the title stamp is automatically inserted by the PressMint finalisation scripts.

4.1.2. Edition statement

PressMint corpora have their edition statement, <editionStmt> both in the corpus root and components. As illustrated below, the only element it contains is <edition>:
<editionStmt>  <edition>1.0</edition> </editionStmt>
We use semantic versioning to specify the version of the corpus, i.e. giving the version number, where a new major version means substantial changes to the corpus, while the minor version is reserved for e.g. correcting errata or other minor changes. We do not use the patch number. It should be noted that - at least so far - all the PressMint corpora were released together, so that they are all of the same edition, i.e. have the same version number.

Note: The <editionStmt> is automatically inserted by the PressMint finalisation scripts.

4.1.3. Extents

The <extent> element gives information on selected sizes of the complete corpus (in the corpus root) or of one corpus component, as illustrated below in the case of a corpus root extent:
<extent>  <measure unit="textsquantity="75122"   xml:lang="sl">75.122 besedil</measure>  <measure unit="textsquantity="75122"   xml:lang="en">75,122 texts</measure>  <measure unit="wordsquantity="20190034"   xml:lang="sl">20.190.034 besed</measure>  <measure unit="wordsquantity="20190034"   xml:lang="en">20,190,034 words</measure> </extent>
PressMint requires two sizes to be given, and, for preference, in both languages, which are distinguished by their unit attribute, namely the number of texts and the number of words. The exact quantity is given in the quantity attribute, while the text content of <measure> gives the quantity together with the unit - if possible, the number here should contain the thousands separator appropriate for the language.

Note: The <extent> is automatically inserted by the PressMint finalisation scripts, however, this is only done for the extents in English. If the corpus compilers want to have the extents also in their local language, they must insert these elements themselves.

4.1.4. Publication statement

The publication statement <publicationStmt> must appear in the corpus root as well as, in identical form, in the corpus components. As illustrated below, it contains information about the publisher of the corpus, the persistent identifier where the complete corpus can be found, under which licence it is distributed, and when it was released:
<publicationStmt>  <publisher>   <orgName xml:lang="sl">Raziskovalna infrastrukutra CLARIN</orgName>   <orgName xml:lang="en">CLARIN research infrastructure</orgName>   <ref target="https://www.clarin.eu/">www.clarin.eu</ref>  </publisher>  <idno type="URIsubtype="handle">http://hdl.handle.net/11356/8943</idno>  <availability status="free">   <licence>http://creativecommons.org/licenses/by/4.0/</licence>   <p xml:lang="sl">To delo je ponujeno pod   <ref target="http://creativecommons.org/licenses/by/4.0/">Creative Commons Priznanje avtorstva 4.0        mednarodna licenca</ref>.</p>   <p xml:lang="en">This work is licensed under the   <ref target="http://creativecommons.org/licenses/by/4.0/">Creative Commons Attribution 4.0        International License</ref>.</p>  </availability>  <date when="2021-06-11">11. 6. 2023</date> </publicationStmt>
The <publisher> is, at least for the corpora produced in the scope of the CLARIN PressMint project, the CLARIN research infrastructure, and the element also gives the home page of the infrastructure. The ‘identifier number’ element, <idno>, specifies via its type and subtype attributes with fixed values URI and handle that the identifier is a handle, and contains the handle where the complete corpus corresponding to the specified version can be found.

Note: The handle <idno> is automatically inserted by the PressMint finalisation scripts.

The <availability> specifies, via its <licence> element the fixed-value CC BY 4.0 URL, and in the following paragraph gives a prose description of the licence, including its URL via the target attribute of <ref>. As usual, the textual information is given in both languages. Finally, the <date> gives the date of the release, where the when gives the date in the ISO 8601 format, while the textual content can give it according to the conventions used in the local language.

Note: The <date> element is automatically inserted by the PressMint finalisation scripts; it's value is the date when the finalisation is performed.

4.1.5. Source description

The source description <sourceDesc> of the corpus root encodes the immediate digital source(s) of the PressMint corpus in the <bibl> element(s), as shown in the following example:
<sourceDesc>  <bibl>   <author>Dobranić, Filip</author>   <author>Evkoski, Bojan</author>   <author>Ljubešić, Nikola</author>   <title type="mainxml:lang="sl">Korpus slovenske periodike (1771-1914) sPeriodika 1.0</title>   <title type="mainxml:lang="en">Corpus of Slovenian periodicals (1771-1914) sPeriodika 1.0</title>   <idno type="URIsubtype="handle">http://hdl.handle.net/11356/8943</idno>   <date>2023</date>   <bibl>    <title type="mainxml:lang="sl">Digitalna knjižnica Slovenije dLib</title>    <title type="mainxml:lang="en">Digital library of Slovenia dLib</title>    <idno type="URI">https://dlib.si/</idno>   </bibl>  </bibl> </sourceDesc>
Apart from the <author>s and bi-lingual <title>s, the source description should also contain the <idno> element with the fixed type as URI, which gives the URL where the source is available, if such exists, while the <date> gives the date of publications of the source. As can be seen, the source corpus was itself based on publications available in the digital library of Slovenia, and this is encoded in the nested <bibl> element.
For corpus components the source description should encode the edition of the newspaper that the component encodes, i.e. the name of the newspaper, marked as <title level="j"> and the <date> when the edition was published; note that here when should contain the date in ISO format, while the content can be in the convention appropriate for the local language. If available, the URL of the digital edition on the Web should also be given. In cases where the whole component corresponds to a particular article in the newspaper, this title is marked as <title level="a"> and if the number of the volume / issue of the newspaper is known this can also be encoded as in <biblScope> with the appropriate unit. Further optional metadata, such as the <publisher>, <pubPlace> (place of publication) can also be given, as shown in the following example:
<sourceDesc>  <bibl>   <title level="a">Gospodarske skušnje</title>   <title level="j">Kmetijske in rokodelske novice</title>   <publisher>Jožef Blaznik</publisher>   <pubPlace>Ljubljana</pubPlace>   <date when="1899-04-16">16. 4. 1899</date>   <biblScope unit="volume">38</biblScope>   <biblScope unit="issue">3</biblScope>   <idno type="URI">https://dlib.si/details/URN:NBN:SI:DOC-000TTDCE/</idno>  </bibl> </sourceDesc>

4.2. Encoding description

The encoding description <encodingDesc> of the corpus root contains the following elements:
<encodingDesc>  <projectDesc>...</projectDesc>  <editorialDecl>...</editorialDecl>  <tagsDecl>...</tagsDecl> </encodingDesc>

In contrast, the encoding description of a corpus component contains only two elements, namely (and redundantly) the <projectDesc> and the <tagsDecl>.

4.2.1. Project description

The project description <projectDesc> of the corpus root contains a short description of the project in the scope of which the corpus was compiled:
<projectDesc>  <p xml:lang="en">   <ref target="https://www.clarin.eu/pressmint">PressMint</ref> is a    project that aims to (1) create a multilingual set of corpora of historical    newspapers uniformly encoded according to the <ref target="https://clarin-eric.github.io/PressMint/">PressMint encoding      guidelines</ref>; (2) add linguistic annotations to the corpora; and (3) make the    corpora available through concordancers.</p>  <p xml:lang="sl">Projekt <ref target="https://www.clarin.eu/pressmint">PressMint</ref> bo izdelal večjezične,    primerljive, označene in prevedene interoperabilne korpuse starejših Evropskih    časopisov z začetka 20. stoletja. Korpusi PressMint bodo odprto dostopni, tako za    prenos v raznovrstnih formatih, kot tudi v več spletnih platformah za analizo    korpusov.</p> </projectDesc>

Note: The project description in English is automatically inserted by the PressMint finalisation scripts.

4.2.2. Editorial declaration

The editorial declaration, <editorialDecl> is used only in the corpus root and contains prose descriptions of the editorial decision made in the process of compiling the corpus, along several dimensions, in particular what, if any types of <correction>, <normalization>, <quotation>, <hyphenation>, and <segmentation> was performed on the texts of the corpus. The example below illustrates the use of these elements:
<editorialDecl>  <quotation>   <p xml:lang="en">Quotation marks have been left in the text and are not explicitly marked up.</p>  </quotation>  <hyphenation>   <p xml:lang="en">A processing step for the sPeriodika corpus attempted to join end-of-line      hyphenated words, although it was not sucessfull in all cases. Previously hyphenated words are      not marked-up as such.</p>  </hyphenation>  <correction>   <p xml:lang="en">In the source sPeriodika corpus the OCR-ed texts were corrected with <ref target="https://github.com/clarinsi/csmtiser">cSMTiser</ref>, a text normalisation tool based on      character-level machine translation. No additional correction was performed in the scope of      PressMint.</p>  </correction>  <normalization>   <p xml:lang="en">Text has not been normalised, except for spacing.</p>  </normalization>  <segmentation>   <p xml:lang="en">The texts are split by page breaks, and then segmented into paragraphs.</p>  </segmentation> </editorialDecl>

4.2.3. Tags declaration

The tags declaration, <tagsDecl> of the corpus root gives the count of all the XML tags used in the data part (so, not in the TEI header) of the corpus (for the corpus root) or in an individual component of the corpus. To distinguish the TEI elements from the possible use of elements from other namespaces, a <namespace> element giving the TEI namespace in its name attribute is introduced first. Inside it, each TEI tag is listed in its own <tagUsage> element, with the attribute gi giving the name of the tag and occurs the number of occurrences, as shown in the following example:
<tagsDecl>  <namespace name="http://www.tei-c.org/ns/1.0">   <tagUsage gi="textoccurs="414"/>   <tagUsage gi="bodyoccurs="414"/>   <tagUsage gi="pboccurs="75122"/>   <tagUsage gi="poccurs="280971"/>   <tagUsage gi="gapoccurs="789"/>  </namespace> </tagsDecl>

Note: The <tagsDecl> element is automatically inserted by the PressMint finalisation scripts.

4.2.4. Class declaration and taxonomies

The class declaration, <classDecl> is used only in the corpus root and contains only definitions of (most) controlled vocabularies used in PressMint corpora. These vocabularies, possibly hierarchically organised, are encoded using the <taxonomy> element.

The taxonomies themselves are stored in separate files, and are typically PressMint-wide, i.e. all corpora use the same taxonomies. The taxonomies are included in the document root with the XInclude directive, as illustrated below:
<classDecl>   <xi:include xmlns:xi="http://www.w3.org/2001/XInclude"       href="PressMint-taxonomy-topic.xml"/>    <!-- Common taxonomy of topics-->   <xi:include xmlns:xi="http://www.w3.org/2001/XInclude"       href="PressMint-NL-taxonomy-topics.xml"/>    <!-- Dutch additional taxonomy of topics-->   
As can be seen, the first taxonomy is a general PressMint one, while the second one is specific to the Dutch corpus. They are distinguished by including the country code NL (followed by hyphen) into the filename.
To illustrate the structure of a taxonomy element, we give below the start of common taxonomy on topics where we include only the descriptions in English and German:
<classDecl> ... <taxonomy xml:id="ParlaMint-taxonomy-topic"   xml:lang="mul">   <desc xml:lang="en">    <term>Topics</term>: Comparative Agendas Project   <ref target="https://www.comparativeagendas.net/pages/master-codebook">CAP major topic labels</ref>   </desc>   <desc xml:lang="de">    <term>Themen</term>: Deutsche Übersetzungen der im Comparative Agendas Projekt (CAP)   <ref target="https://www.comparativeagendas.net/pages/master-codebook">entstandenen Annotations-Tags für Themen</ref>   </desc>   <category xml:id="argic">    <catDesc xml:lang="en">     <term>Agriculture</term>    </catDesc>    <catDesc xml:lang="de">     <term>Landwirtschaft</term>    </catDesc>   </category>   <category xml:id="civil">    <catDesc xml:lang="en">     <term>Civil Rights</term>    </catDesc>    <catDesc xml:lang="de">     <term>Bürgerrechte</term>    </catDesc>   </category>    ...  </taxonomy> </classDecl>
A <taxonomy> thus first describes, via <desc>, what it is a taxonomy of, and then lists (the possibly nested) categories in <category> elements. Crucial here are the values of their xml:id attributes, by which a category is referred to via the ana attribute of some other element, as was already explained in the Section on Attributes of top-level elements, in connection with classifying a corpus component via the ana attribute. The taxonomy category then bilingually glosses its meaning in its <catDesc> elements, which should always first contain the short name of the category, encoded in the <term> element.

PressMint requires only one taxonomy to be defined in the class declaration of the corpus root (as well as a additionaly ones for the linguistically annotated corpus, as further described in the Section on Linguistic metadata). As mentioned, the taxonomies are defined globally and available as part of the data on the PressMint GitHub repository, and there is a special procedure modifying them, in particular on how to insert translations of a new language.

The obligatory taxonomy is:

  • The CAP topics taxonomy, which gives major topic labels of the Comparative Agendas Project.

Furhtermore, there are several obligatory taxonomies which pertain to the linguistically analysed version of the corpus only, cf. the Section on Linguistic taxonomies.

4.2.5. Prefix definitions

Pointing attributes, such as url or ana, take as their value a reference or space-delimited series of references to a URL and/or the value of xml:id elements. If the reference is to an ID, then it is prefixed the hash character, #, e.g. #argic, and if they are to an ID in another XML document, then the hash follows the URL of the document, e.g. https://nl.ijs.si/ME/V6/msd/tables/msd-fslib2-sl.xml#Vmpr1p.

Because complete URLs tend to be long, especially inconvenient when such references are given to many elements, TEI introduces the so called Abbreviated Pointers, whereby a reference can be given in the form of a prefix, which is separated by a colon from the local part of the reference, and the value of this prefix is determined via the <prefixDef> element in the <encodingDesc> of the TEI header.

PressMint can use this mechanism for linguistic annotations with a closed vocabulary, in particular for corpus-specific analytical part-of-speech tags (c.f. the Section on Word-level annotation). The example below illustrates the prefix definitions for optional MULTEXT-East tags:
<listPrefixDef>  <prefixDef ident="mtematchPattern="(.+)"   replacementPattern="https://nl.ijs.si/ME/V6/msd/tables/msd-fslib-sl.xml#$1">   <p xml:lang="en">Private URIs with this prefix point to feature-structure elements defining the Slovenian MULTEXT-East Version 6 MSDs.</p>  </prefixDef> </listPrefixDef>
The specialised element for listing prefix definitions, <listPrefixDef> gives a (series of) prefix definitions, i.e. <prefixDef> elements. A prefix definition defines its prefix as the value of the ident attribute, and then specifies a regular expression that matches the part of the reference after the prefix in its matchPattern attribute, and its substitution as the value of the replacementPattern attribute. The prefix definition above thus defines the mte prefix, so for any reference with this prefix, e.g. mte:Vmpr1p, the part after the prefix (Vmpr1p) should be matched against (.+) and the result being the matched part (here the entire tag Vmpr1p) substituted by #$1, i.e. by the hash character followed by the original value, so that mte:Vmpr1p gives https://nl.ijs.si/ME/V6/msd/tables/msd-fslib-sl#Vmpr1p.

Finally, each prefix definition also contains a possibly bi-lingual paragraph explaining the definition.

4.3. Profile description

The profile description, <profileDesc> is the third main division of the metadata provided by the TEI header. It contains a description of non-bibliographic aspects of the corpus, in particular, the date range of the corpus. It is only used in the corpus root and contains two elements:
<profileDesc>  <settingDesc>...</settingDesc>  <langUsage>...</langUsage> </profileDesc>
We explain the contents of each element in the following sections.

4.3.1. Setting description

The setting description, <settingDesc>, is by the corpus root and contains only one element, <setting>, which, in turn then gives information on the date (year) range that the corpus covers:
<settingDesc>  <setting>   <date from="1870to="1915">1870--1915</date>  </setting> </settingDesc>

4.3.2. Language usage

The language usage, <langUsage> is the last element of the profile description of a corpus root and defines the languages that are used in the corpus. Typically the language use will define (bilingually) only two languages, the local language and English, as the language used in the metadata, for example:
<langUsage>  <language ident="slxml:lang="sl">slovenski</language>  <language ident="enxml:lang="sl">angleški</language>  <language ident="slxml:lang="en">Slovenian</language>  <language ident="enxml:lang="en">English</language> </langUsage>
In cases where the corpus contains more than one language, the percentage of their use can also be indicated in the usage element of the <language> elements, as illustrated in the example below:
<langUsage>  <language ident="enxml:lang="en">English</language>  <language ident="enxml:lang="nl">Engels</language>  <language default="trueusage="55"   ident="nlxml:lang="en">Dutch</language>  <language default="trueusage="55"   ident="nlxml:lang="nl">Nederlands</language>  <language usage="45ident="fr"   xml:lang="en">French</language>  <language usage="45ident="fr"   xml:lang="nl">Frans</language> </langUsage>
Note that only one of the local languages should have @default="true".

4.4. Revision description

The revision description, <revisionDesc> is the fourth, and last element of the TEI header. It is an optional element that can appear in the corpus root or component, and documents the revisions made in the corpus or component. Its structure is illustrated below:
<revisionDesc>  <change when="2025-07-11">   <name>Tomaž Erjavec</name>: Finalized encoding.</change>  <change when="2025-07-03">   <name>Tomaž Erjavec</name>: Built corpus.</change> </revisionDesc>
The revision description consists of a series of <change> elements, with the attribute when giving the date of the change, and the content containing the <name> of the person responsible for the change, and a free-text description of the change. Note that the <change> follow reverse chronological order, i.e. the most recent changes are at the top.

5. Newspaper facsimile

Facsimile (i.e. images) of the newspapers are highly useful, both for providing the original to the trancriptions in their analyis, as well as for allowing better OCR as the state-of-the-art improves. If the facsimile is available it also be also published together with the PressMint corpora, and should be referred to from the corpus, in particular from each corpus component.

How to encode references to the facsimile images in TEI is, in the general case, explained in the Chapter on Representation of Primary Sources of the TEI Guidelines. In this chapter we only provide the basic representation that is directly supported in PressMint.

5.1. The facsimile element

The <facsimile> element should appear in a corpus component immediately after the <teiHeader>, c.f. the Section on Overall XML corpus structure. It contains pointers to the complete facsimile or its parts, i.e. URLs of the images of an issue or its individual pages, and can further structure or document these images.

The <facsimile> element can contain a <graphic> element that points to the complete facsimile of the corpus component (typically an issue of a newspaper). This is followed by a series of <surface> elements, each one typically corresponding to a printed page. These, in turn, also contain <graphic> elements, each one pointing to the image of the page. It is important that each element containing a <graphic> has the xml:id attribute, as this serves to connect the text to the images. This following example illustrates this basic structure:
<facsimile xml:id="PressMint-SI_1851-12-03_KRN-NUK.facsimile">  <graphic xml:id="PressMint-SI_1851-12-03_KRN-NUK.graphic"   url="https://nl.ijs.si/inz/speriodika/NUK-KRN_1851-12-03.jpg"/>  <surface xml:id="PressMint-SI_1851-12-03_KRN-NUK.page1">   <graphic url="https://nl.ijs.si/inz/speriodika/NUK-KRN_1851-12-03-p1.png"/>  </surface>  <surface xml:id="PressMint-SI_1851-12-03_KRN-NUK.page2">   <graphic url="https://nl.ijs.si/inz/speriodika/NUK-KRN_1851-12-03-p2.png"/>  </surface> </facsimile>

Apart from modelling pages with <surface>, areas inside them can also be specified. For this, <zone> elements inside <surface> are used; these can specify a rectangle or, in general, a polygon inside it; the details are given in the TEI Section on Digital Facsimiles. Note, however, that if this approach is used, a mechanism needs to be implemented to show the correct zone on the image.

5.1.1. Connecting the text to the facsimile

Elements of the transcript are connected to the <facsimile>, <surface> or <zone> elements using the facs attribute, which has a series of ID references as its value. For example, if a paragraph appears on the first and second page of a newspaper issue, this would be modelled as in the following example:
<body>  <p facs="#PressMint-SI_1899_KRN-NUK.page1 #PressMint-SI_1899_KRN-NUK.page2"> ...  </p> </body>
The exact point where a new page or area starts can be represented by using the empty page beginning (<pb>), column beginning (<cb>) or line beginning (<lb>) elements, in this order, as shown below:
<body>  <pb facs="#PressMint-SI_1899_KRN-NUK.pb1"/>  <cb facs="#PressMint-SI_1899_KRN-NUK.cb1"/>  <lb facs="#PressMint-SI_1899_KRN-NUK.lb1"/>  <p>v Ljubljani, v četrtek 16. aprila 1899.</p> </body>
If these empty elements are used, they should always appear, as their name suggestst, in front of the part of the text they cover, i.e. a page/line/column beginning should come before the text, as above, which is different from the usual practice of putting line breaks at the end of the line. Furthermore, by convention, the beginnings should appear as high up in the hierarchy as possible, i.e. if they appear at the begining of a paragraph, they should be placed before the start of the paragraph, as in the example above.

Note that these elements can appear anywhere in the text, including in the middle of a (end-of-line hyphenated) word, which makes the linguistic annotation of such text more complicated, as texual data is mixed with markup, typically not otherwise the case.

5.2. Structure of newspaper texts

The newspaper texts are encoded in the <text> element of corpus components. This element must contain the <body> element, which at the minimum, will contain a series of paragraph (<p>) elements, which corresponds to a newspaper text that has not internal divisions, as illustrated below:
<text>  <body>   <p>...</p>   <p>...</p>    ...  </body> </text>
The paragraphs can also be mixed with empty page, column and line beginning elements, as discussed in the preceding Section on Connecting the text to the facsimile.

This kind of encoding is appropriate for texts with no internal structure. However, the text can also be split into divisions, encoded using the standard <div> element. PressMint allows two level of divisions. The upper one (if used) corresponds to sections of a newspaper, such as ‘foreign news’, ‘crime reports’, ‘sports’ etc. while the lower one encodes individual articles, advertisements and other self-contained portions of the text. The type of the division is indicated by the value of its type attribute, which has the following recommended values:

  • domesticNews: groups articles on domestic news
  • foreignNews: groups articles on foreign news
  • sportsNews: groups articles on sports events
  • crimeNews: groups articles on crime and misdemeanors
  • supplements: groups the supplements to the newspaper
  • banner: contains the banner of a newspaper
  • colophon: contains the colophon (details of printing, responsibility etc.) of the newspaper
  • article: contains an individual newspaper article
  • advertisement: contains an advertisement
  • supplement: contains a supplement to the newspaper

In case a corpus contains also other types of divisions, these can be encoded using a local taxonomy of divisions types are referred to via the ana attribute on <div>.

The divisions can also start with the (typically article) title, i.e. the <head> element, and also start (or more commonly) end with the author of the article, encoded in the <byline> element. This structure, using both levels of <div> is then as follows:
<text>  <body> ...  <div type="foreignNews">    <div type="article">     <head>News for New York</head>     <p>...</p>        ...    <byline>A.B.</byline>    </div>      ...   </div>    ...  </body> </text>
Apart from the <body>, the <text> element can also contain <front> and <back>, which, if used, group together matter is not part of the newspaper body proper (so, arenot articles). The <front> element will contain the front-matter of a newspaper issue, i.e. its banner and possibly colophon, while <back> contains material that either comes at the end of a newspaper (such as supplements) or does not fit in well with the article-based structure of a newspaper, in particular, the advertisements, as shown in the example below:
<text>  <front>   <div type="banner"> ...   </div>  </front>  <body>   <div type="article">...</div>   <div type="article">...</div>    ...  </body>  <back>   <div type="advertisement">...</div>   <div type="advertisement">...</div>    ...  </back> </text>

5.2.1. Gaps

For various reasons, such as portions of the text not interesting to PressMint (e.g. tables), parts of the text can be ommitted. To mark missing material, the <gap> element is used, which should also have the reason attribute, specifying why the material was ommitted. It is also possible to give a description of the ommitted content in the <desc> element of the <gap>, as illustrated below:
The city has provided us with a table giving the main expenditures and incomes: <gap reason="editorial">  <desc>Table of expenditures and incomes.</desc> </gap>

6. Linguistic annotation

This section introduces the PressMint linguistic annotation. An important note is that a linguistically annotated PressMint corpus is stored separately from its base (or plain-text) TEI version, i.e. the version that has been discussed in the preceding sections. The encoding of the linguistically annotated version differs from the plain-text one in the following:

6.1. Linguistic markup

Linguistic annotation is added only to the immediate text content of <p> elements. For this text, PressMint requires the following additional markup to be present:

  • tokens: what is a word, and what is punctuation, with preserved information on inter-token spaces;
  • sentences: what is a sentence;
  • normalised form (optional): what is the modernised spelling of archaically spelled words;
  • lemmas: the base form of each word;
  • Universal Dependencies (UD) part-of-speech and morphological features, and, optionally, part-of-speech tags from a different (local) tagset;
  • named entities (NE): a name, categorised into the standard four NE classes;

Below, we explain the encoding of each of these levels.

6.1.1. Word-level annotation

Basic linguistic annotation comprises tokenisation, sentence segmentation, part-of-speech tagging and lemmatisation, and this mark-up is illustrated in the example below:
<s>  <w msd="UPosTag=DET|Case=Gen|Gender=Neut|Number=Sing|PronType=Dem"   lemma="ta">Tega</w>  <w msd="UPosTag=PRON|PronType=Prs|Reflex=Yes|Variant=Short"   lemma="se">se</w>  <w msd="UPosTag=PARTlemma="sploh">sploh</w>  <w msd="UPosTag=AUX|Mood=Ind|Number=Sing|Person=1|Polarity=Neg|Tense=Pres|VerbForm=Fin"   lemma="biti">nisem</w>  <w msd="UPosTag=VERB|Aspect=Perf|Gender=Masc|Number=Sing|VerbForm=Part"   lemma="zavestijoin="right">zavedel</w>  <pc msd="UPosTag=PUNCT">.</pc> </s>
Sentences are marked up using the <s> element, words with the <w> element and punctuation symbols with the <pc> element. To retain the linguistically significant whitespace, the join element with the fixed value right is used, meaning there should be no whitespace to the right of the token. There can be an added complication with tokenisation, which is further taken up in the next Section on Text modernisation.

The base form or lemmas of a word is given as the value of the lemma attribute, while punctuation characters, <pc>, do not have this attribute.

The UD part-of-speech and morphological features are both packed in the msd attribute, with the part-of-speech having the UPosTag linguistic attribute, and the features separated by the vertical bar.

PressMint also allows (but does not require) part-of-speech tags from some other tagset3 to be added to the linguistic annotation. Where this information is encoded, depends on the type of tagset.

For synthetic tagsets, such as the Penn Treebank tagset, which have atomic tags that cannot always be decomposed into attribute-value pairs (e.g. the tag ‘TO’ for the word ‘to’) should be encoded using the pos on words and punctuation symbols, as shown in the example below:
<s>  <w lemma="I"   msd="UPosTag=PRON|Case=Nom|Number=Sing|Person=1|PronType=Prspos="PRP">I</w>  <w lemma="support"   msd="UPosTag=VERB|Mood=Ind|Tense=Pres|VerbForm=Finpos="VBP">support</w>  <w lemma="the"   msd="UPosTag=DET|Definite=Def|PronType=Artpos="DT">the</w>  <w lemma="amendment"   msd="UPosTag=NOUN|Number=Singpos="NNjoin="right">amendment</w>  <pc msd="UPosTag=PUNCTpos=".">.</pc> </s>
For analytic tagsets, where a part-of-speech tag can be always decomposed into a set of attribute-values, the pointing attribute ana should be used. An example of such a collection of tagsets for various languages is given in the MULTEXT-East morphosyntactic specifications, and we give below an example that uses this tagset:
<s>  <w ana="mte:Vmpr1plemma="prehajati"   msd="UPosTag=VERB|Aspect=Imp|Mood=Ind|Number=Plur|Person=1|Tense=Pres|VerbForm=Fin">Prehajamo</w>  <w ana="mte:Salemma="na"   msd="UPosTag=ADP|Case=Acc">na</w>  <w ana="mte:Ncnsajoin="right"   lemma="odločanje"   msd="UPosTag=NOUN|Case=Acc|Gender=Neut|Number=Sing">odločanje</w>  <pc ana="mte:Zmsd="UPosTag=PUNCT">.</pc> </s>
The mte: is a prefix that is, via the TEI extended pointer syntax as defined in the TEI header (cf. the Section on Prefix definitions) expanded so that the value of such an ana attribute points to the expansions of the given tag to a feature structure. For example, the value mte:Vmpr1p would be expanded to https://nl.ijs.si/ME/V6/msd/tables/msd-fslib2-sl.xml#Vmpr1p, which then resolves to the feature-structure below:
<fs xml:id="Vmpr1pxml:lang="en"  corresp="#Ggnspm">  <f name="CATEGORY">   <symbol value="Verb"/>  </f>  <f name="Type">   <symbol value="main"/>  </f>  <f name="Aspect">   <symbol value="progressive"/>  </f>  <f name="VForm">   <symbol value="present"/>  </f>  <f name="Person">   <symbol value="first"/>  </f>  <f name="Number">   <symbol value="plural"/>  </f> </fs>

6.1.2. Text modernisation

The language of older newspapers might differ significantly from the contemporary norm. This has an impact on the quality of linguistic annotations, in cases where the annotation tool has been trained on contemporary texts only, as well as hindering searching for particular words or lemmas in their contemporary spellings. To alleviate this, normalisation (i.e. modernisation) is often used on archaic texts, and the subsequent linguist annotation is performed on such modernised text.

Modern neural approaches typically take a complete chunk of text and normalise it, while more traditional approaches perform the normalisation on individual words. The former has the advantage of being capable not only of modernising the spelling individual words but also substituting archaic words with their contemporary equivalents, modernising multi-word units or even syntactic constructions. However, if such a method is used on a PressMint corpus this means that the linguistically annotated variant of the corpus will contain only the modernised text, and the alignment to the plain-text variant of the corpus will be at the paragraph level only. In other words, losing word-alignment with the original tokens means also losing the ability to search for or directly view the original tokens.

In contrast, traditional methods (such as cSMTiser) will typically normalise only the spelling of individual words, or, at most, sequences of words. This means that the text has to be first tokenised, normalisation applied to such (series of) tokens, and the resulting normalised word-tokens then linguistically annotated. Here both the original and normalised and annotated words are available in the linguistically annotated version of the corpus.

If the normalised word token is identical to the original one then the annotation is exactly the same as for non-normalised text. If a original token is normalised into a single, but different token, then the norm attribute is used to record the value of the normalised token. These two cases are illustrated in the following example, where we also give the lemma of the words but, for simplicity, no further linguistic annotation:
<w lemma="lep">lepo</w> <w norm="soncelemma="sonce">solnce</w>
A complication arrises when one original token corresponds to several normalised tokens or vice versa. For these cases we use the same mechanism as was used in ParlaMint for splitting orthographic words into syntactic ones4, which is illustrated in the following two examples, the first where an archaic word was split into two contemporary ones, and the second where two archaic words form one contemporary word. Note that the linguistic annotation is given only to the normalised forms:
<w>neſkèrbite <w norm="nelemma="ne"/>  <w norm="skrbitelemma="skrbeti"/> </w> ... <w norm="najmanjšilemma="lep">  <w>nar</w>  <w>manſhi</w> </w>
Also note that if such nested tokens do not have a following space, join="right" should be added to the top level word as well as to the last nested word.

6.1.3. Named entities

PressMint also requires annotation of Named Entities (NE), which should be categorised into the following four types:

  • PER: person
  • LOC: location
  • ORG: organisation
  • MISC: miscellaneous
The identified names and their type are marked up as the <name> element with the appropriate value of its type attribute, as shown in the example below:
... <w lemma="andmsd="UPosTag=CCONJ">and</w> <name type="ORG">  <w lemma="Westminster"   msd="UPosTag=PROPN|Number=Sing">Westminster</w>  <w join="rightlemma="Hall"   msd="UPosTag=PROPN|Number=Sing">Hall</w> </name> <pc msd="UPosTag=PUNCT">,</pc> ...

6.2. Metadata for linguistic annotation

What kind of metadata a plain-text PressMint corpus should contain was explained in the Section on Corpus metadata and in this section we detail what additions must be made to the metadata for the linguistically annotated version. Note that the other changes for this version of a corpus have been already explained at the start of this Chapter. In short, there are two additional parts that should be added to the <teiHeader> of the corpus root, namely a description of the tool(s) used to linguistically annotate the corpus and an additional taxonomy for named entities.

6.2.1. Application information for linguistic processing

As the linguistic analysis of a PressMint corpus will be performed by a tool, the information on which tool (or tools) have been used should be documented in the corpus root TEI header. This information is encoded in the <appInfo> element of the <encodingDesc>, as shown in the example below:
<appInfo>  <application version="1.0ident="classla">   <label>CLASSLA</label>   <desc xml:lang="en">Linguistic processing performed with with CLASSLA trained for      Slovene, available from <ref target="https://github.com/clarinsi/classla">https://github.com/clarinsi/classla</ref>.</desc>  </application> </appInfo>
The <appInfo> element contains, in general, a series of <application> elements, each one giving the information on one tool. The element gives the version number of the tool and specifies, via ident, and identifying code. It has two subordinate elements, with <label> giving the name of the tool and <desc> a short description of it, preferably with a pointer to the URL where it can be found or is at least documented.

6.2.2. Linguistic taxonomies

Some linguistic annotations have fixed vocabularies and these should be encoded as taxonomies in the TEI header of the linguistically analysed corpus root, similarly to other taxonomies, as discussed in the Section on the Class declaration.

In PressMint there is only one such taxonomy, namely that for Named Entity types, which has - apart from translating the categories into the local language - a fixed structure, as follows:
<taxonomy xml:id="ParlaMint-taxonomy-NER.ana">  <desc xml:lang="en">   <term>Named entities</term>  </desc>  <category xml:id="PER">   <catDesc xml:lang="sl">    <term>oseba</term>   </catDesc>   <catDesc xml:lang="en">    <term>person</term>   </catDesc>  </category>  <category xml:id="LOC">   <catDesc xml:lang="sl">    <term>lokacija</term>   </catDesc>   <catDesc xml:lang="en">    <term>location</term>   </catDesc>  </category>  <category xml:id="ORG">   <catDesc xml:lang="sl">    <term>organizacija</term>   </catDesc>   <catDesc xml:lang="en">    <term>organisation</term>   </catDesc>  </category>  <category xml:id="MISC">   <catDesc xml:lang="sl">    <term>drugo</term>   </catDesc>   <catDesc xml:lang="en">    <term>miscellaneous</term>   </catDesc>  </category> </taxonomy>

7. Validation and conversion

The chapter explains how to validate and finalise a PressMint corpus, and introduces scripts for converting a PressMint corpus to other, derived formats.

7.1. Validating PressMint corpora

The XML structure of PressMint corpora can be validated via RelaxNG schema produced as a customisation of the TEI Guidelines.

The TEI customisation is written as a TEI ODD document, which is, in fact, the XML version of this document, and is available in the TEI/ directory of the PressMint GitHub repository. The XML contains not only the prose guidelines, but also the formal specification of the TEI schema, which is given in the Appendix A. In the XML it contains the formal schema specification, while in the on-line version this is converted to a reference to all the elements, attributes and classes used in PressMint corpora --- quite a lot, as the PressMint schema has been left open enough to accommodate differing requirements in the encoding.

The ODD document is not immediately useful for XML validation but has to be converted with standard TEI XSLT stylesheets to a RelaxNG schema. The TEI ODD and its RelaxNG schema (PressMint.rng (and the HTML guidelelines) are always kept in sync. This schema should be used to check that PressMint component files validate against TEI, typically using Jing (cf. Contributing to PressMint.

7.2. Finalisation of corpora

While the vast majority of converting source encodings into the PressMint corpus format is left to the compilers of a corpus, there are a few metadata elements that can be produced by a common script on the basis of nearly finished corpora, which then results in the final version of the corpus for a particular release. This includes setting the date, edition and handle under which the corpus will be distributed, and also calculating the size of the corpus (cf. the Sections on Extents and on Tags declaration). The script for finalisation can be found in the Scripts/ directory of the PressMint GitHub repository and the README file briefly explains its function; more comments can be found in the script itself.

7.3. Conversions

A TEI encoded document is, in general, not meant to be used directly by software programs, rather, it serves as an interchange and storage format. The PressMint project has produced various scripts to down-convert the XML encoded corpora to other formats and they can be found in the Scripts/ directory of the PressMint GitHub repository, with the README file listing them and explaining their function. In short, the scripts convert the PressMint XML to plain text, to CoNLL-U, and to vertical format. There is also a script that takes a PressMint corpus and makes from it a sample for inclusion to the PressMint GitHub repository.

8. Contributing to PressMint

The PressMint GitHub repository contains these guidelines, the PressMint XML schemas, the scripts used to validate, finalise and convert the PressMint TEI XML corpora to derived formats, and samples of the PressMint corpora. There are four main branches in the repository:

The validation procedure for corpora is explained in the Section on Validating PressMint corpora, while the technical aspects of contributing corpora is further explained in the CONTRIBUTING file of the repository.

9. Acknowledgements

The work on these recommendations was funded by the CLARIN Research Infrastructure for Language Resources and Tools.

Notes
1
Note that this is a illustrative example, i.e. a valid PressMint corpus would also need certain attributes to be defined on the illustrated elements. This holds for all the examples in chapter.
2
Note that this is different from ParlaMint, where a hyphen, not underscore is used.
3
These are typically tagset developed and used for specific languages and can be found in the XPOS column of CoNLL-U files, which is the native format for UD treebanks.
4
Note that PressMint does not foresee syntactic parsing, so there is not ambiguity if word is split because of normalisation or because of its syntactic analysis. However, if both were present, the outer one would correspond to normalisation and the inner to syntactic words.
Tomaž Erjavec, tomaz.erjavec@ijs.si and Matyáš Kopp, kopp@ufal.mff.cuni.cz. Date: 2025-11-13