Introducing the Federated Content Search (FCS)

  • Description, History & Glossary

What is the FCS?

  • “Federated Content Search” at CLARIN

    In short: Content Search over Distributed Resources

    Also: Federated “Corpus Query Platform”

  • Search for patterns in distributed text collections

  • No central index!

  • Text resources include annotated corpora, full-texts etc.

  • FCS = interface specification, search infrastructure and software ecosystem

  • Usage of established standards and extensibility!

What is included in the FCS?

Interface Spezification

  • Description of search protocol (query languages, formats and communication channels)
    “for homogeneous access to heterogeneous search engines”

  • RESTful protocol

Search Infrastruktur in CLARIN and Text+

  • Central client (search result “Aggregator” and web portal)

  • Decentralized endpoints at the data centers (local search eninges on resources)

Software Ecosystem primarily in Java

  • Libraries (Java, Python, …)

  • Tools (Validator, Aggregator, Registry)

Requirements for participation in the FCS

  • (Own) text resources

  • “Search engine” on those text resources

    • Minimum: full-text search

  • Deployment of publicly accessible FCS endpoint(s)

Pros and Cons for the FCS (as Infrastructure)

Pros

  • Integration of many resources, linking and comparison of results

  • Integration with other tools (Weblicht, Registry/VLO, Switchboard, …)

  • Same queries, formats, result presentation

  • No duplicate data storage, inconsistency

Cons

  • No control over resources

  • No deterministic results (e.g. links for publications)

  • No global ranking of results possible

Pros and Cons for FCS Endpoints (Operators)

Pros

  • Control over resources and search (ranking, fuzzy, …)

  • No duplication of data due to central index

  • Increased visibility in a larger resource catalog

Cons

  • Deployment of (additional) endpoint necessary

Comparison of FCS with Central Index

Data

➕ At the endpoints

➖ Duplicate data storage, possible inconsistency (age, updates); legally no transfer may be possible

Updates to Data

➕ Endpoints can react quickly

➖ Difficult, e.g. removal of resources in the event of legal problems; updates entail longer delays, if at all possible

Global Ranking

➖ Very difficult/impossible

➕ Quite possible (?), probably implicit assumption and normalization of data for indexing

Faceted Search

➖ Difficult (e.g. via external metadata; not explicitly intended)

➕ Indexing allows clustering/classification according to topics and categories

History

  • ~ 2011 Started as Working Group in CLARIN

  • Mai 2011 EDC/FCS Workshop

  • ~ 2011–2013 Initial version, now named FCS “Legacy”

    • SRU Scan for resources, BASIC Search (CQL/full-text), KWIC

  • April 2013 FCS Workshop

  • ~ 2013/2014 Code and Spec for FCS Core 1.0

    • fcs-simple-endpoint:1.0.0, sru-server:1.5.0

    • BASIC Hits Data View, SRU Scan operation not used anymore

  • ~ 2015/2016 Starting work on and Code for FCS Core 2.0

    • fcs-simple-endpoint:1.3.0, sru-server:1.8.0

    • Advanced Data Views (FCS-QL), …

  • June 2017 Official release of FCS Core 2.0 Spec

  • 2022 FCS is focus in Text+ (Findability)

  • 2023 New FCS maintainer in CLARIN

    • Migration of Source Code to GitHub.com, updated documentation

    • Python FCS endpoint libraries

    • Updated libraries & tools

    • Prototypes for LexFCS extension

  • 2024

    • Experiments with Entity Search (extension)

    • Rewrite of FCS Endpoint Validator

FCS Architecture

Communication Protocol

SRU (Search/Retrieval via URL) / OASIS searchRetrieve

  • Standardized by Library of Congress (LoC) / OASIS

    • RESTful

    • Explain: Listing of resources

      • Languages, annotations, supported data views and formats etc.

    • SearchRetrieve: Search request

  • Data as XML

  • Extensions to the protocol explicitely allowed

Basic Assumption on the Data Structure

  • different (optional) annotationa layers

Full-text

The

cyclists

are

fast

Part of Speech

DET

NOUN

VERB

ADJ

Lemmatisation

The

cyclist

is

fast

Phonetic Transcription

…​

…​

…​

…​

Orthographic Transcription

…​

…​

…​

…​

[…​]

Explain: Resource Discovery

Endpoint Description

Explain: Resource Discovery (2)

Endpoint Description

Query Language FCS-QL

  • Based on CQP

  • Supports various annotation layers

Visual Query Builder for FCS-QL

Visualization of Results

HITS Results

Visualization of Results (2)

KWIC Results

Visualization of Results (3)

ADV Results

Current state of the FCS

Current Work

  • Lexical Resources extension

    • First specification and implementation in Text+

    • Official extension of CLARIN → ~2024 Working Plan

  • AAI integration

    • Specification and implementation

    • Goal: Support access-restricted resources

    • Securing the aggregator via Shibboleth → Passing on AAI attributes to endpoints

    • Preliminary work from CLARIAH-DE, part of the Text+ work plan (IDS Mannheim, Uni/SAW Leipzig, preliminary work BBAW)

  • Syntactic Search

  • Entity Search

  • Optional metadata for each result

Current status regarding Lexical Resources

Current status of participants

  • 209 Resources (94 in Advanced)

    in 61 Languages

    from 20 Institutions in 12 Countries

  • 53 Resources (17 in Advanced, 30 in Lexical)

    in 6 Languages

    from 9 Institutions in Germany

Integration in FCS Infrastructure

CLARIN

  • Alpha/Beta using Side-Loading in Aggregator

  • Stable/Long-Term: Entry in Centre Registry

    • CLARIN Account + Formular as a Centre

    • Including monitoring etc.

Text+

  • Side-Loading in Aggregator

  • WIP: Registry (index of endpoints)

Alternative Ways of using FCS

Bootstrapping Endpoint Development

Guide to Endpoint Development

  • Important preliminary questions

  • Existing implementations, resources for new development

  • Prerequisites

Development Decisions

❓ Can I host the endpoint myself?

❗ No → HelpDesk: CLARIN, Text+

❓ What type of data do I have?

❗ Raw text, Vertical/CONLL, TEI, …

❓ Which search engine do I use / can I use?

❗ KorAP, Korp/CWB, Lucene/Solr/ElasticSearch, BlackLab, (No)SketchEngine, …

❓ Customization or new development?

❗ List of existing endpoint implementations (Awesome List)

❓ Programming language?

❗ Java, Python, (PHP, XQuery)

❓ In-house development: Use of the reference libraries (Java, Python)

❗ Maven Archetype, Korp

❗ SRU + FCS specifications …

Endpoint Implementations

Sources: clarin, awesome-fcs

New Endpoint Development

Prerequisite for local search engine

❗ Full-text search

❓ With Hit markers

❓ Corpus search (segmented text with annotations)

❕ Pagination (total number of hits)

❗ Resource PID

❓ Linking to result pages

Fundamentals

  • SRU (Overview, APD/Models, Request Parameter, Diagnostics, …)

  • FCS (Discovery, Endpoint Description, Search, SRU Parameter, Diagnostics)

  • FCS Notes (Versions, Compatibility, Aggregator)

Disclaimer

Main focus on:

  • Version: FCS Core 2.0; maximum compatibility with FCS Core 1.0

  • SRU Server, FCS Endpunkt; not FCS client application development

  • Using the reference libraries

    Java and Python

  • Possible (re-)use of existing endpoints

No:

  • Working through the specification; only the essential information

  • New or redevelopment of SRU/FCS protocols, libraries etc.

    (e.g. in other languages)

SRU – History

SRU: Search/Retrieve via URLLOC

  • Originally developed by the Library of Congress (LOC)

    2004: SRU 1.1 - LOC

    2007: SRU 1.2 - LOC

  • As of SRU 2.0 standardized by OASIS [1] as “searchRetrieve Version 1.0 OASIS Standard”

    2013: SRU 2.0 - LOC, OASIS (OASIS Announcement)

    Extension of SRU 1.2 → Differences to SRU 1.2 (LOC)

searchRetrieve Version 1.0 – OASIS Standard

  • Part 0. Overview Version 1.0

  • Part 1. Abstract Protocol Definition Version 1.0

  • Part 2. SRU searchRetrieve Operation: APD Binding for SRU 1.2 Version 1.0

  • Part 3. SRU searchRetrieve Operation: APD Binding for SRU 2.0 Version 1.0

  • Part 4. APD Binding for OpenSearch 1.0 version 1.0

  • Part 5. CQL: The Contextual Query Language version 1.0

  • Part 6. SRU Scan Operation version 1.0

  • Part 7. SRU Explain Operation version 1.0

  • grayed out: not relevant for us

  • crossed out: plays no role at all for the FCS

searchRetrieve: Part 0. – Overview Version 1.0

SRU (SRU: Search/Retrieve via URL) is a web service protocol supported over both SOAP and REST for client-server based search. SRU1.x was developed as a web service replacement for the NISO Z39.50 protocol. SRU2.0 is a revision to SRU which as well as including many enhancements to SRU1.2 was developed alongside the APD.

For the SRU protocol model, three operations are defined as part of its Processing Model:

  • SearchRetrieve Operation. The actual SearchRetrieve operation defined by the SRU protocol; A SearchRetrieve operation consists of a SearchRetrieve request from client to server followed by a SearchRetrieve response from server to client.

  • Scan Operation. Similar to SRU, the Scan protocol defines a request message and a response message for iterating through available search terms. a Scan operation consists of a Scan request followed by a Scan response.

  • Explain Operation. Every SRU or scan server provides an associated Explain document as part of its Description and Discovery Model, providing information about the server’s capabilities. A client may retrieve this document and use the information to self-configure and provide an appropriate interface to the user. When a client retrieves an Explain document, this constitutes an Explain operation.

searchRetrieve – APD and Bindings

  • Abstract Protocol Definition (APD) für “searchRetrieve operation”

    • Model for SearchRetrieve Operation

    • Describes Capabilities and General Characteristics of a Server or Search Engine, as well as how access should take place

    • Defines abstract Request parameters and Response elements

  • Binding

    • Describes corresponding names of the parameters and elements

    • static (for human), dynamic (for machine), …

    • Bindings: SRU 1.2, SRU 2.0, (OpenSearch)

    • Examples: “startPosition” (APD) → “startRecord” (SRU 2.0)

      “recordPacking” (SRU 1.2) → “recordXMLEscaping” (SRU 2.0)

searchRetrieve – APD Abstract Models

Data Model
Description of the data on which the search is to be executed

Query Model
Description of the construction of search queries

Processing Model
Description of how query is sent from client to server

Result Set Model
Structure of the results of a search

Diagnostics Model
Description of how errors are communicated from the server to the client

Description and Discovery Model
Description, for the discovery of the “Search Service”, self-description of the functionality of the service

SRU 2.0 – Operation Model

  • SRU Request (Client → Server) with Response (Server → Client)

  • Operations

    • SearchRetrieve

    • Scan

    • Explain

SRU 2.0 – Data Model

  • Server = Database for Client for search/retrieval

  • Database = Collection of Units of Data → Abstract Record

  • Abstract Records (or Response Records) in one/multiple formats by server

  • Format (or Item Type) = Record Schema

SRU 2.0 – Protocol Model

  • HTTP GET

    • Parameters encoded as “key=value

    • UTF-8

    • %-Escaping

    • Separation at “?”, “&”, “=

  • HTTP POST

    • application/x-www-form-urlencoded

    • No character encoding necessary

    • No length restriction

  • HTTP SOAP (?)

SRU Protocol

SRU 2.0 – Processing Model

  • “Request processing on the server”

  • Request

    • Number of records

    • Identifier for Record Schema (→ Records in Response)

    • Identifier for Response Schema (→ whole Response)

  • Response

    • Records in Result Set

    • Diagnostic Information

    • Result Set Identifier for requests for further results

SRU 2.0 – Query Model

  • Any “appropriate query language” can be used

  • Mandatory support of

    “Contextual Query Language” (CQL)

SRU 2.0 – Parameter Model

  • Use of Parameters, some predefined by SRU 2.0

  • Parameters not defined in the protocol are also permitted

  • Parameter “query

    • included in every query in some manner (“query” or by parameters not defined in the protocol)

    • Query with “queryType” (default “cql”)

SRU 2.0 – Result Set Model

  • Logical model → “Result Sets” are not mandatory

  • Query → Selection of suitable Records

    • Ordered list, non-modifiable set after creation

    • Sorting/order determined by server

  • for Client:

    • Set of abstract Records, counting starts with 1

    • Each record can be requested in its own format

    • Individual records can “disappear”, no reordering in the Result Set by the Server, but Diagnostic to inform

SRU 2.0 – Diagnostic Model

  • fatal

    • Execution of the query cannot be completed

    • e.g. invalid query

  • non-fatal

    • Processing impaired, but request can be completed

    • e.g. individual records are not available in the requested schema, server only sends the available ones and informs about the rest

    • surrogate

      • For single Records

    • non-surrogate

      • All records are available, but something went wrong, e.g. sorting

      • Or simply a warning

SRU 2.0 – Explain Model

  • Must be available for HTTP GET via the base URL of the SRU server

  • → Server Capabilities

  • In the client for self-configuration and to provide the corresponding user interface

  • Details on supported Query Types, CQL Context Sets, Diagnostic Sets, Records Schemas, Sorting options, defaults, …

SRU 2.0 – Serialization Model

  • No restriction on the serialization of responses

    (for the entire message or single records)

  • Non-XML serialization is allowed

searchRetrieve 2.0 – Request Parameter

  • All parameters are optional, non-repeatable

  • query, startRecord, maximumRecords, recordXMLEscaping, recordSchema, resultSetTTL, stylesheet; Extension parameters

  • New in 2.0: queryType, sortKeys, renderedBy, httpAccept, responseType, recordPacking; Facet Parameters

searchRetrieve 2.0 – Response Elements

  • All elements are optional, non-repeatable by default

  • numberOfRecords, resultSetId, records, nextRecordPosition, echoedSearchRetrieveRequest, diagnostics, extraResponseDataⓇ

  • New in 2.0: resultSetTTL, resultCountPrecisionⓇ, facetedResultsⓇ, searchResultAnalysisⓇ

    (Ⓡ = repeatable)

searchRetrieve 2.0 – Query

  • query (Parameter)

    • Query

    • Mandatory if no specification of queryType

  • queryType (Parameter, SRU 2.0)

    • Optional, by default “cql

    • Query Types must be listed in the Explain, with URL for definition and usage abbreviation

    • Reserved

      • cql

      • searchTerms (processing is left to the server, < SRU 2.0)

searchRetrieve 2.0 – Pagination

  • Query for result range of startRecord with maximum maximumRecords

  • startRecord (Parameter)

    • Optional, positive integer, starting with 1

  • maximumRecords (Parameter)

    • Optional, non-negative integer

    • Server selects default if not specified

    • Server can respond with fewer records, never more

  • Response with total number (numberOfRecords) of records in the Result Set, with offset (nextRecordPosition) to next results

  • numberOfRecords (Element)

    • Number of results in the Result Set

    • If query fails, it must be “0

  • nextRecordPosition (Element)

    • Counter for next result set, if last record in the response is not last in the result set

    • If no further records, then this element must not appear

searchRetrieve 2.0 – Result Set

  • resultSetId (Element)

    • Optional, identifier for the Result Set, for referencing in the subsequent requests

  • resultSetTTL (Parameter / Element, Element in SRU 2.0 only)

    • Optional, in seconds

    • In request from Client when Result Set is no longer used

    • In response from Server, how long Result Set is available (“good-faith estimate”, → can be longer or shorter)

  • resultCountPrecision (Element, SRU 2.0)

    • URI: “info:srw/vocabulary/resultCountPrecision/1/…

    • exact / unknown / estimate / maximum / minimum / current

searchRetrieve 2.0 – Pagination (Cont.)

Response XML for CQL search for "cat"

searchRetrieve 2.0 – Serialization

  • recordXMLEscaping (Parameter, SRU 2.0)

    • If records are serialized as XML, “string” of the Records can be escaped (“<”, “>”, “&”); default is “xml” as direct embedding of the Records in the Response, e.g. for Stylesheets

  • recordPacking (Parameter, SRU 2.0)

    • In SRU 1.2 used to have the semantic of recordXMLEscaping

    • packed” (default), Server should deliver Records with requested schema; “unpacked”, Server can determine the location of the application data in the Records itself (?)

  • httpAccept (Parameter, SRU 2.0)

    • Schema for Response, default is “application/sru+xml

  • responseType (Parameter)

    • Schema for Response (in combination with httpAccept parameter)

  • recordSchema (Parameter)

  • records (Element)

    • Contains Records / Surrogate Diagnostics

    • According to default Schema a list of “<record>” elements

  • recordSchema with http://clarin.eu/fcs/resource can be used for multiplexing if several SRU functionalities are offered via one endpoint, e.g. also DFG Viewer or similar.

searchRetrieve 2.0 – Unsupported Parameters

  • Sorting (sortKeys) and Faceting not supported

SRU 2.0 – Extensions

  • Extensions possible in

    • Request via Extension Parameter

    • (prefixed with “x-” and namespace identifier, e.g. “x-fcs-”)

  • Response in the <extraResponseData>” Element

  • Response with extraResponseData, only if requested in Request with corresponding parameter, never voluntary

    • Server can ignore the request, no obligation

  • Unknown extension parameters are to be ignored

SRU 2.0 – Backwards Compatibility

  • Parameters “operation” and “version” only in SRU 1.1/SRU 1.2, removed in SRU 2.0 → Assumption of a separate endpoint for each SRU version

  • Heuristic for detecting the SRU version

    • searchRetrieve = query or queryType parameter

    • scan = scanClause parameter

    • explain

  • Interoperability with older versions:

    • Use of operation/version parameters → SRU < 2.0

    • Caution with parameters with changed semantics

      especially recordPacking

SRU 2.0 – Diagnostics

  • “Error handling”

  • Difference between (non-)fatal, (non-)surrogate → SRU 2.0 – Diagnostic Model

  • Schema: info:srw/schema/1/diagnostics-v1.1

    Prefix: info:srw/diagnostic/1/

    • uri (ID), details (additional information, depending on Diagnostic), message

  • Information:

  • Categories: General (1-9), CQL (10-49), Result Sets (50-60), Records (61-74), Sorting (80-96), Explain (100-102), Stylesheets (110-111), Scan (120-121)

  • Not limited to this list only, custom diagnostics possible

1

General system error

Debugging information (traceback)

2

System temporarily unavailable

3

Authentication error

4

Unsupported operation

5

Unsupported version

Highest version supported

6

Unsupported parameter value

Name of parameter

7

Mandatory parameter not supplied

Name of missing parameter

8

Unsupported parameter

Name of the unsupported parameter

9

Unsupported combination of parameters

10

Query syntax error

23

Too many characters in term

Length of longest term

26

Non special character escaped in term

Character incorrectly escaped

35

Term contains only stopwords

Value

37

Unsupported boolean operator

Value

38

Too many boolean operators in query

Maximum number supported

47

Cannot process query; reason unknown

48

Query feature unsupported

Feature

60

Result set not created: too many matching records

Maximum number

61

First record position out of range

64

Record temporarily unavailable

65

Record does not exist

66

Unknown schema for retrieval

Schema URI or short name

67

Record not available in this schema

Schema URI or short name

68

Not authorized to send record

69

Not authorized to send record in this schema

70

Record too large to send

Maximum record size

71

Unsupported recordXMLEscaping value

80

Sort not supported

110

Stylesheets not supported

111

Unsupported stylesheet

URL of stylesheet

FCS Interface Specification

FCS Architecture
  • FCS = Description of capabilities,
    Extensions according to SRU
    and operations

    → Use of SRU/CQL and
    Erweiterungen nach SRU

  • Interface specification = formats and transport protocol

    • Endpoint = bridge between client (FCS formats) and local search engine

    • Client = user interface, query input and result presentation

  • Discovery and Search mechanism

FCS – Discovery

  • SRU Explain

    • Help and information for the client on accessing, requesting and processing results from the server

  • Information about endpoint

    • Capabilities: Basic Search, Advanced Search?

    • Resources for search

    → Endpoint Description (XML) via explain SRU Operation

FCS – Endpoint Description

  • XML according to the schema Endpoint-Description.xsd

  • <ed:EndpointDescription>

    • @version mit “2

    • <ed:Capabilities> (1)

    • <ed:SupportedDataViews> (1)

    • <ed:SupportedLayers> (1) (if Advanced Search Capability)

    • <ed:Resources> (1)

  • <ed:Capability>

    • Content: Capability Identifier, URI

      • http://clarin.eu/fcs/capability/basic-search

      • http://clarin.eu/fcs/capability/advanced-search

  • <ed:SupportedDataView>

    • Content: MIME type, e.g. application/x-clarin-fcs-hits+xml

    • @id → for referencing in <ed:Resource>

    • @delivery-policy: send-by-default / need-to-request

    • No duplicates (based on MIME type) allowed

  • <ed:SupportedLayer>

    • (only for Advanced Search)

    • Content: Layer Identifier, e.g. “orth

    • @id → for referencing in <ed:Resource>

    • @result-id → Referencing the layer in the Advanced Data View

    • @qualifier → Identifier in FCS-QL Search Term for the layer

    • @alt-value-info,[.blue]` @alt-value-info-uri`: short description of the layer, e.g. for tagset, + URL with further information

    • No duplicates allowed based on @result-id MIME type

  • <ed:Resource>

    • @pid: Persistent Identifier (e.g. MdSelfLink from CMDI Record)

    • <ed:Title> (1+) with @xml:lang, no duplicates, English required

    • <ed:Description> (0+) with @xml:lang, English required, should be at most 1 sentence

    • <ed:Institution> (0+) with @xml:lang, English required

    • <ed:LandingPageURI> (0/1) – link to the website of the resource (or institution) with more information

    • <ed:Languages> (1) with <ed:Language> content according to ISO 639-3 language codes

    • <ed:AvailableDataViews> (1) with @ref = list of IDs of the <ed:SupportedDataView> elements, e.g. “hits adv

    • <ed:AvailableLayers> (1) (if Advanced Search Capability), with @ref = list of IDs of the <ed:SupportedLayer> elements, e.g. “word lemma pos

    • <ed:Resources> (0/1) for sub resources

    • For <ed:AvailableDataViews> and <ed:AvailableLayers> sub-resources should support the same lists, a new declaration is still required

Minimal Endpoint Description for BASIC Search
<ed:EndpointDescription xmlns:ed="http://clarin.eu/fcs/endpoint-description" version="2">
  <ed:Capabilities>
    <ed:Capability>http://clarin.eu/fcs/capability/basic-search</ed:Capability>
  </ed:Capabilities>
  <ed:SupportedDataViews>
    <ed:SupportedDataView id="hits" delivery-policy="send-by-default">application/x-clarin-fcs-hits+xml</ed:SupportedDataView>
  </ed:SupportedDataViews>
  <ed:Resources>
    <!-- just one top-level resource at the Endpoint -->
    <ed:Resource pid="http://hdl.handle.net/4711/0815">
      <ed:Title xml:lang="de">Goethe Korpus</ed:Title>
      <ed:Title xml:lang="en">Goethe corpus</ed:Title>
      <ed:Description xml:lang="de">Das Goethe-Korpus des IDS Mannheim.</ed:Description>
      <ed:Description xml:lang="en">The Goethe corpus of IDS Mannheim.</ed:Description>
      <ed:LandingPageURI>http://repos.example.org/corpus1.html</ed:LandingPageURI>
      <ed:Languages>
        <ed:Language>deu</ed:Language>
      </ed:Languages>
      <ed:AvailableDataViews ref="hits" />
    </ed:Resource>
  </ed:Resources>
</ed:EndpointDescription>
Endpoint Description with CMDI Data View and Sub-Resources
<ed:EndpointDescription xmlns:ed="http://clarin.eu/fcs/endpoint-description" version="2">
  <ed:Capabilities>
    <ed:Capability>http://clarin.eu/fcs/capability/basic-search</ed:Capability>
  </ed:Capabilities>
  <ed:SupportedDataViews>
    <ed:SupportedDataView id="hits" delivery-policy="send-by-default">application/x-clarin-fcs-hits+xml</ed:SupportedDataView>
    <ed:SupportedDataView id="cmdi" delivery-policy="need-to-request">application/x-cmdi+xml</ed:SupportedDataView>
  </ed:SupportedDataViews>
  <ed:Resources>
    <!-- top-level resource 1 -->
    <ed:Resource pid="http://hdl.handle.net/4711/0815">
      <ed:Title xml:lang="de">Goethe Korpus</ed:Title>
      <ed:Title xml:lang="en">Goethe corpus</ed:Title>
      <ed:Description xml:lang="de">Das Goethe-Korpus des IDS Mannheim.</ed:Description>
      <ed:Description xml:lang="en">The Goethe corpus of IDS Mannheim.</ed:Description>
      <ed:LandingPageURI>http://repos.example.org/corpus1.html</ed:LandingPageURI>
      <ed:Languages>
        <ed:Language>deu</ed:Language>
      </ed:Languages>
      <ed:AvailableDataViews ref="hits" />
    </ed:Resource>
    <!-- top-level resource 2 -->
    <ed:Resource pid="http://hdl.handle.net/4711/0816">
      <ed:Title xml:lang="de">Zeitungskorpus des Mannheimer Morgen</ed:Title>
      <ed:Title xml:lang="en">Mannheimer Morgen newspaper corpus</ed:Title>
      <ed:LandingPageURI>http://repos.example.org/corpus2.html</ed:LandingPageURI>
      <ed:Languages>
        <ed:Language>deu</ed:Language>
      </ed:Languages>
      <ed:AvailableDataViews ref="hits cmdi" />
      <ed:Resources>
        <!-- sub-resource 1 of top-level resource 2 -->
        <ed:Resource pid="http://hdl.handle.net/4711/0816-1">
          <ed:Title xml:lang="de">Zeitungskorpus des Mannheimer Morgen (vor 1990)</ed:Title>
          <ed:Title xml:lang="en">Mannheimer Morgen newspaper corpus (before 1990)</ed:Title>
          <ed:LandingPageURI>http://repos.example.org/corpus2.html#sub1</ed:LandingPageURI>
          <ed:Languages>
            <ed:Language>deu</ed:Language>
          </ed:Languages>
          <ed:AvailableDataViews ref="hits cmdi" />
        </ed:Resource>
        <!-- sub-resource 2 of top-level resource 2 ... -->
      </ed:Resources>
    </ed:Resource>
  </ed:Resources>
</ed:EndpointDescription>
Endpoint Description with ADVANCED Search Capability
<ed:EndpointDescription xmlns:ed="http://clarin.eu/fcs/endpoint-description" version="2">
  <ed:Capabilities>
    <ed:Capability>http://clarin.eu/fcs/capability/basic-search</ed:Capability>
    <ed:Capability>http://clarin.eu/fcs/capability/advanced-search</ed:Capability>
  </ed:Capabilities>
  <ed:SupportedDataViews>
    <ed:SupportedDataView id="hits" delivery-policy="send-by-default">application/x-clarin-fcs-hits+xml</ed:SupportedDataView>
    <ed:SupportedDataView id="adv" delivery-policy="send-by-default">application/x-clarin-fcs-adv+xml</ed:SupportedDataView>
  </ed:SupportedDataViews>
  <ed:SupportedLayers>
    <ed:SupportedLayer id="word" result-id="http://spraakbanken.gu.se/ns/fcs/layer/word">text</ed:SupportedLayer>
    <ed:SupportedLayer id="orth" result-id="http://endpoint.example.org/Layers/orth" type="empty">orth</ed:SupportedLayer>
    <ed:SupportedLayer id="lemma" result-id="http://spraakbanken.gu.se/ns/fcs/layer/lemma">lemma</ed:SupportedLayer>
    <ed:SupportedLayer id="pos" result-id="http://spraakbanken.gu.se/ns/fcs/layer/pos"
                       alt-value-info="SUC tagset"
                       alt-value-info-uri="https://spraakbanken.gu.se/parole/Docs/SUC2.0-manual.pdf"
                       qualifier="suc">pos</ed:SupportedLayer>
    <ed:SupportedLayer id="pos2" result-id="http://spraakbanken.gu.se/ns/fcs/layer/pos2"
                       alt-value-info="2nd tagset"
                       qualifier="t2">pos</ed:SupportedLayer>
  </ed:SupportedLayers>

  <ed:Resources>
    <!-- just one top-level resource at the Endpoint -->
    <ed:Resource pid="hdl:10794/suc">
      <ed:Title xml:lang="sv">SUC-korpusen</ed:Title>
      <ed:Title xml:lang="en">The SUC corpus</ed:Title>
      <ed:Description xml:lang="sv">Stockholm-Umeå-korpusen hos Språkbanken.</ed:Description>
      <ed:Description xml:lang="en">The Stockholm-Umeå corpus at Språkbanken.</ed:Description>
      <ed:LandingPageURI>https://spraakbanken.gu.se/resurser/suc</ed:LandingPageURI>
      <ed:Languages>
        <ed:Language>swe</ed:Language>
      </ed:Languages>
      <ed:AvailableDataViews ref="hits adv" />
      <ed:AvailableLayers ref="word lemma pos pos2" />
    </ed:Resource>
  </ed:Resources>
</ed:EndpointDescription>
  • SRU SearchRetreive

  • Actual “Search”

    • Basic Search with CQL

    • Advanced Search with FCS-QL

  • Search results are serialized in Resource (Fragment) and in Data View formats

  • Implementation details → Chapter Resources and Data Views

FCS – SRU Extension Parameter

  • x-fcs-endpoint-description (explain)

    • true” - <sru:extraResponseData> of the Explain Response contains the Endpoint Description document

  • x-fcs-context (searchRetrieve)

    • Comma-separated list of PIDs

    • Restrict the search to resources identified by these PIDs

  • x-fcs-dataviews (searchRetrieve)

    • Comma-separated list of Data View identifiers

    • Endpoints should also deliver these need-to-request Data Views if requested

  • x-fcs-rewrites-allowed (searchRetrieve)

    • true” - Endpoint can simplify query for higher recall

FCS – Diagnostics

Identifier URI Description Impact

http://clarin.eu/fcs/diagnostic/1

Persistent identifier passed by the Client for restricting the search is invalid.

non-fatal

http://clarin.eu/fcs/diagnostic/2

Resource set too large. Query context automatically adjusted.

non-fatal

http://clarin.eu/fcs/diagnostic/3

Resource set too large. Cannot perform Query.

fatal

http://clarin.eu/fcs/diagnostic/4

Requested Data View not valid for this resource.

non-fatal

http://clarin.eu/fcs/diagnostic/10

General query syntax error.

fatal

http://clarin.eu/fcs/diagnostic/11

Query too complex. Cannot perform Query.

fatal

http://clarin.eu/fcs/diagnostic/12

Query was rewritten.

non-fatal

http://clarin.eu/fcs/diagnostic/14

General processing hint.

non-fatal

Versions and Backwards Compatibility

  • “Clients MUST be compatible to CLARIN-FCS 1.0” (Quelle)

    • Thus implementation of SRU 1.2 still required (?)

    • Restriction to Basic Search Capability

    • Processing of legacy XML namespaces (SRU Response, Diagnostics)

  • Heuristic for version detection (of endpoints)

    • Client: Explain request without version and operation parameters

    • Endpoint: SRU Response <sru:explainResponse>/<sru:version> with default SRU version

  • Versions

    • FCS 2.0 ↔ SRU 2.0

    • FCS 1.0 ↔ SRU 1.2 (SRU 1.1)

Notes on FCS SRU Aggregator

  • Currently no (?) support for FCS 2.0 only endpoints

    • For compatibility reasons support of Legacy FCS and FCS 1.0

    • Assumption that endpoints in FCS 2.0 also support earlier FCS Versions… (no issue with CLARIN SRU/FCS libraries)

    FCS 2.0 only endpoints may therefore still receive FCS 1.0 (SRU 1.2) requests!

  • Aggregator sends searchRetrieve requests with only one resource PID in the x-fcs-context parameter for each resource requested

    • i.e. search across N resources of an endpoint → N separate search queries

Reference Implementations

  • Java and Python, focus on FCS endpoints

  • Java class hierarchies, organization & structure, processes & lifecycles, configuration

CLARIN Reference Libraries (Java)

  • Development started ~2012

  • Modularized: Client/Server, SRU/FCS, Parser

  • in Java 1.8+ (EOL: Ende 2030)

  • Extensive documentation, some tests (proven by being in use for a long time)

  • Artifacts in CLARIN Nexus, Code on Github

  • Server/endpoint: external dependencies to

    • Logging: slf4j

    • HTTP: javax.servlet:servlet-api

    • Parser: antlr4 (FCS-QL) / CQL

  • Build: maven

  • Deployment: jetty, tomcat, …

CLARIN Reference Libraries (Python)

  • ~ 2022: Translation of Java reference libraries to Python

  • Strong orientation towards the Java reference libraries

    → (fast) (almost) identical interfaces, class/function names

  • but: slight optimizations for Python, no 1:1 copy

  • Focus on (new) FCS endpoints → no clients!

  • Typed, documented; published on PyPI

  • Synchronous, minimal WSGI - allows embedding in existing apps

  • Python 3.8+

  • Dependencies to

    • XML parsing: lxml

    • HTTP/WSGI: werkzeug

    • Query Parser: PLY (CQL), ANTLR4 (FCS-QL)

CLARIN Reference Libraries

  • Maven Endpoint Archetype: Java

  • FCS SRU Aggregator: Java

  • FCS Endpoint Validator: Java (old), Java ← test compliance with SRU/FCS protocol

  • Korp: Java, Python

Indexdata: CQL-Parser, Querela: Python implementations

  • Note: concrete examples and implementations will follow in a later section, high-level overview here

FCS Endpoint – Design and structure

  • Query Parser (CQL, FCS-QL)

  • FCS SRU Server

    • SRU configurations, versions, parameters, diagnostics, namespaces

    • XML SRU Writer

    • Request Parameter parser, SRUServer (request handler)

    • Abstract SRU interfaces (results, SRUSearchEngine)

    • Auth (Interface, WIP)

  • FCS Simple Endpoint

    • FCS configurations (Endpoint Description), parameters, diagnostics, namespaces

    • XML Endpoint Description parser, Record and Data View writer

    • SimpleEndpointSearchEngineBase (SRUSearchEngine + FCS extensions)

  • FCS Endpoint for XYZ

    • Implementation of abstract classes and bindings to search engine, query translation

    • Configuration: Endpoint Description, SRU Server Configuration

    • Deployment on Java Servlet Server or as WSGI app

FCS Endpoint – Initialization

SRUServerServlet / SRUServerApp (web server)

  • Set default WebApp parameters

  • Parse the SRU Server Config

  • Create QueryParserRegistry (CQL)

  • Initialize SRUSearchEngine

  • Create SRUServer (with SearchEngine + configurations)

SRUSearchEngine
(user implementation, → SimpleEndpointSearchEngineBase)

  • Further initialization of the QueryParserRegistry (FCS-QL)

  • do_init (user init)

  • Create Endpoint Description

FCS Endpoint – Communication Flow

[GET] request (incoming)

SRUServerServlet / SRUServerApp (web server)

SRUServer

  • URL parameter evaluation

  • Multiplexing by operation: search/scan/explain

SimpleEndpointSearchEngineBase (user implementation)

  • Parse search query (CQL/FCS-QL) and send to search engine

  • Wrap result in SRUSearchResultSet

  • Possible diagnostics etc.

  • optional error handling

  • XML output generation (SRU parameter)

FCS Endpoint – Class Hierarchy

SRUServerServlet - Java (Servlet) / SRUServerApp - Python (WSGI)

Servlet implementation for servlet container, doGet handler, setup of SRUServer wrapper/application executed by the endpoint operator

SRUServer - Java, Python

SRU protocol implementation, handleRequest, error handling, XML output generation

SRURequestImpl - Java, Python

Specific SRU GET parameter evaluation (parsing, validation; SRU versions) + possible FCS parameters (“x-…​”), SRU version detection

↳ SRURequest (Interface) - Java, Python

Documentation of all SRU parameters

XYZEndpointSearchEngine - korp: Java, Python

Actual implementation of createEndpointDescription, do* methods

SimpleEndpointSearchEngineBase (abstract) - Java, Python

Lifecyle (initdestroy), integration of endpoint description, interfaces for users

SRUSearchEngineBase (abstract) - Java

SRUSearchEngine (interface) - Java, Python

Interface: search, explain, scan

XYZSRUSearchResultSet - korp: Java, Python

Actual implementation, nextRecord + writeRecord iterator and serialization of results

SRUSearchResultSet (abstract) - Java, Python

Fields for searchRetrieve operation results (total, records, …)

SRUAbstractResult (interface) - Java, Python

Diagnostics + ExtraResponseData

XYZSRUScanResultSet, XYZSRUExplainResult do not need to be implemented separately, default behavior is adequate

SRUConstants - Java, Python

  • Diagnostic codes

  • Namespaces

  • Python: SRU parameter + values

SRUDiagnostic - Java, Python

  • Error handling, message (text description) of the diagnostic

Endpoint Configurations

<?xml version="1.0" encoding="UTF-8"?>
<endpoint-config xmlns="http://www.clarin.eu/sru-server/1.0/">
    <databaseInfo>
        <title xml:lang="se">Språkbankens korpusar</title>
        <title xml:lang="en" primary="true">The Språkbanken corpora</title>
        <description xml:lang="se">Sök i Språkbankens korpusar.</description>
        <description xml:lang="en" primary="true">Search in the Språkbanken corpora.</description>
        <author xml:lang="en">Språkbanken (The Swedish Language Bank)</author>
        <author xml:lang="se" primary="true">Språkbanken</author>
    </databaseInfo>

    <indexInfo>
        <set name="fcs" identifier="http://clarin.eu/fcs/resource">
            <title xml:lang="se">Clarins innehållssökning</title>
            <title xml:lang="en" primary="true">CLARIN Content Search</title>
        </set>
        <index search="true" scan="false" sort="false">
            <title xml:lang="en" primary="true">Words</title>
            <map primary="true">
                <name set="fcs">words</name>
            </map>
        </index>
    </indexInfo>

    <schemaInfo>
        <schema identifier="http://clarin.eu/fcs/resource" name="fcs"
                sort="false" retrieve="true">
            <title xml:lang="en" primary="true">CLARIN Content Search</title>
        </schema>
    </schemaInfo>
</endpoint-config>

WebApp Parameter (web.xml o.Ä.) - Korp example

  • SRU Version

  • SRU/FCS configurations

SRU (SRU Server Config) - Korp example

  • databaseInfo about endpoint, but no evaluation in client?

  • default: indexInfo + schemaInfo

  • Mandatory: database field in serverInfo!

FCS (Endpoint Description) - Korp example

  • FCS Version (1/2)

  • Capabilities, Layer, DataViews

  • Resources

Resources and Data Views

  • Endpoint Capabilities, BASIC/ADVANCED Search, FCS-QL

  • Resource, Resource Fragment, Data View (Hits, Advanced)

  • Result serialization, query languages

Endpoint Description – Capabilities

http://clarin.eu/fcs/capability/basic-search

  • Mandatory

  • Query: Full-text search (Basic) with minimal CQL (AND/OR)

  • DataView: HITS

http://clarin.eu/fcs/capability/advanced-search

  • Optional

  • Query: FCS-QL (Structured search over annotation layers)

  • DataView: HITS and Advanced

  • Other capabilities possible

    → currently limited to Basic and Advanced Search!

  • Do not only determine search modes!

  • Work in progress:

    • Authentication/authorization

    • Lexical search: …/lex-search → LexCQL, LexHITS

    • Syntactic search?

  • Note: according to XSD, capability URIs have the following schema http://clarin.eu/fcs/capability/\w([\.\-]{0,1}\w)*

cat
"cat"
cat AND dog
"grumpy cat"
"grumpy cat" AND dog
"grumpy cat" OR "lazy dog"
cat AND (mouse OR "lazy dog")
  • Mandatory!

  • Simple full-text search

  • Contextual Query Language (CQL) as query language

  • Endpoints

    • must support “term-only” queries

    • can support Boolean operators (AND/OR) and sub-queries

    • must abort in case of errors with appropriate diagnostics

    • can decide themselves what to search for

      (text, normalization etc.)

  • Results serialized in Generic Hits (HITS) Data View

http://clarin.eu/fcs/capability/basic-search

"walking"
[token = "walking"]
"Dog" /c
[word = "Dog" /c]
[pos = "NOUN"]
[pos != "NOUN"]
[lemma = "walk"]
"blaue|grüne" [pos = "NOUN"]
"dogs" []{3,} "cats" within s
[z:pos = "ADJ"]
[z:pos = "ADJ" & q:pos = "ADJ"]
  • Optional

  • Structured search in annotated data,

    represented in annotation layers

    → Query language FCS-QL

    • Queries can combine different annotation layers

    • Endpoints should support as many annotation layers as possible

  • Results serialized in Advanced (ADV) Data View and Generic Hits (HITS) Data View

http://clarin.eu/fcs/capability/advanced-search

FCS-QL

  • Annotation Layers, containing annotations of a certain type (e.g. text, POS tags, …)

  • Query supports combination of these layers

  • Each layer is segmented → search for individual lemma

    • No requirement as to how segmentation should be done

    • Assumption that segmentation is consistent across layers (for display in Advanced Data View)

    • Queries can combine segments for multi-token patterns

FCS-QL – Notes

  • Endpoints must be able to parse FCS-QL completely!

  • Requests with unsupported operators or layers?

    • Generate errors with diagnostics, or

    • Rewrite queries if permitted by “x-fcs-rewrites-allowed” (on request)

  • Searches are Case Sensitive (configurable in the query)

  • Searches (by endpoints) should take place on layers where it makes sense,

    e.g. if there are several text or POS layers

FCS-QL – Layer Types

Layer Type Identifier Annotation Layer Description Syntax Examples (without quotes)

text

Textual representation of resource, also the layer that is used in Basic Search

String

"Dog", "cat" "walking", "better"

lemma

Lemmatisation

String

"good", "walk", "dog"

pos

Part-of-Speech annotations

Universal POS tags

"NOUN", "VERB", "ADJ"

orth

Orthographic transcription of (mostly) spoken resources

String

"dug", "cat", "wolking"

norm

Orthographic normalization of (mostly) spoken resources

String

"dog", "cat", "walking", "best"

phonetic

Phonetic transcription

SAMPA

"'du:", "'vi:-d6 'ha:-b@n"

  • Universal Dependencies, Universal POS tags v2.0

  • Dafydd Gibbon, Inge Mertins, Roger Moore (Eds.): Handbook of Multimodal and Spoken Language Systems. Resources, Terminology and Product Evaluation, Kluwer Academic Publishers, Boston MA, 2000, ISBN 0-7923-7904-7

FCS-QL – Layer Type Identifier

  • Identifies layers for FCS-QL and Advanced Data View

  • Other identifiers are not allowed, except for testing purposes

  • Custom identifiers must be prefixed with “x-

Result Serialization

  • Results must be serialized in CLARIN FCS format

    • Resource (Fragment), Data View

    • XML → XSD

  • Important: 1 Hit = 1 Result Record

    • Do not combine multiple hits in one record

      → generate separate SRU records for each hit that reference the same resource

    • Multiple hit markers are allowed, e.g. for boolean expressions to highlight individual terms

    • Each “Hit” should be defined in a sentence context

Resource

  • searchable and addressable entity” in the endpoint, e.g. text corpus

  • “self contained”, i.e. entire document, not a single sentence from a document

  • Addressable as a whole via Persistent Identifier or URI

Resource Fragment

  • Part of a Resource, e.g. single sentence, or time interval in audio transcription (for multi-modal corpora)

  • Should be addressable within a Resource (offset / ID)

  • Optional, but recommended

Data View

  • Serialization of a “Hits” in Resource (Fragment)

  • Enables different representations, expandable

Result Serialization – Linking

  • Endpoints should provide link to Resource (Fragment)

    • Persistent Identifier (PID) / URI

    • If direct linking is not possible, then e.g. website with description of the resource, corpus or collection

    • Link should be as specific as possible

    • PIDs preferred to URIs, both together recommended

Result Serialization – Examples

HITS Data View of a resource with PID
<fcs:Resource xmlns:fcs="http://clarin.eu/fcs/resource" pid="http://hdl.handle.net/4711/00-15">
  <fcs:DataView type="application/x-clarin-fcs-hits+xml">
    <!-- data view payload omitted -->
  </fcs:DataView>
</fcs:Resource>
HITS Data View for a resource with Resource Fragment for more granular structuring
<fcs:Resource xmlns:fcs="http://clarin.eu/fcs/resource" pid="http://hdl.handle.net/4711/08-15">
  <fcs:ResourceFragment>
    <fcs:DataView type="application/x-clarin-fcs-hits+xml">
      <!-- data view payload omitted -->
    </fcs:DataView>
  </fcs:ResourceFragment>
</fcs:Resource>
HITS Data View with CMDI Data View for resource metadata
<fcs:Resource xmlns:fcs="http://clarin.eu/fcs/resource"
              pid="http://hdl.handle.net/4711/08-15"
              ref="http://repos.example.org/file/text_08_15.html">
  <fcs:DataView type="application/x-cmdi+xml" (1)
                pid="http://hdl.handle.net/4711/08-15-1"
                ref="http://repos.example.org/file/08_15_1.cmdi">
      <!-- data view payload omitted -->
  </fcs:DataView>

  <fcs:ResourceFragment pid="http://hdl.handle.net/4711/08-15-2" (2)
                        ref="http://repos.example.org/file/text_08_15.html#sentence2">
    <fcs:DataView type="application/x-clarin-fcs-hits+xml">
      <!-- data view payload omitted -->
    </fcs:DataView>
  </fcs:ResourceFragment>
</fcs:Resource>
1 Specification of CMDI metadata for the resource
2 Hit is part of a larger resource “semantically more meaningful”

Data Views

  • Specification (with XSD schema, examples)

  • Specified in FCS Core 2.0

    • Advanced (ADV) Data View

    • Generic Hits (HITS) Data View

  • Additional Data Views such as Component Metadata (CMDI), Images (IMG), Geolocation (GEO) are included, but not used in the standard FCS client “Aggregator”

  • Mandatory “send-by-default

    or optional “need-to-request

  • Generic Hits Data View is mandatory, must always be sent

  • Only send data views that

    • explicitely requested with (SRU) FCS parameter “x-fcs-dataviews”, or

    • have delivery policy “send-by-default

  • Invalid Data Views → non-fatal diagnostic for each requested Data View

    http://clarin.eu/fcs/diagnostic/4

    ("Requested Data View not valid for this resource")

Hits Data View

Description

The representation of the hit

MIME type

application/x-clarin-fcs-hits+xml

Payload Disposition

inline

Payload Delivery

send-by-default (REQUIRED)

Recommended Short Identifier

hits (RECOMMENDED)

XML Schema

DataView-Hits.xsd

  • Required implementation

  • Simplest serialization, (lossy) approximation of results

  • Each hit should only occur in a single sentence context (or similar)

  • Multiple hit annotations possible, e.g. for conjunctions in the query

Hits Data View – Examples

HITS Data View with a hit marker
<!-- potential @pid and @ref attributes omitted -->
<fcs:DataView type="application/x-clarin-fcs-hits+xml">
  <hits:Result xmlns:hits="http://clarin.eu/fcs/dataview/hits">
    The quick brown <hits:Hit>fox</hits:Hit> jumps over the lazy dog.
  </hits:Result>
</fcs:DataView>
HITS Data View with multiple hit markers for boolean queries
<!-- potential @pid and @ref attributes omitted -->
<fcs:DataView type="application/x-clarin-fcs-hits+xml">
  <hits:Result xmlns:hits="http://clarin.eu/fcs/dataview/hits">
    The quick brown <hits:Hit>fox</hits:Hit> jumps over the lazy <hits:Hit>dog</hits:Hit>.
  </hits:Result>
</fcs:DataView>

KWIC Data View

Description

The representation of the hit

MIME type

application/x-clarin-fcs-kwic+xml

Payload Disposition

inline

Payload Delivery

send-by-default (REQUIRED)

Recommended Short Identifier

kwic (RECOMMENDED)

XML Schema

-

  • Deprecated!

  • Only for compatibility with Legacy FCS clients

  • Example in CQP/SRU bridge

  • Mapping of

    • left and right context,

    • hits

  • Serializer Java, Python

  • Aggregator transforms it to Hits Data View!

Advanced Data View

Description

The representation of the hit for Advanced Search

MIME type

application/x-clarin-fcs-adv+xml

Payload Disposition

inline

Payload Delivery

send-by-default (REQUIRED)

Recommended Short Identifier

adv (RECOMMENDED)

XML Schema

DataView-Advanced.xsd

  • Serialization for Advanced Search for multimedia data (text, transcribed audio)

  • Presentation of structured information via multiple annotation layers

  • Annotations are streams (ranges) over the signal in a stand-off like format with start and end offsets (inclusive)

  • Segmentation via <Segment>, annotations in <Span> in <Layer>

    • Segments must be possible to align over all annotation layers

Advanced Data View – Example

Advanced Data View - Stream Data

Advanced Data View – Example (2)

Advanced Data View - Relation

Examples

Query Translation

  • Query Languages, Visualization

  • FCS-QL Details

  • Query Mapping

Query Languages

CQL-JS Demo

FCS-QL – Visualization

FCS-QL parse tree
  • Installation

    pip install antlr4-tools
    git clone https://github.com/clarin-eric/fcs-ql.git
    cd fcs-ql/src/main/antlr4/eu/clarin/sru/fcs/qlparser
  • Visualization according to ANTLR4 > Getting Started

    antlr4-parse src/fcsql/FCSParser.g4 src/fcsql/FCSLexer.g4 query -gui
    [ word = "her.*" ] [ lemma = "Artznei" ] [ pos = "VERB" ]
    ^D

FCS-QL Query Nodes

QueryNode (with child node “children”)

  • Expression (layer identifier, layer identifier qualifier, operator, regular expression + flags)

    • Wildcard

    • Group → 1 QueryNode; “(” … “)

    • NOT → 1 QueryNode

    • AND, OR → list of QueryNodes

  • QueryDisjunction → list of QueryNodes

  • QuerySequence → list of QueryNodes → “list of QuerySegmenten”

  • QuerySegment (min, max) → Expression → “a single token”

  • QueryGroup (min, max) → QueryNode

  • Within-Query (SimpleWithin, QueryWithWithin) (Scope: sentence, utterance, paragraph, turn, text, session) (unused)

  • grayed out: currently not supported by the FCS Aggregator for searching (in visual query builder)

FCS-QL Query Nodes – Aggregator

FCS-QL Query Builder

Parsed Query:

  • Query Sequencewith list of Query Segment

    [ word = ".*her" ] [ lemma = "Artznei" ] [ pos = "VERB" ]

  • Query Segmenta token (can be repeatable)

    [ word = "her.*" & ( word = "test" | word = "Apfel" ) ] [ pos = "ADV" ]{1,3}

    • Expression AND

      [ word = "her.*" & word = "test" ]

      • Expression Group

      • Expression

    • Expression GroupExpression ORlist of Expression

      [ ( word = "her.*" | word = "Test" ) ]

    • ExpressionLayer Identifier, Operator, Regex (value)

      [ word = "her.*" ]

FCS-QL – Remarks

  • Currently (Aggregator v3.9.1) only limited support of all FCS-QL features

    → partly due to Visual Query Builder

  • Free text input / improved query builder planned for the future

  • Use appropriate diagnostics if query features are not supported

    • SRU: \info:srw/diagnostic/1/48 - Query feature unsupported.

    • FCS: http://clarin.eu/fcs/diagnostic/10 - General query syntax error. - should be intercepted by FCS-QL parser library

    • FCS: http://clarin.eu/fcs/diagnostic/11 - Query too complex. Cannot perform Query.

Query-Mapping

  • Idea:

    • Let libraries parse raw queries (CQL, FCS-QL)

    • Recursively walk through the parsed query tree, “depth first”

    • Successively generate transformed query (for target system),

      e.g. StringBuilder in Java

  • Examples:

  • ElasticSearch

  • Solr

    • Only BASIC Search

    • ADVANCED Search with e.g. MTAS (“Multi Tier Annotation Search”)

  • In general: use actual Corpus Search Engine for ADVANCED Search

    → otherwise at most a single annotation layer (“text”) can be searched

FCS Endpoint Development

  • VSCode settings, kickstart a project

  • Minimal FCS endpoint, search engine connection, result serialization

  • Deployment, Embedding, Extensibility

Visual Studio Code (suggestion)

CQL-JS Demo
  • QoL = Quality of Life

Visual Studio Code – Debugging (Java)

  • For *.war/Jetty web application testing

  • No hot code swapping / do not make any changes between compilation and debugging!

  • VSCode Debug Setting:

    • Run and Debug > Add Configuration … > “Java: Attach by Process ID”

  • Run application with Maven:

    MAVEN_OPTS="-Xdebug -Xnoagent -Djava.compiler=NONE -agentlib:jdwp=transport=dt_socket,server=y,address=5005" \
        mvn [jetty:run-war|...]

Visual Studio Code – Debugging (Python)

  • launch.json

    • pytest: no predefined configuration in “Run and Debug” menu

    • file/module: as required

  • settings.json

    • pytest: coverage must be deactivated here!

{
    "name": "Python: pytest",
    "type": "python",
    "request": "launch",
    "console": "integratedTerminal",
    "purpose": [
        "debug-test"
    ],
    "justMyCode": false
}
"python.testing.pytestArgs": [
    ".",
    // disable coverage for debugging
    "--no-cov",
    // disable ansi color output (-vv)
    "-q",
],

Kickstart FCS Endpoint Project

CLARIN SRU/FCS Endpoint Archetype

  • Installation of the archetype in the local Maven repository, or

  • Configuration of the CLARIN Nexus as a remote repository

  • Project generation with Maven:

mvn archetype:generate \
    -DarchetypeGroupId=eu.clarin.sru.fcs \
    -DarchetypeArtifactId=fcs-endpoint-archetype \
    -DarchetypeVersion=1.6.0 \
    -DgroupId=[ id.group.fcs ] \
    -DartifactId=[ my-cool-endpoint ] \
    -Dversion=[ 1.0-SNAPSHOT ] \
    -DinstitutionName=[ "My Institution" ]
  • all […​ ] placeholders must be replaced with the appropriate values (enclose values with spaces in quotation marks)

  • if archetype is installed using git, then use archetypeVersion=1.6.0-SNAPSHOT (see details in pom.xml)

Minimal FCS Endpoint

  • Required class implementations

    • SimpleEndpointSearchEngineBase

    • SRUSearchResultSet

    • Wrapper or adapter for search engine (!)

  • Required configurations

    • sru-server-config.xml

    • endpoint-description.xml

    • Web app configurations

      (Java: web.xml, Python: key-value parameter dict)

      • Reference to implementation of the SimpleEndpointSearchEngineBase

      • Required SRU parameters (host, port, server, …)

Minimal FCS Endpoint – Initialization

SimpleEndpointSearchEngineBase (Java, Python)

void doInit (ServletContext context, SRUServerConfig config, SRUQueryParserRegistry.Builder queryParsersBuilder, Map<String, String> params) - Java, Python

  • Required implementation!

  • (optional) initialization of APIs, default values (PIDs), …

EndpointDescription createEndpointDescription (ServletContext context, SRUServerConfig config, Map<String, String> params) - Java, Python

  • Required implementation!

  • Loading of EndpointDescription (Java, Python)

    • embedded XML file (load with SimpleEndpointDescriptionParser, Java, Python) or

    • construction dynamically, e.g. via API - example NoSketchEngine

Minimal FCS Endpoint – Scan/Explain

  • (theoretically) nothing to implement

    → Default handlers for “explain” and “scan” respond to requests automatically

  • Endpoint Description is always returned as an “explain” operation (in case of doubt)

SimpleEndpointSearchEngineBase (Java, Python)

Minimal FCS Endpoint – Search Request

SRUSearchResultSet search (SRUServerConfig config, SRURequest request, SRUDiagnosticList diagnostics)

  • Parse query (search request)

    • Check “queryType” parameter, whether CQL, FCS-QL, …

    • Error: SRU_CANNOT_PROCESS_QUERY_REASON_UNKNOWN

  • Analyze ExtraRequestData

    • x-fcs-context” - requested resource (scope of search)

      • Diagnostic: FCS_DIAGNOSTIC_PERSISTENT_IDENTIFIER_INVALID - invalid PIDs

      • Error: SRU_UNSUPPORTED_PARAMETER_VALUE - e.g. too many PIDs, no PIDs

    • x-fcs-dataviews” - requested Data Views

      • Diagnostic: FCS_DIAGNOSTIC_PERSISTENT_IDENTIFIER_INVALID

  • Pagination → startRecord (1) / maximumRecords (-1)

  • Process search with (local) search engine

  • Wrap results in SRUSearchResultSet

  • “If in Doubt” → `SRU_GENERAL_SYSTEM_ERROR`

Search Engine Integration

  • Input: Parameters of search query

    • Query (translated for (local) search engine)

    • Resource (PID)

    • Pagination: offset + count, → startRecord (1) / maximumRecords (-1)

    • (Request object and Server configurations)

    • (all global/static objects, such as API adapters etc.)

  • Output: Details for response, results

    • Total number (optional, FCS 2.0 allows indication of accuracy)

    • List of results

      • with “hit highlighting” (Hits) (Basic + Advanced Search)

      • tokenized (using character offsets) for FCS-QL (Advanced Search) with optional Advanced Search annotation layers

    • Diagnostics

  • Wrapper for results

    • Total number of results

    • List of results (text with hit offsets; tokens + annotations)

    • Resource PID, URL to result details

  • SRUSearchResultSet implementation

    • Iterator interface → nextRecord(), writeRecord(); curRecordCursor

  • Ex: MyResults, NoSkESRUFCSSearchResultSet

protected NoSkESRUFCSSearchResultSet(..., MyResults results) {
    super(diagnostics);
    this.serverConfig = serverConfig;
    this.request = request;

    this.results = results;
    currentRecordCursor = -1;
    // ...

public int getTotalRecordCount() { return (int) results.getTotal(); }
public int getRecordCount() { return results.getResults().size(); }

public boolean nextRecord() throws SRUException {
    if (currentRecordCursor < (getRecordCount() - 1)) {
        currentRecordCursor++;
        return true; }
    return false; }

public void writeRecord(XMLStreamWriter writer) {
    MyResults.ResultEntry result = results.getResults().get(currentRecordCursor);

    XMLStreamWriterHelper.writeStartResource(writer, results.getPid(), null);
    XMLStreamWriterHelper.writeStartResourceFragment(writer, null, result.landingpage);
    // ...
    XMLStreamWriterHelper.writeEndResourceFragment(writer);
    XMLStreamWriterHelper.writeEndResource(writer);
}

Result Serialization

  • SRUXMLStreamWriter - Java, Python

    • (internal), specifically for SRU “recordXmlEscaping

  • XMLStreamWriterHelper - Java, Python (FCSRecordXMLStreamWriter)

    • Boilerplate + help for writing Record, RecordFragment, Hits/Kwic Data View

  • AdvancedDataViewWriter - Java, Python

    • Help with writing Advanced Data Views

    • addSpans (content, layer, offset, hit?)

      writeHitsDataView, writeAdvancedDataView

Minimal Configuration – Endpoint Description

  • FCS Version: 2

  • Capabilities: BASIC Search

  • Data Views: HITS

  • Resources: (min: 1)

    • Title

    • Description

    • LandingPage URL

    • Languages → one language (ISO 639-3)

<?xml version="1.0" encoding="UTF-8"?>
<EndpointDescription xmlns="http://clarin.eu/fcs/endpoint-description"
             xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
             xsi:schemaLocation="http://clarin.eu/fcs/endpoint-description ../../schema/Core_2/Endpoint-Description.xsd"
             version="2">
  <Capabilities>
    <Capability>http://clarin.eu/fcs/capability/basic-search</Capability>
  </Capabilities>
  <SupportedDataViews>
    <SupportedDataView id="hits" delivery-policy="send-by-default" >application/x-clarin-fcs-hits+xml</SupportedDataView>
  </SupportedDataViews>
  <Resources>
    <Resource pid="hdl:10794/sbkorpusar">
      <Title xml:lang="sv">Språkbankens korpusar</Title>
      <Title xml:lang="en">The Språkbanken corpora</Title>
      <Description xml:lang="sv">Korpusarna hos Språkbanken.</Description>
      <Description xml:lang="en">The corpora at Språkbanken.</Description>
      <LandingPageURI >https://spraakbanken.gu.se/resurser/corpus</LandingPageURI>
      <Languages>
        <Language>swe</Language>
      </Languages>
      <AvailableDataViews ref="hits"/>
    </Resource>
  </Resources>
</EndpointDescription>

Minimal Configuration – SRU

  • SRU Server Configurations → Endpoint Configurations (sru-server-config.xml)

    • databaseInfo with general information about endpoint

    • default: indexInfo + schemaInfo

    • required: serverInfo > database (host and port by default)

  • Web server configuration

    • Optional adjustment of SRU / FCS parameters

    • Java: web.xml

    • Python: key-value dictionary

  • default: indexInfo + schemaInfo → copy/paste from template/existing endpoints, configuration remains largely the same here

FCS Endpoint Deployment (Java)

  • Using Maven (!) / pom.xml

    • <packaging>war</packaging>

    • Build Plugin:

      • org.apache.maven.plugins:maven-war-plugin[:2.6] (?)

      • org.apache.maven.plugins:maven-compiler-plugin

  • Create WAR artifact

    • mvn clean compile war:war

    • mvn clean package (also run tests etc.)

  • Deploy with Java Servlet Engine / HTTP server like Apache Tomcat / Eclipse Jetty / …

  • TODO: Check if maven-war-plugin is no longer necessary?

FCS Endpoint Deployment (Python)

Embedded FCS Endpoint (Python)

  • Tested only with Python as WSGI app in Flask

    → in kosh: PR, commit

  • Idea:

    • Create SRUServer with SRUSearchEngine (global)

    • Forward requests (filtered by path) to SRUServer

def init(self, flask: Flask) -> None:
    self.server = self.build_fcs_server()
    flask.add_url_rule("some-path/fcs", "some-path/fcs", self.handle)

def build_fcs_server(self) -> SRUServer:
    params = self.build_fcs_server_params()
    config = self.build_fcs_server_config(params)
    qpr_builder = SRUQueryParserRegistry.Builder(True)
    search_engine = KoshFCSEndpointSearchEngine(
        endpoint_description=self.build_fcs_endpointdescription(),
        # ... other parameters
    )
    search_engine.init(config, qpr_builder, params)
    return SRUServer(config, qpr_builder.build(), search_engine)

def handle(self) -> Response:
    LOGGER.debug("request: %s", request)  # Flask/Werkzeug Request
    LOGGER.debug("request?args: %s", request.args)
    response = Response()                 # Flask/Werkzeug Response
    self.server.handle_request(request, response)
    return response

FCS Endpoint – Extensibility

  • Supports own query languages, Data Views etc.

  • Example: LexFCS (FCS extension for lexical resources)

    → i.e. new query language and Data View

  • LexCQL - query language (CQL dialect)

    • SRUQueryParser (Java, Python), based on CQLQueryParser (Java, Python)

      LexCQLQueryParser with LexCQLQuery

    • SimpleEndpointSearchEngineBase.doInit() (Java, Python)

      queryParsersBuilder.register(new LexCQLQueryParser());

  • LexHITS - HITS Data View extension

    • in SRUSearchResultSet.writeRecord (Java, Python) appropriate XML result serialization

Deployment

  • Deployment instructions for FCS Endpoint Tester/Validator, FCS SRU Aggregator and FCS Korp Endpoint

FCS Endpoint Protocol Conformance Tester

  • NOTE: This is about the now legacy FCS endpoint tester, see Section: FCS Endpoint Validator for the updated and rewritten validator!

  • WebApp for testing the compliance with the FCS specification of endpoints

  • Installation uses SNAPSHOT versions of the SRU/FCS libraries, and normally reserved functions to validate the SRU/FCS protocols

FCS Endpoint Conformance Tester – Deployment

SRU/FCS SNAPSHOT libraries must be installed directly from Git

$ git clone https://github.com/clarin-eric/fcs-sru-client.git && cd fcs-sru-client
$ mvn install
$ git clone https://github.com/clarin-eric/fcs-simple-client.git && cd fcs-simple-client
$ mvn install

Build with Maven

$ git clone https://github.com/clarin-eric/fcs-endpoint-tester.git && cd fcs-endpoint-tester
$ mvn clean package

Deployment with Jetty on http://localhost:8080/

$ JETTY_VERSION="9.4.51.v20230217"
$ wget https://repo1.maven.org/maven2/org/eclipse/jetty/jetty-distribution/${JETTY_VERSION}/jetty-distribution-${JETTY_VERSION}.zip && unzip jetty-distribution-${JETTY_VERSION}.zip && rm jetty-distribution-${JETTY_VERSION}.zip
$ cd jetty-distribution-${JETTY_VERSION}/
$ java -jar start.jar --add-to-start=http,deploy
$ cd webapps/ && cp ../../target/FCSEndpointTester-X.Y.Z-SNAPSHOT.war ROOT.war && cd ..
$ java -jar start.jar

FCS Endpoint Conformance Tester – Deployment (Docker)

Create Docker Image

$ git clone https://github.com/clarin-eric/fcs-endpoint-tester.git && cd fcs-endpoint-tester
$ docker build -t fcs-endpoint-tester .

Run Container

$ docker run --rm -it -p 8080:8080 fcs-endpoint-tester

FCS Endpoint Validator

  • This is a updated and completely rewritten SRU/FCS Endpoint Validator based on FCS Endpoint Protocol Conformance Tester. It allows to inspect HTTP requests/responses and store validation results in addition to more test cases.

  • WebApp for testing the compliance with the SRU/FCS specification of FCS endpoints

FCS Endpoint Validator – Deployment

Build with Maven

$ git clone https://github.com/saw-leipzig/fcs-endpoint-validator.git && cd fcs-endpoint-validator
$ mvn clean package install

Deployment with SpringBoot on http://localhost:8080/ (might automatically open a new browser tab)

$ cd fcs-endpoint-validator-ui/
$ mvn spring-boot:run

FCS Endpoint Validator – Deployment (Docker)

Download sources:

$ git clone https://github.com/saw-leipzig/fcs-endpoint-validator.git && cd fcs-endpoint-validator

Create docker-compose.yml deployment description:

version: '3'

services:
  fcs-endpoint-validator:
    build:
      context: .
      dockerfile: fcs-endpoint-validator-ui/Dockerfile
    container_name: fcs-endpoint-validator
    ports:
      # default, public 8080 to docker container 8080
      - 8080:8080
    restart: unless-stopped

Run Docker-Compose deployment:

$ docker compose build
$ docker compose down -v
$ docker compose up -d

FCS SRU Aggregator

  • Primary FCS client application

  • Central search interface for users,

    “aggregates” FCS search queries to/from distributed endpoints

FCS SRU Aggregator – Deployment

Build application (native)

$ git clone https://github.com/clarin-eric/fcs-sru-aggregator.git && cd fcs-sru-aggregator
$ ./build.sh --jar

Configuration (endpoint sideloading + logging) in aggregator_devel.yml (aggregator.yml for production deployment)

  • aggregatorParamsadditionalFCSEndpoints

  • loggingloggers

$ ./build.sh --run

FCS SRU Aggregator – Deployment (Docker)

Create Docker Image

$ git clone https://github.com/clarin-eric/fcs-sru-aggregator.git && cd fcs-sru-aggregator
$ docker build --tag=fcs-aggregator .

Run Docker Container

$ touch fcsAggregatorResources.json fcsAggregatorResources.backup.json
$ docker run -d --restart unless-stopped \
    -p 4019:4019 -p 5005:5005 \
    -v $(pwd)/aggregator.yml:/work/aggregator.yml:ro \
    -v $(pwd)/fcsAggregatorResources.json:/var/lib/aggregator/fcsAggregatorResources.json \
    -v $(pwd)/fcsAggregatorResources.backup.json:/var/lib/aggregator/fcsAggregatorResources.backup.json \
    fcs-aggregator

FCS Korp Endpoint

  • Reference endpoint for Korp corpus search engine

  • Example → Korp-API publicly accessible, no further configuration required for testing

FCS Korp Endpoint – Deployment (Java)

Build Application

$ git clone https://github.com/clarin-eric/fcs-korp-endpoint.git && cd fcs-korp-endpoint
$ mvn clean compile war:war

Deployment then with Jetty/Tomcat etc. analogous to the FCS Endpoint Tester

FCS Korp Endpoint – Deployment (Python)

Prepare Deployment

$ git clone https://github.com/Querela/fcs-korp-endpoint-python.git && cd fcs-korp-endpoint-python
$ python3 -m venv venv && source venv/bin/activate
$ python3 -m pip install -e .

Test Deployment (http://localhost:8080)

$ python3 -m korp_endpoint

Productive deployment with Docker (http://localhost:5000)

$ docker build --progress=plain -t korpy .
$ docker run --rm -it -p 5000:5000 korpy

Deployment Notes

  • When using Docker and localhost, network configurations may need to be adjusted so that the Docker container has access to the host

    • host.docker.internal

Resources

Publications

  • As part of Text+ → see Zotero.org tagged “FCS”

  • Listing in Text+ Awesome FCS list

  • In the context of CLARIN? → no dedicated bibliography

    • CLARIN Federated Content Search (CLARIN-FCS) – Core Specification, 2014, Oliver Schonefeld et al.

    • Federated Search: Towards a Common Search Infrastructure, 2012, Herman Stehouwer et al.

    • Several workshops


1. OASIS: Organization for the Advancement of Structured Information Standards