FCS Endpoint Development Tutorial

Introducing the Federated Content Search (FCS)

Description, History & Glossary

What is the FCS?

“Federated Content Search” at CLARIN

In short: Content Search over Distributed Resources

Also: Federated “Corpus Query Platform”
Search for patterns in distributed text collections
No central index!
Text resources include annotated corpora, full-texts etc.
FCS = interface specification, search infrastructure and software ecosystem
Usage of established standards and extensibility!

What is included in the FCS?

Interface Specification

Description of search protocol (query languages, formats and communication channels)
“for homogeneous access to heterogeneous search engines”
RESTful protocol

Search Infrastruktur in CLARIN and Text+

Central client (search result “Aggregator” and web portal)
Decentralized endpoints at the data centers (local search eninges on resources)

Software Ecosystem primarily in Java

Libraries (Java, Python, …)
Tools (Validator, Aggregator, Registry)

Requirements for participation in the FCS

(Own) text resources
“Search engine” on those text resources
- Minimum: full-text search
Deployment of publicly accessible FCS endpoint(s)

Pros and Cons for the FCS (as Infrastructure)

Pros

Integration of many resources, linking and comparison of results
Integration with other tools (Weblicht, Registry/VLO, Switchboard, …)
Same queries, formats, result presentation
No duplicate data storage, inconsistency

Cons

No control over resources
No deterministic results (e.g. links for publications)
No global ranking of results possible

Pros and Cons for FCS Endpoints (Operators)

Pros

Control over resources and search (ranking, fuzzy, …)
No duplication of data due to central index
Increased visibility in a larger resource catalog

Cons

Deployment of (additional) endpoint necessary

Comparison of FCS with Central Index

Data	➕ At the endpoints	➖ Duplicate data storage, possible inconsistency (age, updates); legally no transfer may be possible
Updates to Data	➕ Endpoints can react quickly	➖ Difficult, e.g. removal of resources in the event of legal problems; updates entail longer delays, if at all possible
Global Ranking	➖ Very difficult/impossible	➕ Quite possible (?), probably implicit assumption and normalization of data for indexing
Faceted Search	➖ Difficult (e.g. via external metadata; not explicitly intended)	➕ Indexing allows clustering/classification according to topics and categories

Data

➕ At the endpoints

➖ Duplicate data storage, possible inconsistency (age, updates); legally no transfer may be possible

Updates to Data

➕ Endpoints can react quickly

➖ Difficult, e.g. removal of resources in the event of legal problems; updates entail longer delays, if at all possible

Global Ranking

➖ Very difficult/impossible

➕ Quite possible (?), probably implicit assumption and normalization of data for indexing

Faceted Search

➖ Difficult (e.g. via external metadata; not explicitly intended)

➕ Indexing allows clustering/classification according to topics and categories

History

~ 2011 Started as Working Group in CLARIN
Mai 2011 EDC/FCS Workshop
~ 2011–2013 Initial version, now named FCS “Legacy”
- SRU Scan for resources, BASIC Search (CQL/full-text), KWIC
April 2013 FCS Workshop
~ 2013/2014 Code and Spec for FCS Core 1.0
- fcs-simple-endpoint:1.0.0, sru-server:1.5.0
- BASIC Hits Data View, SRU Scan operation not used anymore

much has disappeared into the annals of history …
https://github.com/clarin-eric/fcs-misc/tree/main/historical/documents
https://trac.clarin.eu/wiki/FederatedSearch?version=1
https://trac.clarin.eu/browser/FCSSimpleEndpoint/tags
https://trac.clarin.eu/wiki/FCS/Specification?action=history
https://trac.clarin.eu/wiki/Taskforces/FCS/FCS-Specification-Draft?action=history
https://www.clarin.eu/event/2013/federated-content-search-workshop
EDC: European Demonstrator Case

~ 2015/2016 Starting work on and Code for FCS Core 2.0
- fcs-simple-endpoint:1.3.0, sru-server:1.8.0
- Advanced Data Views (FCS-QL), …
June 2017 Official release of FCS Core 2.0 Spec
2022 FCS is focus in Text+ (Findability)
2023 New FCS maintainer in CLARIN
- Migration of Source Code to GitHub.com, updated documentation
- Python FCS endpoint libraries
- Updated libraries & tools
- Prototypes for LexFCS extension
2024
- Experiments with Entity Search (extension)
- Rewrite of FCS Endpoint Validator

FCS Architecture

Communication Protocol

SRU (Search/Retrieval via URL) / OASIS searchRetrieve

Standardized by Library of Congress (LoC) / OASIS
- RESTful
- Explain: Listing of resources
  - Languages, annotations, supported data views and formats etc.
- SearchRetrieve: Search request
Data as XML
Extensions to the protocol explicitely allowed

Basic Assumption on the Data Structure

different (optional) annotation layers

Full-text	The	cyclists	are	fast
Part of Speech	DET	NOUN	VERB	ADJ
Lemmatisation	The	cyclist	is	fast
Phonetic Transcription	…	…	…	…
Orthographic Transcription	…	…	…	…
[…]

Full-text

The

cyclists

are

fast

Part of Speech

DET

NOUN

VERB

ADJ

Lemmatisation

The

cyclist

fast

Phonetic Transcription

…

Orthographic Transcription

…

[…]

Explain: Resource Discovery

Explain: Resource Discovery (2)

Explain: Resource Structure

Source: https://spraakbanken.gu.se/ws/fcs/2.0/endpoint/korp/sru?operation=explain&version=1.2&x-fcs-endpoint-description=true

Query Language FCS-QL

Based on CQP
Supports various annotation layers

Visualization of Results

Visualization of Results (2)

Visualization of Results (3)

Current state of the FCS

Current version of the specification: FCS Core 2.0
Poster at Bazaar @ CLARIN2023 on the current status
😎 “Awesome FCS” List: github.com/clarin-eric/awesome-fcs with relevant links to specs, tools, libraries, implementations and much more
- Additions by Text+ (z.B. on LexFCS/LexCQL/Forks/Software): gitlab.gwdg.de/textplus/ag-fcs-documents/-/blob/main/awesome-fcs.md
CLARIN specifications: github.com/clarin-eric/fcs-misc
Small ecosystem (Code on Github/Gitlab)
- Software libraries (SRU/FCS, endpoint + client, Java/Python)
- Aggregator (Code: Github, Text+ Fork)
- Online Validator for Endpunkte (fcsvalidator, Code: Github (old), Github (new))
Endpunkte Registry: centres.clarin.eu/fcs

Current Work

Lexical Resources extension
- First specification and implementation in Text+
- Official extension of CLARIN → ~2024 Working Plan

AAI integration
- Specification and implementation
- Goal: Support access-restricted resources
- Securing the aggregator via Shibboleth → Passing on AAI attributes to endpoints
- Preliminary work from CLARIAH-DE, part of the Text+ work plan (IDS Mannheim, Uni/SAW Leipzig, preliminary work BBAW)

Syntactic Search
Entity Search
Optional metadata for each result

Current status regarding Lexical Resources

CLARIN-EU Taskforce
CLARIN ERIC working plan: „extending the protocol to cover additional data types (e.g. lexica) will be explored“
- on the CLARIN 2024 Working Plan
Interest expressed from various countries
Preliminary work: „RABA“ (Estland): e.g. „Eesti Wordnet“
First specification and implementation in Text+
- Specification on Zenodo: zenodo.org/records/7849754
- Presentation at eLex 2023: “A Federated Search and Retrieval Platform for Lexical Resources in Text+ and CLARIN”
- Aggregator: fcs.text-plus.org/?queryType=lex

Current status of participants

CLARIN (contentsearch.clarin.eu, Registry)

209 Resources (94 in Advanced)

in 61 Languages

from 20 Institutions in 12 Countries

Text+ (fcs.text-plus.org)

53 Resources (17 in Advanced, 30 in Lexical)

in 6 Languages

from 9 Institutions in Germany

Integration in FCS Infrastructure

CLARIN

Alpha/Beta using Side-Loading in Aggregator
Stable/Long-Term: Entry in Centre Registry
- CLARIN Account + Formular as a Centre
- Including monitoring etc.

Text+

Side-Loading in Aggregator
WIP: Registry (index of endpoints)

Alternative Ways of using FCS

Development of an alternative aggregator frontend as Web Component
- Code: Vue.js Store + Vuetify Component (Dialog); Demo
- Use of the Aggregator API
- Restriction to subset of resources, e.g. for integration on own website
- Faceting, alternative visualization

Bootstrapping Endpoint Development

Java: Maven Archetype github.com/clarin-eric/fcs-endpoint-archetype
Java & Python (reference implementation Korp):
- github.com/clarin-eric/fcs-korp-endpoint
- github.com/Querela/fcs-korp-endpoint-python
😎 “Awesome FCS” List: github.com/clarin-eric/awesome-fcs
- List of reference implementations, endpoints, query parsers
- Code for FCS SRU Aggregator and SRU/FCS Validator

Guide to Endpoint Development

Important preliminary questions
Existing implementations, resources for new development
Prerequisites

Development Decisions

❓ Can I host the endpoint myself?

❗ No → HelpDesk: CLARIN, Text+

❓ What type of data do I have?

❗ Raw text, Vertical/CONLL, TEI, …

❓ Which search engine do I use / can I use?

❗ KorAP, Korp/CWB, Lucene/Solr/ElasticSearch, BlackLab, (No)SketchEngine, …

❓ Customization or new development?

❗ List of existing endpoint implementations (Awesome List)

❓ Programming language?

❗ Java, Python, (PHP, XQuery)

❓ In-house development: Use of the reference libraries (Java, Python)

❗ Maven Archetype, Korp

❗ SRU + FCS specifications …

Endpoint Implementations

Korp FCS 2.0 - reference implementation, Korp corpus search
CQP/SRU bridge - Corpus Workbench (CWB)
KonText, fcs-noske-endpoint - (No)SketchEngine (CONLL/Vertical)
oclcsrw - SRW/SRU server for DSpace, Lucene and/or Pears/Newton
corpus_shell, SADE - MySQL PHP/DDC Perl, eXist/XQuery
arche-fcs - ARCHE Suite, php
Blacklab / MTAS - corpus search engines using Lucene/Solr
KorapSRU - KorAP (IDS)

Sources: clarin, awesome-fcs

New Endpoint Development

Customization of reference implementation (Korp)
- Java: github.com/clarin-eric/fcs-korp-endpoint
- Python: github.com/Querela/fcs-korp-endpoint-python
Development using CLARIN SRU/FCS libraries
- Java: github.com/clarin-eric/fcs-endpoint-archetype
- Docs:
  - Java SRU: clarin-eric.github.io/fcs-sru-server/apidocs/index.html
  - Java FCS: clarin-eric.github.io/fcs-simple-endpoint/apidocs/index.html
  - Python SRU: fcs-sru-server-python.readthedocs.io
  - Python FCS: fcs-simple-endpoint-python.readthedocs.io
“New” development specifications (for other languages)
- SRU: OASIS SRU Overview, Library of Congress SRU
- FCS: github.com/clarin-eric/fcs-misc → “FCS Core 2.0”
Awesome List: github.com/clarin-eric/awesome-fcs

Prerequisite for local search engine

❗ Full-text search

❓ With Hit markers

❓ Corpus search (segmented text with annotations)

❕ Pagination (total number of hits)

❗ Resource PID

❓ Linking to result pages

Fundamentals

SRU (Overview, APD/Models, Request Parameter, Diagnostics, …)
FCS (Discovery, Endpoint Description, Search, SRU Parameter, Diagnostics)
FCS Notes (Versions, Compatibility, Aggregator)

Disclaimer

Main focus on:

Version: FCS Core 2.0; maximum compatibility with FCS Core 1.0
SRU Server, FCS Endpunkt; not FCS client application development
Using the reference libraries

→ Java and Python
Possible (re-)use of existing endpoints

No:

Working through the specification; only the essential information
New or redevelopment of SRU/FCS protocols, libraries etc.

(e.g. in other languages)

SRU – History

SRU: Search/Retrieve via URL → LOC

Originally developed by the Library of Congress (LOC)

2004: SRU 1.1 - LOC

2007: SRU 1.2 - LOC
As of SRU 2.0 standardized by OASIS ^[1] as “searchRetrieve Version 1.0 OASIS Standard”

2013: SRU 2.0 - LOC, OASIS (OASIS Announcement)

Extension of SRU 1.2 → Differences to SRU 1.2 (LOC)

searchRetrieve Version 1.0 – OASIS Standard

Part 0. Overview Version 1.0
Part 1. Abstract Protocol Definition Version 1.0
Part 2. SRU searchRetrieve Operation: APD Binding for SRU 1.2 Version 1.0
Part 3. SRU searchRetrieve Operation: APD Binding for SRU 2.0 Version 1.0
Part 4. ~~APD Binding for OpenSearch 1.0 version 1.0~~
Part 5. CQL: The Contextual Query Language version 1.0
Part 6. SRU Scan Operation version 1.0
Part 7. SRU Explain Operation version 1.0

grayed out: not relevant for us
crossed out: plays no role at all for the FCS

searchRetrieve: Part 0. – Overview Version 1.0

SRU (SRU: Search/Retrieve via URL) is a web service protocol supported over both SOAP and REST for client-server based search. SRU1.x was developed as a web service replacement for the NISO Z39.50 protocol. SRU2.0 is a revision to SRU which as well as including many enhancements to SRU1.2 was developed alongside the APD.

For the SRU protocol model, three operations are defined as part of its Processing Model:

SearchRetrieve Operation. The actual SearchRetrieve operation defined by the SRU protocol; A SearchRetrieve operation consists of a SearchRetrieve request from client to server followed by a SearchRetrieve response from server to client.
Scan Operation. Similar to SRU, the Scan protocol defines a request message and a response message for iterating through available search terms. a Scan operation consists of a Scan request followed by a Scan response.
Explain Operation. Every SRU or scan server provides an associated Explain document as part of its Description and Discovery Model, providing information about the server’s capabilities. A client may retrieve this document and use the information to self-configure and provide an appropriate interface to the user. When a client retrieves an Explain document, this constitutes an Explain operation.

http://docs.oasis-open.org/search-ws/searchRetrieve/v1.0/os/part1-apd/searchRetrieve-v1.0-os-part1-apd.html
SRW = search/retrieve for the web

searchRetrieve – APD and Bindings

Abstract Protocol Definition (APD) für “searchRetrieve operation”
- Model for SearchRetrieve Operation
- Describes Capabilities and General Characteristics of a Server or Search Engine, as well as how access should take place
- Defines abstract Request parameters and Response elements
Binding
- Describes corresponding names of the parameters and elements
- static (for human), dynamic (for machine), …
- Bindings: SRU 1.2, SRU 2.0, (OpenSearch)
- Examples: “startPosition” (APD) → “startRecord” (SRU 2.0)
  
  “recordPacking” (SRU 1.2) → “recordXMLEscaping” (SRU 2.0)

http://docs.oasis-open.org/search-ws/searchRetrieve/v1.0/os/part3-sru2.0/searchRetrieve-v1.0-os-part3-sru2.0.html#recordPacking

searchRetrieve – APD Abstract Models

Data Model
Description of the data on which the search is to be executed

Query Model
Description of the construction of search queries

Processing Model
Description of how query is sent from client to server

Result Set Model
Structure of the results of a search

Diagnostics Model
Description of how errors are communicated from the server to the client

Description and Discovery Model
Description, for the discovery of the “Search Service”, self-description of the functionality of the service

http://docs.oasis-open.org/search-ws/searchRetrieve/v1.0/os/part0-overview/searchRetrieve-v1.0-os-part0-overview.html#_Toc312151029

SRU 2.0 – Operation Model

SRU Request (Client → Server) with Response (Server → Client)
Operations
- SearchRetrieve
- Scan
- Explain

SRU 2.0 – Data Model

Server = Database for Client for search/retrieval
Database = Collection of Units of Data → Abstract Record
Abstract Records (or Response Records) in one/multiple formats by server
Format (or Item Type) = Record Schema

SRU 2.0 – Protocol Model

HTTP GET
- Parameters encoded as “key=value”
- UTF-8
- %-Escaping
- Separation at “?”, “&”, “=”
HTTP POST
- application/x-www-form-urlencoded
- No character encoding necessary
- No length restriction
HTTP SOAP (?)

SRU 2.0 – Processing Model

“Request processing on the server”
Request
- Number of records
- Identifier for Record Schema (→ Records in Response)
- Identifier for Response Schema (→ whole Response)
Response
- Records in Result Set
- Diagnostic Information
- Result Set Identifier for requests for further results

SRU 2.0 – Query Model

Any “appropriate query language” can be used
Mandatory support of

“Contextual Query Language” (CQL)

SRU 2.0 – Parameter Model

Use of Parameters, some predefined by SRU 2.0
Parameters not defined in the protocol are also permitted
Parameter “query”
- included in every query in some manner (“query” or by parameters not defined in the protocol)
- Query with “queryType” (default “cql”)

SRU 2.0 – Result Set Model

Logical model → “Result Sets” are not mandatory
Query → Selection of suitable Records
- Ordered list, non-modifiable set after creation
- Sorting/order determined by server
for Client:
- Set of abstract Records, counting starts with 1
- Each record can be requested in its own format
- Individual records can “disappear”, no reordering in the Result Set by the Server, but Diagnostic to inform

SRU 2.0 – Diagnostic Model

fatal
- Execution of the query cannot be completed
- e.g. invalid query
non-fatal
- Processing impaired, but request can be completed
- e.g. individual records are not available in the requested schema, server only sends the available ones and informs about the rest
- surrogate
  - For single Records
- non-surrogate
  - All records are available, but something went wrong, e.g. sorting
  - Or simply a warning

SRU 2.0 – Explain Model

Must be available for HTTP GET via the base URL of the SRU server
→ Server Capabilities
In the client for self-configuration and to provide the corresponding user interface
Details on supported Query Types, CQL Context Sets, Diagnostic Sets, Records Schemas, Sorting options, defaults, …

SRU 2.0 – Serialization Model

No restriction on the serialization of responses

(for the entire message or single records)
Non-XML serialization is allowed

searchRetrieve 2.0 – Request Parameter

All parameters are optional, non-repeatable
query, startRecord, maximumRecords, recordXMLEscaping, recordSchema, resultSetTTL, stylesheet; Extension parameters
New in 2.0: queryType, sortKeys, renderedBy, httpAccept, responseType, recordPacking; Facet Parameters

Spec: “Actual Request Parameters for this Binding”

searchRetrieve 2.0 – Response Elements

All elements are optional, non-repeatable by default
numberOfRecords, resultSetId, records, nextRecordPosition, echoedSearchRetrieveRequest, diagnostics, extraResponseDataⓇ
New in 2.0: resultSetTTL, resultCountPrecisionⓇ, facetedResultsⓇ, searchResultAnalysisⓇ

(Ⓡ = repeatable)

Spec: “Actual Response Elements for this Binding”

http://docs.oasis-open.org/search-ws/searchRetrieve/v1.0/os/part3-sru2.0/searchRetrieve-v1.0-os-part3-sru2.0.html#_Toc324162438

searchRetrieve 2.0 – Query

query (Parameter)
- Query
- Mandatory if no specification of queryType
queryType (Parameter, SRU 2.0)
- Optional, by default “cql”
- Query Types must be listed in the Explain, with URL for definition and usage abbreviation
- Reserved
  - cql
  - searchTerms (processing is left to the server, < SRU 2.0)

searchRetrieve 2.0 – Query (Examples)

spraakbanken.gu.se/…/sru?query=cat

(default, FCS 2.0, SRU 2.0)
spraakbanken.gu.se/…/sru?operation=searchRetrieve&version=1.2&query=cat

(FCS 1.0, SRU 1.2)
spraakbanken.gu.se/…/sru?operation=searchRetrieve&queryType=cql&query=%22anv%C3%A4ndning%22

(FCS 2.0, SRU 2.0)
spraakbanken.gu.se/…/sru?operation=searchRetrieve&queryType=fcs&query=%5bword%3d%22anv%C3%A4ndning%22%5d&x-cmd-resource-info=true

(FCS 2.0 mit FCS-QL Query)

searchRetrieve 2.0 – Pagination

Query for result range of startRecord with maximum maximumRecords
startRecord (Parameter)
- Optional, positive integer, starting with 1
maximumRecords (Parameter)
- Optional, non-negative integer
- Server selects default if not specified
- Server can respond with fewer records, never more
Response with total number (numberOfRecords) of records in the Result Set, with offset (nextRecordPosition) to next results
numberOfRecords (Element)
- Number of results in the Result Set
- If query fails, it must be “0”
nextRecordPosition (Element)
- Counter for next result set, if last record in the response is not last in the result set
- If no further records, then this element must not appear

searchRetrieve 2.0 – Result Set

resultSetId (Element)
- Optional, identifier for the Result Set, for referencing in the subsequent requests
resultSetTTL (Parameter / Element, Element in SRU 2.0 only)
- Optional, in seconds
- In request from Client when Result Set is no longer used
- In response from Server, how long Result Set is available (“good-faith estimate”, → can be longer or shorter)
resultCountPrecision (Element, SRU 2.0)
- URI: “info:srw/vocabulary/resultCountPrecision/1/…”
- exact / unknown / estimate / maximum / minimum / current

searchRetrieve 2.0 – Pagination (Cont.)

spraakbanken.gu.se/…/sru?query=cat

→ 9220 results, next results starting from 251
spraakbanken.gu.se/…/sru?query=cat&startRecord=300&maximumRecords=10

→ More from 310
spraakbanken.gu.se/…/sru?query=cat&startRecord=10000&maximumRecords=10

→ Error, because “out of range”
spraakbanken.gu.se/…/sru?query=catsss

→ No results
spraakbanken.gu.se/…/sru?query=cat&maximumRecords=100000

→ Restricted to 1000 Records

searchRetrieve 2.0 – Serialization

recordXMLEscaping (Parameter, SRU 2.0)
- If records are serialized as XML, “string” of the Records can be escaped (“<”, “>”, “&”); default is “xml” as direct embedding of the Records in the Response, e.g. for Stylesheets
recordPacking (Parameter, SRU 2.0)
- In SRU 1.2 used to have the semantic of recordXMLEscaping
- “packed” (default), Server should deliver Records with requested schema; “unpacked”, Server can determine the location of the application data in the Records itself (?)

httpAccept (Parameter, SRU 2.0)
- Schema for Response, default is “application/sru+xml”
responseType (Parameter)
- Schema for Response (in combination with httpAccept parameter)
recordSchema (Parameter)
- Schema of the Records in Response, e.g. “http://clarin.eu/fcs/resource”
- Identifier for schema from Explain Response
records (Element)
- Contains Records / Surrogate Diagnostics
- According to default Schema a list of “<record>” elements

recordSchema with http://clarin.eu/fcs/resource can be used for multiplexing if several SRU functionalities are offered via one endpoint, e.g. also DFG Viewer or similar.

stylesheet (Parameter)
- URL to stylesheet, for display to the user
- renderedBy (Parameter, SRU 2.0)
- Where is stylesheet for Response rendered
- “client” (default), URL of stylesheet parameter is simply echoed → “thin client” (in Web Browser)
- “server”, should transform default SRU response with stylesheet (e.g. for httpAccept with HTML format)
spraakbanken.gu.se/…/sru?query=cat&recordXMLEscaping=string

→ Possible serialization error in Java library
spraakbanken.gu.se/…/sru?operation=searchRetrieve&version=1.2&query=cat&version=1.2&recordPacking=string

(FCS 1.0, SRU 1.2, like recordXMLEscaping)
spraakbanken.gu.se/…/sru?query=cat&recordPacking=unpacked

→ No noticeable change here
…

searchRetrieve 2.0 – Unsupported Parameters

Sorting (sortKeys) and Faceting not supported

SRU 2.0 – Extensions

Extensions possible in
- Request via Extension Parameter
- (prefixed with “x-” and namespace identifier, e.g. “x-fcs-”)
Response in the “<extraResponseData>” Element
Response with extraResponseData, only if requested in Request with corresponding parameter, never voluntary
- Server can ignore the request, no obligation
Unknown extension parameters are to be ignored

http://docs.oasis-open.org/search-ws/searchRetrieve/v1.0/os/part3-sru2.0/searchRetrieve-v1.0-os-part3-sru2.0.html#Extensions

SRU 2.0 – Backwards Compatibility

Parameters “operation” and “version” only in SRU 1.1/SRU 1.2, removed in SRU 2.0 → Assumption of a separate endpoint for each SRU version
Heuristic for detecting the SRU version
- searchRetrieve = query or queryType parameter
- scan = scanClause parameter
- explain
Interoperability with older versions:
- Use of operation/version parameters → SRU < 2.0
- Caution with parameters with changed semantics
  
  especially recordPacking

http://docs.oasis-open.org/search-ws/searchRetrieve/v1.0/os/part3-sru2.0/searchRetrieve-v1.0-os-part3-sru2.0.html#interop

SRU 2.0 – Diagnostics

“Error handling”
Difference between (non-)fatal, (non-)surrogate → SRU 2.0 – Diagnostic Model
Schema: info:srw/schema/1/diagnostics-v1.1

Prefix: info:srw/diagnostic/1/
- uri (ID), details (additional information, depending on Diagnostic), message
Information:
- General information and notes (LOC, OASIS SRU 2.0)
- List of Diagnostics (LOC, OASIS SRU 2.0)
Categories: General (1-9), CQL (10-49), Result Sets (50-60), Records (61-74), Sorting (80-96), Explain (100-102), Stylesheets (110-111), Scan (120-121)
Not limited to this list only, custom diagnostics possible

General system error

Debugging information (traceback)

System temporarily unavailable

Authentication error

Unsupported operation

Unsupported version

Highest version supported

Unsupported parameter value

Name of parameter

Mandatory parameter not supplied

Name of missing parameter

Unsupported parameter

Name of the unsupported parameter

Unsupported combination of parameters

Query syntax error

Too many characters in term

Length of longest term

Non special character escaped in term

Character incorrectly escaped

Term contains only stopwords

Value

Unsupported boolean operator

Value

Too many boolean operators in query

Maximum number supported

Cannot process query; reason unknown

Query feature unsupported

Feature

Result set not created: too many matching records

Maximum number

First record position out of range

Record temporarily unavailable

Record does not exist

Unknown schema for retrieval

Schema URI or short name

Record not available in this schema

Schema URI or short name

Not authorized to send record

Not authorized to send record in this schema

Record too large to send

Maximum record size

Unsupported recordXMLEscaping value

Sort not supported

110

Stylesheets not supported

111

Unsupported stylesheet

URL of stylesheet

FCS Interface Specification

FCS = Description of capabilities,
Extensions according to SRU
and operations

→ Use of SRU/CQL and
Erweiterungen nach SRU
Interface specification = formats and transport protocol
- Endpoint = bridge between client (FCS formats) and local search engine
- Client = user interface, query input and result presentation
Discovery and Search mechanism

FCS – Discovery

SRU Explain
- Help and information for the client on accessing, requesting and processing results from the server
Information about endpoint
- Capabilities: Basic Search, Advanced Search?
- Resources for search
→ Endpoint Description (XML) via explain SRU Operation

FCS 2.0 §3 CLARIN-FCS to SRU/CQL binding

FCS – Endpoint Description

XML according to the schema Endpoint-Description.xsd
<ed:EndpointDescription>
- @version mit “2”
- <ed:Capabilities> (1)
- <ed:SupportedDataViews> (1)
- <ed:SupportedLayers> (1) (if Advanced Search Capability)
- <ed:Resources> (1)
<ed:Capability>
- Content: Capability Identifier, URI
  - http://clarin.eu/fcs/capability/basic-search
  - http://clarin.eu/fcs/capability/advanced-search
<ed:SupportedDataView>
- Content: MIME type, e.g. application/x-clarin-fcs-hits+xml
- @id → for referencing in <ed:Resource>
- @delivery-policy: send-by-default / need-to-request
- No duplicates (based on MIME type) allowed
<ed:SupportedLayer>
- (only for Advanced Search)
- Content: Layer Identifier, e.g. “orth”
- @id → for referencing in <ed:Resource>
- @result-id → Referencing the layer in the Advanced Data View
- @qualifier → Identifier in FCS-QL Search Term for the layer
- @alt-value-info,[.blue]` @alt-value-info-uri`: short description of the layer, e.g. for tagset, + URL with further information
- No duplicates allowed based on @result-id MIME type
<ed:Resource>
- @pid: Persistent Identifier (e.g. MdSelfLink from CMDI Record)
- <ed:Title> (1+) with @xml:lang, no duplicates, English required
- <ed:Description> (0+) with @xml:lang, English required, should be at most 1 sentence
- <ed:Institution> (0+) with @xml:lang, English required
- <ed:LandingPageURI> (0/1) – link to the website of the resource (or institution) with more information
- <ed:Languages> (1) with <ed:Language> content according to ISO 639-3 language codes
- <ed:AvailableDataViews> (1) with @ref = list of IDs of the <ed:SupportedDataView> elements, e.g. “hits adv”
- <ed:AvailableLayers> (1) (if Advanced Search Capability), with @ref = list of IDs of the <ed:SupportedLayer> elements, e.g. “word lemma pos”
- <ed:Resources> (0/1) for sub resources
- For <ed:AvailableDataViews> and <ed:AvailableLayers> sub-resources should support the same lists, a new declaration is still required

Minimal Endpoint Description for BASIC Search

<ed:EndpointDescription xmlns:ed="http://clarin.eu/fcs/endpoint-description" version="2">
  <ed:Capabilities>
    <ed:Capability>http://clarin.eu/fcs/capability/basic-search</ed:Capability>
  </ed:Capabilities>
  <ed:SupportedDataViews>
    <ed:SupportedDataView id="hits" delivery-policy="send-by-default">application/x-clarin-fcs-hits+xml</ed:SupportedDataView>
  </ed:SupportedDataViews>
  <ed:Resources>
    <!-- just one top-level resource at the Endpoint -->
    <ed:Resource pid="http://hdl.handle.net/4711/0815">
      <ed:Title xml:lang="de">Goethe Korpus</ed:Title>
      <ed:Title xml:lang="en">Goethe corpus</ed:Title>
      <ed:Description xml:lang="de">Das Goethe-Korpus des IDS Mannheim.</ed:Description>
      <ed:Description xml:lang="en">The Goethe corpus of IDS Mannheim.</ed:Description>
      <ed:LandingPageURI>http://repos.example.org/corpus1.html</ed:LandingPageURI>
      <ed:Languages>
        <ed:Language>deu</ed:Language>
      </ed:Languages>
      <ed:AvailableDataViews ref="hits" />
    </ed:Resource>
  </ed:Resources>
</ed:EndpointDescription>

Endpoint Description with CMDI Data View and Sub-Resources

<ed:EndpointDescription xmlns:ed="http://clarin.eu/fcs/endpoint-description" version="2">
  <ed:Capabilities>
    <ed:Capability>http://clarin.eu/fcs/capability/basic-search</ed:Capability>
  </ed:Capabilities>
  <ed:SupportedDataViews>
    <ed:SupportedDataView id="hits" delivery-policy="send-by-default">application/x-clarin-fcs-hits+xml</ed:SupportedDataView>
    <ed:SupportedDataView id="cmdi" delivery-policy="need-to-request">application/x-cmdi+xml</ed:SupportedDataView>
  </ed:SupportedDataViews>
  <ed:Resources>
    <!-- top-level resource 1 -->
    <ed:Resource pid="http://hdl.handle.net/4711/0815">
      <ed:Title xml:lang="de">Goethe Korpus</ed:Title>
      <ed:Title xml:lang="en">Goethe corpus</ed:Title>
      <ed:Description xml:lang="de">Das Goethe-Korpus des IDS Mannheim.</ed:Description>
      <ed:Description xml:lang="en">The Goethe corpus of IDS Mannheim.</ed:Description>
      <ed:LandingPageURI>http://repos.example.org/corpus1.html</ed:LandingPageURI>
      <ed:Languages>
        <ed:Language>deu</ed:Language>
      </ed:Languages>
      <ed:AvailableDataViews ref="hits" />
    </ed:Resource>
    <!-- top-level resource 2 -->
    <ed:Resource pid="http://hdl.handle.net/4711/0816">
      <ed:Title xml:lang="de">Zeitungskorpus des Mannheimer Morgen</ed:Title>
      <ed:Title xml:lang="en">Mannheimer Morgen newspaper corpus</ed:Title>
      <ed:LandingPageURI>http://repos.example.org/corpus2.html</ed:LandingPageURI>
      <ed:Languages>
        <ed:Language>deu</ed:Language>
      </ed:Languages>
      <ed:AvailableDataViews ref="hits cmdi" />
      <ed:Resources>
        <!-- sub-resource 1 of top-level resource 2 -->
        <ed:Resource pid="http://hdl.handle.net/4711/0816-1">
          <ed:Title xml:lang="de">Zeitungskorpus des Mannheimer Morgen (vor 1990)</ed:Title>
          <ed:Title xml:lang="en">Mannheimer Morgen newspaper corpus (before 1990)</ed:Title>
          <ed:LandingPageURI>http://repos.example.org/corpus2.html#sub1</ed:LandingPageURI>
          <ed:Languages>
            <ed:Language>deu</ed:Language>
          </ed:Languages>
          <ed:AvailableDataViews ref="hits cmdi" />
        </ed:Resource>
        <!-- sub-resource 2 of top-level resource 2 ... -->
      </ed:Resources>
    </ed:Resource>
  </ed:Resources>
</ed:EndpointDescription>

Endpoint Description with ADVANCED Search Capability

<ed:EndpointDescription xmlns:ed="http://clarin.eu/fcs/endpoint-description" version="2">
  <ed:Capabilities>
    <ed:Capability>http://clarin.eu/fcs/capability/basic-search</ed:Capability>
    <ed:Capability>http://clarin.eu/fcs/capability/advanced-search</ed:Capability>
  </ed:Capabilities>
  <ed:SupportedDataViews>
    <ed:SupportedDataView id="hits" delivery-policy="send-by-default">application/x-clarin-fcs-hits+xml</ed:SupportedDataView>
    <ed:SupportedDataView id="adv" delivery-policy="send-by-default">application/x-clarin-fcs-adv+xml</ed:SupportedDataView>
  </ed:SupportedDataViews>
  <ed:SupportedLayers>
    <ed:SupportedLayer id="word" result-id="http://spraakbanken.gu.se/ns/fcs/layer/word">text</ed:SupportedLayer>
    <ed:SupportedLayer id="orth" result-id="http://endpoint.example.org/Layers/orth" type="empty">orth</ed:SupportedLayer>
    <ed:SupportedLayer id="lemma" result-id="http://spraakbanken.gu.se/ns/fcs/layer/lemma">lemma</ed:SupportedLayer>
    <ed:SupportedLayer id="pos" result-id="http://spraakbanken.gu.se/ns/fcs/layer/pos"
                       alt-value-info="SUC tagset"
                       alt-value-info-uri="https://spraakbanken.gu.se/parole/Docs/SUC2.0-manual.pdf"
                       qualifier="suc">pos</ed:SupportedLayer>
    <ed:SupportedLayer id="pos2" result-id="http://spraakbanken.gu.se/ns/fcs/layer/pos2"
                       alt-value-info="2nd tagset"
                       qualifier="t2">pos</ed:SupportedLayer>
  </ed:SupportedLayers>

  <ed:Resources>
    <!-- just one top-level resource at the Endpoint -->
    <ed:Resource pid="hdl:10794/suc">
      <ed:Title xml:lang="sv">SUC-korpusen</ed:Title>
      <ed:Title xml:lang="en">The SUC corpus</ed:Title>
      <ed:Description xml:lang="sv">Stockholm-Umeå-korpusen hos Språkbanken.</ed:Description>
      <ed:Description xml:lang="en">The Stockholm-Umeå corpus at Språkbanken.</ed:Description>
      <ed:LandingPageURI>https://spraakbanken.gu.se/resurser/suc</ed:LandingPageURI>
      <ed:Languages>
        <ed:Language>swe</ed:Language>
      </ed:Languages>
      <ed:AvailableDataViews ref="hits adv" />
      <ed:AvailableLayers ref="word lemma pos pos2" />
    </ed:Resource>
  </ed:Resources>
</ed:EndpointDescription>

FCS – Search

SRU SearchRetreive
Actual “Search”
- Basic Search with CQL
- Advanced Search with FCS-QL
Search results are serialized in Resource (Fragment) and in Data View formats
Implementation details → Chapter Resources and Data Views

FCS – SRU Extension Parameter

x-fcs-endpoint-description (explain)
- “true” - <sru:extraResponseData> of the Explain Response contains the Endpoint Description document
x-fcs-context (searchRetrieve)
- Comma-separated list of PIDs
- Restrict the search to resources identified by these PIDs
x-fcs-dataviews (searchRetrieve)
- Comma-separated list of Data View identifiers
- Endpoints should also deliver these need-to-request Data Views if requested
x-fcs-rewrites-allowed (searchRetrieve)
- “true” - Endpoint can simplify query for higher recall

https://github.com/clarin-eric/fcs-misc/blob/main/fcs-core-2.0/normative-appendix.adoc#list-of-extra-request-parameters

FCS – Diagnostics

Complements to the SRU Diagnostics → SRU 2.0 – Diagnostics
Prefix: http://clarin.eu/fcs/diagnostic/
Refer to the Extra Request Parameters

Identifier URI Description Impact

Identifier URI	Description	Impact
`http://clarin.eu/fcs/diagnostic/1`	Persistent identifier passed by the Client for restricting the search is invalid.	non-fatal
`http://clarin.eu/fcs/diagnostic/2`	Resource set too large. Query context automatically adjusted.	non-fatal
`http://clarin.eu/fcs/diagnostic/3`	Resource set too large. Cannot perform Query.	fatal
`http://clarin.eu/fcs/diagnostic/4`	Requested Data View not valid for this resource.	non-fatal
`http://clarin.eu/fcs/diagnostic/10`	General query syntax error.	fatal
`http://clarin.eu/fcs/diagnostic/11`	Query too complex. Cannot perform Query.	fatal
`http://clarin.eu/fcs/diagnostic/12`	Query was rewritten.	non-fatal
`http://clarin.eu/fcs/diagnostic/14`	General processing hint.	non-fatal

http://clarin.eu/fcs/diagnostic/1

Persistent identifier passed by the Client for restricting the search is invalid.

non-fatal

http://clarin.eu/fcs/diagnostic/2

Resource set too large. Query context automatically adjusted.

non-fatal

http://clarin.eu/fcs/diagnostic/3

Resource set too large. Cannot perform Query.

fatal

http://clarin.eu/fcs/diagnostic/4

Requested Data View not valid for this resource.

non-fatal

http://clarin.eu/fcs/diagnostic/10

General query syntax error.

fatal

http://clarin.eu/fcs/diagnostic/11

Query too complex. Cannot perform Query.

fatal

http://clarin.eu/fcs/diagnostic/12

Query was rewritten.

non-fatal

http://clarin.eu/fcs/diagnostic/14

General processing hint.

non-fatal

https://github.com/clarin-eric/fcs-misc/blob/main/fcs-core-2.0/normative-appendix.adoc#list-of-diagnostics

Versions and Backwards Compatibility

“Clients MUST be compatible to CLARIN-FCS 1.0” (Quelle)
- Thus implementation of SRU 1.2 still required (?)
- Restriction to Basic Search Capability
- Processing of legacy XML namespaces (SRU Response, Diagnostics)
Heuristic for version detection (of endpoints)
- Client: Explain request without version and operation parameters
- Endpoint: SRU Response <sru:explainResponse>/<sru:version> with default SRU version
Versions
- FCS 2.0 ↔ SRU 2.0
- FCS 1.0 ↔ SRU 1.2 (SRU 1.1)

https://github.com/clarin-eric/fcs-misc/blob/main/fcs-core-2.0/interface-specification.adoc#versioning-and-extensions

Notes on FCS SRU Aggregator

Currently no (?) support for FCS 2.0 only endpoints
- For compatibility reasons support of Legacy FCS and FCS 1.0
- Assumption that endpoints in FCS 2.0 also support earlier FCS Versions… (no issue with CLARIN SRU/FCS libraries)
→ FCS 2.0 only endpoints may therefore still receive FCS 1.0 (SRU 1.2) requests!
Aggregator sends searchRetrieve requests with only one resource PID in the x-fcs-context parameter for each resource requested
- i.e. search across N resources of an endpoint → N separate search queries

Reference Implementations

Java and Python, focus on FCS endpoints
Java class hierarchies, organization & structure, processes & lifecycles, configuration

CLARIN Reference Libraries (Java)

Development started ~2012
Modularized: Client/Server, SRU/FCS, Parser
in Java 1.8+ (EOL: Ende 2030)
Extensive documentation, some tests (proven by being in use for a long time)
Artifacts in CLARIN Nexus, Code on Github
Server/endpoint: external dependencies to
- Logging: slf4j
- HTTP: javax.servlet:servlet-api
- Parser: antlr4 (FCS-QL) / CQL
Build: maven
Deployment: jetty, tomcat, …

CLARIN Reference Libraries (Python)

~ 2022: Translation of Java reference libraries to Python
Strong orientation towards the Java reference libraries

→ (fast) (almost) identical interfaces, class/function names
but: slight optimizations for Python, no 1:1 copy
Focus on (new) FCS endpoints → no clients!
Typed, documented; published on PyPI
Synchronous, minimal WSGI - allows embedding in existing apps
Python 3.8+
Dependencies to
- XML parsing: lxml
- HTTP/WSGI: werkzeug
- Query Parser: PLY (CQL), ANTLR4 (FCS-QL)

CLARIN Reference Libraries

FCS SRU Server: Java (docs), Python (docs)
FCS Simple Endpoint: Java (docs), Python (docs)

FCS SRU Client: Java (docs)
FCS Simple Client: Java (docs)

CQL Parser: Java (docs?), Python, JavaScript
FCS-QL Parser: Java, Python (docs)

Maven Endpoint Archetype: Java
FCS SRU Aggregator: Java
FCS Endpoint Validator: Java (old), Java ← test compliance with SRU/FCS protocol
Korp: Java, Python

Indexdata: CQL-Parser, Querela: Python implementations

Note: concrete examples and implementations will follow in a later section, high-level overview here

FCS Endpoint – Design and structure

Query Parser (CQL, FCS-QL)

FCS SRU Server
- SRU configurations, versions, parameters, diagnostics, namespaces
- XML SRU Writer
- Request Parameter parser, SRUServer (request handler)
- Abstract SRU interfaces (results, SRUSearchEngine)
- Auth (Interface, WIP)

FCS Simple Endpoint
- FCS configurations (Endpoint Description), parameters, diagnostics, namespaces
- XML Endpoint Description parser, Record and Data View writer
- SimpleEndpointSearchEngineBase (SRUSearchEngine + FCS extensions)

FCS Endpoint for XYZ
- Implementation of abstract classes and bindings to search engine, query translation
- Configuration: Endpoint Description, SRU Server Configuration
- Deployment on Java Servlet Server or as WSGI app

FCS Endpoint – Initialization

SRUServerServlet / SRUServerApp (web server)

Set default WebApp parameters
Parse the SRU Server Config
Create QueryParserRegistry (CQL)
Initialize SRUSearchEngine
Create SRUServer (with SearchEngine + configurations)

SRUSearchEngine
(user implementation, → SimpleEndpointSearchEngineBase)

Further initialization of the QueryParserRegistry (FCS-QL)
do_init (user init)
Create Endpoint Description

FCS Endpoint – Communication Flow

[GET] request (incoming)

↳ SRUServerServlet / SRUServerApp (web server)

↳ SRUServer

URL parameter evaluation
Multiplexing by operation: search/scan/explain

↳ SimpleEndpointSearchEngineBase (user implementation)

Parse search query (CQL/FCS-QL) and send to search engine
Wrap result in SRUSearchResultSet
Possible diagnostics etc.

↲

optional error handling
XML output generation (SRU parameter)

FCS Endpoint – Class Hierarchy

SRUServerServlet - Java (Servlet) / SRUServerApp - Python (WSGI)

Servlet implementation for servlet container, doGet handler, setup of SRUServer wrapper/application executed by the endpoint operator

SRUServer - Java, Python

SRU protocol implementation, handleRequest, error handling, XML output generation

SRURequestImpl - Java, Python

Specific SRU GET parameter evaluation (parsing, validation; SRU versions) + possible FCS parameters (“x-…”), SRU version detection

↳ SRURequest (Interface) - Java, Python

Documentation of all SRU parameters

XYZEndpointSearchEngine - korp: Java, Python

Actual implementation of createEndpointDescription, do* methods

↳ SimpleEndpointSearchEngineBase (abstract) - Java, Python

Lifecyle (init → destroy), integration of endpoint description, interfaces for users

↳ SRUSearchEngineBase (abstract) - Java

↳ SRUSearchEngine (interface) - Java, Python

Interface: search, explain, scan

XYZSRUSearchResultSet - korp: Java, Python

Actual implementation, nextRecord + writeRecord iterator and serialization of results

↳ SRUSearchResultSet (abstract) - Java, Python

Fields for searchRetrieve operation results (total, records, …)

↳ SRUAbstractResult (interface) - Java, Python

Diagnostics + ExtraResponseData

XYZSRUScanResultSet, XYZSRUExplainResult do not need to be implemented separately, default behavior is adequate

SRUConstants - Java, Python

Diagnostic codes
Namespaces
Python: SRU parameter + values

SRUDiagnostic - Java, Python

Error handling, message (text description) of the diagnostic

Endpoint Configurations

<?xml version="1.0" encoding="UTF-8"?>
<endpoint-config xmlns="http://www.clarin.eu/sru-server/1.0/">
    <databaseInfo>
        <title xml:lang="se">Språkbankens korpusar</title>
        <title xml:lang="en" primary="true">The Språkbanken corpora</title>
        <description xml:lang="se">Sök i Språkbankens korpusar.</description>
        <description xml:lang="en" primary="true">Search in the Språkbanken corpora.</description>
        <author xml:lang="en">Språkbanken (The Swedish Language Bank)</author>
        <author xml:lang="se" primary="true">Språkbanken</author>
    </databaseInfo>

    <indexInfo>
        <set name="fcs" identifier="http://clarin.eu/fcs/resource">
            <title xml:lang="se">Clarins innehållssökning</title>
            <title xml:lang="en" primary="true">CLARIN Content Search</title>
        </set>
        <index search="true" scan="false" sort="false">
            <title xml:lang="en" primary="true">Words</title>
            <map primary="true">
                <name set="fcs">words</name>
            </map>
        </index>
    </indexInfo>

    <schemaInfo>
        <schema identifier="http://clarin.eu/fcs/resource" name="fcs"
                sort="false" retrieve="true">
            <title xml:lang="en" primary="true">CLARIN Content Search</title>
        </schema>
    </schemaInfo>
</endpoint-config>

WebApp Parameter (web.xml o.Ä.) - Korp example

SRU Version
SRU/FCS configurations

SRU (SRU Server Config) - Korp example →

databaseInfo about endpoint, but no evaluation in client?
default: indexInfo + schemaInfo
Mandatory: database field in serverInfo!

FCS (Endpoint Description) - Korp example

FCS Version (1/2)
Capabilities, Layer, DataViews
Resources

Resources and Data Views

Endpoint Capabilities, BASIC/ADVANCED Search, FCS-QL
Resource, Resource Fragment, Data View (Hits, Advanced)
Result serialization, query languages

Endpoint Description – Capabilities

http://clarin.eu/fcs/capability/basic-search

Mandatory
Query: Full-text search (Basic) with minimal CQL (AND/OR)
DataView: HITS

http://clarin.eu/fcs/capability/advanced-search

Optional
Query: FCS-QL (Structured search over annotation layers)
DataView: HITS and Advanced
Other capabilities possible

→ currently limited to Basic and Advanced Search!
Do not only determine search modes!
Work in progress:
- Authentication/authorization
- Lexical search: …/lex-search → LexCQL, LexHITS
- Syntactic search?
Note: according to XSD, capability URIs have the following schema http://clarin.eu/fcs/capability/\w([\.\-]{0,1}\w)*

BASIC Search

cat
"cat"
cat AND dog
"grumpy cat"
"grumpy cat" AND dog
"grumpy cat" OR "lazy dog"
cat AND (mouse OR "lazy dog")

Mandatory!
Simple full-text search
Contextual Query Language (CQL) as query language
Endpoints
- must support “term-only” queries
- can support Boolean operators (AND/OR) and sub-queries
- must abort in case of errors with appropriate diagnostics
- can decide themselves what to search for
  
  (text, normalization etc.)
Results serialized in Generic Hits (HITS) Data View

http://clarin.eu/fcs/capability/basic-search

ADVANCED Search

"walking"
[token = "walking"]
"Dog" /c
[word = "Dog" /c]
[pos = "NOUN"]
[pos != "NOUN"]
[lemma = "walk"]
"blaue|grüne" [pos = "NOUN"]
"dogs" []{3,} "cats" within s
[z:pos = "ADJ"]
[z:pos = "ADJ" & q:pos = "ADJ"]

Optional
Structured search in annotated data,

represented in annotation layers

→ Query language FCS-QL
- Queries can combine different annotation layers
- Endpoints should support as many annotation layers as possible
Results serialized in Advanced (ADV) Data View and Generic Hits (HITS) Data View

http://clarin.eu/fcs/capability/advanced-search

FCS-QL

Annotation Layers, containing annotations of a certain type (e.g. text, POS tags, …)
Query supports combination of these layers
Each layer is segmented → search for individual lemma
- No requirement as to how segmentation should be done
- Assumption that segmentation is consistent across layers (for display in Advanced Data View)
- Queries can combine segments for multi-token patterns

FCS-QL – Notes

Endpoints must be able to parse FCS-QL completely!
Requests with unsupported operators or layers?
- Generate errors with diagnostics, or
- Rewrite queries if permitted by “x-fcs-rewrites-allowed” (on request)
Searches are Case Sensitive (configurable in the query)
Searches (by endpoints) should take place on layers where it makes sense,

e.g. if there are several text or POS layers

FCS-QL – Layer Types

Layer Type Identifier Annotation Layer Description Syntax Examples (without quotes)

Layer Type Identifier	Annotation Layer Description	Syntax	Examples (without quotes)
`text`	Textual representation of resource, also the layer that is used in Basic Search	String	"Dog", "cat" "walking", "better"
`lemma`	Lemmatisation	String	"good", "walk", "dog"
`pos`	Part-of-Speech annotations	Universal POS tags	"NOUN", "VERB", "ADJ"
`orth`	Orthographic transcription of (mostly) spoken resources	String	"dug", "cat", "wolking"
`norm`	Orthographic normalization of (mostly) spoken resources	String	"dog", "cat", "walking", "best"
`phonetic`	Phonetic transcription	SAMPA	"'du:", "'vi:-d6 'ha:-b@n"

text

Textual representation of resource, also the layer that is used in Basic Search

String

"Dog", "cat" "walking", "better"

lemma

Lemmatisation

String

"good", "walk", "dog"

pos

Part-of-Speech annotations

Universal POS tags

"NOUN", "VERB", "ADJ"

orth

Orthographic transcription of (mostly) spoken resources

String

"dug", "cat", "wolking"

norm

Orthographic normalization of (mostly) spoken resources

String

"dog", "cat", "walking", "best"

phonetic

Phonetic transcription

SAMPA

"'du:", "'vi:-d6 'ha:-b@n"

Universal Dependencies, Universal POS tags v2.0
Dafydd Gibbon, Inge Mertins, Roger Moore (Eds.): Handbook of Multimodal and Spoken Language Systems. Resources, Terminology and Product Evaluation, Kluwer Academic Publishers, Boston MA, 2000, ISBN 0-7923-7904-7

FCS-QL – Layer Type Identifier

Identifies layers for FCS-QL and Advanced Data View
Other identifiers are not allowed, except for testing purposes
Custom identifiers must be prefixed with “x-”

Result Serialization

Results must be serialized in CLARIN FCS format
- Resource (Fragment), Data View
- XML → XSD
Important: 1 Hit = 1 Result Record
- Do not combine multiple hits in one record
  
  → generate separate SRU records for each hit that reference the same resource
- Multiple hit markers are allowed, e.g. for boolean expressions to highlight individual terms
- Each “Hit” should be defined in a sentence context

Resource

“searchable and addressable entity” in the endpoint, e.g. text corpus
“self contained”, i.e. entire document, not a single sentence from a document
Addressable as a whole via Persistent Identifier or URI

Resource Fragment

Part of a Resource, e.g. single sentence, or time interval in audio transcription (for multi-modal corpora)
Should be addressable within a Resource (offset / ID)
Optional, but recommended

Data View

Serialization of a “Hits” in Resource (Fragment)
Enables different representations, expandable

Result Serialization – Linking

Endpoints should provide link to Resource (Fragment)
- Persistent Identifier (PID) / URI
- If direct linking is not possible, then e.g. website with description of the resource, corpus or collection
- Link should be as specific as possible
- PIDs preferred to URIs, both together recommended

Result Serialization – Examples

HITS Data View of a resource with PID

<fcs:Resource xmlns:fcs="http://clarin.eu/fcs/resource" pid="http://hdl.handle.net/4711/00-15">
  <fcs:DataView type="application/x-clarin-fcs-hits+xml">
    <!-- data view payload omitted -->
  </fcs:DataView>
</fcs:Resource>

HITS Data View for a resource with Resource Fragment for more granular structuring

<fcs:Resource xmlns:fcs="http://clarin.eu/fcs/resource" pid="http://hdl.handle.net/4711/08-15">
  <fcs:ResourceFragment>
    <fcs:DataView type="application/x-clarin-fcs-hits+xml">
      <!-- data view payload omitted -->
    </fcs:DataView>
  </fcs:ResourceFragment>
</fcs:Resource>

HITS Data View with CMDI Data View for resource metadata

<fcs:Resource xmlns:fcs="http://clarin.eu/fcs/resource"
              pid="http://hdl.handle.net/4711/08-15"
              ref="http://repos.example.org/file/text_08_15.html">
  <fcs:DataView type="application/x-cmdi+xml" (1)
                pid="http://hdl.handle.net/4711/08-15-1"
                ref="http://repos.example.org/file/08_15_1.cmdi">
      <!-- data view payload omitted -->
  </fcs:DataView>

  <fcs:ResourceFragment pid="http://hdl.handle.net/4711/08-15-2" (2)
                        ref="http://repos.example.org/file/text_08_15.html#sentence2">
    <fcs:DataView type="application/x-clarin-fcs-hits+xml">
      <!-- data view payload omitted -->
    </fcs:DataView>
  </fcs:ResourceFragment>
</fcs:Resource>

1	Specification of CMDI metadata for the resource
2	Hit is part of a larger resource “semantically more meaningful”

Data Views

Specification (with XSD schema, examples)
- Data Views 1.0 (pdf, repo)
- FCS Core 2.0 (pdf, repo) (primary)
Specified in FCS Core 2.0
- Advanced (ADV) Data View
- Generic Hits (HITS) Data View
Additional Data Views such as Component Metadata (CMDI), Images (IMG), Geolocation (GEO) are included, but not used in the standard FCS client “Aggregator”
Mandatory “send-by-default”

or optional “need-to-request”
Generic Hits Data View is mandatory, must always be sent
Only send data views that
- explicitely requested with (SRU) FCS parameter “x-fcs-dataviews”, or
- have delivery policy “send-by-default”
Invalid Data Views → non-fatal diagnostic for each requested Data View

http://clarin.eu/fcs/diagnostic/4

("Requested Data View not valid for this resource")

Hits Data View

Description	The representation of the hit
MIME type	`application/x-clarin-fcs-hits+xml`
Payload Disposition	inline
Payload Delivery	send-by-default (`REQUIRED`)
Recommended Short Identifier	`hits` (`RECOMMENDED`)
XML Schema	DataView-Hits.xsd

Description

The representation of the hit

MIME type

application/x-clarin-fcs-hits+xml

Payload Disposition

inline

Payload Delivery

send-by-default (REQUIRED)

Recommended Short Identifier

hits (RECOMMENDED)

XML Schema

DataView-Hits.xsd

Required implementation
Simplest serialization, (lossy) approximation of results
Each hit should only occur in a single sentence context (or similar)
Multiple hit annotations possible, e.g. for conjunctions in the query

Hits Data View – Examples

HITS Data View with a hit marker

<!-- potential @pid and @ref attributes omitted -->
<fcs:DataView type="application/x-clarin-fcs-hits+xml">
  <hits:Result xmlns:hits="http://clarin.eu/fcs/dataview/hits">
    The quick brown <hits:Hit>fox</hits:Hit> jumps over the lazy dog.
  </hits:Result>
</fcs:DataView>

HITS Data View with multiple hit markers for boolean queries

<!-- potential @pid and @ref attributes omitted -->
<fcs:DataView type="application/x-clarin-fcs-hits+xml">
  <hits:Result xmlns:hits="http://clarin.eu/fcs/dataview/hits">
    The quick brown <hits:Hit>fox</hits:Hit> jumps over the lazy <hits:Hit>dog</hits:Hit>.
  </hits:Result>
</fcs:DataView>

KWIC Data View

Description	The representation of the hit
MIME type	`application/x-clarin-fcs-kwic+xml`
Payload Disposition	inline
Payload Delivery	send-by-default (`REQUIRED`)
Recommended Short Identifier	`kwic` (`RECOMMENDED`)
XML Schema	-

Description

The representation of the hit

MIME type

application/x-clarin-fcs-kwic+xml

Payload Disposition

inline

Payload Delivery

send-by-default (REQUIRED)

Recommended Short Identifier

kwic (RECOMMENDED)

XML Schema

Deprecated!
Only for compatibility with Legacy FCS clients
Example in CQP/SRU bridge
Mapping of
- left and right context,
- hits

Serializer Java, Python
Aggregator transforms it to Hits Data View!

Advanced Data View

Description	The representation of the hit for Advanced Search
MIME type	`application/x-clarin-fcs-adv+xml`
Payload Disposition	inline
Payload Delivery	send-by-default (`REQUIRED`)
Recommended Short Identifier	`adv` (`RECOMMENDED`)
XML Schema	DataView-Advanced.xsd

Description

The representation of the hit for Advanced Search

MIME type

application/x-clarin-fcs-adv+xml

Payload Disposition

inline

Payload Delivery

send-by-default (REQUIRED)

Recommended Short Identifier

adv (RECOMMENDED)

XML Schema

DataView-Advanced.xsd

Serialization for Advanced Search for multimedia data (text, transcribed audio)
Presentation of structured information via multiple annotation layers
Annotations are streams (ranges) over the signal in a stand-off like format with start and end offsets (inclusive)
Segmentation via <Segment>, annotations in <Span> in <Layer>
- Segments must be possible to align over all annotation layers

Advanced Data View – Example

Advanced Data View – Example (2)

Advanced Data View – Presentation

Advanced Data View - Visualization in Aggregator

https://contentsearch.clarin.eu/?&queryType=fcs&query=%5B%20word%20%3D%20%22her.*%22%20%5D%20%5B%20lemma%20%3D%20%22Artznei%22%20%5D%20%5B%20pos%20%3D%20%22VERB%22%20%5D

Examples

→ see more examples (searchRetrieve query)
endpoint: https://spraakbanken.gu.se/ws/fcs/2.0/endpoint/korp/sru

…?operation=searchRetrieve&queryType=fcs&query=%5bword%3d%22anv%C3%A4ndning%22%5d

→ FCS 2.0, FCS-QL: [ word = "användning" ], HITS + ADV
…?operation=searchRetrieve&queryType=cql&query=%22anv%C3%A4ndning%22

→ FCS 2.0, CQL: "användning", HITS
…?operation=searchRetrieve&version=1.2&query=cat ↔ …?query=cat → HITS
- FCS 1.0, sru="http://www.loc.gov/zing/srw/"
- FCS 2.0, sruResponse="http://docs.oasis-open.org/ns/search-ws/sruResponse"
more parameters: x-indent-response=1 / x-fcs-dataviews=cmdi / x-fcs-context=11022/0000-0000-20DF-1

Query Translation

Query Languages, Visualization
FCS-QL Details
Query Mapping

Query Languages

More resources in Awesome FCS List > Query Parsers
CQL (Contextual Query Language)
- BNF grammar: www.loc.gov/standards/sru/cql/spec.html#bnf
- Hand-written parser implementation in Java, Python, JS, …
- Documentation: Java
- Visualization in demo of JS parser
  - Validation for Text+ LexCQL
FCS-QL (Federated Content Search Query Language)
- EBNF grammar: github.com/clarin-eric/fcs-misc (FCS Core 2.0)
- Parser implementation in Java, Python,
  
  using ANTRL4, parsed into own wrappers (Java, Python)
- Documentation: Java, Python
- Grammar visualization with ANTLR4 tools

FCS-QL – Visualization

Installation

pip install antlr4-tools
git clone https://github.com/clarin-eric/fcs-ql.git
cd fcs-ql/src/main/antlr4/eu/clarin/sru/fcs/qlparser

Visualization according to ANTLR4 > Getting Started

antlr4-parse src/fcsql/FCSParser.g4 src/fcsql/FCSLexer.g4 query -gui
[ word = "her.*" ] [ lemma = "Artznei" ] [ pos = "VERB" ]
^D

FCS-QL Query Nodes

QueryNode (with child node “children”)

Expression (layer identifier, layer identifier qualifier, operator, regular expression + flags)
- Wildcard
- Group → 1 QueryNode; “(” … “)”
- NOT → 1 QueryNode
- AND, OR → list of QueryNodes
QueryDisjunction → list of QueryNodes
QuerySequence → list of QueryNodes → “list of QuerySegmenten”
QuerySegment (min, max) → Expression → “a single token”
QueryGroup (min, max) → QueryNode
Within-Query (SimpleWithin, QueryWithWithin) (Scope: sentence, utterance, paragraph, turn, text, session) (unused)

grayed out: currently not supported by the FCS Aggregator for searching (in visual query builder)

FCS-QL Query Nodes – Aggregator

Parsed Query:

Query Sequence → with list of Query Segment

[ word = ".*her" ] [ lemma = "Artznei" ] [ pos = "VERB" ]
Query Segment → a token (can be repeatable)

[ word = "her.*" & ( word = "test" | word = "Apfel" ) ] [ pos = "ADV" ]{1,3}
- Expression AND
  
  [ word = "her.*" & word = "test" ]
  
  Expression Group
  
  Expression
- Expression Group → Expression OR → list of Expression
  
  [ ( word = "her.*" | word = "Test" ) ]
- Expression → Layer Identifier, Operator, Regex (value)
  
  [ word = "her.*" ]

FCS-QL – Remarks

Currently (Aggregator v3.9.1) only limited support of all FCS-QL features

→ partly due to Visual Query Builder
Free text input / improved query builder planned for the future
Use appropriate diagnostics if query features are not supported
- SRU: \info:srw/diagnostic/1/48 - Query feature unsupported.
- FCS: http://clarin.eu/fcs/diagnostic/10 - General query syntax error. - should be intercepted by FCS-QL parser library
- FCS: http://clarin.eu/fcs/diagnostic/11 - Query too complex. Cannot perform Query.

Query-Mapping

Idea:
- Let libraries parse raw queries (CQL, FCS-QL)
- Recursively walk through the parsed query tree, “depth first”
- Successively generate transformed query (for target system),
  
  e.g. StringBuilder in Java
Examples:
- Korp: CQL → CQP (Java, Python), FCS-QL → CQP (Java, Python)
- NoSketchEngine: CQL → CQL (Java), FCS-QL → CQL (Java)
- Solr: CQL → Solr (Java), LexCQL → Solr (Java)
  - SolrQuery with highlighting, Custom hit prefixes/postfixes, use Solr result as pre-formatted Data View content (Code)
- CQI Bridge: CQL → CQP (Java)
ElasticSearch
- Only BASIC Search with full-text queries, e.g. with Simple Query String
Solr
- Only BASIC Search
- ADVANCED Search with e.g. MTAS (“Multi Tier Annotation Search”)
In general: use actual Corpus Search Engine for ADVANCED Search

→ otherwise at most a single annotation layer (“text”) can be searched

FCS Endpoint Development

VSCode settings, kickstart a project
Minimal FCS endpoint, search engine connection, result serialization
Deployment, Embedding, Extensibility

Visual Studio Code (suggestion)

Download & Installation: code.visualstudio.com
Extensions:
- Java
  - vscjava.vscode-java-pack
  - redhat.vscode-xml (optional)
- Python
  - ms-python.python
  - ms-python.vscode-pylance, ms-python.black-formatter (optional)
- Quality of Life
  - eamodio.gitlens
  - ms-vscode-remote.vscode-remote-extensionpack, ms-vscode.remote-explorer (for WSL or remote via SSH)

QoL = Quality of Life

Visual Studio Code – Debugging (Java)

For *.war/Jetty web application testing
No hot code swapping / do not make any changes between compilation and debugging!
VSCode Debug Setting:
- Run and Debug > Add Configuration … > “Java: Attach by Process ID”

Run application with Maven:

MAVEN_OPTS="-Xdebug -Xnoagent -Djava.compiler=NONE -agentlib:jdwp=transport=dt_socket,server=y,address=5005" \
    mvn [jetty:run-war|...]

Visual Studio Code – Debugging (Python)

launch.json
- pytest: no predefined configuration in “Run and Debug” menu
- file/module: as required
settings.json
- pytest: coverage must be deactivated here!

{
    "name": "Python: pytest",
    "type": "python",
    "request": "launch",
    "console": "integratedTerminal",
    "purpose": [
        "debug-test"
    ],
    "justMyCode": false
}

"python.testing.pytestArgs": [
    ".",
    // disable coverage for debugging
    "--no-cov",
    // disable ansi color output (-vv)
    "-q",
],

Kickstart FCS Endpoint Project

See Guide to Endpoint Development

→ Using reference endpoint implementations

Using the corp endpoint
- Java: github.com/clarin-eric/fcs-korp-endpoint
- Python: github.com/Querela/fcs-korp-endpoint-python
- Java: github.com/clarin-eric/fcs-sru-cqi-bridge (CQP/SRU bridge)
Java: project generation with Maven
- Project template: github.com/clarin-eric/fcs-endpoint-archetype

CLARIN SRU/FCS Endpoint Archetype

Installation of the archetype in the local Maven repository, or
Configuration of the CLARIN Nexus as a remote repository
Project generation with Maven:

mvn archetype:generate \
    -Pclarin \
    -DarchetypeGroupId=eu.clarin.sru.fcs \
    -DarchetypeArtifactId=fcs-endpoint-archetype \
    -DarchetypeVersion=1.6.0 \
    -DgroupId=[ id.group.fcs ] \
    -DartifactId=[ my-cool-endpoint ] \
    -Dversion=[ 1.0-SNAPSHOT ] \
    -DinstitutionName=[ "My Institution" ]

all [… ] placeholders must be replaced with the appropriate values (enclose values with spaces in quotation marks)
if using the CLARIN remote repository, the custom profile is selected with -Pclarin, see example maven configuration
if archetype is installed using git, then use archetypeVersion=1.6.0-SNAPSHOT (see details in pom.xml)

Minimal FCS Endpoint

Required class implementations
- SimpleEndpointSearchEngineBase
- SRUSearchResultSet
- Wrapper or adapter for search engine (!)
Required configurations
- sru-server-config.xml
- endpoint-description.xml
- Web app configurations
  
  (Java: web.xml, Python: key-value parameter dict)
  - Reference to implementation of the SimpleEndpointSearchEngineBase
  - Required SRU parameters (host, port, server, …)

Minimal FCS Endpoint – Initialization

→ SimpleEndpointSearchEngineBase (Java, Python)

void doInit (ServletContext context, SRUServerConfig config, SRUQueryParserRegistry.Builder queryParsersBuilder, Map<String, String> params) - Java, Python

Required implementation!
(optional) initialization of APIs, default values (PIDs), …

EndpointDescription createEndpointDescription (ServletContext context, SRUServerConfig config, Map<String, String> params) - Java, Python

Required implementation!
Loading of EndpointDescription (Java, Python)
- embedded XML file (load with SimpleEndpointDescriptionParser, Java, Python) or
- construction dynamically, e.g. via API - example NoSketchEngine

Minimal FCS Endpoint – Scan/Explain

(theoretically) nothing to implement

→ Default handlers for “explain” and “scan” respond to requests automatically
Endpoint Description is always returned as an “explain” operation (in case of doubt)

→ SimpleEndpointSearchEngineBase (Java, Python)

Minimal FCS Endpoint – Search Request

SRUSearchResultSet search (SRUServerConfig config, SRURequest request, SRUDiagnosticList diagnostics)

Parse query (search request)
- Check “queryType” parameter, whether CQL, FCS-QL, …
- Error: SRU_CANNOT_PROCESS_QUERY_REASON_UNKNOWN
Analyze ExtraRequestData
- “x-fcs-context” - requested resource (scope of search)
  - Diagnostic: FCS_DIAGNOSTIC_PERSISTENT_IDENTIFIER_INVALID - invalid PIDs
  - Error: SRU_UNSUPPORTED_PARAMETER_VALUE - e.g. too many PIDs, no PIDs
- “x-fcs-dataviews” - requested Data Views
  - Diagnostic: FCS_DIAGNOSTIC_PERSISTENT_IDENTIFIER_INVALID
Pagination → startRecord (1) / maximumRecords (-1)

Process search with (local) search engine
Wrap results in SRUSearchResultSet

“If in Doubt” → `SRU_GENERAL_SYSTEM_ERROR`

Search Engine Integration

Input: Parameters of search query
- Query (translated for (local) search engine)
- Resource (PID)
- Pagination: offset + count, → startRecord (1) / maximumRecords (-1)
- (Request object and Server configurations)
- (all global/static objects, such as API adapters etc.)

Output: Details for response, results
- Total number (optional, FCS 2.0 allows indication of accuracy)
- List of results
  - with “hit highlighting” (Hits) (Basic + Advanced Search)
  - tokenized (using character offsets) for FCS-QL (Advanced Search) with optional Advanced Search annotation layers
- Diagnostics

Wrapper for results
- Total number of results
- List of results (text with hit offsets; tokens + annotations)
- Resource PID, URL to result details
SRUSearchResultSet implementation
- Iterator interface → nextRecord(), writeRecord(); curRecordCursor
Ex: MyResults, NoSkESRUFCSSearchResultSet

protected NoSkESRUFCSSearchResultSet(..., MyResults results) {
    super(diagnostics);
    this.serverConfig = serverConfig;
    this.request = request;

    this.results = results;
    currentRecordCursor = -1;
    // ...

public int getTotalRecordCount() { return (int) results.getTotal(); }
public int getRecordCount() { return results.getResults().size(); }

public boolean nextRecord() throws SRUException {
    if (currentRecordCursor < (getRecordCount() - 1)) {
        currentRecordCursor++;
        return true; }
    return false; }

public void writeRecord(XMLStreamWriter writer) {
    MyResults.ResultEntry result = results.getResults().get(currentRecordCursor);

    XMLStreamWriterHelper.writeStartResource(writer, results.getPid(), null);
    XMLStreamWriterHelper.writeStartResourceFragment(writer, null, result.landingpage);
    // ...
    XMLStreamWriterHelper.writeEndResourceFragment(writer);
    XMLStreamWriterHelper.writeEndResource(writer);
}

Result Serialization

SRUXMLStreamWriter - Java, Python
- (internal), specifically for SRU “recordXmlEscaping”
XMLStreamWriterHelper - Java, Python (FCSRecordXMLStreamWriter)
- Boilerplate + help for writing Record, RecordFragment, Hits/Kwic Data View
AdvancedDataViewWriter - Java, Python
- Help with writing Advanced Data Views
- addSpans (content, layer, offset, hit?)
  
  writeHitsDataView, writeAdvancedDataView

Minimal Configuration – Endpoint Description

FCS Version: 2
Capabilities: BASIC Search
Data Views: HITS
Resources: (min: 1)
- Title
- Description
- LandingPage URL
- Languages → one language (ISO 639-3)

<?xml version="1.0" encoding="UTF-8"?>
<EndpointDescription xmlns="http://clarin.eu/fcs/endpoint-description"
             xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
             xsi:schemaLocation="http://clarin.eu/fcs/endpoint-description ../../schema/Core_2/Endpoint-Description.xsd"
             version="2">
  <Capabilities>
    <Capability>http://clarin.eu/fcs/capability/basic-search</Capability>
  </Capabilities>
  <SupportedDataViews>
    <SupportedDataView id="hits" delivery-policy="send-by-default" >application/x-clarin-fcs-hits+xml</SupportedDataView>
  </SupportedDataViews>
  <Resources>
    <Resource pid="hdl:10794/sbkorpusar">
      <Title xml:lang="sv">Språkbankens korpusar</Title>
      <Title xml:lang="en">The Språkbanken corpora</Title>
      <Description xml:lang="sv">Korpusarna hos Språkbanken.</Description>
      <Description xml:lang="en">The corpora at Språkbanken.</Description>
      <LandingPageURI >https://spraakbanken.gu.se/resurser/corpus</LandingPageURI>
      <Languages>
        <Language>swe</Language>
      </Languages>
      <AvailableDataViews ref="hits"/>
    </Resource>
  </Resources>
</EndpointDescription>

Minimal Configuration – SRU

SRU Server Configurations → Endpoint Configurations (sru-server-config.xml)
- databaseInfo with general information about endpoint
- default: indexInfo + schemaInfo
- required: serverInfo > database (host and port by default)
Web server configuration
- Optional adjustment of SRU / FCS parameters
- Java: web.xml
- Python: key-value dictionary

default: indexInfo + schemaInfo → copy/paste from template/existing endpoints, configuration remains largely the same here

FCS Endpoint Deployment (Java)

Using Maven (!) / pom.xml
- <packaging>war</packaging>
- Build Plugin:
  - org.apache.maven.plugins:maven-war-plugin[:2.6] (?)
  - org.apache.maven.plugins:maven-compiler-plugin
Create WAR artifact
- mvn clean compile war:war
- mvn clean package (also run tests etc.)
Deploy with Java Servlet Engine / HTTP server like Apache Tomcat / Eclipse Jetty / …

TODO: Check if maven-war-plugin is no longer necessary?

FCS Endpoint Deployment (Python)

“make_app()” method

→ provides configured WSGI SRUServerApp (Python)
Deployment suggestion: gunicorn (Python WSGI HTTP server)
Example: fcs-korp-endpoint-python
- as module with werkzeug test server
  
  python3 -m korp_endpoint
- gunicorn in Docker Container (Dockerfile)
  
  gunicorn 'korp_endpoint.app:make_gunicorn_app()'

Embedded FCS Endpoint (Python)

Tested only with Python as WSGI app in Flask

→ in kosh: PR, commit
Idea:
- Create SRUServer with SRUSearchEngine (global)
- Forward requests (filtered by path) to SRUServer

def init(self, flask: Flask) -> None:
    self.server = self.build_fcs_server()
    flask.add_url_rule("some-path/fcs", "some-path/fcs", self.handle)

def build_fcs_server(self) -> SRUServer:
    params = self.build_fcs_server_params()
    config = self.build_fcs_server_config(params)
    qpr_builder = SRUQueryParserRegistry.Builder(True)
    search_engine = KoshFCSEndpointSearchEngine(
        endpoint_description=self.build_fcs_endpointdescription(),
        # ... other parameters
    )
    search_engine.init(config, qpr_builder, params)
    return SRUServer(config, qpr_builder.build(), search_engine)

def handle(self) -> Response:
    LOGGER.debug("request: %s", request)  # Flask/Werkzeug Request
    LOGGER.debug("request?args: %s", request.args)
    response = Response()                 # Flask/Werkzeug Response
    self.server.handle_request(request, response)
    return response

FCS Endpoint – Extensibility

Supports own query languages, Data Views etc.
Example: LexFCS (FCS extension for lexical resources)

→ i.e. new query language and Data View

LexCQL - query language (CQL dialect)
- SRUQueryParser (Java, Python), based on CQLQueryParser (Java, Python)
  
  → LexCQLQueryParser with LexCQLQuery
- SimpleEndpointSearchEngineBase.doInit() (Java, Python)
  
  → queryParsersBuilder.register(new LexCQLQueryParser());
LexHITS - HITS Data View extension
- in SRUSearchResultSet.writeRecord (Java, Python) appropriate XML result serialization

Deployment

Deployment instructions for FCS Endpoint Tester/Validator, FCS SRU Aggregator and FCS Korp Endpoint

FCS Endpoint Protocol Conformance Tester

NOTE: This is about the now legacy FCS endpoint tester, see Section: FCS Endpoint Validator for the updated and rewritten validator!
WebApp for testing the compliance with the FCS specification of endpoints

Code: github.com/clarin-eric/fcs-endpoint-tester
Deployment: clarin.ids-mannheim.de/srutest
Java 8; Vaadin 7.7.15 (UI)

Installation uses SNAPSHOT versions of the SRU/FCS libraries, and normally reserved functions to validate the SRU/FCS protocols

FCS Endpoint Conformance Tester – Deployment

SRU/FCS SNAPSHOT libraries must be installed directly from Git

$ git clone https://github.com/clarin-eric/fcs-sru-client.git && cd fcs-sru-client
$ mvn install
$ git clone https://github.com/clarin-eric/fcs-simple-client.git && cd fcs-simple-client
$ mvn install

Build with Maven

$ git clone https://github.com/clarin-eric/fcs-endpoint-tester.git && cd fcs-endpoint-tester
$ mvn clean package

Deployment with Jetty on http://localhost:8080/

$ JETTY_VERSION="9.4.51.v20230217"
$ wget https://repo1.maven.org/maven2/org/eclipse/jetty/jetty-distribution/${JETTY_VERSION}/jetty-distribution-${JETTY_VERSION}.zip && unzip jetty-distribution-${JETTY_VERSION}.zip && rm jetty-distribution-${JETTY_VERSION}.zip
$ cd jetty-distribution-${JETTY_VERSION}/
$ java -jar start.jar --add-to-start=http,deploy
$ cd webapps/ && cp ../../target/FCSEndpointTester-X.Y.Z-SNAPSHOT.war ROOT.war && cd ..
$ java -jar start.jar

FCS Endpoint Conformance Tester – Deployment (Docker)

Create Docker Image

$ git clone https://github.com/clarin-eric/fcs-endpoint-tester.git && cd fcs-endpoint-tester
$ docker build -t fcs-endpoint-tester .

Run Container

$ docker run --rm -it -p 8080:8080 fcs-endpoint-tester

FCS Endpoint Validator

This is a updated and completely rewritten SRU/FCS Endpoint Validator based on FCS Endpoint Protocol Conformance Tester. It allows to inspect HTTP requests/responses and store validation results in addition to more test cases.
WebApp for testing the compliance with the SRU/FCS specification of FCS endpoints

Code: github.com/saw-leipzig/fcs-endpoint-validator
Deployment: fcs-validator.data.saw-leipzig.de
Multi-module maven project
- (standalone) JUnit5 test runner with test cases, Java 11
- Vaadin 24 UI with SpringBoot, Java 17

FCS Endpoint Validator – Deployment

Build with Maven

$ git clone https://github.com/saw-leipzig/fcs-endpoint-validator.git && cd fcs-endpoint-validator
$ mvn clean package install

Deployment with SpringBoot on http://localhost:8080/ (might automatically open a new browser tab)

$ cd fcs-endpoint-validator-ui/
$ mvn spring-boot:run

FCS Endpoint Validator – Deployment (Docker)

Download sources:

$ git clone https://github.com/saw-leipzig/fcs-endpoint-validator.git && cd fcs-endpoint-validator

Create docker-compose.yml deployment description:

version: '3'

services:
  fcs-endpoint-validator:
    build:
      context: .
      dockerfile: fcs-endpoint-validator-ui/Dockerfile
    container_name: fcs-endpoint-validator
    ports:
      # default, public 8080 to docker container 8080
      - 8080:8080
    restart: unless-stopped

Run Docker-Compose deployment:

$ docker compose build
$ docker compose down -v
$ docker compose up -d

FCS SRU Aggregator

Primary FCS client application
Central search interface for users,

“aggregates” FCS search queries to/from distributed endpoints

Code: github.com/clarin-eric/fcs-sru-aggregator
Deployments:
- CLARIN: contentsearch.clarin.eu + (Alpha / Beta instances)
- Text+: fcs.text-plus.org (Alpha instance)
Registry of endpoints in Centre Registry + side loading
Deployment instructions found in the repo in DEPLOYMENT.md

FCS SRU Aggregator – Deployment

Build application (native)

$ git clone https://github.com/clarin-eric/fcs-sru-aggregator.git && cd fcs-sru-aggregator
$ ./build.sh --jar

Configuration (endpoint sideloading + logging) in aggregator_devel.yml (aggregator.yml for production deployment)

aggregatorParams → additionalFCSEndpoints
logging → loggers

Running on http://localhost:4019/

$ ./build.sh --run

FCS SRU Aggregator – Deployment (Docker)

Create Docker Image

$ git clone https://github.com/clarin-eric/fcs-sru-aggregator.git && cd fcs-sru-aggregator
$ docker build --tag=fcs-aggregator .

Run Docker Container

$ touch fcsAggregatorResources.json fcsAggregatorResources.backup.json
$ docker run -d --restart unless-stopped \
    -p 4019:4019 -p 5005:5005 \
    -v $(pwd)/aggregator.yml:/work/aggregator.yml:ro \
    -v $(pwd)/fcsAggregatorResources.json:/var/lib/aggregator/fcsAggregatorResources.json \
    -v $(pwd)/fcsAggregatorResources.backup.json:/var/lib/aggregator/fcsAggregatorResources.backup.json \
    fcs-aggregator

FCS Korp Endpoint

Reference endpoint for Korp corpus search engine
Example → Korp-API publicly accessible, no further configuration required for testing

Code:
- Java: github.com/clarin-eric/fcs-korp-endpoint
- Python: github.com/Querela/fcs-korp-endpoint-python
Deployment(s):
Språkbanken (Göteborg): https://spraakbanken.gu.se/ws/fcs/2.0/endpoint/korp/sru
CLARIN-DK-UCPH (Copenhagen S): https://alf.hum.ku.dk/korp/fcs/2.0/endpoint/sru
…

FCS Korp Endpoint – Deployment (Java)

Build Application

$ git clone https://github.com/clarin-eric/fcs-korp-endpoint.git && cd fcs-korp-endpoint
$ mvn clean compile war:war

Deployment then with Jetty/Tomcat etc. analogous to the FCS Endpoint Tester

FCS Korp Endpoint – Deployment (Python)

Prepare Deployment

$ git clone https://github.com/Querela/fcs-korp-endpoint-python.git && cd fcs-korp-endpoint-python
$ python3 -m venv venv && source venv/bin/activate
$ python3 -m pip install -e .

Test Deployment (http://localhost:8080)

$ python3 -m korp_endpoint

Productive deployment with Docker (http://localhost:5000)

$ docker build --progress=plain -t korpy .
$ docker run --rm -it -p 5000:5000 korpy

Deployment Notes

When using Docker and localhost, network configurations may need to be adjusted so that the Docker container has access to the host
- → host.docker.internal

Resources

Publications

As part of Text+ → see Zotero.org tagged “FCS”
Listing in Text+ Awesome FCS list
In the context of CLARIN? → see Zotero.org “Federated Content Search” group
- CLARIN Federated Content Search (CLARIN-FCS) – Core Specification, 2014, Oliver Schonefeld et al.
- Federated Search: Towards a Common Search Infrastructure, 2012, Herman Stehouwer et al.
- Several workshops

1. OASIS: Organization for the Advancement of Structured Information Standards

FCS Endpoint Development Tutorial

Introducing the Federated Content Search (FCS)

What is the FCS?

What is included in the FCS?

Requirements for participation in the FCS

Pros and Cons for the FCS (as Infrastructure)

Pros and Cons for FCS Endpoints (Operators)

Comparison of FCS with Central Index

History

FCS Architecture

Communication Protocol

Basic Assumption on the Data Structure

Explain: Resource Discovery

Explain: Resource Discovery (2)

Explain: Resource Structure

Query Language FCS-QL

Visualization of Results

Visualization of Results (2)

Visualization of Results (3)

Current state of the FCS

Current Work

Current status regarding Lexical Resources

Current status of participants

Integration in FCS Infrastructure

Alternative Ways of using FCS

Bootstrapping Endpoint Development

Guide to Endpoint Development

Development Decisions

Endpoint Implementations

New Endpoint Development

Prerequisite for local search engine

Fundamentals

Disclaimer

SRU – History

searchRetrieve Version 1.0 – OASIS Standard

searchRetrieve: Part 0. – Overview Version 1.0

searchRetrieve – APD and Bindings

searchRetrieve – APD Abstract Models

SRU 2.0

SRU 2.0 – Operation Model

SRU 2.0 – Data Model

SRU 2.0 – Protocol Model

SRU 2.0 – Processing Model

SRU 2.0 – Query Model

SRU 2.0 – Parameter Model

SRU 2.0 – Result Set Model

SRU 2.0 – Diagnostic Model

SRU 2.0 – Explain Model

SRU 2.0 – Serialization Model

searchRetrieve 2.0 – Request Parameter

searchRetrieve 2.0 – Response Elements

searchRetrieve 2.0 – Query

searchRetrieve 2.0 – Query (Examples)

searchRetrieve 2.0 – Pagination

searchRetrieve 2.0 – Result Set

searchRetrieve 2.0 – Pagination (Cont.)

searchRetrieve 2.0 – Serialization

searchRetrieve 2.0 – Unsupported Parameters

SRU 2.0 – Extensions

SRU 2.0 – Backwards Compatibility

SRU 2.0 – Diagnostics

FCS Interface Specification

FCS – Discovery

FCS – Endpoint Description

FCS – Search

FCS – SRU Extension Parameter

FCS – Diagnostics

Versions and Backwards Compatibility

Notes on FCS SRU Aggregator

Note on Scan Operation

Reference Implementations

CLARIN Reference Libraries (Java)

CLARIN Reference Libraries (Python)

CLARIN Reference Libraries

FCS Endpoint – Design and structure

FCS Endpoint – Initialization

FCS Endpoint – Communication Flow

FCS Endpoint – Class Hierarchy

Endpoint Configurations

Resources and Data Views