Search Infrastruktur in CLARIN and Text+
Central client (search result “Aggregator” and web portal)
Decentralized endpoints at the data centers (local search eninges on resources)
Erik Körner <koerner@saw-leipzig.de>
“Federated Content Search” at CLARIN
In short: Content Search over Distributed Resources
Also: Federated “Corpus Query Platform”
Search for patterns in distributed text collections
No central index!
Text resources include annotated corpora, full-texts etc.
FCS = interface specification, search infrastructure and software ecosystem
Usage of established standards and extensibility!
Interface Specification
Description of search protocol (query languages, formats and communication channels)
“for homogeneous access to heterogeneous search engines”
RESTful protocol
Search Infrastruktur in CLARIN and Text+
Central client (search result “Aggregator” and web portal)
Decentralized endpoints at the data centers (local search eninges on resources)
Software Ecosystem primarily in Java
Libraries (Java, Python, …)
Tools (Validator, Aggregator, Registry)
(Own) text resources
“Search engine” on those text resources
Minimum: full-text search
Deployment of publicly accessible FCS endpoint(s)
Pros
Integration of many resources, linking and comparison of results
Integration with other tools (Weblicht, Registry/VLO, Switchboard, …)
Same queries, formats, result presentation
No duplicate data storage, inconsistency
Cons
No control over resources
No deterministic results (e.g. links for publications)
No global ranking of results possible
Pros
Control over resources and search (ranking, fuzzy, …)
No duplication of data due to central index
Increased visibility in a larger resource catalog
Cons
Deployment of (additional) endpoint necessary
Data | ➕ At the endpoints | ➖ Duplicate data storage, possible inconsistency (age, updates); legally no transfer may be possible |
---|---|---|
Updates to Data | ➕ Endpoints can react quickly | ➖ Difficult, e.g. removal of resources in the event of legal problems; updates entail longer delays, if at all possible |
Global Ranking | ➖ Very difficult/impossible | ➕ Quite possible (?), probably implicit assumption and normalization of data for indexing |
---|---|---|
Faceted Search | ➖ Difficult (e.g. via external metadata; not explicitly intended) | ➕ Indexing allows clustering/classification according to topics and categories |
~ 2011 Started as Working Group in CLARIN
Mai 2011 EDC/FCS Workshop
~ 2011–2013 Initial version, now named FCS “Legacy”
SRU Scan for resources, BASIC Search (CQL/full-text), KWIC
April 2013 FCS Workshop
~ 2013/2014 Code and Spec for FCS Core 1.0
fcs-simple-endpoint:1.0.0
, sru-server:1.5.0
BASIC Hits Data View, SRU Scan operation not used anymore
~ 2015/2016 Starting work on and Code for FCS Core 2.0
fcs-simple-endpoint:1.3.0
, sru-server:1.8.0
Advanced Data Views (FCS-QL), …
June 2017 Official release of FCS Core 2.0 Spec
2022 FCS is focus in Text+ (Findability)
2023 New FCS maintainer in CLARIN
Migration of Source Code to GitHub.com, updated documentation
Python FCS endpoint libraries
Updated libraries & tools
Prototypes for LexFCS extension
2024
Experiments with Entity Search (extension)
Rewrite of FCS Endpoint Validator
SRU (Search/Retrieval via URL) / OASIS searchRetrieve
Standardized by Library of Congress (LoC) / OASIS
RESTful
Explain: Listing of resources
Languages, annotations, supported data views and formats etc.
SearchRetrieve: Search request
Data as XML
Extensions to the protocol explicitely allowed
different (optional) annotation layers
Full-text | The | cyclists | are | fast |
---|---|---|---|---|
Part of Speech | DET | NOUN | VERB | ADJ |
Lemmatisation | The | cyclist | is | fast |
Phonetic Transcription | … | … | … | … |
Orthographic Transcription | … | … | … | … |
[…] |
Based on CQP
Supports various annotation layers
Current version of the specification: FCS Core 2.0
Poster at Bazaar @ CLARIN2023 on the current status
😎 “Awesome FCS” List: github.com/clarin-eric/awesome-fcs with relevant links to specs, tools, libraries, implementations and much more
Additions by Text+ (z.B. on LexFCS/LexCQL/Forks/Software): gitlab.gwdg.de/textplus/ag-fcs-documents/-/blob/main/awesome-fcs.md
CLARIN specifications: github.com/clarin-eric/fcs-misc
Small ecosystem (Code on Github/Gitlab)
Software libraries (SRU/FCS, endpoint + client, Java/Python)
Aggregator (Code: Github, Text+ Fork)
Online Validator for Endpunkte (fcsvalidator, Code: Github (old), Github (new))
Endpunkte Registry: centres.clarin.eu/fcs
Lexical Resources extension
First specification and implementation in Text+
Official extension of CLARIN → ~2024 Working Plan
AAI integration
Specification and implementation
Goal: Support access-restricted resources
Securing the aggregator via Shibboleth → Passing on AAI attributes to endpoints
Preliminary work from CLARIAH-DE, part of the Text+ work plan (IDS Mannheim, Uni/SAW Leipzig, preliminary work BBAW)
Syntactic Search
Entity Search
Optional metadata for each result
CLARIN-EU Taskforce
CLARIN ERIC working plan: „extending the protocol to cover additional data types (e.g. lexica) will be explored“
on the CLARIN 2024 Working Plan
Interest expressed from various countries
Preliminary work: „RABA“ (Estland): e.g. „Eesti Wordnet“
First specification and implementation in Text+
Specification on Zenodo: zenodo.org/records/7849754
Presentation at eLex 2023: “A Federated Search and Retrieval Platform for Lexical Resources in Text+ and CLARIN”
Aggregator: fcs.text-plus.org/?queryType=lex
CLARIN (contentsearch.clarin.eu, Registry)
209 Resources (94 in Advanced)
in 61 Languages
from 20 Institutions in 12 Countries
Text+ (fcs.text-plus.org)
53 Resources (17 in Advanced, 30 in Lexical)
in 6 Languages
from 9 Institutions in Germany
CLARIN
Alpha/Beta using Side-Loading in Aggregator
Stable/Long-Term: Entry in Centre Registry
CLARIN Account + Formular as a Centre
Including monitoring etc.
Text+
Side-Loading in Aggregator
WIP: Registry (index of endpoints)
Development of an alternative aggregator frontend as Web Component
Code: Vue.js Store + Vuetify Component (Dialog); Demo
Use of the Aggregator API
Restriction to subset of resources, e.g. for integration on own website
Faceting, alternative visualization
Java: Maven Archetype github.com/clarin-eric/fcs-endpoint-archetype
Java & Python (reference implementation Korp):
😎 “Awesome FCS” List: github.com/clarin-eric/awesome-fcs
List of reference implementations, endpoints, query parsers
Code for FCS SRU Aggregator and SRU/FCS Validator
❓ Can I host the endpoint myself?
❗ No → HelpDesk: CLARIN, Text+
❓ What type of data do I have?
❗ Raw text, Vertical/CONLL, TEI, …
❓ Which search engine do I use / can I use?
❗ KorAP, Korp/CWB, Lucene/Solr/ElasticSearch, BlackLab, (No)SketchEngine, …
❓ Customization or new development?
❗ List of existing endpoint implementations (Awesome List)
❓ Programming language?
❗ Java, Python, (PHP, XQuery)
❓ In-house development: Use of the reference libraries (Java, Python)
❗ Maven Archetype, Korp
❗ SRU + FCS specifications …
Korp FCS 2.0 - reference implementation, Korp corpus search
CQP/SRU bridge - Corpus Workbench (CWB)
KonText, fcs-noske-endpoint - (No)SketchEngine (CONLL/Vertical)
oclcsrw - SRW/SRU server for DSpace, Lucene and/or Pears/Newton
corpus_shell, SADE - MySQL PHP/DDC Perl, eXist/XQuery
arche-fcs - ARCHE Suite, php
Blacklab / MTAS - corpus search engines using Lucene/Solr
KorapSRU - KorAP (IDS)
Sources: clarin, awesome-fcs
Customization of reference implementation (Korp)
Development using CLARIN SRU/FCS libraries
“New” development specifications (for other languages)
FCS: github.com/clarin-eric/fcs-misc → “FCS Core 2.0”
Awesome List: github.com/clarin-eric/awesome-fcs
❗ Full-text search
❓ With Hit markers
❓ Corpus search (segmented text with annotations)
❕ Pagination (total number of hits)
❗ Resource PID
❓ Linking to result pages
Main focus on:
Version: FCS Core 2.0; maximum compatibility with FCS Core 1.0
SRU Server, FCS Endpunkt; not FCS client application development
Using the reference libraries
→ Java and Python
Possible (re-)use of existing endpoints
No:
Working through the specification; only the essential information
New or redevelopment of SRU/FCS protocols, libraries etc.
(e.g. in other languages)
SRU: Search/Retrieve via URL → LOC
Part 0. Overview Version 1.0
Part 1. Abstract Protocol Definition Version 1.0
Part 2. SRU searchRetrieve Operation: APD Binding for SRU 1.2 Version 1.0
Part 3. SRU searchRetrieve Operation: APD Binding for SRU 2.0 Version 1.0
Part 4. APD Binding for OpenSearch 1.0 version 1.0
Part 5. CQL: The Contextual Query Language version 1.0
Part 6. SRU Scan Operation version 1.0
Part 7. SRU Explain Operation version 1.0
SRU (SRU: Search/Retrieve via URL) is a web service protocol supported over both SOAP and REST for client-server based search. SRU1.x was developed as a web service replacement for the NISO Z39.50 protocol. SRU2.0 is a revision to SRU which as well as including many enhancements to SRU1.2 was developed alongside the APD.
For the SRU protocol model, three operations are defined as part of its Processing Model:
SearchRetrieve Operation. The actual SearchRetrieve operation defined by the SRU protocol; A SearchRetrieve operation consists of a SearchRetrieve request from client to server followed by a SearchRetrieve response from server to client.
Scan Operation. Similar to SRU, the Scan protocol defines a request message and a response message for iterating through available search terms. a Scan operation consists of a Scan request followed by a Scan response.
Explain Operation. Every SRU or scan server provides an associated Explain document as part of its Description and Discovery Model, providing information about the server’s capabilities. A client may retrieve this document and use the information to self-configure and provide an appropriate interface to the user. When a client retrieves an Explain document, this constitutes an Explain operation.
Abstract Protocol Definition (APD) für “searchRetrieve operation”
Model for SearchRetrieve Operation
Describes Capabilities and General Characteristics of a Server or Search Engine, as well as how access should take place
Defines abstract Request parameters and Response elements
Binding
Describes corresponding names of the parameters and elements
static (for human), dynamic (for machine), …
Bindings: SRU 1.2, SRU 2.0, (OpenSearch)
Examples: “startPosition” (APD) → “startRecord” (SRU 2.0)
“recordPacking” (SRU 1.2) → “recordXMLEscaping” (SRU 2.0)
Data Model
Description of the data on which the search is to be executed
Query Model
Description of the construction of search queries
Processing Model
Description of how query is sent from client to server
Result Set Model
Structure of the results of a search
Diagnostics Model
Description of how errors are communicated from the server to the client
Description and Discovery Model
Description, for the discovery of the “Search Service”, self-description of the functionality of the service
Request / response flow (Client ↔ Server)
HTTP GET/POST with set of parameters (extensible)
Request processing on the server
Operations: searchRetrieve, scan, explain
Data model with result sets, records and associated schemas
Diagnostics: (non-)fatal for warnings and errors
No fixed serialization, XML for FCS
SRU Request (Client → Server) with Response (Server → Client)
Operations
SearchRetrieve
Scan
Explain
Server = Database for Client for search/retrieval
Database = Collection of Units of Data → Abstract Record
Abstract Records (or Response Records) in one/multiple formats by server
Format (or Item Type) = Record Schema
HTTP GET
Parameters encoded as “key=value
”
UTF-8
%
-Escaping
Separation at “?
”, “&
”, “=
”
HTTP POST
application/x-www-form-urlencoded
No character encoding necessary
No length restriction
HTTP SOAP (?)
“Request processing on the server”
Request
Number of records
Identifier for Record Schema (→ Records in Response)
Identifier for Response Schema (→ whole Response)
Response
Records in Result Set
Diagnostic Information
Result Set Identifier for requests for further results
Any “appropriate query language” can be used
Mandatory support of
“Contextual Query Language” (CQL)
Use of Parameters, some predefined by SRU 2.0
Parameters not defined in the protocol are also permitted
Parameter “query
”
included in every query in some manner (“query
” or by parameters not defined in the protocol)
Query with “queryType
” (default “cql
”)
Logical model → “Result Sets” are not mandatory
Query → Selection of suitable Records
Ordered list, non-modifiable set after creation
Sorting/order determined by server
for Client:
Set of abstract Records, counting starts with 1
Each record can be requested in its own format
Individual records can “disappear”, no reordering in the Result Set by the Server, but Diagnostic to inform
fatal
Execution of the query cannot be completed
e.g. invalid query
non-fatal
Processing impaired, but request can be completed
e.g. individual records are not available in the requested schema, server only sends the available ones and informs about the rest
surrogate
For single Records
non-surrogate
All records are available, but something went wrong, e.g. sorting
Or simply a warning
Must be available for HTTP GET via the base URL of the SRU server
→ Server Capabilities
In the client for self-configuration and to provide the corresponding user interface
Details on supported Query Types, CQL Context Sets, Diagnostic Sets, Records Schemas, Sorting options, defaults, …
No restriction on the serialization of responses
(for the entire message or single records)
Non-XML serialization is allowed
All parameters are optional, non-repeatable
query, startRecord, maximumRecords, recordXMLEscaping, recordSchema, resultSetTTL, stylesheet; Extension parameters
New in 2.0: queryType, sortKeys, renderedBy, httpAccept, responseType, recordPacking; Facet Parameters
All elements are optional, non-repeatable by default
numberOfRecords, resultSetId, records, nextRecordPosition, echoedSearchRetrieveRequest, diagnostics, extraResponseDataⓇ
New in 2.0: resultSetTTL, resultCountPrecisionⓇ, facetedResultsⓇ, searchResultAnalysisⓇ
(Ⓡ = repeatable)
query
(Parameter)
Query
Mandatory if no specification of queryType
queryType
(Parameter, SRU 2.0)
Optional, by default “cql
”
Query Types must be listed in the Explain, with URL for definition and usage abbreviation
Reserved
cql
searchTerms
(processing is left to the server, < SRU 2.0)
spraakbanken.gu.se/…/sru?query=cat
(default, FCS 2.0, SRU 2.0)
spraakbanken.gu.se/…/sru?operation=searchRetrieve&version=1.2&query=cat
(FCS 1.0, SRU 1.2)
spraakbanken.gu.se/…/sru?operation=searchRetrieve&queryType=cql&query=%22anv%C3%A4ndning%22
(FCS 2.0, SRU 2.0)
(FCS 2.0 mit FCS-QL Query)
Query for result range of startRecord
with maximum maximumRecords
startRecord
(Parameter)
Optional, positive integer, starting with 1
maximumRecords
(Parameter)
Optional, non-negative integer
Server selects default if not specified
Server can respond with fewer records, never more
Response with total number (numberOfRecords
) of records in the Result Set, with offset (nextRecordPosition
) to next results
numberOfRecords
(Element)
Number of results in the Result Set
If query fails, it must be “0
”
nextRecordPosition
(Element)
Counter for next result set, if last record in the response is not last in the result set
If no further records, then this element must not appear
resultSetId
(Element)
Optional, identifier for the Result Set, for referencing in the subsequent requests
resultSetTTL
(Parameter / Element, Element in SRU 2.0 only)
Optional, in seconds
In request from Client when Result Set is no longer used
In response from Server, how long Result Set is available (“good-faith estimate”, → can be longer or shorter)
resultCountPrecision
(Element, SRU 2.0)
URI: “info:srw/vocabulary/resultCountPrecision/1/…
”
exact
/ unknown
/ estimate
/ maximum
/ minimum
/ current
spraakbanken.gu.se/…/sru?query=cat
→ 9220 results, next results starting from 251
spraakbanken.gu.se/…/sru?query=cat&startRecord=300&maximumRecords=10
→ More from 310
spraakbanken.gu.se/…/sru?query=cat&startRecord=10000&maximumRecords=10
→ Error, because “out of range”
spraakbanken.gu.se/…/sru?query=catsss
→ No results
spraakbanken.gu.se/…/sru?query=cat&maximumRecords=100000
→ Restricted to 1000 Records
recordXMLEscaping
(Parameter, SRU 2.0)
If records are serialized as XML, “string
” of the Records can be escaped (“<
”, “>
”, “&
”); default is “xml
” as direct embedding of the Records in the Response, e.g. for Stylesheets
recordPacking
(Parameter, SRU 2.0)
In SRU 1.2 used to have the semantic of recordXMLEscaping
“packed
” (default), Server should deliver Records with requested schema; “unpacked
”, Server can determine the location of the application data in the Records itself (?)
httpAccept
(Parameter, SRU 2.0)
Schema for Response, default is “application/sru+xml
”
responseType
(Parameter)
Schema for Response (in combination with httpAccept
parameter)
recordSchema
(Parameter)
Schema of the Records in Response, e.g. “http://clarin.eu/fcs/resource
”
Identifier for schema from Explain Response
records
(Element)
Contains Records / Surrogate Diagnostics
According to default Schema a list of “<record>
” elements
stylesheet
(Parameter)
URL to stylesheet, for display to the user
renderedBy
(Parameter, SRU 2.0)
Where is stylesheet for Response rendered
“client
” (default), URL of stylesheet
parameter is simply echoed → “thin client” (in Web Browser)
“server
”, should transform default SRU response with stylesheet (e.g. for httpAccept
with HTML format)
spraakbanken.gu.se/…/sru?query=cat&recordXMLEscaping=string
→ Possible serialization error in Java library
(FCS 1.0, SRU 1.2, like recordXMLEscaping
)
spraakbanken.gu.se/…/sru?query=cat&recordPacking=unpacked
→ No noticeable change here
…
Sorting (sortKeys
) and Faceting not supported
Extensions possible in
Request via Extension Parameter
(prefixed with “x-
” and namespace identifier, e.g. “x-fcs-
”)
Response in the “<extraResponseData>
” Element
Response with extraResponseData
, only if requested in Request with corresponding parameter, never voluntary
Server can ignore the request, no obligation
Unknown extension parameters are to be ignored
Parameters “operation
” and “version
” only in SRU 1.1/SRU 1.2, removed in SRU 2.0 → Assumption of a separate endpoint for each SRU version
Heuristic for detecting the SRU version
searchRetrieve
= query
or queryType
parameter
scan
= scanClause
parameter
explain
Interoperability with older versions:
Use of operation
/version
parameters → SRU < 2.0
Caution with parameters with changed semantics
especially recordPacking
“Error handling”
Difference between (non-)fatal, (non-)surrogate → SRU 2.0 – Diagnostic Model
Schema: info:srw/schema/1/diagnostics-v1.1
Prefix: info:srw/diagnostic/1/
uri
(ID), details
(additional information, depending on Diagnostic), message
Information:
General information and notes (LOC, OASIS SRU 2.0)
List of Diagnostics (LOC, OASIS SRU 2.0)
Categories: General (1-9), CQL (10-49), Result Sets (50-60), Records (61-74), Sorting (80-96), Explain (100-102), Stylesheets (110-111), Scan (120-121)
Not limited to this list only, custom diagnostics possible
1 | General system error | Debugging information (traceback) |
2 | System temporarily unavailable | |
3 | Authentication error | |
4 | Unsupported operation | |
5 | Unsupported version | Highest version supported |
6 | Unsupported parameter value | Name of parameter |
7 | Mandatory parameter not supplied | Name of missing parameter |
8 | Unsupported parameter | Name of the unsupported parameter |
9 | Unsupported combination of parameters | |
10 | Query syntax error | |
23 | Too many characters in term | Length of longest term |
26 | Non special character escaped in term | Character incorrectly escaped |
35 | Term contains only stopwords | Value |
37 | Unsupported boolean operator | Value |
38 | Too many boolean operators in query | Maximum number supported |
47 | Cannot process query; reason unknown | |
48 | Query feature unsupported | Feature |
60 | Result set not created: too many matching records | Maximum number |
61 | First record position out of range | |
64 | Record temporarily unavailable | |
65 | Record does not exist | |
66 | Unknown schema for retrieval | Schema URI or short name |
67 | Record not available in this schema | Schema URI or short name |
68 | Not authorized to send record | |
69 | Not authorized to send record in this schema | |
70 | Record too large to send | Maximum record size |
71 | Unsupported recordXMLEscaping value | |
80 | Sort not supported | |
110 | Stylesheets not supported | |
111 | Unsupported stylesheet | URL of stylesheet |
FCS = Description of capabilities,
Extensions according to SRU
and operations
→ Use of SRU/CQL and
Erweiterungen nach SRU
Interface specification = formats and transport protocol
Endpoint = bridge between client (FCS formats) and local search engine
Client = user interface, query input and result presentation
Discovery and Search mechanism
SRU Explain
Help and information for the client on accessing, requesting and processing results from the server
Information about endpoint
Capabilities: Basic Search, Advanced Search?
Resources for search
→ Endpoint Description (XML) via explain SRU Operation
XML according to the schema Endpoint-Description.xsd
<ed:EndpointDescription>
@version
mit “2
”
<ed:Capabilities>
(1)
<ed:SupportedDataViews>
(1)
<ed:SupportedLayers>
(1) (if Advanced Search Capability)
<ed:Resources>
(1)
<ed:Capability>
Content: Capability Identifier, URI
http://clarin.eu/fcs/capability/basic-search
http://clarin.eu/fcs/capability/advanced-search
<ed:SupportedDataView>
Content: MIME type, e.g. application/x-clarin-fcs-hits+xml
@id
→ for referencing in <ed:Resource>
@delivery-policy
: send-by-default
/ need-to-request
No duplicates (based on MIME type) allowed
<ed:SupportedLayer>
(only for Advanced Search)
Content: Layer Identifier, e.g. “orth
”
@id
→ for referencing in <ed:Resource>
@result-id
→ Referencing the layer in the Advanced Data View
@qualifier
→ Identifier in FCS-QL Search Term for the layer
@alt-value-info
,[.blue]` @alt-value-info-uri`: short description of the layer, e.g. for tagset, + URL with further information
No duplicates allowed based on @result-id
MIME type
<ed:Resource>
@pid
: Persistent Identifier (e.g. MdSelfLink
from CMDI Record)
<ed:Title>
(1+) with @xml:lang
, no duplicates, English required
<ed:Description>
(0+) with @xml:lang
, English required, should be at most 1 sentence
<ed:Institution>
(0+) with @xml:lang
, English required
<ed:LandingPageURI>
(0/1) – link to the website of the resource (or institution) with more information
<ed:Languages>
(1) with <ed:Language>
content according to ISO 639-3 language codes
<ed:AvailableDataViews>
(1) with @ref
= list of IDs of the <ed:SupportedDataView>
elements, e.g. “hits adv
”
<ed:AvailableLayers>
(1) (if Advanced Search Capability), with @ref
= list of IDs of the <ed:SupportedLayer>
elements, e.g. “word lemma pos
”
<ed:Resources>
(0/1) for sub resources
For <ed:AvailableDataViews>
and <ed:AvailableLayers>
sub-resources should support the same lists, a new declaration is still required
<ed:EndpointDescription xmlns:ed="http://clarin.eu/fcs/endpoint-description" version="2">
<ed:Capabilities>
<ed:Capability>http://clarin.eu/fcs/capability/basic-search</ed:Capability>
</ed:Capabilities>
<ed:SupportedDataViews>
<ed:SupportedDataView id="hits" delivery-policy="send-by-default">application/x-clarin-fcs-hits+xml</ed:SupportedDataView>
</ed:SupportedDataViews>
<ed:Resources>
<!-- just one top-level resource at the Endpoint -->
<ed:Resource pid="http://hdl.handle.net/4711/0815">
<ed:Title xml:lang="de">Goethe Korpus</ed:Title>
<ed:Title xml:lang="en">Goethe corpus</ed:Title>
<ed:Description xml:lang="de">Das Goethe-Korpus des IDS Mannheim.</ed:Description>
<ed:Description xml:lang="en">The Goethe corpus of IDS Mannheim.</ed:Description>
<ed:LandingPageURI>http://repos.example.org/corpus1.html</ed:LandingPageURI>
<ed:Languages>
<ed:Language>deu</ed:Language>
</ed:Languages>
<ed:AvailableDataViews ref="hits" />
</ed:Resource>
</ed:Resources>
</ed:EndpointDescription>
<ed:EndpointDescription xmlns:ed="http://clarin.eu/fcs/endpoint-description" version="2">
<ed:Capabilities>
<ed:Capability>http://clarin.eu/fcs/capability/basic-search</ed:Capability>
</ed:Capabilities>
<ed:SupportedDataViews>
<ed:SupportedDataView id="hits" delivery-policy="send-by-default">application/x-clarin-fcs-hits+xml</ed:SupportedDataView>
<ed:SupportedDataView id="cmdi" delivery-policy="need-to-request">application/x-cmdi+xml</ed:SupportedDataView>
</ed:SupportedDataViews>
<ed:Resources>
<!-- top-level resource 1 -->
<ed:Resource pid="http://hdl.handle.net/4711/0815">
<ed:Title xml:lang="de">Goethe Korpus</ed:Title>
<ed:Title xml:lang="en">Goethe corpus</ed:Title>
<ed:Description xml:lang="de">Das Goethe-Korpus des IDS Mannheim.</ed:Description>
<ed:Description xml:lang="en">The Goethe corpus of IDS Mannheim.</ed:Description>
<ed:LandingPageURI>http://repos.example.org/corpus1.html</ed:LandingPageURI>
<ed:Languages>
<ed:Language>deu</ed:Language>
</ed:Languages>
<ed:AvailableDataViews ref="hits" />
</ed:Resource>
<!-- top-level resource 2 -->
<ed:Resource pid="http://hdl.handle.net/4711/0816">
<ed:Title xml:lang="de">Zeitungskorpus des Mannheimer Morgen</ed:Title>
<ed:Title xml:lang="en">Mannheimer Morgen newspaper corpus</ed:Title>
<ed:LandingPageURI>http://repos.example.org/corpus2.html</ed:LandingPageURI>
<ed:Languages>
<ed:Language>deu</ed:Language>
</ed:Languages>
<ed:AvailableDataViews ref="hits cmdi" />
<ed:Resources>
<!-- sub-resource 1 of top-level resource 2 -->
<ed:Resource pid="http://hdl.handle.net/4711/0816-1">
<ed:Title xml:lang="de">Zeitungskorpus des Mannheimer Morgen (vor 1990)</ed:Title>
<ed:Title xml:lang="en">Mannheimer Morgen newspaper corpus (before 1990)</ed:Title>
<ed:LandingPageURI>http://repos.example.org/corpus2.html#sub1</ed:LandingPageURI>
<ed:Languages>
<ed:Language>deu</ed:Language>
</ed:Languages>
<ed:AvailableDataViews ref="hits cmdi" />
</ed:Resource>
<!-- sub-resource 2 of top-level resource 2 ... -->
</ed:Resources>
</ed:Resource>
</ed:Resources>
</ed:EndpointDescription>
<ed:EndpointDescription xmlns:ed="http://clarin.eu/fcs/endpoint-description" version="2">
<ed:Capabilities>
<ed:Capability>http://clarin.eu/fcs/capability/basic-search</ed:Capability>
<ed:Capability>http://clarin.eu/fcs/capability/advanced-search</ed:Capability>
</ed:Capabilities>
<ed:SupportedDataViews>
<ed:SupportedDataView id="hits" delivery-policy="send-by-default">application/x-clarin-fcs-hits+xml</ed:SupportedDataView>
<ed:SupportedDataView id="adv" delivery-policy="send-by-default">application/x-clarin-fcs-adv+xml</ed:SupportedDataView>
</ed:SupportedDataViews>
<ed:SupportedLayers>
<ed:SupportedLayer id="word" result-id="http://spraakbanken.gu.se/ns/fcs/layer/word">text</ed:SupportedLayer>
<ed:SupportedLayer id="orth" result-id="http://endpoint.example.org/Layers/orth" type="empty">orth</ed:SupportedLayer>
<ed:SupportedLayer id="lemma" result-id="http://spraakbanken.gu.se/ns/fcs/layer/lemma">lemma</ed:SupportedLayer>
<ed:SupportedLayer id="pos" result-id="http://spraakbanken.gu.se/ns/fcs/layer/pos"
alt-value-info="SUC tagset"
alt-value-info-uri="https://spraakbanken.gu.se/parole/Docs/SUC2.0-manual.pdf"
qualifier="suc">pos</ed:SupportedLayer>
<ed:SupportedLayer id="pos2" result-id="http://spraakbanken.gu.se/ns/fcs/layer/pos2"
alt-value-info="2nd tagset"
qualifier="t2">pos</ed:SupportedLayer>
</ed:SupportedLayers>
<ed:Resources>
<!-- just one top-level resource at the Endpoint -->
<ed:Resource pid="hdl:10794/suc">
<ed:Title xml:lang="sv">SUC-korpusen</ed:Title>
<ed:Title xml:lang="en">The SUC corpus</ed:Title>
<ed:Description xml:lang="sv">Stockholm-Umeå-korpusen hos Språkbanken.</ed:Description>
<ed:Description xml:lang="en">The Stockholm-Umeå corpus at Språkbanken.</ed:Description>
<ed:LandingPageURI>https://spraakbanken.gu.se/resurser/suc</ed:LandingPageURI>
<ed:Languages>
<ed:Language>swe</ed:Language>
</ed:Languages>
<ed:AvailableDataViews ref="hits adv" />
<ed:AvailableLayers ref="word lemma pos pos2" />
</ed:Resource>
</ed:Resources>
</ed:EndpointDescription>
SRU SearchRetreive
Actual “Search”
Basic Search with CQL
Advanced Search with FCS-QL
Search results are serialized in Resource (Fragment) and in Data View formats
Implementation details → Chapter Resources and Data Views
x-fcs-endpoint-description
(explain)
“true
” - <sru:extraResponseData>
of the Explain Response contains the Endpoint Description document
x-fcs-context
(searchRetrieve)
Comma-separated list of PIDs
Restrict the search to resources identified by these PIDs
x-fcs-dataviews
(searchRetrieve)
Comma-separated list of Data View identifiers
Endpoints should also deliver these need-to-request
Data Views if requested
x-fcs-rewrites-allowed
(searchRetrieve)
“true
” - Endpoint can simplify query for higher recall
Complements to the SRU Diagnostics → SRU 2.0 – Diagnostics
Prefix: http://clarin.eu/fcs/diagnostic/
Refer to the Extra Request Parameters
Identifier URI | Description | Impact |
---|---|---|
| Persistent identifier passed by the Client for restricting the search is invalid. | non-fatal |
| Resource set too large. Query context automatically adjusted. | non-fatal |
| Resource set too large. Cannot perform Query. | fatal |
| Requested Data View not valid for this resource. | non-fatal |
| General query syntax error. | fatal |
| Query too complex. Cannot perform Query. | fatal |
| Query was rewritten. | non-fatal |
| General processing hint. | non-fatal |
“Clients MUST be compatible to CLARIN-FCS 1.0” (Quelle)
Thus implementation of SRU 1.2 still required (?)
Restriction to Basic Search Capability
Processing of legacy XML namespaces (SRU Response, Diagnostics)
Heuristic for version detection (of endpoints)
Client: Explain
request without version
and operation
parameters
Endpoint: SRU Response <sru:explainResponse>
/<sru:version>
with default SRU version
Versions
FCS 2.0 ↔ SRU 2.0
FCS 1.0 ↔ SRU 1.2 (SRU 1.1)
Currently no (?) support for FCS 2.0 only endpoints
For compatibility reasons support of Legacy FCS and FCS 1.0
Assumption that endpoints in FCS 2.0 also support earlier FCS Versions… (no issue with CLARIN SRU/FCS libraries)
→ FCS 2.0 only endpoints may therefore still receive FCS 1.0 (SRU 1.2) requests!
Aggregator sends searchRetrieve
requests with only one resource PID in the x-fcs-context
parameter for each resource requested
i.e. search across N
resources of an endpoint → N
separate search queries
Usage only in Legacy FCS,
originally for listing the available resources
Reserved for possible future use
Please ignore!
Development started ~2012
Modularized: Client/Server, SRU/FCS, Parser
in Java 1.8+ (EOL: Ende 2030)
Extensive documentation, some tests (proven by being in use for a long time)
Artifacts in CLARIN Nexus, Code on Github
Server/endpoint: external dependencies to
Logging: slf4j
HTTP: javax.servlet:servlet-api
Parser: antlr4
(FCS-QL) / CQL
Build: maven
Deployment: jetty, tomcat, …
~ 2022: Translation of Java reference libraries to Python
Strong orientation towards the Java reference libraries
→ (fast) (almost) identical interfaces, class/function names
but: slight optimizations for Python, no 1:1 copy
Focus on (new) FCS endpoints → no clients!
Typed, documented; published on PyPI
Synchronous, minimal WSGI - allows embedding in existing apps
Python 3.8+
Dependencies to
XML parsing: lxml
HTTP/WSGI: werkzeug
Query Parser: PLY
(CQL), ANTLR4
(FCS-QL)
Query Parser (CQL, FCS-QL)
FCS SRU Server
SRU configurations, versions, parameters, diagnostics, namespaces
XML SRU Writer
Request Parameter parser, SRUServer (request handler)
Abstract SRU interfaces (results, SRUSearchEngine
)
Auth (Interface, WIP)
FCS Simple Endpoint
FCS configurations (Endpoint Description), parameters, diagnostics, namespaces
XML Endpoint Description parser, Record and Data View writer
SimpleEndpointSearchEngineBase (SRUSearchEngine
+ FCS extensions)
FCS Endpoint for XYZ
Implementation of abstract classes and bindings to search engine, query translation
Configuration: Endpoint Description, SRU Server Configuration
Deployment on Java Servlet Server or as WSGI app
SRUServerServlet
/ SRUServerApp
(web server)
Set default WebApp parameters
Parse the SRU Server Config
Create QueryParserRegistry
(CQL)
Initialize SRUSearchEngine
Create SRUServer
(with SearchEngine
+ configurations)
SRUSearchEngine
(user implementation, → SimpleEndpointSearchEngineBase
)
Further initialization of the QueryParserRegistry
(FCS-QL)
do_init
(user init)
Create Endpoint Description
[GET] request (incoming)
↳ SRUServerServlet
/ SRUServerApp
(web server)
↳ SRUServer
URL parameter evaluation
Multiplexing by operation: search/scan/explain
↳ SimpleEndpointSearchEngineBase
(user implementation)
Parse search query (CQL/FCS-QL) and send to search engine
Wrap result in SRUSearchResultSet
Possible diagnostics etc.
↲
optional error handling
XML output generation (SRU parameter)
Servlet implementation for servlet container, doGet
handler, setup of SRUServer
wrapper/application executed by the endpoint operator
SRU protocol implementation, handleRequest
, error handling, XML output generation
Specific SRU GET parameter evaluation (parsing, validation; SRU versions) + possible FCS parameters (“x-
…”), SRU version detection
Actual implementation of createEndpointDescription
, do
* methods
Actual implementation, nextRecord
+ writeRecord
iterator and serialization of results
XYZSRUScanResultSet
, XYZSRUExplainResult
do not need to be implemented separately, default behavior is adequate
<?xml version="1.0" encoding="UTF-8"?>
<endpoint-config xmlns="http://www.clarin.eu/sru-server/1.0/">
<databaseInfo>
<title xml:lang="se">Språkbankens korpusar</title>
<title xml:lang="en" primary="true">The Språkbanken corpora</title>
<description xml:lang="se">Sök i Språkbankens korpusar.</description>
<description xml:lang="en" primary="true">Search in the Språkbanken corpora.</description>
<author xml:lang="en">Språkbanken (The Swedish Language Bank)</author>
<author xml:lang="se" primary="true">Språkbanken</author>
</databaseInfo>
<indexInfo>
<set name="fcs" identifier="http://clarin.eu/fcs/resource">
<title xml:lang="se">Clarins innehållssökning</title>
<title xml:lang="en" primary="true">CLARIN Content Search</title>
</set>
<index search="true" scan="false" sort="false">
<title xml:lang="en" primary="true">Words</title>
<map primary="true">
<name set="fcs">words</name>
</map>
</index>
</indexInfo>
<schemaInfo>
<schema identifier="http://clarin.eu/fcs/resource" name="fcs"
sort="false" retrieve="true">
<title xml:lang="en" primary="true">CLARIN Content Search</title>
</schema>
</schemaInfo>
</endpoint-config>
WebApp Parameter (web.xml
o.Ä.) - Korp example
SRU Version
SRU/FCS configurations
SRU (SRU Server Config) - Korp example →
databaseInfo
about endpoint, but no evaluation in client?
default: indexInfo
+ schemaInfo
Mandatory: database
field in serverInfo
!
FCS (Endpoint Description) - Korp example
FCS Version (1/2)
Capabilities, Layer, DataViews
Resources
http://clarin.eu/fcs/capability/basic-search
Mandatory
Query: Full-text search (Basic) with minimal CQL (AND/OR)
DataView: HITS
http://clarin.eu/fcs/capability/advanced-search
Optional
Query: FCS-QL (Structured search over annotation layers)
DataView: HITS and Advanced
Other capabilities possible
→ currently limited to Basic and Advanced Search!
Do not only determine search modes!
Work in progress:
Authentication/authorization
Lexical search: …/lex-search
→ LexCQL, LexHITS
Syntactic search?
Note: according to XSD, capability URIs have the following schema
http://clarin.eu/fcs/capability/\w([\.\-]{0,1}\w)*
cat
"cat"
cat AND dog
"grumpy cat"
"grumpy cat" AND dog
"grumpy cat" OR "lazy dog"
cat AND (mouse OR "lazy dog")
Mandatory!
Simple full-text search
Contextual Query Language (CQL) as query language
Endpoints
must support “term-only” queries
can support Boolean operators (AND
/OR
) and sub-queries
must abort in case of errors with appropriate diagnostics
can decide themselves what to search for
(text, normalization etc.)
Results serialized in Generic Hits (HITS) Data View
http://clarin.eu/fcs/capability/basic-search
"walking"
[token = "walking"]
"Dog" /c
[word = "Dog" /c]
[pos = "NOUN"]
[pos != "NOUN"]
[lemma = "walk"]
"blaue|grüne" [pos = "NOUN"]
"dogs" []{3,} "cats" within s
[z:pos = "ADJ"]
[z:pos = "ADJ" & q:pos = "ADJ"]
Optional
Structured search in annotated data,
represented in annotation layers
→ Query language FCS-QL
Queries can combine different annotation layers
Endpoints should support as many annotation layers as possible
Results serialized in Advanced (ADV) Data View and Generic Hits (HITS) Data View
http://clarin.eu/fcs/capability/advanced-search
Annotation Layers, containing annotations of a certain type (e.g. text, POS tags, …)
Query supports combination of these layers
Each layer is segmented → search for individual lemma
No requirement as to how segmentation should be done
Assumption that segmentation is consistent across layers (for display in Advanced Data View)
Queries can combine segments for multi-token patterns
Endpoints must be able to parse FCS-QL completely!
Requests with unsupported operators or layers?
Generate errors with diagnostics, or
Rewrite queries if permitted by “x-fcs-rewrites-allowed
” (on request)
Searches are Case Sensitive (configurable in the query)
Searches (by endpoints) should take place on layers where it makes sense,
e.g. if there are several text or POS layers
Layer Type Identifier | Annotation Layer Description | Syntax | Examples (without quotes) |
---|---|---|---|
| Textual representation of resource, also the layer that is used in Basic Search | String | "Dog", "cat" "walking", "better" |
| Lemmatisation | String | "good", "walk", "dog" |
| Part-of-Speech annotations | Universal POS tags | "NOUN", "VERB", "ADJ" |
| Orthographic transcription of (mostly) spoken resources | String | "dug", "cat", "wolking" |
| Orthographic normalization of (mostly) spoken resources | String | "dog", "cat", "walking", "best" |
| Phonetic transcription | "'du:", "'vi:-d6 'ha:-b@n" |
Universal Dependencies, Universal POS tags v2.0
Dafydd Gibbon, Inge Mertins, Roger Moore (Eds.): Handbook of Multimodal and Spoken Language Systems. Resources, Terminology and Product Evaluation, Kluwer Academic Publishers, Boston MA, 2000, ISBN 0-7923-7904-7
Identifies layers for FCS-QL and Advanced Data View
Other identifiers are not allowed, except for testing purposes
Custom identifiers must be prefixed with “x-
”
Results must be serialized in CLARIN FCS format
Resource (Fragment), Data View
XML → XSD
Important: 1 Hit = 1 Result Record
Do not combine multiple hits in one record
→ generate separate SRU records for each hit that reference the same resource
Multiple hit markers are allowed, e.g. for boolean expressions to highlight individual terms
Each “Hit” should be defined in a sentence context
Resource
“searchable and addressable entity” in the endpoint, e.g. text corpus
“self contained”, i.e. entire document, not a single sentence from a document
Addressable as a whole via Persistent Identifier or URI
Resource Fragment
Part of a Resource, e.g. single sentence, or time interval in audio transcription (for multi-modal corpora)
Should be addressable within a Resource (offset / ID)
Optional, but recommended
Data View
Serialization of a “Hits” in Resource (Fragment)
Enables different representations, expandable
Endpoints should provide link to Resource (Fragment)
Persistent Identifier (PID) / URI
If direct linking is not possible, then e.g. website with description of the resource, corpus or collection
Link should be as specific as possible
PIDs preferred to URIs, both together recommended
<fcs:Resource xmlns:fcs="http://clarin.eu/fcs/resource" pid="http://hdl.handle.net/4711/00-15">
<fcs:DataView type="application/x-clarin-fcs-hits+xml">
<!-- data view payload omitted -->
</fcs:DataView>
</fcs:Resource>
<fcs:Resource xmlns:fcs="http://clarin.eu/fcs/resource" pid="http://hdl.handle.net/4711/08-15">
<fcs:ResourceFragment>
<fcs:DataView type="application/x-clarin-fcs-hits+xml">
<!-- data view payload omitted -->
</fcs:DataView>
</fcs:ResourceFragment>
</fcs:Resource>
<fcs:Resource xmlns:fcs="http://clarin.eu/fcs/resource"
pid="http://hdl.handle.net/4711/08-15"
ref="http://repos.example.org/file/text_08_15.html">
<fcs:DataView type="application/x-cmdi+xml" (1)
pid="http://hdl.handle.net/4711/08-15-1"
ref="http://repos.example.org/file/08_15_1.cmdi">
<!-- data view payload omitted -->
</fcs:DataView>
<fcs:ResourceFragment pid="http://hdl.handle.net/4711/08-15-2" (2)
ref="http://repos.example.org/file/text_08_15.html#sentence2">
<fcs:DataView type="application/x-clarin-fcs-hits+xml">
<!-- data view payload omitted -->
</fcs:DataView>
</fcs:ResourceFragment>
</fcs:Resource>
1 | Specification of CMDI metadata for the resource |
2 | Hit is part of a larger resource “semantically more meaningful” |
Specification (with XSD schema, examples)
Specified in FCS Core 2.0
Advanced (ADV) Data View
Generic Hits (HITS) Data View
Additional Data Views such as Component Metadata (CMDI), Images (IMG), Geolocation (GEO) are included, but not used in the standard FCS client “Aggregator”
Mandatory “send-by-default
”
or optional “need-to-request
”
Generic Hits Data View is mandatory, must always be sent
Only send data views that
explicitely requested with (SRU) FCS parameter “x-fcs-dataviews
”, or
have delivery policy “send-by-default
”
Invalid Data Views → non-fatal diagnostic for each requested Data View
http://clarin.eu/fcs/diagnostic/4
("Requested Data View not valid for this resource")
Description | The representation of the hit |
---|---|
MIME type |
|
Payload Disposition | inline |
Payload Delivery | send-by-default ( |
Recommended Short Identifier |
|
XML Schema |
Required implementation
Simplest serialization, (lossy) approximation of results
Each hit should only occur in a single sentence context (or similar)
Multiple hit annotations possible, e.g. for conjunctions in the query
<!-- potential @pid and @ref attributes omitted -->
<fcs:DataView type="application/x-clarin-fcs-hits+xml">
<hits:Result xmlns:hits="http://clarin.eu/fcs/dataview/hits">
The quick brown <hits:Hit>fox</hits:Hit> jumps over the lazy dog.
</hits:Result>
</fcs:DataView>
<!-- potential @pid and @ref attributes omitted -->
<fcs:DataView type="application/x-clarin-fcs-hits+xml">
<hits:Result xmlns:hits="http://clarin.eu/fcs/dataview/hits">
The quick brown <hits:Hit>fox</hits:Hit> jumps over the lazy <hits:Hit>dog</hits:Hit>.
</hits:Result>
</fcs:DataView>
Description | The representation of the hit |
---|---|
MIME type |
|
Payload Disposition | inline |
Payload Delivery | send-by-default ( |
Recommended Short Identifier |
|
XML Schema | - |
Deprecated!
Only for compatibility with Legacy FCS clients
Example in CQP/SRU bridge
Mapping of
left and right context,
hits
Description | The representation of the hit for Advanced Search |
---|---|
MIME type |
|
Payload Disposition | inline |
Payload Delivery | send-by-default ( |
Recommended Short Identifier |
|
XML Schema |
Serialization for Advanced Search for multimedia data (text, transcribed audio)
Presentation of structured information via multiple annotation layers
Annotations are streams (ranges) over the signal in a stand-off like format with start and end offsets (inclusive)
Segmentation via <Segment>
, annotations in <Span>
in <Layer>
Segments must be possible to align over all annotation layers
→ see more examples (searchRetrieve
query)
endpoint: https://spraakbanken.gu.se/ws/fcs/2.0/endpoint/korp/sru
…?operation=searchRetrieve&queryType=fcs&query=%5bword%3d%22anv%C3%A4ndning%22%5d
→ FCS 2.0, FCS-QL: [ word = "användning" ]
, HITS + ADV
…?operation=searchRetrieve&queryType=cql&query=%22anv%C3%A4ndning%22
→ FCS 2.0, CQL: "användning"
, HITS
…?operation=searchRetrieve&version=1.2&query=cat ↔ …?query=cat → HITS
FCS 1.0, sru="http://www.loc.gov/zing/srw/"
FCS 2.0, sruResponse="http://docs.oasis-open.org/ns/search-ws/sruResponse"
more parameters: x-indent-response=1
/ x-fcs-dataviews=cmdi
/ x-fcs-context=11022/0000-0000-20DF-1
More resources in Awesome FCS List > Query Parsers
CQL (Contextual Query Language)
BNF grammar: www.loc.gov/standards/sru/cql/spec.html#bnf
Hand-written parser implementation in Java, Python, JS, …
Documentation: Java
Visualization in demo of JS parser
Validation for Text+ LexCQL
FCS-QL (Federated Content Search Query Language)
EBNF grammar: github.com/clarin-eric/fcs-misc (FCS Core 2.0)
Grammar visualization with ANTLR4 tools
Installation
pip install antlr4-tools
git clone https://github.com/clarin-eric/fcs-ql.git
cd fcs-ql/src/main/antlr4/eu/clarin/sru/fcs/qlparser
Visualization according to ANTLR4 > Getting Started
antlr4-parse src/fcsql/FCSParser.g4 src/fcsql/FCSLexer.g4 query -gui
[ word = "her.*" ] [ lemma = "Artznei" ] [ pos = "VERB" ]
^D
QueryNode (with child node “children”)
Expression (layer identifier, layer identifier qualifier, operator, regular expression + flags)
Wildcard
Group → 1 QueryNode; “(
” … “)
”
NOT → 1 QueryNode
AND, OR → list of QueryNodes
QueryDisjunction → list of QueryNodes
QuerySequence → list of QueryNodes → “list of QuerySegmenten”
QuerySegment (min, max) → Expression → “a single token”
QueryGroup (min, max) → QueryNode
Within-Query (SimpleWithin, QueryWithWithin) (Scope: sentence, utterance, paragraph, turn, text, session) (unused)
Parsed Query:
Query Sequence → with list of Query Segment
[ word = ".*her" ] [ lemma = "Artznei" ] [ pos = "VERB" ]
Query Segment → a token (can be repeatable)
[ word = "her.*" & ( word = "test" | word = "Apfel" ) ] [ pos = "ADV" ]{1,3}
Expression AND
[ word = "her.*" & word = "test" ]
Expression Group
Expression
Expression Group → Expression OR → list of Expression
[ ( word = "her.*" | word = "Test" ) ]
Expression → Layer Identifier, Operator, Regex (value)
[ word = "her.*" ]
Currently (Aggregator v3.9.1) only limited support of all FCS-QL features
→ partly due to Visual Query Builder
Free text input / improved query builder planned for the future
Use appropriate diagnostics if query features are not supported
SRU: \info:srw/diagnostic/1/48
- Query feature unsupported.
FCS: http://clarin.eu/fcs/diagnostic/10
- General query syntax error. - should be intercepted by FCS-QL parser library
FCS: http://clarin.eu/fcs/diagnostic/11
- Query too complex. Cannot perform Query.
Idea:
Let libraries parse raw queries (CQL, FCS-QL)
Recursively walk through the parsed query tree, “depth first”
Successively generate transformed query (for target system),
e.g. StringBuilder
in Java
Examples:
NoSketchEngine: CQL → CQL (Java), FCS-QL → CQL (Java)
Solr: CQL → Solr (Java), LexCQL → Solr (Java)
SolrQuery with highlighting, Custom hit prefixes/postfixes, use Solr result as pre-formatted Data View content (Code)
CQI Bridge: CQL → CQP (Java)
ElasticSearch
Only BASIC Search with full-text queries, e.g. with Simple Query String
Solr
Only BASIC Search
ADVANCED Search with e.g. MTAS (“Multi Tier Annotation Search”)
In general: use actual Corpus Search Engine for ADVANCED Search
→ otherwise at most a single annotation layer (“text”) can be searched
Download & Installation: code.visualstudio.com
Extensions:
Java
redhat.vscode-xml (optional)
Python
Quality of Life
ms-vscode-remote.vscode-remote-extensionpack, ms-vscode.remote-explorer (for WSL or remote via SSH)
For *.war
/Jetty web application testing
No hot code swapping / do not make any changes between compilation and debugging!
VSCode Debug Setting:
Run and Debug > Add Configuration … > “Java: Attach by Process ID”
Run application with Maven:
MAVEN_OPTS="-Xdebug -Xnoagent -Djava.compiler=NONE -agentlib:jdwp=transport=dt_socket,server=y,address=5005" \
mvn [jetty:run-war|...]
launch.json
pytest: no predefined configuration in “Run and Debug” menu
file/module: as required
settings.json
pytest: coverage must be deactivated here!
{
"name": "Python: pytest",
"type": "python",
"request": "launch",
"console": "integratedTerminal",
"purpose": [
"debug-test"
],
"justMyCode": false
}
"python.testing.pytestArgs": [
".",
// disable coverage for debugging
"--no-cov",
// disable ansi color output (-vv)
"-q",
],
See [_guide_to_endpoint_development]
→ Using reference endpoint implementations
Using the corp endpoint
Java: github.com/clarin-eric/fcs-sru-cqi-bridge (CQP/SRU bridge)
Java: project generation with Maven
Project template: github.com/clarin-eric/fcs-endpoint-archetype
Installation of the archetype in the local Maven repository, or
Configuration of the CLARIN Nexus as a remote repository
Project generation with Maven:
mvn archetype:generate \
-Pclarin \
-DarchetypeGroupId=eu.clarin.sru.fcs \
-DarchetypeArtifactId=fcs-endpoint-archetype \
-DarchetypeVersion=1.6.0 \
-DgroupId=[ id.group.fcs ] \
-DartifactId=[ my-cool-endpoint ] \
-Dversion=[ 1.0-SNAPSHOT ] \
-DinstitutionName=[ "My Institution" ]
all [
… ]
placeholders must be replaced with the appropriate values (enclose values with spaces in quotation marks)
if using the CLARIN remote repository, the custom profile is selected with -Pclarin
, see example maven configuration
if archetype is installed using git
, then use archetypeVersion=1.6.0-SNAPSHOT
(see details in pom.xml
)
Required class implementations
SimpleEndpointSearchEngineBase
SRUSearchResultSet
Wrapper or adapter for search engine (!)
Required configurations
sru-server-config.xml
endpoint-description.xml
Web app configurations
(Java: web.xml
, Python: key-value parameter dict)
Reference to implementation of the SimpleEndpointSearchEngineBase
Required SRU parameters (host
, port
, server
, …)
void doInit (ServletContext context, SRUServerConfig config, SRUQueryParserRegistry.Builder queryParsersBuilder, Map<String, String> params)
- Java, Python
Required implementation!
(optional) initialization of APIs, default values (PIDs), …
EndpointDescription createEndpointDescription (ServletContext context, SRUServerConfig config, Map<String, String> params)
- Java, Python
Required implementation!
Loading of EndpointDescription
(Java, Python)
embedded XML file (load with SimpleEndpointDescriptionParser
, Java, Python) or
construction dynamically, e.g. via API - example NoSketchEngine
SRUSearchResultSet search (SRUServerConfig config, SRURequest request, SRUDiagnosticList diagnostics)
Parse query (search request)
Check “queryType
” parameter, whether CQL, FCS-QL, …
Error: SRU_CANNOT_PROCESS_QUERY_REASON_UNKNOWN
Analyze ExtraRequestData
“x-fcs-context
” - requested resource (scope of search)
Diagnostic: FCS_DIAGNOSTIC_PERSISTENT_IDENTIFIER_INVALID
- invalid PIDs
Error: SRU_UNSUPPORTED_PARAMETER_VALUE
- e.g. too many PIDs, no PIDs
“x-fcs-dataviews
” - requested Data Views
Diagnostic: FCS_DIAGNOSTIC_PERSISTENT_IDENTIFIER_INVALID
Pagination → startRecord
(1) / maximumRecords
(-1)
Process search with (local) search engine
Wrap results in SRUSearchResultSet
“If in Doubt” → `SRU_GENERAL_SYSTEM_ERROR`
Input: Parameters of search query
Query (translated for (local) search engine)
Resource (PID)
Pagination: offset + count, → startRecord
(1) / maximumRecords
(-1)
(Request object and Server configurations)
(all global/static objects, such as API adapters etc.)
Output: Details for response, results
Total number (optional, FCS 2.0 allows indication of accuracy)
List of results
with “hit highlighting” (Hits) (Basic + Advanced Search)
tokenized (using character offsets) for FCS-QL (Advanced Search) with optional Advanced Search annotation layers
Diagnostics
Wrapper for results
Total number of results
List of results (text with hit offsets; tokens + annotations)
Resource PID, URL to result details
SRUSearchResultSet
implementation
Iterator interface → nextRecord()
, writeRecord()
; curRecordCursor
protected NoSkESRUFCSSearchResultSet(..., MyResults results) {
super(diagnostics);
this.serverConfig = serverConfig;
this.request = request;
this.results = results;
currentRecordCursor = -1;
// ...
public int getTotalRecordCount() { return (int) results.getTotal(); }
public int getRecordCount() { return results.getResults().size(); }
public boolean nextRecord() throws SRUException {
if (currentRecordCursor < (getRecordCount() - 1)) {
currentRecordCursor++;
return true; }
return false; }
public void writeRecord(XMLStreamWriter writer) {
MyResults.ResultEntry result = results.getResults().get(currentRecordCursor);
XMLStreamWriterHelper.writeStartResource(writer, results.getPid(), null);
XMLStreamWriterHelper.writeStartResourceFragment(writer, null, result.landingpage);
// ...
XMLStreamWriterHelper.writeEndResourceFragment(writer);
XMLStreamWriterHelper.writeEndResource(writer);
}
SRUXMLStreamWriter
- Java, Python
(internal), specifically for SRU “recordXmlEscaping
”
XMLStreamWriterHelper
- Java, Python (FCSRecordXMLStreamWriter
)
Boilerplate + help for writing Record, RecordFragment, Hits/Kwic Data View
AdvancedDataViewWriter
- Java, Python
Help with writing Advanced Data Views
addSpans
(content, layer, offset, hit?)
writeHitsDataView
, writeAdvancedDataView
FCS Version: 2
Capabilities: BASIC Search
Data Views: HITS
Resources: (min: 1)
Title
Description
LandingPage URL
Languages → one language (ISO 639-3)
<?xml version="1.0" encoding="UTF-8"?>
<EndpointDescription xmlns="http://clarin.eu/fcs/endpoint-description"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://clarin.eu/fcs/endpoint-description ../../schema/Core_2/Endpoint-Description.xsd"
version="2">
<Capabilities>
<Capability>http://clarin.eu/fcs/capability/basic-search</Capability>
</Capabilities>
<SupportedDataViews>
<SupportedDataView id="hits" delivery-policy="send-by-default" >application/x-clarin-fcs-hits+xml</SupportedDataView>
</SupportedDataViews>
<Resources>
<Resource pid="hdl:10794/sbkorpusar">
<Title xml:lang="sv">Språkbankens korpusar</Title>
<Title xml:lang="en">The Språkbanken corpora</Title>
<Description xml:lang="sv">Korpusarna hos Språkbanken.</Description>
<Description xml:lang="en">The corpora at Språkbanken.</Description>
<LandingPageURI >https://spraakbanken.gu.se/resurser/corpus</LandingPageURI>
<Languages>
<Language>swe</Language>
</Languages>
<AvailableDataViews ref="hits"/>
</Resource>
</Resources>
</EndpointDescription>
SRU Server Configurations → Endpoint Configurations (sru-server-config.xml
)
databaseInfo
with general information about endpoint
default: indexInfo
+ schemaInfo
required: serverInfo
> database
(host
and port
by default)
Web server configuration
Optional adjustment of SRU / FCS parameters
Java: web.xml
Python: key-value dictionary
Using Maven (!) / pom.xml
<packaging>war</packaging>
Build Plugin:
org.apache.maven.plugins:maven-war-plugin[:2.6]
(?)
org.apache.maven.plugins:maven-compiler-plugin
Create WAR artifact
mvn clean compile war:war
mvn clean package
(also run tests etc.)
Deploy with Java Servlet Engine / HTTP server like Apache Tomcat / Eclipse Jetty / …
“make_app()
” method
→ provides configured WSGI SRUServerApp
(Python)
Deployment suggestion: gunicorn (Python WSGI HTTP server)
Example: fcs-korp-endpoint-python
as module with werkzeug test server
python3 -m korp_endpoint
gunicorn in Docker Container (Dockerfile)
gunicorn 'korp_endpoint.app:make_gunicorn_app()'
def init(self, flask: Flask) -> None:
self.server = self.build_fcs_server()
flask.add_url_rule("some-path/fcs", "some-path/fcs", self.handle)
def build_fcs_server(self) -> SRUServer:
params = self.build_fcs_server_params()
config = self.build_fcs_server_config(params)
qpr_builder = SRUQueryParserRegistry.Builder(True)
search_engine = KoshFCSEndpointSearchEngine(
endpoint_description=self.build_fcs_endpointdescription(),
# ... other parameters
)
search_engine.init(config, qpr_builder, params)
return SRUServer(config, qpr_builder.build(), search_engine)
def handle(self) -> Response:
LOGGER.debug("request: %s", request) # Flask/Werkzeug Request
LOGGER.debug("request?args: %s", request.args)
response = Response() # Flask/Werkzeug Response
self.server.handle_request(request, response)
return response
Supports own query languages, Data Views etc.
Example: LexFCS (FCS extension for lexical resources)
→ i.e. new query language and Data View
LexCQL - query language (CQL dialect)
LexHITS - HITS Data View extension
NOTE: This is about the now legacy FCS endpoint tester, see Section: FCS Endpoint Validator for the updated and rewritten validator!
WebApp for testing the compliance with the FCS specification of endpoints
Deployment: clarin.ids-mannheim.de/srutest
Java 8; Vaadin 7.7.15 (UI)
Installation uses SNAPSHOT versions of the SRU/FCS libraries, and normally reserved functions to validate the SRU/FCS protocols
SRU/FCS SNAPSHOT libraries must be installed directly from Git
$ git clone https://github.com/clarin-eric/fcs-sru-client.git && cd fcs-sru-client
$ mvn install
$ git clone https://github.com/clarin-eric/fcs-simple-client.git && cd fcs-simple-client
$ mvn install
Build with Maven
$ git clone https://github.com/clarin-eric/fcs-endpoint-tester.git && cd fcs-endpoint-tester
$ mvn clean package
Deployment with Jetty on http://localhost:8080/
$ JETTY_VERSION="9.4.51.v20230217"
$ wget https://repo1.maven.org/maven2/org/eclipse/jetty/jetty-distribution/${JETTY_VERSION}/jetty-distribution-${JETTY_VERSION}.zip && unzip jetty-distribution-${JETTY_VERSION}.zip && rm jetty-distribution-${JETTY_VERSION}.zip
$ cd jetty-distribution-${JETTY_VERSION}/
$ java -jar start.jar --add-to-start=http,deploy
$ cd webapps/ && cp ../../target/FCSEndpointTester-X.Y.Z-SNAPSHOT.war ROOT.war && cd ..
$ java -jar start.jar
Create Docker Image
$ git clone https://github.com/clarin-eric/fcs-endpoint-tester.git && cd fcs-endpoint-tester
$ docker build -t fcs-endpoint-tester .
Run Container
$ docker run --rm -it -p 8080:8080 fcs-endpoint-tester
This is a updated and completely rewritten SRU/FCS Endpoint Validator based on [_fcs_endpoint_protocol_conformance_tester]. It allows to inspect HTTP requests/responses and store validation results in addition to more test cases.
WebApp for testing the compliance with the SRU/FCS specification of FCS endpoints
Deployment: fcs-validator.data.saw-leipzig.de
Multi-module maven project
(standalone) JUnit5 test runner with test cases, Java 11
Vaadin 24 UI with SpringBoot, Java 17
Build with Maven
$ git clone https://github.com/saw-leipzig/fcs-endpoint-validator.git && cd fcs-endpoint-validator
$ mvn clean package install
Deployment with SpringBoot on http://localhost:8080/ (might automatically open a new browser tab)
$ cd fcs-endpoint-validator-ui/
$ mvn spring-boot:run
Download sources:
$ git clone https://github.com/saw-leipzig/fcs-endpoint-validator.git && cd fcs-endpoint-validator
Create docker-compose.yml
deployment description:
version: '3'
services:
fcs-endpoint-validator:
build:
context: .
dockerfile: fcs-endpoint-validator-ui/Dockerfile
container_name: fcs-endpoint-validator
ports:
# default, public 8080 to docker container 8080
- 8080:8080
restart: unless-stopped
Run Docker-Compose deployment:
$ docker compose build
$ docker compose down -v
$ docker compose up -d
Primary FCS client application
Central search interface for users,
“aggregates” FCS search queries to/from distributed endpoints
Deployments:
CLARIN: contentsearch.clarin.eu + (Alpha / Beta instances)
Text+: fcs.text-plus.org (Alpha instance)
Registry of endpoints in Centre Registry + side loading
Deployment instructions found in the repo in DEPLOYMENT.md
Build application (native)
$ git clone https://github.com/clarin-eric/fcs-sru-aggregator.git && cd fcs-sru-aggregator
$ ./build.sh --jar
Configuration (endpoint sideloading + logging) in aggregator_devel.yml
(aggregator.yml
for production deployment)
aggregatorParams
→ additionalFCSEndpoints
logging
→ loggers
Running on http://localhost:4019/
$ ./build.sh --run
Create Docker Image
$ git clone https://github.com/clarin-eric/fcs-sru-aggregator.git && cd fcs-sru-aggregator
$ docker build --tag=fcs-aggregator .
Run Docker Container
$ touch fcsAggregatorResources.json fcsAggregatorResources.backup.json
$ docker run -d --restart unless-stopped \
-p 4019:4019 -p 5005:5005 \
-v $(pwd)/aggregator.yml:/work/aggregator.yml:ro \
-v $(pwd)/fcsAggregatorResources.json:/var/lib/aggregator/fcsAggregatorResources.json \
-v $(pwd)/fcsAggregatorResources.backup.json:/var/lib/aggregator/fcsAggregatorResources.backup.json \
fcs-aggregator
Reference endpoint for Korp corpus search engine
Example → Korp-API publicly accessible, no further configuration required for testing
Code:
Java: github.com/clarin-eric/fcs-korp-endpoint
Python: github.com/Querela/fcs-korp-endpoint-python
Deployment(s):
Språkbanken (Göteborg): https://spraakbanken.gu.se/ws/fcs/2.0/endpoint/korp/sru
CLARIN-DK-UCPH (Copenhagen S): https://alf.hum.ku.dk/korp/fcs/2.0/endpoint/sru
…
Build Application
$ git clone https://github.com/clarin-eric/fcs-korp-endpoint.git && cd fcs-korp-endpoint
$ mvn clean compile war:war
Deployment then with Jetty/Tomcat etc. analogous to the FCS Endpoint Tester
Prepare Deployment
$ git clone https://github.com/Querela/fcs-korp-endpoint-python.git && cd fcs-korp-endpoint-python
$ python3 -m venv venv && source venv/bin/activate
$ python3 -m pip install -e .
Test Deployment (http://localhost:8080)
$ python3 -m korp_endpoint
Productive deployment with Docker (http://localhost:5000)
$ docker build --progress=plain -t korpy .
$ docker run --rm -it -p 5000:5000 korpy
When using Docker and localhost
, network configurations may need to be adjusted so that the Docker container has access to the host
→ host.docker.internal
Comprehensive collections of FCS related links:
github.com/clarin-eric/awesome-fcs,
gitlab.gwdg.de/textplus/ag-fcs-documents/-/blob/main/awesome-fcs.md
CLARIN overview page:
www.clarin.eu/content/federated-content-search-clarin-fcs-technical-details
CLARIN Code Github/Gitlab:
github.com/clarin-eric/?q=fcs,
gitlab.com/CLARIN-ERIC/?filter=fcs,
github.com/clarin-eric/fcs-misc/ (specs, docs, etc.) → overview page
As part of Text+ → see Zotero.org tagged “FCS”
Listing in Text+ Awesome FCS list
In the context of CLARIN? → see Zotero.org “Federated Content Search” group
CLARIN Federated Content Search (CLARIN-FCS) – Core Specification, 2014, Oliver Schonefeld et al.
Federated Search: Towards a Common Search Infrastructure, 2012, Herman Stehouwer et al.
Several workshops