Introducing the Federated Content Search (FCS)
-
Description, History & Glossary
What is the FCS?
-
“Federated Content Search” at CLARIN
In short: Content Search over Distributed Resources
Also: Federated “Corpus Query Platform”
-
Search for patterns in distributed text collections
-
No central index!
-
Text resources include annotated corpora, full-texts etc.
-
FCS = interface specification, search infrastructure and software ecosystem
-
Usage of established standards and extensibility!
What is included in the FCS?
Interface Specification
-
Description of search protocol (query languages, formats and communication channels)
“for homogeneous access to heterogeneous search engines” -
RESTful protocol
Search Infrastruktur in CLARIN and Text+
-
Central client (search result “Aggregator” and web portal)
-
Decentralized endpoints at the data centers (local search eninges on resources)
Software Ecosystem primarily in Java
-
Libraries (Java, Python, …)
-
Tools (Validator, Aggregator, Registry)
Requirements for participation in the FCS
-
(Own) text resources
-
“Search engine” on those text resources
-
Minimum: full-text search
-
-
Deployment of publicly accessible FCS endpoint(s)
Pros and Cons for the FCS (as Infrastructure)
Pros
-
Integration of many resources, linking and comparison of results
-
Integration with other tools (Weblicht, Registry/VLO, Switchboard, …)
-
Same queries, formats, result presentation
-
No duplicate data storage, inconsistency
Cons
-
No control over resources
-
No deterministic results (e.g. links for publications)
-
No global ranking of results possible
Pros and Cons for FCS Endpoints (Operators)
Pros
-
Control over resources and search (ranking, fuzzy, …)
-
No duplication of data due to central index
-
Increased visibility in a larger resource catalog
Cons
-
Deployment of (additional) endpoint necessary
Comparison of FCS with Central Index
Data |
➕ At the endpoints |
➖ Duplicate data storage, possible inconsistency (age, updates); legally no transfer may be possible |
---|---|---|
Updates to Data |
➕ Endpoints can react quickly |
➖ Difficult, e.g. removal of resources in the event of legal problems; updates entail longer delays, if at all possible |
Global Ranking |
➖ Very difficult/impossible |
➕ Quite possible (?), probably implicit assumption and normalization of data for indexing |
Faceted Search |
➖ Difficult (e.g. via external metadata; not explicitly intended) |
➕ Indexing allows clustering/classification according to topics and categories |
History
-
~ 2011 Started as Working Group in CLARIN
-
Mai 2011 EDC/FCS Workshop
-
~ 2011–2013 Initial version, now named FCS “Legacy”
-
SRU Scan for resources, BASIC Search (CQL/full-text), KWIC
-
-
April 2013 FCS Workshop
-
~ 2013/2014 Code and Spec for FCS Core 1.0
-
fcs-simple-endpoint:1.0.0
,sru-server:1.5.0
-
BASIC Hits Data View, SRU Scan operation not used anymore
-
-
much has disappeared into the annals of history …
-
https://github.com/clarin-eric/fcs-misc/tree/main/historical/documents
-
https://trac.clarin.eu/wiki/FCS/Specification?action=history
-
https://trac.clarin.eu/wiki/Taskforces/FCS/FCS-Specification-Draft?action=history
-
https://www.clarin.eu/event/2013/federated-content-search-workshop
-
EDC: European Demonstrator Case
-
~ 2015/2016 Starting work on and Code for FCS Core 2.0
-
fcs-simple-endpoint:1.3.0
,sru-server:1.8.0
-
Advanced Data Views (FCS-QL), …
-
-
June 2017 Official release of FCS Core 2.0 Spec
-
2022 FCS is focus in Text+ (Findability)
-
2023 New FCS maintainer in CLARIN
-
Migration of Source Code to GitHub.com, updated documentation
-
Python FCS endpoint libraries
-
Updated libraries & tools
-
Prototypes for LexFCS extension
-
-
2024
-
Experiments with Entity Search (extension)
-
Rewrite of FCS Endpoint Validator
-
FCS Architecture
Communication Protocol
SRU (Search/Retrieval via URL) / OASIS searchRetrieve
-
Standardized by Library of Congress (LoC) / OASIS
-
RESTful
-
Explain: Listing of resources
-
Languages, annotations, supported data views and formats etc.
-
-
SearchRetrieve: Search request
-
-
Data as XML
-
Extensions to the protocol explicitely allowed
Basic Assumption on the Data Structure
-
different (optional) annotation layers
Full-text |
The |
cyclists |
are |
fast |
---|---|---|---|---|
Part of Speech |
DET |
NOUN |
VERB |
ADJ |
Lemmatisation |
The |
cyclist |
is |
fast |
Phonetic Transcription |
… |
… |
… |
… |
Orthographic Transcription |
… |
… |
… |
… |
[…] |
Explain: Resource Discovery
Explain: Resource Discovery (2)
Explain: Resource Structure
Query Language FCS-QL
-
Based on CQP
-
Supports various annotation layers
Visualization of Results
Visualization of Results (2)
Visualization of Results (3)
Current state of the FCS
-
Current version of the specification: FCS Core 2.0
-
Poster at Bazaar @ CLARIN2023 on the current status
-
😎 “Awesome FCS” List: github.com/clarin-eric/awesome-fcs with relevant links to specs, tools, libraries, implementations and much more
-
Additions by Text+ (z.B. on LexFCS/LexCQL/Forks/Software): gitlab.gwdg.de/textplus/ag-fcs-documents/-/blob/main/awesome-fcs.md
-
-
CLARIN specifications: github.com/clarin-eric/fcs-misc
-
Small ecosystem (Code on Github/Gitlab)
-
Software libraries (SRU/FCS, endpoint + client, Java/Python)
-
Aggregator (Code: Github, Text+ Fork)
-
Online Validator for Endpunkte (fcsvalidator, Code: Github (old), Github (new))
-
-
Endpunkte Registry: centres.clarin.eu/fcs
Current Work
-
Lexical Resources extension
-
First specification and implementation in Text+
-
Official extension of CLARIN → ~2024 Working Plan
-
-
AAI integration
-
Specification and implementation
-
Goal: Support access-restricted resources
-
Securing the aggregator via Shibboleth → Passing on AAI attributes to endpoints
-
Preliminary work from CLARIAH-DE, part of the Text+ work plan (IDS Mannheim, Uni/SAW Leipzig, preliminary work BBAW)
-
-
Syntactic Search
-
Entity Search
-
Optional metadata for each result
Current status regarding Lexical Resources
-
CLARIN-EU Taskforce
-
CLARIN ERIC working plan: „extending the protocol to cover additional data types (e.g. lexica) will be explored“
-
on the CLARIN 2024 Working Plan
-
-
Interest expressed from various countries
-
Preliminary work: „RABA“ (Estland): e.g. „Eesti Wordnet“
-
First specification and implementation in Text+
-
Specification on Zenodo: zenodo.org/records/7849754
-
Presentation at eLex 2023: “A Federated Search and Retrieval Platform for Lexical Resources in Text+ and CLARIN”
-
Aggregator: fcs.text-plus.org/?queryType=lex
-
Current status of participants
CLARIN (contentsearch.clarin.eu, Registry)
-
209 Resources (94 in Advanced)
in 61 Languages
from 20 Institutions in 12 Countries
Text+ (fcs.text-plus.org)
-
53 Resources (17 in Advanced, 30 in Lexical)
in 6 Languages
from 9 Institutions in Germany
Integration in FCS Infrastructure
CLARIN
-
Alpha/Beta using Side-Loading in Aggregator
-
Stable/Long-Term: Entry in Centre Registry
-
CLARIN Account + Formular as a Centre
-
Including monitoring etc.
-
Text+
-
Side-Loading in Aggregator
-
WIP: Registry (index of endpoints)
Alternative Ways of using FCS
-
Development of an alternative aggregator frontend as Web Component
-
Code: Vue.js Store + Vuetify Component (Dialog); Demo
-
Use of the Aggregator API
-
Restriction to subset of resources, e.g. for integration on own website
-
Faceting, alternative visualization
-
Bootstrapping Endpoint Development
-
Java: Maven Archetype github.com/clarin-eric/fcs-endpoint-archetype
-
Java & Python (reference implementation Korp):
-
😎 “Awesome FCS” List: github.com/clarin-eric/awesome-fcs
-
List of reference implementations, endpoints, query parsers
-
Code for FCS SRU Aggregator and SRU/FCS Validator
-
Guide to Endpoint Development
-
Important preliminary questions
-
Existing implementations, resources for new development
-
Prerequisites
Development Decisions
❓ Can I host the endpoint myself?
❗ No → HelpDesk: CLARIN, Text+
❓ What type of data do I have?
❗ Raw text, Vertical/CONLL, TEI, …
❓ Which search engine do I use / can I use?
❗ KorAP, Korp/CWB, Lucene/Solr/ElasticSearch, BlackLab, (No)SketchEngine, …
❓ Customization or new development?
❗ List of existing endpoint implementations (Awesome List)
❓ Programming language?
❗ Java, Python, (PHP, XQuery)
❓ In-house development: Use of the reference libraries (Java, Python)
❗ Maven Archetype, Korp
❗ SRU + FCS specifications …
Endpoint Implementations
-
Korp FCS 2.0 - reference implementation, Korp corpus search
-
CQP/SRU bridge - Corpus Workbench (CWB)
-
KonText, fcs-noske-endpoint - (No)SketchEngine (CONLL/Vertical)
-
oclcsrw - SRW/SRU server for DSpace, Lucene and/or Pears/Newton
-
corpus_shell, SADE - MySQL PHP/DDC Perl, eXist/XQuery
-
arche-fcs - ARCHE Suite, php
-
Blacklab / MTAS - corpus search engines using Lucene/Solr
-
KorapSRU - KorAP (IDS)
Sources: clarin, awesome-fcs
New Endpoint Development
-
Customization of reference implementation (Korp)
-
Development using CLARIN SRU/FCS libraries
-
Docs:
-
“New” development specifications (for other languages)
-
FCS: github.com/clarin-eric/fcs-misc → “FCS Core 2.0”
-
Awesome List: github.com/clarin-eric/awesome-fcs
Prerequisite for local search engine
❗ Full-text search
❓ With Hit markers
❓ Corpus search (segmented text with annotations)
❕ Pagination (total number of hits)
❗ Resource PID
❓ Linking to result pages
Fundamentals
-
SRU (Overview, APD/Models, Request Parameter, Diagnostics, …)
-
FCS (Discovery, Endpoint Description, Search, SRU Parameter, Diagnostics)
-
FCS Notes (Versions, Compatibility, Aggregator)
Disclaimer
Main focus on:
-
Version: FCS Core 2.0; maximum compatibility with FCS Core 1.0
-
SRU Server, FCS Endpunkt; not FCS client application development
-
Using the reference libraries
→ Java and Python
-
Possible (re-)use of existing endpoints
No:
-
Working through the specification; only the essential information
-
New or redevelopment of SRU/FCS protocols, libraries etc.
(e.g. in other languages)
SRU – History
SRU: Search/Retrieve via URL → LOC
searchRetrieve Version 1.0 – OASIS Standard
-
Part 0. Overview Version 1.0
-
Part 1. Abstract Protocol Definition Version 1.0
-
Part 2. SRU searchRetrieve Operation: APD Binding for SRU 1.2 Version 1.0
-
Part 3. SRU searchRetrieve Operation: APD Binding for SRU 2.0 Version 1.0
-
Part 4.
APD Binding for OpenSearch 1.0 version 1.0 -
Part 5. CQL: The Contextual Query Language version 1.0
-
Part 6. SRU Scan Operation version 1.0
-
Part 7. SRU Explain Operation version 1.0
-
grayed out: not relevant for us
-
crossed out: plays no role at all for the FCS
searchRetrieve: Part 0. – Overview Version 1.0
SRU (SRU: Search/Retrieve via URL) is a web service protocol supported over both SOAP and REST for client-server based search. SRU1.x was developed as a web service replacement for the NISO Z39.50 protocol. SRU2.0 is a revision to SRU which as well as including many enhancements to SRU1.2 was developed alongside the APD.
For the SRU protocol model, three operations are defined as part of its Processing Model:
-
SearchRetrieve Operation. The actual SearchRetrieve operation defined by the SRU protocol; A SearchRetrieve operation consists of a SearchRetrieve request from client to server followed by a SearchRetrieve response from server to client.
-
Scan Operation. Similar to SRU, the Scan protocol defines a request message and a response message for iterating through available search terms. a Scan operation consists of a Scan request followed by a Scan response.
-
Explain Operation. Every SRU or scan server provides an associated Explain document as part of its Description and Discovery Model, providing information about the server’s capabilities. A client may retrieve this document and use the information to self-configure and provide an appropriate interface to the user. When a client retrieves an Explain document, this constitutes an Explain operation.
-
SRW = search/retrieve for the web
searchRetrieve – APD and Bindings
-
Abstract Protocol Definition (APD) für “searchRetrieve operation”
-
Model for SearchRetrieve Operation
-
Describes Capabilities and General Characteristics of a Server or Search Engine, as well as how access should take place
-
Defines abstract Request parameters and Response elements
-
-
Binding
-
Describes corresponding names of the parameters and elements
-
static (for human), dynamic (for machine), …
-
Bindings: SRU 1.2, SRU 2.0, (OpenSearch)
-
Examples: “startPosition” (APD) → “startRecord” (SRU 2.0)
“recordPacking” (SRU 1.2) → “recordXMLEscaping” (SRU 2.0)
-
searchRetrieve – APD Abstract Models
Data Model
Description of the data on which the search is to be executed
Query Model
Description of the construction of search queries
Processing Model
Description of how query is sent from client to server
Result Set Model
Structure of the results of a search
Diagnostics Model
Description of how errors are communicated from the server to the client
Description and Discovery Model
Description, for the discovery of the “Search Service”, self-description of the functionality of the service
SRU 2.0 – Operation Model
-
SRU Request (Client → Server) with Response (Server → Client)
-
Operations
-
SearchRetrieve
-
Scan
-
Explain
-
SRU 2.0 – Data Model
-
Server = Database for Client for search/retrieval
-
Database = Collection of Units of Data → Abstract Record
-
Abstract Records (or Response Records) in one/multiple formats by server
-
Format (or Item Type) = Record Schema
SRU 2.0 – Protocol Model
-
HTTP GET
-
Parameters encoded as “
key=value
” -
UTF-8
-
%
-Escaping -
Separation at “
?
”, “&
”, “=
”
-
-
HTTP POST
-
application/x-www-form-urlencoded
-
No character encoding necessary
-
No length restriction
-
-
HTTP SOAP (?)
SRU 2.0 – Processing Model
-
“Request processing on the server”
-
Request
-
Number of records
-
Identifier for Record Schema (→ Records in Response)
-
Identifier for Response Schema (→ whole Response)
-
-
Response
-
Records in Result Set
-
Diagnostic Information
-
Result Set Identifier for requests for further results
-
SRU 2.0 – Query Model
-
Any “appropriate query language” can be used
-
Mandatory support of
“Contextual Query Language” (CQL)
SRU 2.0 – Parameter Model
-
Use of Parameters, some predefined by SRU 2.0
-
Parameters not defined in the protocol are also permitted
-
Parameter “
query
”-
included in every query in some manner (“
query
” or by parameters not defined in the protocol) -
Query with “
queryType
” (default “cql
”)
-
SRU 2.0 – Result Set Model
-
Logical model → “Result Sets” are not mandatory
-
Query → Selection of suitable Records
-
Ordered list, non-modifiable set after creation
-
Sorting/order determined by server
-
-
for Client:
-
Set of abstract Records, counting starts with
1
-
Each record can be requested in its own format
-
Individual records can “disappear”, no reordering in the Result Set by the Server, but Diagnostic to inform
-
SRU 2.0 – Diagnostic Model
-
fatal
-
Execution of the query cannot be completed
-
e.g. invalid query
-
-
non-fatal
-
Processing impaired, but request can be completed
-
e.g. individual records are not available in the requested schema, server only sends the available ones and informs about the rest
-
surrogate
-
For single Records
-
-
non-surrogate
-
All records are available, but something went wrong, e.g. sorting
-
Or simply a warning
-
-
SRU 2.0 – Explain Model
-
Must be available for HTTP GET via the base URL of the SRU server
-
→ Server Capabilities
-
In the client for self-configuration and to provide the corresponding user interface
-
Details on supported Query Types, CQL Context Sets, Diagnostic Sets, Records Schemas, Sorting options, defaults, …
SRU 2.0 – Serialization Model
-
No restriction on the serialization of responses
(for the entire message or single records)
-
Non-XML serialization is allowed
searchRetrieve 2.0 – Request Parameter
-
All parameters are optional, non-repeatable
-
query, startRecord, maximumRecords, recordXMLEscaping, recordSchema, resultSetTTL, stylesheet; Extension parameters
-
New in 2.0: queryType, sortKeys, renderedBy, httpAccept, responseType, recordPacking; Facet Parameters
searchRetrieve 2.0 – Response Elements
-
All elements are optional, non-repeatable by default
-
numberOfRecords, resultSetId, records, nextRecordPosition, echoedSearchRetrieveRequest, diagnostics, extraResponseDataⓇ
-
New in 2.0: resultSetTTL, resultCountPrecisionⓇ, facetedResultsⓇ, searchResultAnalysisⓇ
(Ⓡ = repeatable)
searchRetrieve 2.0 – Query
-
query
(Parameter)-
Query
-
Mandatory if no specification of
queryType
-
-
queryType
(Parameter, SRU 2.0)-
Optional, by default “
cql
” -
Query Types must be listed in the Explain, with URL for definition and usage abbreviation
-
Reserved
-
cql
-
searchTerms
(processing is left to the server, < SRU 2.0)
-
-
searchRetrieve 2.0 – Query (Examples)
-
spraakbanken.gu.se/…/sru?query=cat
(default, FCS 2.0, SRU 2.0)
-
spraakbanken.gu.se/…/sru?operation=searchRetrieve&version=1.2&query=cat
(FCS 1.0, SRU 1.2)
-
spraakbanken.gu.se/…/sru?operation=searchRetrieve&queryType=cql&query=%22anv%C3%A4ndning%22
(FCS 2.0, SRU 2.0)
-
(FCS 2.0 mit FCS-QL Query)
searchRetrieve 2.0 – Pagination
-
Query for result range of
startRecord
with maximummaximumRecords
-
startRecord
(Parameter)-
Optional, positive integer, starting with
1
-
-
maximumRecords
(Parameter)-
Optional, non-negative integer
-
Server selects default if not specified
-
Server can respond with fewer records, never more
-
-
Response with total number (
numberOfRecords
) of records in the Result Set, with offset (nextRecordPosition
) to next results -
numberOfRecords
(Element)-
Number of results in the Result Set
-
If query fails, it must be “
0
”
-
-
nextRecordPosition
(Element)-
Counter for next result set, if last record in the response is not last in the result set
-
If no further records, then this element must not appear
-
searchRetrieve 2.0 – Result Set
-
resultSetId
(Element)-
Optional, identifier for the Result Set, for referencing in the subsequent requests
-
-
resultSetTTL
(Parameter / Element, Element in SRU 2.0 only)-
Optional, in seconds
-
In request from Client when Result Set is no longer used
-
In response from Server, how long Result Set is available (“good-faith estimate”, → can be longer or shorter)
-
-
resultCountPrecision
(Element, SRU 2.0)-
URI: “
info:srw/vocabulary/resultCountPrecision/1/…
” -
exact
/unknown
/estimate
/maximum
/minimum
/current
-
searchRetrieve 2.0 – Pagination (Cont.)
-
spraakbanken.gu.se/…/sru?query=cat
→ 9220 results, next results starting from 251
-
spraakbanken.gu.se/…/sru?query=cat&startRecord=300&maximumRecords=10
→ More from 310
-
spraakbanken.gu.se/…/sru?query=cat&startRecord=10000&maximumRecords=10
→ Error, because “out of range”
-
spraakbanken.gu.se/…/sru?query=catsss
→ No results
-
spraakbanken.gu.se/…/sru?query=cat&maximumRecords=100000
→ Restricted to 1000 Records
searchRetrieve 2.0 – Serialization
-
recordXMLEscaping
(Parameter, SRU 2.0)-
If records are serialized as XML, “
string
” of the Records can be escaped (“<
”, “>
”, “&
”); default is “xml
” as direct embedding of the Records in the Response, e.g. for Stylesheets
-
-
recordPacking
(Parameter, SRU 2.0)-
In SRU 1.2 used to have the semantic of
recordXMLEscaping
-
“
packed
” (default), Server should deliver Records with requested schema; “unpacked
”, Server can determine the location of the application data in the Records itself (?)
-
-
httpAccept
(Parameter, SRU 2.0)-
Schema for Response, default is “
application/sru+xml
”
-
-
responseType
(Parameter)-
Schema for Response (in combination with
httpAccept
parameter)
-
-
recordSchema
(Parameter)-
Schema of the Records in Response, e.g. “
http://clarin.eu/fcs/resource
” -
Identifier for schema from Explain Response
-
-
records
(Element)-
Contains Records / Surrogate Diagnostics
-
According to default Schema a list of “
<record>
” elements
-
-
recordSchema
withhttp://clarin.eu/fcs/resource
can be used for multiplexing if several SRU functionalities are offered via one endpoint, e.g. also DFG Viewer or similar.
-
stylesheet
(Parameter)-
URL to stylesheet, for display to the user
-
renderedBy
(Parameter, SRU 2.0) -
Where is stylesheet for Response rendered
-
“
client
” (default), URL ofstylesheet
parameter is simply echoed → “thin client” (in Web Browser) -
“
server
”, should transform default SRU response with stylesheet (e.g. forhttpAccept
with HTML format)
-
-
spraakbanken.gu.se/…/sru?query=cat&recordXMLEscaping=string
→ Possible serialization error in Java library
-
(FCS 1.0, SRU 1.2, like
recordXMLEscaping
) -
spraakbanken.gu.se/…/sru?query=cat&recordPacking=unpacked
→ No noticeable change here
-
…
searchRetrieve 2.0 – Unsupported Parameters
-
Sorting (
sortKeys
) and Faceting not supported
SRU 2.0 – Extensions
-
Extensions possible in
-
Request via Extension Parameter
-
(prefixed with “
x-
” and namespace identifier, e.g. “x-fcs-
”)
-
-
Response in the “
<extraResponseData>
” Element -
Response with
extraResponseData
, only if requested in Request with corresponding parameter, never voluntary-
Server can ignore the request, no obligation
-
-
Unknown extension parameters are to be ignored
SRU 2.0 – Backwards Compatibility
-
Parameters “
operation
” and “version
” only in SRU 1.1/SRU 1.2, removed in SRU 2.0 → Assumption of a separate endpoint for each SRU version -
Heuristic for detecting the SRU version
-
searchRetrieve
=query
orqueryType
parameter -
scan
=scanClause
parameter -
explain
-
-
Interoperability with older versions:
-
Use of
operation
/version
parameters → SRU < 2.0 -
Caution with parameters with changed semantics
especially
recordPacking
-
SRU 2.0 – Diagnostics
-
“Error handling”
-
Difference between (non-)fatal, (non-)surrogate → SRU 2.0 – Diagnostic Model
-
Schema:
info:srw/schema/1/diagnostics-v1.1
Prefix:
info:srw/diagnostic/1/
-
uri
(ID),details
(additional information, depending on Diagnostic), message
-
-
Information:
-
General information and notes (LOC, OASIS SRU 2.0)
-
List of Diagnostics (LOC, OASIS SRU 2.0)
-
-
Categories: General (1-9), CQL (10-49), Result Sets (50-60), Records (61-74), Sorting (80-96), Explain (100-102), Stylesheets (110-111), Scan (120-121)
-
Not limited to this list only, custom diagnostics possible
1 |
General system error |
Debugging information (traceback) |
2 |
System temporarily unavailable |
|
3 |
Authentication error |
|
4 |
Unsupported operation |
|
5 |
Unsupported version |
Highest version supported |
6 |
Unsupported parameter value |
Name of parameter |
7 |
Mandatory parameter not supplied |
Name of missing parameter |
8 |
Unsupported parameter |
Name of the unsupported parameter |
9 |
Unsupported combination of parameters |
|
10 |
Query syntax error |
|
23 |
Too many characters in term |
Length of longest term |
26 |
Non special character escaped in term |
Character incorrectly escaped |
35 |
Term contains only stopwords |
Value |
37 |
Unsupported boolean operator |
Value |
38 |
Too many boolean operators in query |
Maximum number supported |
47 |
Cannot process query; reason unknown |
|
48 |
Query feature unsupported |
Feature |
60 |
Result set not created: too many matching records |
Maximum number |
61 |
First record position out of range |
|
64 |
Record temporarily unavailable |
|
65 |
Record does not exist |
|
66 |
Unknown schema for retrieval |
Schema URI or short name |
67 |
Record not available in this schema |
Schema URI or short name |
68 |
Not authorized to send record |
|
69 |
Not authorized to send record in this schema |
|
70 |
Record too large to send |
Maximum record size |
71 |
Unsupported recordXMLEscaping value |
|
80 |
Sort not supported |
|
110 |
Stylesheets not supported |
|
111 |
Unsupported stylesheet |
URL of stylesheet |
FCS Interface Specification
-
FCS = Description of capabilities,
Extensions according to SRU
and operations→ Use of SRU/CQL and
Erweiterungen nach SRU -
Interface specification = formats and transport protocol
-
Endpoint = bridge between client (FCS formats) and local search engine
-
Client = user interface, query input and result presentation
-
-
Discovery and Search mechanism
FCS – Discovery
-
SRU Explain
-
Help and information for the client on accessing, requesting and processing results from the server
-
-
Information about endpoint
-
Capabilities: Basic Search, Advanced Search?
-
Resources for search
→ Endpoint Description (XML) via explain SRU Operation
-
-
FCS 2.0 §3 CLARIN-FCS to SRU/CQL binding
FCS – Endpoint Description
-
XML according to the schema Endpoint-Description.xsd
-
<ed:EndpointDescription>
-
@version
mit “2
” -
<ed:Capabilities>
(1) -
<ed:SupportedDataViews>
(1) -
<ed:SupportedLayers>
(1) (if Advanced Search Capability) -
<ed:Resources>
(1)
-
-
<ed:Capability>
-
Content: Capability Identifier, URI
-
http://clarin.eu/fcs/capability/basic-search
-
http://clarin.eu/fcs/capability/advanced-search
-
-
-
<ed:SupportedDataView>
-
Content: MIME type, e.g.
application/x-clarin-fcs-hits+xml
-
@id
→ for referencing in<ed:Resource>
-
@delivery-policy
:send-by-default
/need-to-request
-
No duplicates (based on MIME type) allowed
-
-
<ed:SupportedLayer>
-
(only for Advanced Search)
-
Content: Layer Identifier, e.g. “
orth
” -
@id
→ for referencing in<ed:Resource>
-
@result-id
→ Referencing the layer in the Advanced Data View -
@qualifier
→ Identifier in FCS-QL Search Term for the layer -
@alt-value-info
,[.blue]` @alt-value-info-uri`: short description of the layer, e.g. for tagset, + URL with further information -
No duplicates allowed based on
@result-id
MIME type
-
-
<ed:Resource>
-
@pid
: Persistent Identifier (e.g.MdSelfLink
from CMDI Record) -
<ed:Title>
(1+) with@xml:lang
, no duplicates, English required -
<ed:Description>
(0+) with@xml:lang
, English required, should be at most 1 sentence -
<ed:Institution>
(0+) with@xml:lang
, English required -
<ed:LandingPageURI>
(0/1) – link to the website of the resource (or institution) with more information -
<ed:Languages>
(1) with<ed:Language>
content according to ISO 639-3 language codes -
<ed:AvailableDataViews>
(1) with@ref
= list of IDs of the<ed:SupportedDataView>
elements, e.g. “hits adv
” -
<ed:AvailableLayers>
(1) (if Advanced Search Capability), with@ref
= list of IDs of the<ed:SupportedLayer>
elements, e.g. “word lemma pos
” -
<ed:Resources>
(0/1) for sub resources -
For
<ed:AvailableDataViews>
and<ed:AvailableLayers>
sub-resources should support the same lists, a new declaration is still required
-
<ed:EndpointDescription xmlns:ed="http://clarin.eu/fcs/endpoint-description" version="2">
<ed:Capabilities>
<ed:Capability>http://clarin.eu/fcs/capability/basic-search</ed:Capability>
</ed:Capabilities>
<ed:SupportedDataViews>
<ed:SupportedDataView id="hits" delivery-policy="send-by-default">application/x-clarin-fcs-hits+xml</ed:SupportedDataView>
</ed:SupportedDataViews>
<ed:Resources>
<!-- just one top-level resource at the Endpoint -->
<ed:Resource pid="http://hdl.handle.net/4711/0815">
<ed:Title xml:lang="de">Goethe Korpus</ed:Title>
<ed:Title xml:lang="en">Goethe corpus</ed:Title>
<ed:Description xml:lang="de">Das Goethe-Korpus des IDS Mannheim.</ed:Description>
<ed:Description xml:lang="en">The Goethe corpus of IDS Mannheim.</ed:Description>
<ed:LandingPageURI>http://repos.example.org/corpus1.html</ed:LandingPageURI>
<ed:Languages>
<ed:Language>deu</ed:Language>
</ed:Languages>
<ed:AvailableDataViews ref="hits" />
</ed:Resource>
</ed:Resources>
</ed:EndpointDescription>
<ed:EndpointDescription xmlns:ed="http://clarin.eu/fcs/endpoint-description" version="2">
<ed:Capabilities>
<ed:Capability>http://clarin.eu/fcs/capability/basic-search</ed:Capability>
</ed:Capabilities>
<ed:SupportedDataViews>
<ed:SupportedDataView id="hits" delivery-policy="send-by-default">application/x-clarin-fcs-hits+xml</ed:SupportedDataView>
<ed:SupportedDataView id="cmdi" delivery-policy="need-to-request">application/x-cmdi+xml</ed:SupportedDataView>
</ed:SupportedDataViews>
<ed:Resources>
<!-- top-level resource 1 -->
<ed:Resource pid="http://hdl.handle.net/4711/0815">
<ed:Title xml:lang="de">Goethe Korpus</ed:Title>
<ed:Title xml:lang="en">Goethe corpus</ed:Title>
<ed:Description xml:lang="de">Das Goethe-Korpus des IDS Mannheim.</ed:Description>
<ed:Description xml:lang="en">The Goethe corpus of IDS Mannheim.</ed:Description>
<ed:LandingPageURI>http://repos.example.org/corpus1.html</ed:LandingPageURI>
<ed:Languages>
<ed:Language>deu</ed:Language>
</ed:Languages>
<ed:AvailableDataViews ref="hits" />
</ed:Resource>
<!-- top-level resource 2 -->
<ed:Resource pid="http://hdl.handle.net/4711/0816">
<ed:Title xml:lang="de">Zeitungskorpus des Mannheimer Morgen</ed:Title>
<ed:Title xml:lang="en">Mannheimer Morgen newspaper corpus</ed:Title>
<ed:LandingPageURI>http://repos.example.org/corpus2.html</ed:LandingPageURI>
<ed:Languages>
<ed:Language>deu</ed:Language>
</ed:Languages>
<ed:AvailableDataViews ref="hits cmdi" />
<ed:Resources>
<!-- sub-resource 1 of top-level resource 2 -->
<ed:Resource pid="http://hdl.handle.net/4711/0816-1">
<ed:Title xml:lang="de">Zeitungskorpus des Mannheimer Morgen (vor 1990)</ed:Title>
<ed:Title xml:lang="en">Mannheimer Morgen newspaper corpus (before 1990)</ed:Title>
<ed:LandingPageURI>http://repos.example.org/corpus2.html#sub1</ed:LandingPageURI>
<ed:Languages>
<ed:Language>deu</ed:Language>
</ed:Languages>
<ed:AvailableDataViews ref="hits cmdi" />
</ed:Resource>
<!-- sub-resource 2 of top-level resource 2 ... -->
</ed:Resources>
</ed:Resource>
</ed:Resources>
</ed:EndpointDescription>
<ed:EndpointDescription xmlns:ed="http://clarin.eu/fcs/endpoint-description" version="2">
<ed:Capabilities>
<ed:Capability>http://clarin.eu/fcs/capability/basic-search</ed:Capability>
<ed:Capability>http://clarin.eu/fcs/capability/advanced-search</ed:Capability>
</ed:Capabilities>
<ed:SupportedDataViews>
<ed:SupportedDataView id="hits" delivery-policy="send-by-default">application/x-clarin-fcs-hits+xml</ed:SupportedDataView>
<ed:SupportedDataView id="adv" delivery-policy="send-by-default">application/x-clarin-fcs-adv+xml</ed:SupportedDataView>
</ed:SupportedDataViews>
<ed:SupportedLayers>
<ed:SupportedLayer id="word" result-id="http://spraakbanken.gu.se/ns/fcs/layer/word">text</ed:SupportedLayer>
<ed:SupportedLayer id="orth" result-id="http://endpoint.example.org/Layers/orth" type="empty">orth</ed:SupportedLayer>
<ed:SupportedLayer id="lemma" result-id="http://spraakbanken.gu.se/ns/fcs/layer/lemma">lemma</ed:SupportedLayer>
<ed:SupportedLayer id="pos" result-id="http://spraakbanken.gu.se/ns/fcs/layer/pos"
alt-value-info="SUC tagset"
alt-value-info-uri="https://spraakbanken.gu.se/parole/Docs/SUC2.0-manual.pdf"
qualifier="suc">pos</ed:SupportedLayer>
<ed:SupportedLayer id="pos2" result-id="http://spraakbanken.gu.se/ns/fcs/layer/pos2"
alt-value-info="2nd tagset"
qualifier="t2">pos</ed:SupportedLayer>
</ed:SupportedLayers>
<ed:Resources>
<!-- just one top-level resource at the Endpoint -->
<ed:Resource pid="hdl:10794/suc">
<ed:Title xml:lang="sv">SUC-korpusen</ed:Title>
<ed:Title xml:lang="en">The SUC corpus</ed:Title>
<ed:Description xml:lang="sv">Stockholm-Umeå-korpusen hos Språkbanken.</ed:Description>
<ed:Description xml:lang="en">The Stockholm-Umeå corpus at Språkbanken.</ed:Description>
<ed:LandingPageURI>https://spraakbanken.gu.se/resurser/suc</ed:LandingPageURI>
<ed:Languages>
<ed:Language>swe</ed:Language>
</ed:Languages>
<ed:AvailableDataViews ref="hits adv" />
<ed:AvailableLayers ref="word lemma pos pos2" />
</ed:Resource>
</ed:Resources>
</ed:EndpointDescription>
FCS – Search
-
SRU SearchRetreive
-
Actual “Search”
-
Basic Search with CQL
-
Advanced Search with FCS-QL
-
-
Search results are serialized in Resource (Fragment) and in Data View formats
-
Implementation details → Chapter Resources and Data Views
FCS – SRU Extension Parameter
-
x-fcs-endpoint-description
(explain)-
“
true
” -<sru:extraResponseData>
of the Explain Response contains the Endpoint Description document
-
-
x-fcs-context
(searchRetrieve)-
Comma-separated list of PIDs
-
Restrict the search to resources identified by these PIDs
-
-
x-fcs-dataviews
(searchRetrieve)-
Comma-separated list of Data View identifiers
-
Endpoints should also deliver these
need-to-request
Data Views if requested
-
-
x-fcs-rewrites-allowed
(searchRetrieve)-
“
true
” - Endpoint can simplify query for higher recall
-
FCS – Diagnostics
-
Complements to the SRU Diagnostics → SRU 2.0 – Diagnostics
-
Prefix:
http://clarin.eu/fcs/diagnostic/
-
Refer to the Extra Request Parameters
Identifier URI | Description | Impact |
---|---|---|
|
Persistent identifier passed by the Client for restricting the search is invalid. |
non-fatal |
|
Resource set too large. Query context automatically adjusted. |
non-fatal |
|
Resource set too large. Cannot perform Query. |
fatal |
|
Requested Data View not valid for this resource. |
non-fatal |
|
General query syntax error. |
fatal |
|
Query too complex. Cannot perform Query. |
fatal |
|
Query was rewritten. |
non-fatal |
|
General processing hint. |
non-fatal |
Versions and Backwards Compatibility
-
“Clients MUST be compatible to CLARIN-FCS 1.0” (Quelle)
-
Thus implementation of SRU 1.2 still required (?)
-
Restriction to Basic Search Capability
-
Processing of legacy XML namespaces (SRU Response, Diagnostics)
-
-
Heuristic for version detection (of endpoints)
-
Client:
Explain
request withoutversion
andoperation
parameters -
Endpoint: SRU Response
<sru:explainResponse>
/<sru:version>
with default SRU version
-
-
Versions
-
FCS 2.0 ↔ SRU 2.0
-
FCS 1.0 ↔ SRU 1.2 (SRU 1.1)
-
Notes on FCS SRU Aggregator
-
Currently no (?) support for FCS 2.0 only endpoints
-
For compatibility reasons support of Legacy FCS and FCS 1.0
-
Assumption that endpoints in FCS 2.0 also support earlier FCS Versions… (no issue with CLARIN SRU/FCS libraries)
→ FCS 2.0 only endpoints may therefore still receive FCS 1.0 (SRU 1.2) requests!
-
-
Aggregator sends
searchRetrieve
requests with only one resource PID in thex-fcs-context
parameter for each resource requested-
i.e. search across
N
resources of an endpoint →N
separate search queries
-
Reference Implementations
-
Java and Python, focus on FCS endpoints
-
Java class hierarchies, organization & structure, processes & lifecycles, configuration
CLARIN Reference Libraries (Java)
-
Development started ~2012
-
Modularized: Client/Server, SRU/FCS, Parser
-
in Java 1.8+ (EOL: Ende 2030)
-
Extensive documentation, some tests (proven by being in use for a long time)
-
Artifacts in CLARIN Nexus, Code on Github
-
Server/endpoint: external dependencies to
-
Logging:
slf4j
-
HTTP:
javax.servlet:servlet-api
-
Parser:
antlr4
(FCS-QL) / CQL
-
-
Build: maven
-
Deployment: jetty, tomcat, …
CLARIN Reference Libraries (Python)
-
~ 2022: Translation of Java reference libraries to Python
-
Strong orientation towards the Java reference libraries
→ (fast) (almost) identical interfaces, class/function names
-
but: slight optimizations for Python, no 1:1 copy
-
Focus on (new) FCS endpoints → no clients!
-
Typed, documented; published on PyPI
-
Synchronous, minimal WSGI - allows embedding in existing apps
-
Python 3.8+
-
Dependencies to
-
XML parsing:
lxml
-
HTTP/WSGI:
werkzeug
-
Query Parser:
PLY
(CQL),ANTLR4
(FCS-QL)
-
CLARIN Reference Libraries
-
Note: concrete examples and implementations will follow in a later section, high-level overview here
FCS Endpoint – Design and structure
-
Query Parser (CQL, FCS-QL)
-
FCS SRU Server
-
SRU configurations, versions, parameters, diagnostics, namespaces
-
XML SRU Writer
-
Request Parameter parser, SRUServer (request handler)
-
Abstract SRU interfaces (results,
SRUSearchEngine
) -
Auth (Interface, WIP)
-
-
FCS Simple Endpoint
-
FCS configurations (Endpoint Description), parameters, diagnostics, namespaces
-
XML Endpoint Description parser, Record and Data View writer
-
SimpleEndpointSearchEngineBase (
SRUSearchEngine
+ FCS extensions)
-
-
FCS Endpoint for XYZ
-
Implementation of abstract classes and bindings to search engine, query translation
-
Configuration: Endpoint Description, SRU Server Configuration
-
Deployment on Java Servlet Server or as WSGI app
-
FCS Endpoint – Initialization
SRUServerServlet
/ SRUServerApp
(web server)
-
Set default WebApp parameters
-
Parse the SRU Server Config
-
Create
QueryParserRegistry
(CQL) -
Initialize
SRUSearchEngine
-
Create
SRUServer
(withSearchEngine
+ configurations)
SRUSearchEngine
(user implementation, → SimpleEndpointSearchEngineBase
)
-
Further initialization of the
QueryParserRegistry
(FCS-QL) -
do_init
(user init) -
Create Endpoint Description
FCS Endpoint – Communication Flow
[GET] request (incoming)
↳ SRUServerServlet
/ SRUServerApp
(web server)
↳ SRUServer
-
URL parameter evaluation
-
Multiplexing by operation: search/scan/explain
↳ SimpleEndpointSearchEngineBase
(user implementation)
-
Parse search query (CQL/FCS-QL) and send to search engine
-
Wrap result in
SRUSearchResultSet
-
Possible diagnostics etc.
↲
-
optional error handling
-
XML output generation (SRU parameter)
FCS Endpoint – Class Hierarchy
Servlet implementation for servlet container, doGet
handler, setup of SRUServer
wrapper/application executed by the endpoint operator
SRU protocol implementation, handleRequest
, error handling, XML output generation
Specific SRU GET parameter evaluation (parsing, validation; SRU versions) + possible FCS parameters (“x-
…”), SRU version detection
Actual implementation of createEndpointDescription
, do
* methods
Actual implementation, nextRecord
+ writeRecord
iterator and serialization of results
XYZSRUScanResultSet
, XYZSRUExplainResult
do not need to be implemented separately, default behavior is adequate
-
Diagnostic codes
-
Namespaces
-
Python: SRU parameter + values
-
Error handling, message (text description) of the diagnostic
Endpoint Configurations
<?xml version="1.0" encoding="UTF-8"?>
<endpoint-config xmlns="http://www.clarin.eu/sru-server/1.0/">
<databaseInfo>
<title xml:lang="se">Språkbankens korpusar</title>
<title xml:lang="en" primary="true">The Språkbanken corpora</title>
<description xml:lang="se">Sök i Språkbankens korpusar.</description>
<description xml:lang="en" primary="true">Search in the Språkbanken corpora.</description>
<author xml:lang="en">Språkbanken (The Swedish Language Bank)</author>
<author xml:lang="se" primary="true">Språkbanken</author>
</databaseInfo>
<indexInfo>
<set name="fcs" identifier="http://clarin.eu/fcs/resource">
<title xml:lang="se">Clarins innehållssökning</title>
<title xml:lang="en" primary="true">CLARIN Content Search</title>
</set>
<index search="true" scan="false" sort="false">
<title xml:lang="en" primary="true">Words</title>
<map primary="true">
<name set="fcs">words</name>
</map>
</index>
</indexInfo>
<schemaInfo>
<schema identifier="http://clarin.eu/fcs/resource" name="fcs"
sort="false" retrieve="true">
<title xml:lang="en" primary="true">CLARIN Content Search</title>
</schema>
</schemaInfo>
</endpoint-config>
WebApp Parameter (web.xml
o.Ä.) - Korp example
-
SRU Version
-
SRU/FCS configurations
SRU (SRU Server Config) - Korp example →
-
databaseInfo
about endpoint, but no evaluation in client? -
default:
indexInfo
+schemaInfo
-
Mandatory:
database
field inserverInfo
!
FCS (Endpoint Description) - Korp example
-
FCS Version (1/2)
-
Capabilities, Layer, DataViews
-
Resources
Resources and Data Views
-
Endpoint Capabilities, BASIC/ADVANCED Search, FCS-QL
-
Resource, Resource Fragment, Data View (Hits, Advanced)
-
Result serialization, query languages
Endpoint Description – Capabilities
http://clarin.eu/fcs/capability/basic-search
-
Mandatory
-
Query: Full-text search (Basic) with minimal CQL (AND/OR)
-
DataView: HITS
http://clarin.eu/fcs/capability/advanced-search
-
Optional
-
Query: FCS-QL (Structured search over annotation layers)
-
DataView: HITS and Advanced
-
Other capabilities possible
→ currently limited to Basic and Advanced Search!
-
Do not only determine search modes!
-
Work in progress:
-
Authentication/authorization
-
Lexical search:
…/lex-search
→ LexCQL, LexHITS -
Syntactic search?
-
-
Note: according to XSD, capability URIs have the following schema
http://clarin.eu/fcs/capability/\w([\.\-]{0,1}\w)*
BASIC Search
cat
"cat"
cat AND dog
"grumpy cat"
"grumpy cat" AND dog
"grumpy cat" OR "lazy dog"
cat AND (mouse OR "lazy dog")
-
Mandatory!
-
Simple full-text search
-
Contextual Query Language (CQL) as query language
-
Endpoints
-
must support “term-only” queries
-
can support Boolean operators (
AND
/OR
) and sub-queries -
must abort in case of errors with appropriate diagnostics
-
can decide themselves what to search for
(text, normalization etc.)
-
-
Results serialized in Generic Hits (HITS) Data View
http://clarin.eu/fcs/capability/basic-search
ADVANCED Search
"walking"
[token = "walking"]
"Dog" /c
[word = "Dog" /c]
[pos = "NOUN"]
[pos != "NOUN"]
[lemma = "walk"]
"blaue|grüne" [pos = "NOUN"]
"dogs" []{3,} "cats" within s
[z:pos = "ADJ"]
[z:pos = "ADJ" & q:pos = "ADJ"]
-
Optional
-
Structured search in annotated data,
represented in annotation layers
→ Query language FCS-QL
-
Queries can combine different annotation layers
-
Endpoints should support as many annotation layers as possible
-
-
Results serialized in Advanced (ADV) Data View and Generic Hits (HITS) Data View
http://clarin.eu/fcs/capability/advanced-search
FCS-QL
-
Annotation Layers, containing annotations of a certain type (e.g. text, POS tags, …)
-
Query supports combination of these layers
-
Each layer is segmented → search for individual lemma
-
No requirement as to how segmentation should be done
-
Assumption that segmentation is consistent across layers (for display in Advanced Data View)
-
Queries can combine segments for multi-token patterns
-
FCS-QL – Notes
-
Endpoints must be able to parse FCS-QL completely!
-
Requests with unsupported operators or layers?
-
Generate errors with diagnostics, or
-
Rewrite queries if permitted by “
x-fcs-rewrites-allowed
” (on request)
-
-
Searches are Case Sensitive (configurable in the query)
-
Searches (by endpoints) should take place on layers where it makes sense,
e.g. if there are several text or POS layers
FCS-QL – Layer Types
Layer Type Identifier | Annotation Layer Description | Syntax | Examples (without quotes) |
---|---|---|---|
|
Textual representation of resource, also the layer that is used in Basic Search |
String |
"Dog", "cat" "walking", "better" |
|
Lemmatisation |
String |
"good", "walk", "dog" |
|
Part-of-Speech annotations |
Universal POS tags |
"NOUN", "VERB", "ADJ" |
|
Orthographic transcription of (mostly) spoken resources |
String |
"dug", "cat", "wolking" |
|
Orthographic normalization of (mostly) spoken resources |
String |
"dog", "cat", "walking", "best" |
|
Phonetic transcription |
"'du:", "'vi:-d6 'ha:-b@n" |
-
Universal Dependencies, Universal POS tags v2.0
-
Dafydd Gibbon, Inge Mertins, Roger Moore (Eds.): Handbook of Multimodal and Spoken Language Systems. Resources, Terminology and Product Evaluation, Kluwer Academic Publishers, Boston MA, 2000, ISBN 0-7923-7904-7
FCS-QL – Layer Type Identifier
-
Identifies layers for FCS-QL and Advanced Data View
-
Other identifiers are not allowed, except for testing purposes
-
Custom identifiers must be prefixed with “
x-
”
Result Serialization
-
Results must be serialized in CLARIN FCS format
-
Resource (Fragment), Data View
-
XML → XSD
-
-
Important: 1 Hit = 1 Result Record
-
Do not combine multiple hits in one record
→ generate separate SRU records for each hit that reference the same resource
-
Multiple hit markers are allowed, e.g. for boolean expressions to highlight individual terms
-
Each “Hit” should be defined in a sentence context
-
Resource
-
“searchable and addressable entity” in the endpoint, e.g. text corpus
-
“self contained”, i.e. entire document, not a single sentence from a document
-
Addressable as a whole via Persistent Identifier or URI
Resource Fragment
-
Part of a Resource, e.g. single sentence, or time interval in audio transcription (for multi-modal corpora)
-
Should be addressable within a Resource (offset / ID)
-
Optional, but recommended
Data View
-
Serialization of a “Hits” in Resource (Fragment)
-
Enables different representations, expandable
Result Serialization – Linking
-
Endpoints should provide link to Resource (Fragment)
-
Persistent Identifier (PID) / URI
-
If direct linking is not possible, then e.g. website with description of the resource, corpus or collection
-
Link should be as specific as possible
-
PIDs preferred to URIs, both together recommended
-
Result Serialization – Examples
<fcs:Resource xmlns:fcs="http://clarin.eu/fcs/resource" pid="http://hdl.handle.net/4711/00-15">
<fcs:DataView type="application/x-clarin-fcs-hits+xml">
<!-- data view payload omitted -->
</fcs:DataView>
</fcs:Resource>
<fcs:Resource xmlns:fcs="http://clarin.eu/fcs/resource" pid="http://hdl.handle.net/4711/08-15">
<fcs:ResourceFragment>
<fcs:DataView type="application/x-clarin-fcs-hits+xml">
<!-- data view payload omitted -->
</fcs:DataView>
</fcs:ResourceFragment>
</fcs:Resource>
<fcs:Resource xmlns:fcs="http://clarin.eu/fcs/resource"
pid="http://hdl.handle.net/4711/08-15"
ref="http://repos.example.org/file/text_08_15.html">
<fcs:DataView type="application/x-cmdi+xml" (1)
pid="http://hdl.handle.net/4711/08-15-1"
ref="http://repos.example.org/file/08_15_1.cmdi">
<!-- data view payload omitted -->
</fcs:DataView>
<fcs:ResourceFragment pid="http://hdl.handle.net/4711/08-15-2" (2)
ref="http://repos.example.org/file/text_08_15.html#sentence2">
<fcs:DataView type="application/x-clarin-fcs-hits+xml">
<!-- data view payload omitted -->
</fcs:DataView>
</fcs:ResourceFragment>
</fcs:Resource>
1 | Specification of CMDI metadata for the resource |
2 | Hit is part of a larger resource “semantically more meaningful” |
Data Views
-
Specification (with XSD schema, examples)
-
Specified in FCS Core 2.0
-
Advanced (ADV) Data View
-
Generic Hits (HITS) Data View
-
-
Additional Data Views such as Component Metadata (CMDI), Images (IMG), Geolocation (GEO) are included, but not used in the standard FCS client “Aggregator”
-
Mandatory “
send-by-default
”or optional “
need-to-request
” -
Generic Hits Data View is mandatory, must always be sent
-
Only send data views that
-
explicitely requested with (SRU) FCS parameter “
x-fcs-dataviews
”, or -
have delivery policy “
send-by-default
”
-
-
Invalid Data Views → non-fatal diagnostic for each requested Data View
http://clarin.eu/fcs/diagnostic/4
("Requested Data View not valid for this resource")
Hits Data View
Description |
The representation of the hit |
---|---|
MIME type |
|
Payload Disposition |
inline |
Payload Delivery |
send-by-default ( |
Recommended Short Identifier |
|
XML Schema |
-
Required implementation
-
Simplest serialization, (lossy) approximation of results
-
Each hit should only occur in a single sentence context (or similar)
-
Multiple hit annotations possible, e.g. for conjunctions in the query
Hits Data View – Examples
<!-- potential @pid and @ref attributes omitted -->
<fcs:DataView type="application/x-clarin-fcs-hits+xml">
<hits:Result xmlns:hits="http://clarin.eu/fcs/dataview/hits">
The quick brown <hits:Hit>fox</hits:Hit> jumps over the lazy dog.
</hits:Result>
</fcs:DataView>
<!-- potential @pid and @ref attributes omitted -->
<fcs:DataView type="application/x-clarin-fcs-hits+xml">
<hits:Result xmlns:hits="http://clarin.eu/fcs/dataview/hits">
The quick brown <hits:Hit>fox</hits:Hit> jumps over the lazy <hits:Hit>dog</hits:Hit>.
</hits:Result>
</fcs:DataView>
KWIC Data View
Description |
The representation of the hit |
---|---|
MIME type |
|
Payload Disposition |
inline |
Payload Delivery |
send-by-default ( |
Recommended Short Identifier |
|
XML Schema |
- |
-
Deprecated!
-
Only for compatibility with Legacy FCS clients
-
Example in CQP/SRU bridge
-
Mapping of
-
left and right context,
-
hits
-
Advanced Data View
Description |
The representation of the hit for Advanced Search |
---|---|
MIME type |
|
Payload Disposition |
inline |
Payload Delivery |
send-by-default ( |
Recommended Short Identifier |
|
XML Schema |
-
Serialization for Advanced Search for multimedia data (text, transcribed audio)
-
Presentation of structured information via multiple annotation layers
-
Annotations are streams (ranges) over the signal in a stand-off like format with start and end offsets (inclusive)
-
Segmentation via
<Segment>
, annotations in<Span>
in<Layer>
-
Segments must be possible to align over all annotation layers
-
Advanced Data View – Example
Advanced Data View – Example (2)
Advanced Data View – Presentation
Examples
→ see more examples (searchRetrieve
query)
endpoint: https://spraakbanken.gu.se/ws/fcs/2.0/endpoint/korp/sru
-
…?operation=searchRetrieve&queryType=fcs&query=%5bword%3d%22anv%C3%A4ndning%22%5d
→ FCS 2.0, FCS-QL:
[ word = "användning" ]
, HITS + ADV -
…?operation=searchRetrieve&queryType=cql&query=%22anv%C3%A4ndning%22
→ FCS 2.0, CQL:
"användning"
, HITS -
…?operation=searchRetrieve&version=1.2&query=cat ↔ …?query=cat → HITS
-
FCS 1.0,
sru="http://www.loc.gov/zing/srw/"
-
FCS 2.0,
sruResponse="http://docs.oasis-open.org/ns/search-ws/sruResponse"
-
-
more parameters:
x-indent-response=1
/x-fcs-dataviews=cmdi
/x-fcs-context=11022/0000-0000-20DF-1
Query Translation
-
Query Languages, Visualization
-
FCS-QL Details
-
Query Mapping
Query Languages
-
More resources in Awesome FCS List > Query Parsers
-
CQL (Contextual Query Language)
-
BNF grammar: www.loc.gov/standards/sru/cql/spec.html#bnf
-
Hand-written parser implementation in Java, Python, JS, …
-
Documentation: Java
-
Visualization in demo of JS parser
-
Validation for Text+ LexCQL
-
-
-
FCS-QL (Federated Content Search Query Language)
-
EBNF grammar: github.com/clarin-eric/fcs-misc (FCS Core 2.0)
-
Grammar visualization with ANTLR4 tools
-
FCS-QL – Visualization
-
Installation
pip install antlr4-tools git clone https://github.com/clarin-eric/fcs-ql.git cd fcs-ql/src/main/antlr4/eu/clarin/sru/fcs/qlparser
-
Visualization according to ANTLR4 > Getting Started
antlr4-parse src/fcsql/FCSParser.g4 src/fcsql/FCSLexer.g4 query -gui [ word = "her.*" ] [ lemma = "Artznei" ] [ pos = "VERB" ] ^D
FCS-QL Query Nodes
QueryNode (with child node “children”)
-
Expression (layer identifier, layer identifier qualifier, operator, regular expression + flags)
-
Wildcard
-
Group → 1 QueryNode; “
(
” … “)
” -
NOT → 1 QueryNode
-
AND, OR → list of QueryNodes
-
-
QueryDisjunction → list of QueryNodes
-
QuerySequence → list of QueryNodes → “list of QuerySegmenten”
-
QuerySegment (min, max) → Expression → “a single token”
-
QueryGroup (min, max) → QueryNode
-
Within-Query (SimpleWithin, QueryWithWithin) (Scope: sentence, utterance, paragraph, turn, text, session) (unused)
-
grayed out: currently not supported by the FCS Aggregator for searching (in visual query builder)
FCS-QL Query Nodes – Aggregator
Parsed Query:
-
Query Sequence → with list of Query Segment
[ word = ".*her" ] [ lemma = "Artznei" ] [ pos = "VERB" ]
-
Query Segment → a token (can be repeatable)
[ word = "her.*" & ( word = "test" | word = "Apfel" ) ] [ pos = "ADV" ]{1,3}
-
Expression AND
[ word = "her.*" & word = "test" ]
-
Expression Group
-
Expression
-
-
Expression Group → Expression OR → list of Expression
[ ( word = "her.*" | word = "Test" ) ]
-
Expression → Layer Identifier, Operator, Regex (value)
[ word = "her.*" ]
-
FCS-QL – Remarks
-
Currently (Aggregator v3.9.1) only limited support of all FCS-QL features
→ partly due to Visual Query Builder
-
Free text input / improved query builder planned for the future
-
Use appropriate diagnostics if query features are not supported
-
SRU:
\info:srw/diagnostic/1/48
- Query feature unsupported. -
FCS:
http://clarin.eu/fcs/diagnostic/10
- General query syntax error. - should be intercepted by FCS-QL parser library -
FCS:
http://clarin.eu/fcs/diagnostic/11
- Query too complex. Cannot perform Query.
-
Query-Mapping
-
Idea:
-
Let libraries parse raw queries (CQL, FCS-QL)
-
Recursively walk through the parsed query tree, “depth first”
-
Successively generate transformed query (for target system),
e.g.
StringBuilder
in Java
-
-
Examples:
-
NoSketchEngine: CQL → CQL (Java), FCS-QL → CQL (Java)
-
Solr: CQL → Solr (Java), LexCQL → Solr (Java)
-
SolrQuery with highlighting, Custom hit prefixes/postfixes, use Solr result as pre-formatted Data View content (Code)
-
-
CQI Bridge: CQL → CQP (Java)
-
ElasticSearch
-
Only BASIC Search with full-text queries, e.g. with Simple Query String
-
-
Solr
-
Only BASIC Search
-
ADVANCED Search with e.g. MTAS (“Multi Tier Annotation Search”)
-
-
In general: use actual Corpus Search Engine for ADVANCED Search
→ otherwise at most a single annotation layer (“text”) can be searched
FCS Endpoint Development
-
VSCode settings, kickstart a project
-
Minimal FCS endpoint, search engine connection, result serialization
-
Deployment, Embedding, Extensibility
Visual Studio Code (suggestion)
-
Download & Installation: code.visualstudio.com
-
Extensions:
-
Java
-
redhat.vscode-xml (optional)
-
Python
-
Quality of Life
-
ms-vscode-remote.vscode-remote-extensionpack, ms-vscode.remote-explorer (for WSL or remote via SSH)
-
-
QoL = Quality of Life
Visual Studio Code – Debugging (Java)
-
For
*.war
/Jetty web application testing -
No hot code swapping / do not make any changes between compilation and debugging!
-
VSCode Debug Setting:
-
Run and Debug > Add Configuration … > “Java: Attach by Process ID”
-
-
Run application with Maven:
MAVEN_OPTS="-Xdebug -Xnoagent -Djava.compiler=NONE -agentlib:jdwp=transport=dt_socket,server=y,address=5005" \ mvn [jetty:run-war|...]
Visual Studio Code – Debugging (Python)
-
launch.json
-
pytest: no predefined configuration in “Run and Debug” menu
-
file/module: as required
-
-
settings.json
-
pytest: coverage must be deactivated here!
-
{
"name": "Python: pytest",
"type": "python",
"request": "launch",
"console": "integratedTerminal",
"purpose": [
"debug-test"
],
"justMyCode": false
}
"python.testing.pytestArgs": [
".",
// disable coverage for debugging
"--no-cov",
// disable ansi color output (-vv)
"-q",
],
Kickstart FCS Endpoint Project
-
See Guide to Endpoint Development
→ Using reference endpoint implementations
-
Using the corp endpoint
-
Java: github.com/clarin-eric/fcs-sru-cqi-bridge (CQP/SRU bridge)
-
Java: project generation with Maven
-
Project template: github.com/clarin-eric/fcs-endpoint-archetype
-
CLARIN SRU/FCS Endpoint Archetype
-
Installation of the archetype in the local Maven repository, or
-
Configuration of the CLARIN Nexus as a remote repository
-
Project generation with Maven:
mvn archetype:generate \
-Pclarin \
-DarchetypeGroupId=eu.clarin.sru.fcs \
-DarchetypeArtifactId=fcs-endpoint-archetype \
-DarchetypeVersion=1.6.0 \
-DgroupId=[ id.group.fcs ] \
-DartifactId=[ my-cool-endpoint ] \
-Dversion=[ 1.0-SNAPSHOT ] \
-DinstitutionName=[ "My Institution" ]
-
all
[
…]
placeholders must be replaced with the appropriate values (enclose values with spaces in quotation marks) -
if using the CLARIN remote repository, the custom profile is selected with
-Pclarin
, see example maven configuration -
if archetype is installed using
git
, then usearchetypeVersion=1.6.0-SNAPSHOT
(see details inpom.xml
)
Minimal FCS Endpoint
-
Required class implementations
-
SimpleEndpointSearchEngineBase
-
SRUSearchResultSet
-
Wrapper or adapter for search engine (!)
-
-
Required configurations
-
sru-server-config.xml
-
endpoint-description.xml
-
Web app configurations
(Java:
web.xml
, Python: key-value parameter dict)-
Reference to implementation of the
SimpleEndpointSearchEngineBase
-
Required SRU parameters (
host
,port
,server
, …)
-
-
Minimal FCS Endpoint – Initialization
void doInit (ServletContext context, SRUServerConfig config, SRUQueryParserRegistry.Builder queryParsersBuilder, Map<String, String> params)
- Java, Python
-
Required implementation!
-
(optional) initialization of APIs, default values (PIDs), …
EndpointDescription createEndpointDescription (ServletContext context, SRUServerConfig config, Map<String, String> params)
- Java, Python
-
Required implementation!
-
Loading of
EndpointDescription
(Java, Python)-
embedded XML file (load with
SimpleEndpointDescriptionParser
, Java, Python) or -
construction dynamically, e.g. via API - example NoSketchEngine
-
Minimal FCS Endpoint – Scan/Explain
-
(theoretically) nothing to implement
→ Default handlers for “explain” and “scan” respond to requests automatically
-
Endpoint Description is always returned as an “explain” operation (in case of doubt)
Minimal FCS Endpoint – Search Request
SRUSearchResultSet search (SRUServerConfig config, SRURequest request, SRUDiagnosticList diagnostics)
-
Parse query (search request)
-
Check “
queryType
” parameter, whether CQL, FCS-QL, … -
Error:
SRU_CANNOT_PROCESS_QUERY_REASON_UNKNOWN
-
-
Analyze
ExtraRequestData
-
“
x-fcs-context
” - requested resource (scope of search)-
Diagnostic:
FCS_DIAGNOSTIC_PERSISTENT_IDENTIFIER_INVALID
- invalid PIDs -
Error:
SRU_UNSUPPORTED_PARAMETER_VALUE
- e.g. too many PIDs, no PIDs
-
-
“
x-fcs-dataviews
” - requested Data Views-
Diagnostic:
FCS_DIAGNOSTIC_PERSISTENT_IDENTIFIER_INVALID
-
-
-
Pagination →
startRecord
(1) /maximumRecords
(-1)
-
Process search with (local) search engine
-
Wrap results in
SRUSearchResultSet
-
“If in Doubt” → `SRU_GENERAL_SYSTEM_ERROR`
Search Engine Integration
-
Input: Parameters of search query
-
Query (translated for (local) search engine)
-
Resource (PID)
-
Pagination: offset + count, →
startRecord
(1) /maximumRecords
(-1)
-
(Request object and Server configurations)
-
(all global/static objects, such as API adapters etc.)
-
-
Output: Details for response, results
-
Total number (optional, FCS 2.0 allows indication of accuracy)
-
List of results
-
with “hit highlighting” (Hits) (Basic + Advanced Search)
-
tokenized (using character offsets) for FCS-QL (Advanced Search) with optional Advanced Search annotation layers
-
-
Diagnostics
-
-
Wrapper for results
-
Total number of results
-
List of results (text with hit offsets; tokens + annotations)
-
Resource PID, URL to result details
-
-
SRUSearchResultSet
implementation-
Iterator interface →
nextRecord()
,writeRecord()
;curRecordCursor
-
protected NoSkESRUFCSSearchResultSet(..., MyResults results) {
super(diagnostics);
this.serverConfig = serverConfig;
this.request = request;
this.results = results;
currentRecordCursor = -1;
// ...
public int getTotalRecordCount() { return (int) results.getTotal(); }
public int getRecordCount() { return results.getResults().size(); }
public boolean nextRecord() throws SRUException {
if (currentRecordCursor < (getRecordCount() - 1)) {
currentRecordCursor++;
return true; }
return false; }
public void writeRecord(XMLStreamWriter writer) {
MyResults.ResultEntry result = results.getResults().get(currentRecordCursor);
XMLStreamWriterHelper.writeStartResource(writer, results.getPid(), null);
XMLStreamWriterHelper.writeStartResourceFragment(writer, null, result.landingpage);
// ...
XMLStreamWriterHelper.writeEndResourceFragment(writer);
XMLStreamWriterHelper.writeEndResource(writer);
}
Result Serialization
-
SRUXMLStreamWriter
- Java, Python-
(internal), specifically for SRU “
recordXmlEscaping
”
-
-
XMLStreamWriterHelper
- Java, Python (FCSRecordXMLStreamWriter
)-
Boilerplate + help for writing Record, RecordFragment, Hits/Kwic Data View
-
-
AdvancedDataViewWriter
- Java, Python-
Help with writing Advanced Data Views
-
addSpans
(content, layer, offset, hit?)writeHitsDataView
,writeAdvancedDataView
-
Minimal Configuration – Endpoint Description
-
FCS Version: 2
-
Capabilities: BASIC Search
-
Data Views: HITS
-
Resources: (min: 1)
-
Title
-
Description
-
LandingPage URL
-
Languages → one language (ISO 639-3)
-
<?xml version="1.0" encoding="UTF-8"?>
<EndpointDescription xmlns="http://clarin.eu/fcs/endpoint-description"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://clarin.eu/fcs/endpoint-description ../../schema/Core_2/Endpoint-Description.xsd"
version="2">
<Capabilities>
<Capability>http://clarin.eu/fcs/capability/basic-search</Capability>
</Capabilities>
<SupportedDataViews>
<SupportedDataView id="hits" delivery-policy="send-by-default" >application/x-clarin-fcs-hits+xml</SupportedDataView>
</SupportedDataViews>
<Resources>
<Resource pid="hdl:10794/sbkorpusar">
<Title xml:lang="sv">Språkbankens korpusar</Title>
<Title xml:lang="en">The Språkbanken corpora</Title>
<Description xml:lang="sv">Korpusarna hos Språkbanken.</Description>
<Description xml:lang="en">The corpora at Språkbanken.</Description>
<LandingPageURI >https://spraakbanken.gu.se/resurser/corpus</LandingPageURI>
<Languages>
<Language>swe</Language>
</Languages>
<AvailableDataViews ref="hits"/>
</Resource>
</Resources>
</EndpointDescription>
Minimal Configuration – SRU
-
SRU Server Configurations → Endpoint Configurations (
sru-server-config.xml
)-
databaseInfo
with general information about endpoint -
default:
indexInfo
+schemaInfo
-
required:
serverInfo
>database
(host
andport
by default)
-
-
Web server configuration
-
Optional adjustment of SRU / FCS parameters
-
Java:
web.xml
-
Python: key-value dictionary
-
-
default:
indexInfo
+schemaInfo
→ copy/paste from template/existing endpoints, configuration remains largely the same here
FCS Endpoint Deployment (Java)
-
Using Maven (!) /
pom.xml
-
<packaging>war</packaging>
-
Build Plugin:
-
org.apache.maven.plugins:maven-war-plugin[:2.6]
(?) -
org.apache.maven.plugins:maven-compiler-plugin
-
-
-
Create WAR artifact
-
mvn clean compile war:war
-
mvn clean package
(also run tests etc.)
-
-
Deploy with Java Servlet Engine / HTTP server like Apache Tomcat / Eclipse Jetty / …
-
TODO: Check if
maven-war-plugin
is no longer necessary?
FCS Endpoint Deployment (Python)
-
“
make_app()
” method→ provides configured WSGI
SRUServerApp
(Python) -
Deployment suggestion: gunicorn (Python WSGI HTTP server)
-
Example: fcs-korp-endpoint-python
-
as module with werkzeug test server
python3 -m korp_endpoint
-
gunicorn in Docker Container (Dockerfile)
gunicorn 'korp_endpoint.app:make_gunicorn_app()'
-
Embedded FCS Endpoint (Python)
def init(self, flask: Flask) -> None:
self.server = self.build_fcs_server()
flask.add_url_rule("some-path/fcs", "some-path/fcs", self.handle)
def build_fcs_server(self) -> SRUServer:
params = self.build_fcs_server_params()
config = self.build_fcs_server_config(params)
qpr_builder = SRUQueryParserRegistry.Builder(True)
search_engine = KoshFCSEndpointSearchEngine(
endpoint_description=self.build_fcs_endpointdescription(),
# ... other parameters
)
search_engine.init(config, qpr_builder, params)
return SRUServer(config, qpr_builder.build(), search_engine)
def handle(self) -> Response:
LOGGER.debug("request: %s", request) # Flask/Werkzeug Request
LOGGER.debug("request?args: %s", request.args)
response = Response() # Flask/Werkzeug Response
self.server.handle_request(request, response)
return response
FCS Endpoint – Extensibility
-
Supports own query languages, Data Views etc.
-
Example: LexFCS (FCS extension for lexical resources)
→ i.e. new query language and Data View
-
LexCQL - query language (CQL dialect)
-
LexHITS - HITS Data View extension
Deployment
-
Deployment instructions for FCS Endpoint Tester/Validator, FCS SRU Aggregator and FCS Korp Endpoint
FCS Endpoint Protocol Conformance Tester
-
NOTE: This is about the now legacy FCS endpoint tester, see Section: FCS Endpoint Validator for the updated and rewritten validator!
-
WebApp for testing the compliance with the FCS specification of endpoints
-
Deployment: clarin.ids-mannheim.de/srutest
-
Java 8; Vaadin 7.7.15 (UI)
-
Installation uses SNAPSHOT versions of the SRU/FCS libraries, and normally reserved functions to validate the SRU/FCS protocols
FCS Endpoint Conformance Tester – Deployment
SRU/FCS SNAPSHOT libraries must be installed directly from Git
$ git clone https://github.com/clarin-eric/fcs-sru-client.git && cd fcs-sru-client
$ mvn install
$ git clone https://github.com/clarin-eric/fcs-simple-client.git && cd fcs-simple-client
$ mvn install
Build with Maven
$ git clone https://github.com/clarin-eric/fcs-endpoint-tester.git && cd fcs-endpoint-tester
$ mvn clean package
Deployment with Jetty on http://localhost:8080/
$ JETTY_VERSION="9.4.51.v20230217"
$ wget https://repo1.maven.org/maven2/org/eclipse/jetty/jetty-distribution/${JETTY_VERSION}/jetty-distribution-${JETTY_VERSION}.zip && unzip jetty-distribution-${JETTY_VERSION}.zip && rm jetty-distribution-${JETTY_VERSION}.zip
$ cd jetty-distribution-${JETTY_VERSION}/
$ java -jar start.jar --add-to-start=http,deploy
$ cd webapps/ && cp ../../target/FCSEndpointTester-X.Y.Z-SNAPSHOT.war ROOT.war && cd ..
$ java -jar start.jar
FCS Endpoint Conformance Tester – Deployment (Docker)
Create Docker Image
$ git clone https://github.com/clarin-eric/fcs-endpoint-tester.git && cd fcs-endpoint-tester
$ docker build -t fcs-endpoint-tester .
Run Container
$ docker run --rm -it -p 8080:8080 fcs-endpoint-tester
FCS Endpoint Validator
-
This is a updated and completely rewritten SRU/FCS Endpoint Validator based on FCS Endpoint Protocol Conformance Tester. It allows to inspect HTTP requests/responses and store validation results in addition to more test cases.
-
WebApp for testing the compliance with the SRU/FCS specification of FCS endpoints
-
Deployment: fcs-validator.data.saw-leipzig.de
-
Multi-module maven project
-
(standalone) JUnit5 test runner with test cases, Java 11
-
Vaadin 24 UI with SpringBoot, Java 17
-
FCS Endpoint Validator – Deployment
Build with Maven
$ git clone https://github.com/saw-leipzig/fcs-endpoint-validator.git && cd fcs-endpoint-validator
$ mvn clean package install
Deployment with SpringBoot on http://localhost:8080/ (might automatically open a new browser tab)
$ cd fcs-endpoint-validator-ui/
$ mvn spring-boot:run
FCS Endpoint Validator – Deployment (Docker)
Download sources:
$ git clone https://github.com/saw-leipzig/fcs-endpoint-validator.git && cd fcs-endpoint-validator
Create docker-compose.yml
deployment description:
version: '3'
services:
fcs-endpoint-validator:
build:
context: .
dockerfile: fcs-endpoint-validator-ui/Dockerfile
container_name: fcs-endpoint-validator
ports:
# default, public 8080 to docker container 8080
- 8080:8080
restart: unless-stopped
Run Docker-Compose deployment:
$ docker compose build
$ docker compose down -v
$ docker compose up -d
FCS SRU Aggregator
-
Primary FCS client application
-
Central search interface for users,
“aggregates” FCS search queries to/from distributed endpoints
-
Deployments:
-
CLARIN: contentsearch.clarin.eu + (Alpha / Beta instances)
-
Text+: fcs.text-plus.org (Alpha instance)
-
-
Registry of endpoints in Centre Registry + side loading
-
Deployment instructions found in the repo in
DEPLOYMENT.md
FCS SRU Aggregator – Deployment
Build application (native)
$ git clone https://github.com/clarin-eric/fcs-sru-aggregator.git && cd fcs-sru-aggregator
$ ./build.sh --jar
Configuration (endpoint sideloading + logging) in aggregator_devel.yml
(aggregator.yml
for production deployment)
-
aggregatorParams
→additionalFCSEndpoints
-
logging
→loggers
Running on http://localhost:4019/
$ ./build.sh --run
FCS SRU Aggregator – Deployment (Docker)
Create Docker Image
$ git clone https://github.com/clarin-eric/fcs-sru-aggregator.git && cd fcs-sru-aggregator
$ docker build --tag=fcs-aggregator .
Run Docker Container
$ touch fcsAggregatorResources.json fcsAggregatorResources.backup.json
$ docker run -d --restart unless-stopped \
-p 4019:4019 -p 5005:5005 \
-v $(pwd)/aggregator.yml:/work/aggregator.yml:ro \
-v $(pwd)/fcsAggregatorResources.json:/var/lib/aggregator/fcsAggregatorResources.json \
-v $(pwd)/fcsAggregatorResources.backup.json:/var/lib/aggregator/fcsAggregatorResources.backup.json \
fcs-aggregator
FCS Korp Endpoint
-
Reference endpoint for Korp corpus search engine
-
Example → Korp-API publicly accessible, no further configuration required for testing
-
Code:
-
Java: github.com/clarin-eric/fcs-korp-endpoint
-
Python: github.com/Querela/fcs-korp-endpoint-python
-
-
Deployment(s):
-
Språkbanken (Göteborg): https://spraakbanken.gu.se/ws/fcs/2.0/endpoint/korp/sru
-
CLARIN-DK-UCPH (Copenhagen S): https://alf.hum.ku.dk/korp/fcs/2.0/endpoint/sru
-
…
FCS Korp Endpoint – Deployment (Java)
Build Application
$ git clone https://github.com/clarin-eric/fcs-korp-endpoint.git && cd fcs-korp-endpoint
$ mvn clean compile war:war
Deployment then with Jetty/Tomcat etc. analogous to the FCS Endpoint Tester
FCS Korp Endpoint – Deployment (Python)
Prepare Deployment
$ git clone https://github.com/Querela/fcs-korp-endpoint-python.git && cd fcs-korp-endpoint-python
$ python3 -m venv venv && source venv/bin/activate
$ python3 -m pip install -e .
Test Deployment (http://localhost:8080)
$ python3 -m korp_endpoint
Productive deployment with Docker (http://localhost:5000)
$ docker build --progress=plain -t korpy .
$ docker run --rm -it -p 5000:5000 korpy
Deployment Notes
-
When using Docker and
localhost
, network configurations may need to be adjusted so that the Docker container has access to the host-
→
host.docker.internal
-
Resources
Links
-
Comprehensive collections of FCS related links:
github.com/clarin-eric/awesome-fcs,
gitlab.gwdg.de/textplus/ag-fcs-documents/-/blob/main/awesome-fcs.md -
CLARIN overview page:
www.clarin.eu/content/federated-content-search-clarin-fcs-technical-details -
CLARIN Code Github/Gitlab:
github.com/clarin-eric/?q=fcs,
gitlab.com/CLARIN-ERIC/?filter=fcs,
github.com/clarin-eric/fcs-misc/ (specs, docs, etc.) → overview page
Publications
-
As part of Text+ → see Zotero.org tagged “FCS”
-
Listing in Text+ Awesome FCS list
-
In the context of CLARIN? → see Zotero.org “Federated Content Search” group
-
CLARIN Federated Content Search (CLARIN-FCS) – Core Specification, 2014, Oliver Schonefeld et al.
-
Federated Search: Towards a Common Search Infrastructure, 2012, Herman Stehouwer et al.
-
Several workshops
-