Introducing the Federated Content Search (FCS)
-
Description, History & Glossary
What is the FCS?
-
“Federated Content Search” at CLARIN
In short: Content Search over Distributed Resources
Also: Federated “Corpus Query Platform”
-
Search for patterns in distributed text collections
-
No central index!
-
Text resources include annotated corpora, full-texts etc.
-
FCS = interface specification, search infrastructure and software ecosystem
-
Usage of established standards and extensibility!
What is included in the FCS?
Interface Specification
-
Description of search protocol (query languages, formats and communication channels)
“for homogeneous access to heterogeneous search engines” -
RESTful protocol
Search Infrastruktur in CLARIN and Text+
-
Central client (search result “Aggregator” and web portal)
-
Decentralized endpoints at the data centers (local search eninges on resources)
Software Ecosystem primarily in Java
-
Libraries (Java, Python, …)
-
Tools (Validator, Aggregator, Registry)
Requirements for participation in the FCS
-
(Own) text resources
-
“Search engine” on those text resources
-
Minimum: full-text search
-
-
Deployment of publicly accessible FCS endpoint(s)
Pros and Cons for the FCS (as Infrastructure)
Pros
-
Integration of many resources, linking and comparison of results
-
Integration with other tools (Weblicht, Registry/VLO, Switchboard, …)
-
Same queries, formats, result presentation
-
No duplicate data storage, inconsistency
Cons
-
No control over resources
-
No deterministic results (e.g. links for publications)
-
No global ranking of results possible
Pros and Cons for FCS Endpoints (Operators)
Pros
-
Control over resources and search (ranking, fuzzy, …)
-
No duplication of data due to central index
-
Increased visibility in a larger resource catalog
Cons
-
Deployment of (additional) endpoint necessary
Comparison of FCS with Central Index
Data |
➕ At the endpoints |
➖ Duplicate data storage, possible inconsistency (age, updates); legally no transfer may be possible |
---|---|---|
Updates to Data |
➕ Endpoints can react quickly |
➖ Difficult, e.g. removal of resources in the event of legal problems; updates entail longer delays, if at all possible |
Global Ranking |
➖ Very difficult/impossible |
➕ Quite possible (?), probably implicit assumption and normalization of data for indexing |
Faceted Search |
➖ Difficult (e.g. via external metadata; not explicitly intended) |
➕ Indexing allows clustering/classification according to topics and categories |
History
-
~ 2011 Started as Working Group in CLARIN
-
Mai 2011 EDC/FCS Workshop
-
~ 2011–2013 Initial version, now named FCS “Legacy”
-
SRU Scan for resources, BASIC Search (CQL/full-text), KWIC
-
-
April 2013 FCS Workshop
-
~ 2013/2014 Code and Spec for FCS Core 1.0
-
fcs-simple-endpoint:1.0.0
,sru-server:1.5.0
-
BASIC Hits Data View, SRU Scan operation not used anymore
-
-
much has disappeared into the annals of history …
-
https://github.com/clarin-eric/fcs-misc/tree/main/historical/documents
-
https://trac.clarin.eu/wiki/FCS/Specification?action=history
-
https://trac.clarin.eu/wiki/Taskforces/FCS/FCS-Specification-Draft?action=history
-
https://www.clarin.eu/event/2013/federated-content-search-workshop
-
EDC: European Demonstrator Case
-
~ 2015/2016 Starting work on and Code for FCS Core 2.0
-
fcs-simple-endpoint:1.3.0
,sru-server:1.8.0
-
Advanced Data Views (FCS-QL), …
-
-
June 2017 Official release of FCS Core 2.0 Spec
-
2022 FCS is focus in Text+ (Findability)
-
2023 New FCS maintainer in CLARIN
-
Migration of Source Code to GitHub.com, updated documentation
-
Python FCS endpoint libraries
-
Updated libraries & tools
-
Prototypes for LexFCS extension
-
-
2024
-
Experiments with Entity Search (extension)
-
Rewrite of FCS Endpoint Validator
-
FCS Architecture
Communication Protocol
SRU (Search/Retrieval via URL) / OASIS searchRetrieve
-
Standardized by Library of Congress (LoC) / OASIS
-
RESTful
-
Explain: Listing of resources
-
Languages, annotations, supported data views and formats etc.
-
-
SearchRetrieve: Search request
-
-
Data as XML
-
Extensions to the protocol explicitely allowed
Basic Assumption on the Data Structure
-
different (optional) annotation layers
Full-text |
The |
cyclists |
are |
fast |
---|---|---|---|---|
Part of Speech |
DET |
NOUN |
VERB |
ADJ |
Lemmatisation |
The |
cyclist |
is |
fast |
Phonetic Transcription |
… |
… |
… |
… |
Orthographic Transcription |
… |
… |
… |
… |
[…] |