Introducing the Federated Content Search (FCS)

  • Description, History & Glossary

What is the FCS?

  • “Federated Content Search” at CLARIN

    In short: Content Search over Distributed Resources

    Also: Federated “Corpus Query Platform”

  • Search for patterns in distributed text collections

  • No central index!

  • Text resources include annotated corpora, full-texts etc.

  • FCS = interface specification, search infrastructure and software ecosystem

  • Usage of established standards and extensibility!

What is included in the FCS?

Interface Specification

  • Description of search protocol (query languages, formats and communication channels)
    “for homogeneous access to heterogeneous search engines”

  • RESTful protocol

Search Infrastruktur in CLARIN and Text+

  • Central client (search result “Aggregator” and web portal)

  • Decentralized endpoints at the data centers (local search eninges on resources)

Software Ecosystem primarily in Java

  • Libraries (Java, Python, …)

  • Tools (Validator, Aggregator, Registry)

Requirements for participation in the FCS

  • (Own) text resources

  • “Search engine” on those text resources

    • Minimum: full-text search

  • Deployment of publicly accessible FCS endpoint(s)

Pros and Cons for the FCS (as Infrastructure)

Pros

  • Integration of many resources, linking and comparison of results

  • Integration with other tools (Weblicht, Registry/VLO, Switchboard, …)

  • Same queries, formats, result presentation

  • No duplicate data storage, inconsistency

Cons

  • No control over resources

  • No deterministic results (e.g. links for publications)

  • No global ranking of results possible

Pros and Cons for FCS Endpoints (Operators)

Pros

  • Control over resources and search (ranking, fuzzy, …)

  • No duplication of data due to central index

  • Increased visibility in a larger resource catalog

Cons

  • Deployment of (additional) endpoint necessary

Comparison of FCS with Central Index

Data

➕ At the endpoints

➖ Duplicate data storage, possible inconsistency (age, updates); legally no transfer may be possible

Updates to Data

➕ Endpoints can react quickly

➖ Difficult, e.g. removal of resources in the event of legal problems; updates entail longer delays, if at all possible

Global Ranking

➖ Very difficult/impossible

➕ Quite possible (?), probably implicit assumption and normalization of data for indexing

Faceted Search

➖ Difficult (e.g. via external metadata; not explicitly intended)

➕ Indexing allows clustering/classification according to topics and categories

History

  • ~ 2011 Started as Working Group in CLARIN

  • Mai 2011 EDC/FCS Workshop

  • ~ 2011–2013 Initial version, now named FCS “Legacy”

    • SRU Scan for resources, BASIC Search (CQL/full-text), KWIC

  • April 2013 FCS Workshop

  • ~ 2013/2014 Code and Spec for FCS Core 1.0

    • fcs-simple-endpoint:1.0.0, sru-server:1.5.0

    • BASIC Hits Data View, SRU Scan operation not used anymore

  • ~ 2015/2016 Starting work on and Code for FCS Core 2.0

    • fcs-simple-endpoint:1.3.0, sru-server:1.8.0

    • Advanced Data Views (FCS-QL), …

  • June 2017 Official release of FCS Core 2.0 Spec

  • 2022 FCS is focus in Text+ (Findability)

  • 2023 New FCS maintainer in CLARIN

    • Migration of Source Code to GitHub.com, updated documentation

    • Python FCS endpoint libraries

    • Updated libraries & tools

    • Prototypes for LexFCS extension

  • 2024

    • Experiments with Entity Search (extension)

    • Rewrite of FCS Endpoint Validator

FCS Architecture

Communication Protocol

SRU (Search/Retrieval via URL) / OASIS searchRetrieve

  • Standardized by Library of Congress (LoC) / OASIS

    • RESTful

    • Explain: Listing of resources

      • Languages, annotations, supported data views and formats etc.

    • SearchRetrieve: Search request

  • Data as XML

  • Extensions to the protocol explicitely allowed

Basic Assumption on the Data Structure

  • different (optional) annotation layers

Full-text

The

cyclists

are

fast

Part of Speech

DET

NOUN

VERB

ADJ

Lemmatisation

The

cyclist

is

fast

Phonetic Transcription

…​

…​

…​

…​

Orthographic Transcription

…​

…​

…​

…​

[…​]

Explain: Resource Discovery

Endpoint Description

Explain: Resource Discovery (2)

Endpoint Description

Explain: Resource Structure