WiktionaryEN / README.md
cstr's picture
Update README.md
d91d571 verified
|
raw
history blame
3.94 kB
metadata
title: WiktionaryDE
emoji: 🐠
colorFrom: indigo
colorTo: pink
sdk: gradio
sdk_version: 5.49.1
app_file: app.py
pinned: false
license: cc-by-sa-3.0

πŸ‡©πŸ‡ͺ WiktionaryDE - German Linguistics Hub

License: CC-BY-SA-3.0 Gradio

An advanced multi-tool for German linguistic analysis that combines German Wiktionary database query with multiple morphological engines and semantic knowledge bases into a single, comprehensive interface.

🎯 Overview

This Space aggregates multiple German NLP tools and databases to provide:

  • Deep morphological analysis of German words
  • Contextual sentence analysis with semantic ranking
  • Full inflection tables (declensions and conjugations)
  • Thesaurus and semantic relation discovery
  • Grammar and spelling checking

πŸ› οΈ Tools & Data Sources

Core Databases

  • Wiktionary Database: 3.7GB cstr/de-wiktionary-sqlite-normalized database providing ground truth for lemmas, inflected forms, definitions, examples, and pronunciation
  • OdeNet (WordNet): German thesaurus for synonyms, antonyms, hypernyms, etc.
  • ConceptNet: Multilingual knowledge graph for semantic relations

Morphological Engines

  • DWDSmor: High-precision FST-based analyzer from zentrum-lexikographie/dwdsmor-open
  • HanTa: Hanover Tagger for robust morphological analysis and lemmatization
  • spaCy-IWNLP: de_core_news_md combined with IWNLP for spaCy-based analysis
  • Pattern.de: Full inflection table generation

Additional Tools

  • LanguageTool: German grammar and spelling checks

πŸ“– Main Features

1. Word Encyclopedia (DE)

The primary non-contextual tool for analyzing single words.

What it does:

  • Finds all possible analyses (e.g., "Lauf" as noun vs. "lauf" as verb)
  • Aggregates data from all engines and databases
  • Cross-validates results to filter out artifacts
  • Provides complete morphological, semantic, and inflectional information

Engine Options:

  • Wiktionary (Default): Most accurate, database-driven
  • DWDSmor: High-precision formal grammar
  • HanTa: Robust tagger-based
  • IWNLP: spaCy-based analysis

The engine selector automatically falls back to other engines if no result is found.

2. Comprehensive Analyzer (DE)

Full sentence analysis with contextual disambiguation.

Features:

  • Uses spaCy to parse sentences and extract lemmas
  • Runs full Word Encyclopedia analysis on each lemma
  • Contextual Ranking: Uses sentence similarity to rank semantic senses by relevance to the full sentence
  • Provides integrated analysis of all words in context

3. Individual Engine Tabs

Direct access to raw outputs from:

  • Wiktionary
  • DWDSmor
  • HanTa
  • IWNLP

Useful for comparing individual engine outputs.

4. Component Tools

Raw access to specialized tools:

  • spaCy: Dependency parsing and NER
  • Grammar: LanguageTool checking
  • Inflections: Pattern.de inflection tables
  • Thesaurus: OdeNet relations
  • ConceptNet: Semantic knowledge graph

βš™οΈ Technical Details

  • SDK: Gradio 4.31.0
  • Database Size: 3.7GB (Wiktionary sqlite)
  • Processing: Multi-engine pipeline with intelligent fallback
  • (basic) Quality Control: Cross-validation between engines to filter artifacts

πŸ“ License

The code for this Gradio interface is licensed under CC-BY-SA-3.0.

The underlying models and data sources retain their original licenses:

  • Wiktionary: CC-BY-SA
  • DWDSmor: Open license (zentrum-lexikographie)
  • HanTa: Various open licenses
  • spaCy models: MIT License
  • OdeNet: CC-BY-SA
  • ConceptNet: CC-BY-SA

Note: This is a simple educational tool and work-in-progress. Many results will not be consistent and faulty.