Spaces:
Sleeping
Sleeping
| title: WiktionaryDE | |
| emoji: π | |
| colorFrom: indigo | |
| colorTo: pink | |
| sdk: gradio | |
| sdk_version: 5.49.1 | |
| app_file: app.py | |
| pinned: false | |
| license: cc-by-sa-3.0 | |
| # π©πͺ WiktionaryDE - German Linguistics Hub | |
| [](https://creativecommons.org/licenses/by-sa/3.0/) | |
| [](https://gradio.app/) | |
| An advanced multi-tool for German linguistic analysis that combines German Wiktionary database query with multiple morphological engines and semantic knowledge bases into a single, comprehensive interface. | |
| ## π― Overview | |
| This Space aggregates multiple German NLP tools and databases to provide: | |
| - Deep morphological analysis of German words | |
| - Contextual sentence analysis with semantic ranking | |
| - Full inflection tables (declensions and conjugations) | |
| - Thesaurus and semantic relation discovery | |
| - Grammar and spelling checking | |
| ## π οΈ Tools & Data Sources | |
| ### Core Databases | |
| - **Wiktionary Database**: 3.7GB `cstr/de-wiktionary-sqlite-normalized` database providing ground truth for lemmas, inflected forms, definitions, examples, and pronunciation | |
| - **OdeNet (WordNet)**: German thesaurus for synonyms, antonyms, hypernyms, etc. | |
| - **ConceptNet**: Multilingual knowledge graph for semantic relations | |
| ### Morphological Engines | |
| - **DWDSmor**: High-precision FST-based analyzer from `zentrum-lexikographie/dwdsmor-open` | |
| - **HanTa**: Hanover Tagger for robust morphological analysis and lemmatization | |
| - **spaCy-IWNLP**: `de_core_news_md` combined with IWNLP for spaCy-based analysis | |
| - **Pattern.de**: Full inflection table generation | |
| ### Additional Tools | |
| - **LanguageTool**: German grammar and spelling checks | |
| ## π Main Features | |
| ### 1. Word Encyclopedia (DE) | |
| The primary non-contextual tool for analyzing single words. | |
| **What it does:** | |
| - Finds all possible analyses (e.g., "Lauf" as noun vs. "lauf" as verb) | |
| - Aggregates data from all engines and databases | |
| - Cross-validates results to filter out artifacts | |
| - Provides complete morphological, semantic, and inflectional information | |
| **Engine Options:** | |
| - **Wiktionary** (Default): Most accurate, database-driven | |
| - **DWDSmor**: High-precision formal grammar | |
| - **HanTa**: Robust tagger-based | |
| - **IWNLP**: spaCy-based analysis | |
| The engine selector automatically falls back to other engines if no result is found. | |
| ### 2. Comprehensive Analyzer (DE) | |
| Full sentence analysis with contextual disambiguation. | |
| **Features:** | |
| - Uses spaCy to parse sentences and extract lemmas | |
| - Runs full Word Encyclopedia analysis on each lemma | |
| - **Contextual Ranking**: Uses sentence similarity to rank semantic senses by relevance to the full sentence | |
| - Provides integrated analysis of all words in context | |
| ### 3. Individual Engine Tabs | |
| Direct access to raw outputs from: | |
| - Wiktionary | |
| - DWDSmor | |
| - HanTa | |
| - IWNLP | |
| Useful for comparing individual engine outputs. | |
| ### 4. Component Tools | |
| Raw access to specialized tools: | |
| - **spaCy**: Dependency parsing and NER | |
| - **Grammar**: LanguageTool checking | |
| - **Inflections**: Pattern.de inflection tables | |
| - **Thesaurus**: OdeNet relations | |
| - **ConceptNet**: Semantic knowledge graph | |
| ## βοΈ Technical Details | |
| - **SDK**: Gradio 4.31.0 | |
| - **Database Size**: 3.7GB (Wiktionary sqlite) | |
| - **Processing**: Multi-engine pipeline with intelligent fallback | |
| - (basic) **Quality Control**: Cross-validation between engines to filter artifacts | |
| ## π License | |
| The code for this Gradio interface is licensed under [CC-BY-SA-3.0](https://creativecommons.org/licenses/by-sa/3.0/). | |
| The underlying models and data sources retain their original licenses: | |
| - Wiktionary: CC-BY-SA | |
| - DWDSmor: Open license (zentrum-lexikographie) | |
| - HanTa: Various open licenses | |
| - spaCy models: MIT License | |
| - OdeNet: CC-BY-SA | |
| - ConceptNet: CC-BY-SA | |
| **Note**: This is a simple educational tool and work-in-progress. Many results will not be consistent and faulty. |