Spaces:
Sleeping
Sleeping
File size: 3,944 Bytes
ad7f72a d91d571 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 |
---
title: WiktionaryDE
emoji: π
colorFrom: indigo
colorTo: pink
sdk: gradio
sdk_version: 5.49.1
app_file: app.py
pinned: false
license: cc-by-sa-3.0
---
# π©πͺ WiktionaryDE - German Linguistics Hub
[](https://creativecommons.org/licenses/by-sa/3.0/)
[](https://gradio.app/)
An advanced multi-tool for German linguistic analysis that combines German Wiktionary database query with multiple morphological engines and semantic knowledge bases into a single, comprehensive interface.
## π― Overview
This Space aggregates multiple German NLP tools and databases to provide:
- Deep morphological analysis of German words
- Contextual sentence analysis with semantic ranking
- Full inflection tables (declensions and conjugations)
- Thesaurus and semantic relation discovery
- Grammar and spelling checking
## π οΈ Tools & Data Sources
### Core Databases
- **Wiktionary Database**: 3.7GB `cstr/de-wiktionary-sqlite-normalized` database providing ground truth for lemmas, inflected forms, definitions, examples, and pronunciation
- **OdeNet (WordNet)**: German thesaurus for synonyms, antonyms, hypernyms, etc.
- **ConceptNet**: Multilingual knowledge graph for semantic relations
### Morphological Engines
- **DWDSmor**: High-precision FST-based analyzer from `zentrum-lexikographie/dwdsmor-open`
- **HanTa**: Hanover Tagger for robust morphological analysis and lemmatization
- **spaCy-IWNLP**: `de_core_news_md` combined with IWNLP for spaCy-based analysis
- **Pattern.de**: Full inflection table generation
### Additional Tools
- **LanguageTool**: German grammar and spelling checks
## π Main Features
### 1. Word Encyclopedia (DE)
The primary non-contextual tool for analyzing single words.
**What it does:**
- Finds all possible analyses (e.g., "Lauf" as noun vs. "lauf" as verb)
- Aggregates data from all engines and databases
- Cross-validates results to filter out artifacts
- Provides complete morphological, semantic, and inflectional information
**Engine Options:**
- **Wiktionary** (Default): Most accurate, database-driven
- **DWDSmor**: High-precision formal grammar
- **HanTa**: Robust tagger-based
- **IWNLP**: spaCy-based analysis
The engine selector automatically falls back to other engines if no result is found.
### 2. Comprehensive Analyzer (DE)
Full sentence analysis with contextual disambiguation.
**Features:**
- Uses spaCy to parse sentences and extract lemmas
- Runs full Word Encyclopedia analysis on each lemma
- **Contextual Ranking**: Uses sentence similarity to rank semantic senses by relevance to the full sentence
- Provides integrated analysis of all words in context
### 3. Individual Engine Tabs
Direct access to raw outputs from:
- Wiktionary
- DWDSmor
- HanTa
- IWNLP
Useful for comparing individual engine outputs.
### 4. Component Tools
Raw access to specialized tools:
- **spaCy**: Dependency parsing and NER
- **Grammar**: LanguageTool checking
- **Inflections**: Pattern.de inflection tables
- **Thesaurus**: OdeNet relations
- **ConceptNet**: Semantic knowledge graph
## βοΈ Technical Details
- **SDK**: Gradio 4.31.0
- **Database Size**: 3.7GB (Wiktionary sqlite)
- **Processing**: Multi-engine pipeline with intelligent fallback
- (basic) **Quality Control**: Cross-validation between engines to filter artifacts
## π License
The code for this Gradio interface is licensed under [CC-BY-SA-3.0](https://creativecommons.org/licenses/by-sa/3.0/).
The underlying models and data sources retain their original licenses:
- Wiktionary: CC-BY-SA
- DWDSmor: Open license (zentrum-lexikographie)
- HanTa: Various open licenses
- spaCy models: MIT License
- OdeNet: CC-BY-SA
- ConceptNet: CC-BY-SA
**Note**: This is a simple educational tool and work-in-progress. Many results will not be consistent and faulty. |