WiktionaryDE / README.md
cstr's picture
Update README.md
d91d571 verified
---
title: WiktionaryDE
emoji: 🐠
colorFrom: indigo
colorTo: pink
sdk: gradio
sdk_version: 5.49.1
app_file: app.py
pinned: false
license: cc-by-sa-3.0
---
# πŸ‡©πŸ‡ͺ WiktionaryDE - German Linguistics Hub
[![License: CC-BY-SA-3.0](https://img.shields.io/badge/License-CC--BY--SA%203.0-lightgrey.svg)](https://creativecommons.org/licenses/by-sa/3.0/)
[![Gradio](https://img.shields.io/badge/Gradio-4.31.0-orange)](https://gradio.app/)
An advanced multi-tool for German linguistic analysis that combines German Wiktionary database query with multiple morphological engines and semantic knowledge bases into a single, comprehensive interface.
## 🎯 Overview
This Space aggregates multiple German NLP tools and databases to provide:
- Deep morphological analysis of German words
- Contextual sentence analysis with semantic ranking
- Full inflection tables (declensions and conjugations)
- Thesaurus and semantic relation discovery
- Grammar and spelling checking
## πŸ› οΈ Tools & Data Sources
### Core Databases
- **Wiktionary Database**: 3.7GB `cstr/de-wiktionary-sqlite-normalized` database providing ground truth for lemmas, inflected forms, definitions, examples, and pronunciation
- **OdeNet (WordNet)**: German thesaurus for synonyms, antonyms, hypernyms, etc.
- **ConceptNet**: Multilingual knowledge graph for semantic relations
### Morphological Engines
- **DWDSmor**: High-precision FST-based analyzer from `zentrum-lexikographie/dwdsmor-open`
- **HanTa**: Hanover Tagger for robust morphological analysis and lemmatization
- **spaCy-IWNLP**: `de_core_news_md` combined with IWNLP for spaCy-based analysis
- **Pattern.de**: Full inflection table generation
### Additional Tools
- **LanguageTool**: German grammar and spelling checks
## πŸ“– Main Features
### 1. Word Encyclopedia (DE)
The primary non-contextual tool for analyzing single words.
**What it does:**
- Finds all possible analyses (e.g., "Lauf" as noun vs. "lauf" as verb)
- Aggregates data from all engines and databases
- Cross-validates results to filter out artifacts
- Provides complete morphological, semantic, and inflectional information
**Engine Options:**
- **Wiktionary** (Default): Most accurate, database-driven
- **DWDSmor**: High-precision formal grammar
- **HanTa**: Robust tagger-based
- **IWNLP**: spaCy-based analysis
The engine selector automatically falls back to other engines if no result is found.
### 2. Comprehensive Analyzer (DE)
Full sentence analysis with contextual disambiguation.
**Features:**
- Uses spaCy to parse sentences and extract lemmas
- Runs full Word Encyclopedia analysis on each lemma
- **Contextual Ranking**: Uses sentence similarity to rank semantic senses by relevance to the full sentence
- Provides integrated analysis of all words in context
### 3. Individual Engine Tabs
Direct access to raw outputs from:
- Wiktionary
- DWDSmor
- HanTa
- IWNLP
Useful for comparing individual engine outputs.
### 4. Component Tools
Raw access to specialized tools:
- **spaCy**: Dependency parsing and NER
- **Grammar**: LanguageTool checking
- **Inflections**: Pattern.de inflection tables
- **Thesaurus**: OdeNet relations
- **ConceptNet**: Semantic knowledge graph
## βš™οΈ Technical Details
- **SDK**: Gradio 4.31.0
- **Database Size**: 3.7GB (Wiktionary sqlite)
- **Processing**: Multi-engine pipeline with intelligent fallback
- (basic) **Quality Control**: Cross-validation between engines to filter artifacts
## πŸ“ License
The code for this Gradio interface is licensed under [CC-BY-SA-3.0](https://creativecommons.org/licenses/by-sa/3.0/).
The underlying models and data sources retain their original licenses:
- Wiktionary: CC-BY-SA
- DWDSmor: Open license (zentrum-lexikographie)
- HanTa: Various open licenses
- spaCy models: MIT License
- OdeNet: CC-BY-SA
- ConceptNet: CC-BY-SA
**Note**: This is a simple educational tool and work-in-progress. Many results will not be consistent and faulty.