File size: 3,944 Bytes
ad7f72a
 
 
 
 
 
 
 
 
 
 
 
d91d571
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
---
title: WiktionaryDE
emoji: 🐠
colorFrom: indigo
colorTo: pink
sdk: gradio
sdk_version: 5.49.1
app_file: app.py
pinned: false
license: cc-by-sa-3.0
---

# πŸ‡©πŸ‡ͺ WiktionaryDE - German Linguistics Hub

[![License: CC-BY-SA-3.0](https://img.shields.io/badge/License-CC--BY--SA%203.0-lightgrey.svg)](https://creativecommons.org/licenses/by-sa/3.0/)
[![Gradio](https://img.shields.io/badge/Gradio-4.31.0-orange)](https://gradio.app/)

An advanced multi-tool for German linguistic analysis that combines German Wiktionary database query with multiple morphological engines and semantic knowledge bases into a single, comprehensive interface.

## 🎯 Overview

This Space aggregates multiple German NLP tools and databases to provide:
- Deep morphological analysis of German words
- Contextual sentence analysis with semantic ranking
- Full inflection tables (declensions and conjugations)
- Thesaurus and semantic relation discovery
- Grammar and spelling checking

## πŸ› οΈ Tools & Data Sources

### Core Databases
- **Wiktionary Database**: 3.7GB `cstr/de-wiktionary-sqlite-normalized` database providing ground truth for lemmas, inflected forms, definitions, examples, and pronunciation
- **OdeNet (WordNet)**: German thesaurus for synonyms, antonyms, hypernyms, etc.
- **ConceptNet**: Multilingual knowledge graph for semantic relations

### Morphological Engines
- **DWDSmor**: High-precision FST-based analyzer from `zentrum-lexikographie/dwdsmor-open`
- **HanTa**: Hanover Tagger for robust morphological analysis and lemmatization
- **spaCy-IWNLP**: `de_core_news_md` combined with IWNLP for spaCy-based analysis
- **Pattern.de**: Full inflection table generation

### Additional Tools
- **LanguageTool**: German grammar and spelling checks

## πŸ“– Main Features

### 1. Word Encyclopedia (DE)
The primary non-contextual tool for analyzing single words.

**What it does:**
- Finds all possible analyses (e.g., "Lauf" as noun vs. "lauf" as verb)
- Aggregates data from all engines and databases
- Cross-validates results to filter out artifacts
- Provides complete morphological, semantic, and inflectional information

**Engine Options:**
- **Wiktionary** (Default): Most accurate, database-driven
- **DWDSmor**: High-precision formal grammar
- **HanTa**: Robust tagger-based
- **IWNLP**: spaCy-based analysis

The engine selector automatically falls back to other engines if no result is found.

### 2. Comprehensive Analyzer (DE)
Full sentence analysis with contextual disambiguation.

**Features:**
- Uses spaCy to parse sentences and extract lemmas
- Runs full Word Encyclopedia analysis on each lemma
- **Contextual Ranking**: Uses sentence similarity to rank semantic senses by relevance to the full sentence
- Provides integrated analysis of all words in context

### 3. Individual Engine Tabs
Direct access to raw outputs from:
- Wiktionary
- DWDSmor
- HanTa
- IWNLP

Useful for comparing individual engine outputs.

### 4. Component Tools
Raw access to specialized tools:
- **spaCy**: Dependency parsing and NER
- **Grammar**: LanguageTool checking
- **Inflections**: Pattern.de inflection tables
- **Thesaurus**: OdeNet relations
- **ConceptNet**: Semantic knowledge graph

## βš™οΈ Technical Details

- **SDK**: Gradio 4.31.0
- **Database Size**: 3.7GB (Wiktionary sqlite)
- **Processing**: Multi-engine pipeline with intelligent fallback
- (basic) **Quality Control**: Cross-validation between engines to filter artifacts

## πŸ“ License

The code for this Gradio interface is licensed under [CC-BY-SA-3.0](https://creativecommons.org/licenses/by-sa/3.0/).

The underlying models and data sources retain their original licenses:
- Wiktionary: CC-BY-SA
- DWDSmor: Open license (zentrum-lexikographie)
- HanTa: Various open licenses
- spaCy models: MIT License
- OdeNet: CC-BY-SA
- ConceptNet: CC-BY-SA


**Note**: This is a simple educational tool and work-in-progress. Many results will not be consistent and faulty.