cstr commited on
Commit
d91d571
·
verified ·
1 Parent(s): ed93d76

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +98 -1
README.md CHANGED
@@ -10,4 +10,101 @@ pinned: false
10
  license: cc-by-sa-3.0
11
  ---
12
 
13
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
10
  license: cc-by-sa-3.0
11
  ---
12
 
13
+ # 🇩🇪 WiktionaryDE - German Linguistics Hub
14
+
15
+ [![License: CC-BY-SA-3.0](https://img.shields.io/badge/License-CC--BY--SA%203.0-lightgrey.svg)](https://creativecommons.org/licenses/by-sa/3.0/)
16
+ [![Gradio](https://img.shields.io/badge/Gradio-4.31.0-orange)](https://gradio.app/)
17
+
18
+ An advanced multi-tool for German linguistic analysis that combines German Wiktionary database query with multiple morphological engines and semantic knowledge bases into a single, comprehensive interface.
19
+
20
+ ## 🎯 Overview
21
+
22
+ This Space aggregates multiple German NLP tools and databases to provide:
23
+ - Deep morphological analysis of German words
24
+ - Contextual sentence analysis with semantic ranking
25
+ - Full inflection tables (declensions and conjugations)
26
+ - Thesaurus and semantic relation discovery
27
+ - Grammar and spelling checking
28
+
29
+ ## 🛠️ Tools & Data Sources
30
+
31
+ ### Core Databases
32
+ - **Wiktionary Database**: 3.7GB `cstr/de-wiktionary-sqlite-normalized` database providing ground truth for lemmas, inflected forms, definitions, examples, and pronunciation
33
+ - **OdeNet (WordNet)**: German thesaurus for synonyms, antonyms, hypernyms, etc.
34
+ - **ConceptNet**: Multilingual knowledge graph for semantic relations
35
+
36
+ ### Morphological Engines
37
+ - **DWDSmor**: High-precision FST-based analyzer from `zentrum-lexikographie/dwdsmor-open`
38
+ - **HanTa**: Hanover Tagger for robust morphological analysis and lemmatization
39
+ - **spaCy-IWNLP**: `de_core_news_md` combined with IWNLP for spaCy-based analysis
40
+ - **Pattern.de**: Full inflection table generation
41
+
42
+ ### Additional Tools
43
+ - **LanguageTool**: German grammar and spelling checks
44
+
45
+ ## 📖 Main Features
46
+
47
+ ### 1. Word Encyclopedia (DE)
48
+ The primary non-contextual tool for analyzing single words.
49
+
50
+ **What it does:**
51
+ - Finds all possible analyses (e.g., "Lauf" as noun vs. "lauf" as verb)
52
+ - Aggregates data from all engines and databases
53
+ - Cross-validates results to filter out artifacts
54
+ - Provides complete morphological, semantic, and inflectional information
55
+
56
+ **Engine Options:**
57
+ - **Wiktionary** (Default): Most accurate, database-driven
58
+ - **DWDSmor**: High-precision formal grammar
59
+ - **HanTa**: Robust tagger-based
60
+ - **IWNLP**: spaCy-based analysis
61
+
62
+ The engine selector automatically falls back to other engines if no result is found.
63
+
64
+ ### 2. Comprehensive Analyzer (DE)
65
+ Full sentence analysis with contextual disambiguation.
66
+
67
+ **Features:**
68
+ - Uses spaCy to parse sentences and extract lemmas
69
+ - Runs full Word Encyclopedia analysis on each lemma
70
+ - **Contextual Ranking**: Uses sentence similarity to rank semantic senses by relevance to the full sentence
71
+ - Provides integrated analysis of all words in context
72
+
73
+ ### 3. Individual Engine Tabs
74
+ Direct access to raw outputs from:
75
+ - Wiktionary
76
+ - DWDSmor
77
+ - HanTa
78
+ - IWNLP
79
+
80
+ Useful for comparing individual engine outputs.
81
+
82
+ ### 4. Component Tools
83
+ Raw access to specialized tools:
84
+ - **spaCy**: Dependency parsing and NER
85
+ - **Grammar**: LanguageTool checking
86
+ - **Inflections**: Pattern.de inflection tables
87
+ - **Thesaurus**: OdeNet relations
88
+ - **ConceptNet**: Semantic knowledge graph
89
+
90
+ ## ⚙️ Technical Details
91
+
92
+ - **SDK**: Gradio 4.31.0
93
+ - **Database Size**: 3.7GB (Wiktionary sqlite)
94
+ - **Processing**: Multi-engine pipeline with intelligent fallback
95
+ - (basic) **Quality Control**: Cross-validation between engines to filter artifacts
96
+
97
+ ## 📝 License
98
+
99
+ The code for this Gradio interface is licensed under [CC-BY-SA-3.0](https://creativecommons.org/licenses/by-sa/3.0/).
100
+
101
+ The underlying models and data sources retain their original licenses:
102
+ - Wiktionary: CC-BY-SA
103
+ - DWDSmor: Open license (zentrum-lexikographie)
104
+ - HanTa: Various open licenses
105
+ - spaCy models: MIT License
106
+ - OdeNet: CC-BY-SA
107
+ - ConceptNet: CC-BY-SA
108
+
109
+
110
+ **Note**: This is a simple educational tool and work-in-progress. Many results will not be consistent and faulty.