word_enc_de / README.md
cstr's picture
Update README.md
0f3b652 verified
---
title: Word Enc De
emoji:
colorFrom: red
colorTo: red
sdk: gradio
sdk_version: 5.49.1
app_file: app.py
pinned: false
license: cc-by-sa-4.0
---
# 🏛️ Word Encyclopedia: German Linguistics Hub
A consolidated Gradio application that combines NLP tools into a single web interface for deep (mostly: German) linguistic analysis. Does both broad non-contextual analysis of individual words and deep contextual analysis of full sentences.
[![License: CC BY-SA 4.0](https://img.shields.io/badge/License-CC%20BY--SA%204.0-lightgrey.svg)](https://creativecommons.org/licenses/by-sa/4.0/)
## 🌟 Features
### 📖 Word Encyclopedia (DE) - Non-Contextual Analysis
The flagship tool for comprehensive word analysis. Enter a single German word and discover all its possible grammatical roles.
**Key Features:**
- **Multi-Engine Architecture**: Uses `HanTa` (primary) with automatic fallback to `spacy-iwnlp`
- **Artifact-Free**: Cross-validates all grammatical roles with OdeNet to eliminate false inflections
- **Comprehensive Coverage**: Handles ambiguous words (e.g., "Lauf" as noun vs "laufen" as verb, "See" as masculine/neuter)
- **Rich Data**:
- Complete inflection tables (declension, conjugation) via `pattern.de`
- Morpheme analysis via `HanTa`
- Semantic senses from `OdeNet` (German WordNet)
- Conceptual relations from `ConceptNet`
**Example:** Input "Lauf" → Finds both noun ("der Lauf" - the run) and verb ("laufen" - to run) with complete inflections.
### 🚀 Comprehensive Analyzer (DE) - Contextual Analysis
Deep sentence-level analysis with context-aware semantic ranking.
**Key Features:**
- Lemma-by-lemma analysis of entire sentences
- Context-aware semantic ranking using spaCy sentence vectors
- Subject-Verb-Agreement (SVA) validation
- Grammar checking via LanguageTool
- Filters semantic senses by relevance to sentence context
**Example:** In "Der schnelle Hund läuft", ranks "fast" senses of "schnell" higher than "quick" based on context.
### 🔬 spaCy Analyzer - Multi-lingual
Direct access to morpho-syntactic parsing for multiple languages.
**Supported Languages:**
- German (`de_core_news_md`)
- English (`en_core_web_md`)
- Spanish (`es_core_news_md`)
- Ancient Greek (7 greCy models: PROIEL/Perseus, TRF/LG/SM variants)
**Outputs:**
- Dependency parsing visualization
- Named Entity Recognition (NER)
- Morphological analysis tables
- JSON export
### ✅ Grammar Check (DE)
Professional-grade grammar and style checking powered by `language-tool-python`.
### 📚 Inflections (DE)
Direct access to complete German inflection generation via `pattern.de`.
### 📖 Thesaurus (DE)
Query interface for OdeNet (German WordNet) with:
- Synonyms, antonyms
- Hypernyms, hyponyms
- Holonyms, meronyms
- Multiple sense disambiguation
### 🌐 ConceptNet
Direct API access to ConceptNet 5 knowledge graph with robust parser that filters self-referential results.
## 🏗️ Architecture
```
┌─────────────────────────────────────────────────────┐
│ Gradio Interface │
└─────────────────────────────────────────────────────┘
┌────────────────┼────────────────┐
│ │ │
┌───────▼────────┐ ┌─────▼──────┐ ┌──────▼───────┐
│ Word │ │Comprehensive│ │ spaCy │
│ Encyclopedia │ │ Analyzer │ │ Analyzer │
│ (Non-Context) │ │ (Contextual)│ │(Multi-lingual)│
└───────┬────────┘ └─────┬──────┘ └──────┬───────┘
│ │ │
┌────▼────┐ ┌────▼────┐ ┌────▼────┐
│ HanTa │ │ spaCy │ │ spaCy │
│(Primary)│ │Sentence │ │ Models │
└────┬────┘ │ Vectors │ └─────────┘
│ └────┬────┘
┌────▼────┐ │
│spaCy- │ │
│ IWNLP │ │
│(Fallback)│ │
└────┬────┘ │
│ │
└────────┬───────┘
┌────────────┼────────────┐
│ │ │
┌───▼────┐ ┌───▼────┐ ┌───▼────┐
│Pattern │ │ OdeNet │ │ConceptNet│
│ .de │ │(WordNet)│ │ API │
└────────┘ └────────┘ └─────────┘
```
## 🔌 API Endpoints
The application exposes Gradio API endpoints:
- `/api/get_morphology` - spaCy analysis
- `/api/check_grammar` - Grammar checking
- `/api/get_thesaurus` - OdeNet queries
- `/api/get_all_inflections` - Pattern.de inflections
- `/api/get_conceptnet` - ConceptNet queries
- `/api/comprehensive_analysis` - Full contextual analysis
- `/api/analyze_word` - Word encyclopedia
Access via Gradio Client:
```python
from gradio_client import Client
client = Client("http://localhost:7860")
result = client.predict("Lauf", 3, api_name="/analyze_word")
```
## 🐛 Troubleshooting
### "HanTa model file missing"
```bash
pip uninstall HanTa
pip install HanTa --no-cache-dir
```
### "LanguageTool failed to initialize"
Ensure Java is installed (required by LanguageTool):
```bash
java -version # Should show Java 8+
```
### "OdeNet worker failed"
Check internet connection. The app downloads OdeNet data on first run.
### spaCy model not found
```bash
python -m spacy download de_core_news_md --force
```
## ⚖️ License and Attribution
This application is licensed under the **Creative Commons Attribution-ShareAlike 4.0 International License (CC BY-SA 4.0)** due to the ShareAlike requirements of its core data dependencies.
### Why CC BY-SA 4.0?
- **OdeNet Data**: CC BY-SA 4.0 (German WordNet)
- **ConceptNet 5 Data**: CC BY-SA 4.0
The ShareAlike clause requires derivative works to use the same license.
### Your Obligations
Under CC BY-SA 4.0, you must:
1. **Give credit**: Cite this work and all dependencies
2. **Indicate changes**: Note any modifications you make
3. **Share-Alike**: Distribute derivative works under CC BY-SA 4.0
4. **No additional restrictions**: Cannot add DRM or extra legal terms
## 🙏 Acknowledgments
This project would not be possible without:
- **[HanTa](https://github.com/wartaal/HanTa)** - High-accuracy morphological analysis
- **[spaCy](https://spacy.io/)** - Industrial-strength NLP framework
- **[IWNLP](https://www.iwnlp.com/)** - Comprehensive German lemmatization
- **[OdeNet](https://github.com/hdaSprachtechnologie/odenet)** - Open German WordNet
- **[pattern.de](https://github.com/clips/pattern)** - German linguistics tools
- **[ConceptNet](https://conceptnet.io/)** - Multilingual knowledge graph
- **[LanguageTool](https://languagetool.org/)** - Grammar checking
- **[greCy](https://github.com/CrispStrobe/greCy)** - Ancient Greek NLP models