Spaces:

cstr
/

word_enc_de

Paused

App Files Files Community

word_enc_de / README.md

cstr

Update README.md

0f3b652 verified 2 months ago

preview code

raw

history blame contribute delete

7.62 kB

	---
	title: Word Enc De
	emoji: ⚡
	colorFrom: red
	colorTo: red
	sdk: gradio
	sdk_version: 5.49.1
	app_file: app.py
	pinned: false
	license: cc-by-sa-4.0
	---

	# 🏛️ Word Encyclopedia: German Linguistics Hub

	A consolidated Gradio application that combines NLP tools into a single web interface for deep (mostly: German) linguistic analysis. Does both broad non-contextual analysis of individual words and deep contextual analysis of full sentences.

	[![License: CC BY-SA 4.0](https://img.shields.io/badge/License-CC%20BY--SA%204.0-lightgrey.svg)](https://creativecommons.org/licenses/by-sa/4.0/)

	## 🌟 Features

	### 📖 Word Encyclopedia (DE) - Non-Contextual Analysis
	The flagship tool for comprehensive word analysis. Enter a single German word and discover all its possible grammatical roles.

	Key Features:
	- Multi-Engine Architecture: Uses `HanTa` (primary) with automatic fallback to `spacy-iwnlp`
	- Artifact-Free: Cross-validates all grammatical roles with OdeNet to eliminate false inflections
	- Comprehensive Coverage: Handles ambiguous words (e.g., "Lauf" as noun vs "laufen" as verb, "See" as masculine/neuter)
	- Rich Data:
	- Complete inflection tables (declension, conjugation) via `pattern.de`
	- Morpheme analysis via `HanTa`
	- Semantic senses from `OdeNet` (German WordNet)
	- Conceptual relations from `ConceptNet`

	Example: Input "Lauf" → Finds both noun ("der Lauf" - the run) and verb ("laufen" - to run) with complete inflections.

	### 🚀 Comprehensive Analyzer (DE) - Contextual Analysis
	Deep sentence-level analysis with context-aware semantic ranking.

	Key Features:
	- Lemma-by-lemma analysis of entire sentences
	- Context-aware semantic ranking using spaCy sentence vectors
	- Subject-Verb-Agreement (SVA) validation
	- Grammar checking via LanguageTool
	- Filters semantic senses by relevance to sentence context

	Example: In "Der schnelle Hund läuft", ranks "fast" senses of "schnell" higher than "quick" based on context.

	### 🔬 spaCy Analyzer - Multi-lingual
	Direct access to morpho-syntactic parsing for multiple languages.

	Supported Languages:
	- German (`de_core_news_md`)
	- English (`en_core_web_md`)
	- Spanish (`es_core_news_md`)
	- Ancient Greek (7 greCy models: PROIEL/Perseus, TRF/LG/SM variants)

	Outputs:
	- Dependency parsing visualization
	- Named Entity Recognition (NER)
	- Morphological analysis tables
	- JSON export

	### ✅ Grammar Check (DE)
	Professional-grade grammar and style checking powered by `language-tool-python`.

	### 📚 Inflections (DE)
	Direct access to complete German inflection generation via `pattern.de`.

	### 📖 Thesaurus (DE)
	Query interface for OdeNet (German WordNet) with:
	- Synonyms, antonyms
	- Hypernyms, hyponyms
	- Holonyms, meronyms
	- Multiple sense disambiguation

	### 🌐 ConceptNet
	Direct API access to ConceptNet 5 knowledge graph with robust parser that filters self-referential results.

	## 🏗️ Architecture

	```
	┌─────────────────────────────────────────────────────┐
	│ Gradio Interface │
	└─────────────────────────────────────────────────────┘
	│
	┌────────────────┼────────────────┐
	│ │ │
	┌───────▼────────┐ ┌─────▼──────┐ ┌──────▼───────┐
	│ Word │ │Comprehensive│ │ spaCy │
	│ Encyclopedia │ │ Analyzer │ │ Analyzer │
	│ (Non-Context) │ │ (Contextual)│ │(Multi-lingual)│
	└───────┬────────┘ └─────┬──────┘ └──────┬───────┘
	│ │ │
	┌────▼────┐ ┌────▼────┐ ┌────▼────┐
	│ HanTa │ │ spaCy │ │ spaCy │
	│(Primary)│ │Sentence │ │ Models │
	└────┬────┘ │ Vectors │ └─────────┘
	│ └────┬────┘
	┌────▼────┐ │
	│spaCy- │ │
	│ IWNLP │ │
	│(Fallback)│ │
	└────┬────┘ │
	│ │
	└────────┬───────┘
	│
	┌────────────┼────────────┐
	│ │ │
	┌───▼────┐ ┌───▼────┐ ┌───▼────┐
	│Pattern │ │ OdeNet │ │ConceptNet│
	│ .de │ │(WordNet)│ │ API │
	└────────┘ └────────┘ └─────────┘
	```

	## 🔌 API Endpoints

	The application exposes Gradio API endpoints:

	- `/api/get_morphology` - spaCy analysis
	- `/api/check_grammar` - Grammar checking
	- `/api/get_thesaurus` - OdeNet queries
	- `/api/get_all_inflections` - Pattern.de inflections
	- `/api/get_conceptnet` - ConceptNet queries
	- `/api/comprehensive_analysis` - Full contextual analysis
	- `/api/analyze_word` - Word encyclopedia

	Access via Gradio Client:
	```python
	from gradio_client import Client

	client = Client("http://localhost:7860")
	result = client.predict("Lauf", 3, api_name="/analyze_word")
	```

	## 🐛 Troubleshooting

	### "HanTa model file missing"
	```bash
	pip uninstall HanTa
	pip install HanTa --no-cache-dir
	```

	### "LanguageTool failed to initialize"
	Ensure Java is installed (required by LanguageTool):
	```bash
	java -version # Should show Java 8+
	```

	### "OdeNet worker failed"
	Check internet connection. The app downloads OdeNet data on first run.

	### spaCy model not found
	```bash
	python -m spacy download de_core_news_md --force
	```

	## ⚖️ License and Attribution

	This application is licensed under the Creative Commons Attribution-ShareAlike 4.0 International License (CC BY-SA 4.0) due to the ShareAlike requirements of its core data dependencies.

	### Why CC BY-SA 4.0?
	- OdeNet Data: CC BY-SA 4.0 (German WordNet)
	- ConceptNet 5 Data: CC BY-SA 4.0

	The ShareAlike clause requires derivative works to use the same license.

	### Your Obligations
	Under CC BY-SA 4.0, you must:
	1. Give credit: Cite this work and all dependencies
	2. Indicate changes: Note any modifications you make
	3. Share-Alike: Distribute derivative works under CC BY-SA 4.0
	4. No additional restrictions: Cannot add DRM or extra legal terms

	## 🙏 Acknowledgments

	This project would not be possible without:

	- [HanTa](https://github.com/wartaal/HanTa) - High-accuracy morphological analysis
	- [spaCy](https://spacy.io/) - Industrial-strength NLP framework
	- [IWNLP](https://www.iwnlp.com/) - Comprehensive German lemmatization
	- [OdeNet](https://github.com/hdaSprachtechnologie/odenet) - Open German WordNet
	- [pattern.de](https://github.com/clips/pattern) - German linguistics tools
	- [ConceptNet](https://conceptnet.io/) - Multilingual knowledge graph
	- [LanguageTool](https://languagetool.org/) - Grammar checking
	- [greCy](https://github.com/CrispStrobe/greCy) - Ancient Greek NLP models