WiktionaryEN / README.md
cstr's picture
Update README.md
5172377 verified
---
title: WiktionaryEN
emoji: 🐠
colorFrom: indigo
colorTo: pink
sdk: gradio
sdk_version: 5.49.1
app_file: app.py
pinned: false
license: cc-by-sa-3.0
---
# πŸ‡¬πŸ‡§ Consolidated English Linguistics Hub
A unified, multi-engine interface for morpho-syntactic and semantic analysis of the English language. This application aggregates traditional rule-based NLP, statistical models, and knowledge graphs into a single Gradio dashboard.
## πŸš€ Features
* **Word Encyclopedia (Non-Contextual):** A robust fallback chain that analyzes a single word using **Wiktionary**, **HanTa**, **Stanza**, **NLTK**, and **TextBlob**.
* **Comprehensive Analyzer (Contextual):** Performs full-sentence analysis (POS tagging, dependency parsing) and ranks semantic senses based on context.
* **Grammar Checking:** Dual-engine support using **LanguageTool** (Python library) and **After The Deadline** (Java Server via API).
* **Knowledge Graphs:** Integration with **Open English WordNet (OEWN)**, **ConceptNet**, and **OpenBLP** for synonymy, antonymy, and semantic relations.
* **Inflection Generation:** Complete verb conjugation and noun declension tables via `pattern.en`.
## 🧩 Architecture
The app uses a "Waterfall" logic for robustness:
1. **Primary:** Wiktionary (SQL) & spaCy (Transformer/Statistical).
2. **Secondary:** HanTa (HMM) & OEWN (Graph).
3. **Fallback:** Pattern (Rules), NLTK (Statistical), TextBlob (Naive Bayes).
-----
### πŸ’‘ Why use these tools in the Era of LLMs?
#### 1\. Determinism vs. Probabilistic Outcome
If you ask an LLM, there is always a non-zero probability that it derives from expectation / "ground truth". Traditional Tools return, *if* they "know" it, the ground truth. Every single time. It is mathematically impossible for them to hallucinate outcomes that are not in their rule set / "knowledge" base.
#### 2\. Structured Data vs. Unstructured Text
LLMs generate text. Tools like **WordNet** and **ConceptNet** generate *graphs*. If you are building a game or a search engine, or are interested in semantical / ontological relations, you often need a rigid hierarchy (e.g., *Dog is-a Animal*). Parsing an LLM's chatty response to get this data is fragile and slow / resources consuming. Traditional knowledge graphs like WordNet provide traversable tree structures that computers can process almost instantly.
#### 3\. Latency and Cost
* **LLM:** Requires an API call (cost) or a massive local GPU (hardware). A single request can take seconds.
* **spaCy/HanTa:** Runs on a standard CPU in milliseconds. For analyzing millions of words (e.g., data mining a library of books), traditional tools are orders of magnitude faster and cheaper.
#### 4\. Explainability and Debugging
When **LanguageTool** flags a grammar error, it points to a specific, coded rule (e.g., that sentence subject and verb must agree). You know *exactly* why it flagged the error.
An LLM might fix the sentence, but its "explanation" is a generated post-rationalization, which may or may not be the actual linguistic reason it changed the token.
#### 5\. Privacy (Air-Gapped)
You can easily run this hub entirely locally (including the optional AtD/ConceptNet APIs). You can process sensitive documents through these tools absolutely safely.
## πŸ“ License
The code for this Gradio interface is licensed under [CC-BY-SA-3.0](https://creativecommons.org/licenses/by-sa/3.0/).
The underlying models and data sources retain their original licenses.
**Note**: This is a simple educational tool and work-in-progress. Many results will not be consistent and faulty.