--- title: WiktionaryEN emoji: 🐠 colorFrom: indigo colorTo: pink sdk: gradio sdk_version: 5.49.1 app_file: app.py pinned: false license: cc-by-sa-3.0 --- # 🇬🇧 Consolidated English Linguistics Hub A unified, multi-engine interface for morpho-syntactic and semantic analysis of the English language. This application aggregates traditional rule-based NLP, statistical models, and knowledge graphs into a single Gradio dashboard. ## 🚀 Features * **Word Encyclopedia (Non-Contextual):** A robust fallback chain that analyzes a single word using **Wiktionary**, **HanTa**, **Stanza**, **NLTK**, and **TextBlob**. * **Comprehensive Analyzer (Contextual):** Performs full-sentence analysis (POS tagging, dependency parsing) and ranks semantic senses based on context. * **Grammar Checking:** Dual-engine support using **LanguageTool** (Python library) and **After The Deadline** (Java Server via API). * **Knowledge Graphs:** Integration with **Open English WordNet (OEWN)**, **ConceptNet**, and **OpenBLP** for synonymy, antonymy, and semantic relations. * **Inflection Generation:** Complete verb conjugation and noun declension tables via `pattern.en`. ## 🧩 Architecture The app uses a "Waterfall" logic for robustness: 1. **Primary:** Wiktionary (SQL) & spaCy (Transformer/Statistical). 2. **Secondary:** HanTa (HMM) & OEWN (Graph). 3. **Fallback:** Pattern (Rules), NLTK (Statistical), TextBlob (Naive Bayes). ----- ### 💡 Why use these tools in the Era of LLMs? #### 1\. Determinism vs. Probabilistic Outcome If you ask an LLM, there is always a non-zero probability that it derives from expectation / "ground truth". Traditional Tools return, *if* they "know" it, the ground truth. Every single time. It is mathematically impossible for them to hallucinate outcomes that are not in their rule set / "knowledge" base. #### 2\. Structured Data vs. Unstructured Text LLMs generate text. Tools like **WordNet** and **ConceptNet** generate *graphs*. If you are building a game or a search engine, or are interested in semantical / ontological relations, you often need a rigid hierarchy (e.g., *Dog is-a Animal*). Parsing an LLM's chatty response to get this data is fragile and slow / resources consuming. Traditional knowledge graphs like WordNet provide traversable tree structures that computers can process almost instantly. #### 3\. Latency and Cost * **LLM:** Requires an API call (cost) or a massive local GPU (hardware). A single request can take seconds. * **spaCy/HanTa:** Runs on a standard CPU in milliseconds. For analyzing millions of words (e.g., data mining a library of books), traditional tools are orders of magnitude faster and cheaper. #### 4\. Explainability and Debugging When **LanguageTool** flags a grammar error, it points to a specific, coded rule (e.g., that sentence subject and verb must agree). You know *exactly* why it flagged the error. An LLM might fix the sentence, but its "explanation" is a generated post-rationalization, which may or may not be the actual linguistic reason it changed the token. #### 5\. Privacy (Air-Gapped) You can easily run this hub entirely locally (including the optional AtD/ConceptNet APIs). You can process sensitive documents through these tools absolutely safely. ## 📝 License The code for this Gradio interface is licensed under [CC-BY-SA-3.0](https://creativecommons.org/licenses/by-sa/3.0/). The underlying models and data sources retain their original licenses. **Note**: This is a simple educational tool and work-in-progress. Many results will not be consistent and faulty.