Spaces:
Sleeping
Sleeping
| title: WiktionaryEN | |
| emoji: π | |
| colorFrom: indigo | |
| colorTo: pink | |
| sdk: gradio | |
| sdk_version: 5.49.1 | |
| app_file: app.py | |
| pinned: false | |
| license: cc-by-sa-3.0 | |
| # π¬π§ Consolidated English Linguistics Hub | |
| A unified, multi-engine interface for morpho-syntactic and semantic analysis of the English language. This application aggregates traditional rule-based NLP, statistical models, and knowledge graphs into a single Gradio dashboard. | |
| ## π Features | |
| * **Word Encyclopedia (Non-Contextual):** A robust fallback chain that analyzes a single word using **Wiktionary**, **HanTa**, **Stanza**, **NLTK**, and **TextBlob**. | |
| * **Comprehensive Analyzer (Contextual):** Performs full-sentence analysis (POS tagging, dependency parsing) and ranks semantic senses based on context. | |
| * **Grammar Checking:** Dual-engine support using **LanguageTool** (Python library) and **After The Deadline** (Java Server via API). | |
| * **Knowledge Graphs:** Integration with **Open English WordNet (OEWN)**, **ConceptNet**, and **OpenBLP** for synonymy, antonymy, and semantic relations. | |
| * **Inflection Generation:** Complete verb conjugation and noun declension tables via `pattern.en`. | |
| ## π§© Architecture | |
| The app uses a "Waterfall" logic for robustness: | |
| 1. **Primary:** Wiktionary (SQL) & spaCy (Transformer/Statistical). | |
| 2. **Secondary:** HanTa (HMM) & OEWN (Graph). | |
| 3. **Fallback:** Pattern (Rules), NLTK (Statistical), TextBlob (Naive Bayes). | |
| ----- | |
| ### π‘ Why use these tools in the Era of LLMs? | |
| #### 1\. Determinism vs. Probabilistic Outcome | |
| If you ask an LLM, there is always a non-zero probability that it derives from expectation / "ground truth". Traditional Tools return, *if* they "know" it, the ground truth. Every single time. It is mathematically impossible for them to hallucinate outcomes that are not in their rule set / "knowledge" base. | |
| #### 2\. Structured Data vs. Unstructured Text | |
| LLMs generate text. Tools like **WordNet** and **ConceptNet** generate *graphs*. If you are building a game or a search engine, or are interested in semantical / ontological relations, you often need a rigid hierarchy (e.g., *Dog is-a Animal*). Parsing an LLM's chatty response to get this data is fragile and slow / resources consuming. Traditional knowledge graphs like WordNet provide traversable tree structures that computers can process almost instantly. | |
| #### 3\. Latency and Cost | |
| * **LLM:** Requires an API call (cost) or a massive local GPU (hardware). A single request can take seconds. | |
| * **spaCy/HanTa:** Runs on a standard CPU in milliseconds. For analyzing millions of words (e.g., data mining a library of books), traditional tools are orders of magnitude faster and cheaper. | |
| #### 4\. Explainability and Debugging | |
| When **LanguageTool** flags a grammar error, it points to a specific, coded rule (e.g., that sentence subject and verb must agree). You know *exactly* why it flagged the error. | |
| An LLM might fix the sentence, but its "explanation" is a generated post-rationalization, which may or may not be the actual linguistic reason it changed the token. | |
| #### 5\. Privacy (Air-Gapped) | |
| You can easily run this hub entirely locally (including the optional AtD/ConceptNet APIs). You can process sensitive documents through these tools absolutely safely. | |
| ## π License | |
| The code for this Gradio interface is licensed under [CC-BY-SA-3.0](https://creativecommons.org/licenses/by-sa/3.0/). | |
| The underlying models and data sources retain their original licenses. | |
| **Note**: This is a simple educational tool and work-in-progress. Many results will not be consistent and faulty. |