Spaces:
Sleeping
Sleeping
File size: 3,604 Bytes
ad7f72a 0520c26 ad7f72a 5172377 d91d571 5172377 d91d571 5172377 d91d571 5172377 d91d571 5172377 d91d571 5172377 d91d571 5172377 d91d571 5172377 d91d571 5172377 d91d571 5172377 d91d571 5172377 d91d571 5172377 d91d571 5172377 d91d571 5172377 d91d571 5172377 d91d571 5172377 d91d571 5172377 d91d571 5172377 d91d571 5172377 d91d571 5172377 d91d571 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 |
---
title: WiktionaryEN
emoji: ๐
colorFrom: indigo
colorTo: pink
sdk: gradio
sdk_version: 5.49.1
app_file: app.py
pinned: false
license: cc-by-sa-3.0
---
# ๐ฌ๐ง Consolidated English Linguistics Hub
A unified, multi-engine interface for morpho-syntactic and semantic analysis of the English language. This application aggregates traditional rule-based NLP, statistical models, and knowledge graphs into a single Gradio dashboard.
## ๐ Features
* **Word Encyclopedia (Non-Contextual):** A robust fallback chain that analyzes a single word using **Wiktionary**, **HanTa**, **Stanza**, **NLTK**, and **TextBlob**.
* **Comprehensive Analyzer (Contextual):** Performs full-sentence analysis (POS tagging, dependency parsing) and ranks semantic senses based on context.
* **Grammar Checking:** Dual-engine support using **LanguageTool** (Python library) and **After The Deadline** (Java Server via API).
* **Knowledge Graphs:** Integration with **Open English WordNet (OEWN)**, **ConceptNet**, and **OpenBLP** for synonymy, antonymy, and semantic relations.
* **Inflection Generation:** Complete verb conjugation and noun declension tables via `pattern.en`.
## ๐งฉ Architecture
The app uses a "Waterfall" logic for robustness:
1. **Primary:** Wiktionary (SQL) & spaCy (Transformer/Statistical).
2. **Secondary:** HanTa (HMM) & OEWN (Graph).
3. **Fallback:** Pattern (Rules), NLTK (Statistical), TextBlob (Naive Bayes).
-----
### ๐ก Why use these tools in the Era of LLMs?
#### 1\. Determinism vs. Probabilistic Outcome
If you ask an LLM, there is always a non-zero probability that it derives from expectation / "ground truth". Traditional Tools return, *if* they "know" it, the ground truth. Every single time. It is mathematically impossible for them to hallucinate outcomes that are not in their rule set / "knowledge" base.
#### 2\. Structured Data vs. Unstructured Text
LLMs generate text. Tools like **WordNet** and **ConceptNet** generate *graphs*. If you are building a game or a search engine, or are interested in semantical / ontological relations, you often need a rigid hierarchy (e.g., *Dog is-a Animal*). Parsing an LLM's chatty response to get this data is fragile and slow / resources consuming. Traditional knowledge graphs like WordNet provide traversable tree structures that computers can process almost instantly.
#### 3\. Latency and Cost
* **LLM:** Requires an API call (cost) or a massive local GPU (hardware). A single request can take seconds.
* **spaCy/HanTa:** Runs on a standard CPU in milliseconds. For analyzing millions of words (e.g., data mining a library of books), traditional tools are orders of magnitude faster and cheaper.
#### 4\. Explainability and Debugging
When **LanguageTool** flags a grammar error, it points to a specific, coded rule (e.g., that sentence subject and verb must agree). You know *exactly* why it flagged the error.
An LLM might fix the sentence, but its "explanation" is a generated post-rationalization, which may or may not be the actual linguistic reason it changed the token.
#### 5\. Privacy (Air-Gapped)
You can easily run this hub entirely locally (including the optional AtD/ConceptNet APIs). You can process sensitive documents through these tools absolutely safely.
## ๐ License
The code for this Gradio interface is licensed under [CC-BY-SA-3.0](https://creativecommons.org/licenses/by-sa/3.0/).
The underlying models and data sources retain their original licenses.
**Note**: This is a simple educational tool and work-in-progress. Many results will not be consistent and faulty. |