WiktionaryEN

Sleeping

File size: 3,604 Bytes

---
title: WiktionaryEN
emoji: 🐠
colorFrom: indigo
colorTo: pink
sdk: gradio
sdk_version: 5.49.1
app_file: app.py
pinned: false
license: cc-by-sa-3.0
---

# 🇬🇧 Consolidated English Linguistics Hub

A unified, multi-engine interface for morpho-syntactic and semantic analysis of the English language. This application aggregates traditional rule-based NLP, statistical models, and knowledge graphs into a single Gradio dashboard.

## 🚀 Features

  * **Word Encyclopedia (Non-Contextual):** A robust fallback chain that analyzes a single word using **Wiktionary**, **HanTa**, **Stanza**, **NLTK**, and **TextBlob**.
  * **Comprehensive Analyzer (Contextual):** Performs full-sentence analysis (POS tagging, dependency parsing) and ranks semantic senses based on context.
  * **Grammar Checking:** Dual-engine support using **LanguageTool** (Python library) and **After The Deadline** (Java Server via API).
  * **Knowledge Graphs:** Integration with **Open English WordNet (OEWN)**, **ConceptNet**, and **OpenBLP** for synonymy, antonymy, and semantic relations.
  * **Inflection Generation:** Complete verb conjugation and noun declension tables via `pattern.en`.

## 🧩 Architecture

The app uses a "Waterfall" logic for robustness:

1.  **Primary:** Wiktionary (SQL) & spaCy (Transformer/Statistical).
2.  **Secondary:** HanTa (HMM) & OEWN (Graph).
3.  **Fallback:** Pattern (Rules), NLTK (Statistical), TextBlob (Naive Bayes).

-----

### 💡 Why use these tools in the Era of LLMs?

#### 1\. Determinism vs. Probabilistic Outcome

If you ask an LLM, there is always a non-zero probability that it derives from expectation / "ground truth". Traditional Tools return, *if* they "know" it, the ground truth. Every single time. It is mathematically impossible for them to hallucinate outcomes that are not in their rule set / "knowledge" base.

#### 2\. Structured Data vs. Unstructured Text

LLMs generate text. Tools like **WordNet** and **ConceptNet** generate *graphs*. If you are building a game or a search engine, or are interested in semantical / ontological relations, you often need a rigid hierarchy (e.g., *Dog is-a Animal*). Parsing an LLM's chatty response to get this data is fragile and slow / resources consuming. Traditional knowledge graphs like WordNet provide traversable tree structures that computers can process almost instantly.

#### 3\. Latency and Cost

  * **LLM:** Requires an API call (cost) or a massive local GPU (hardware). A single request can take seconds.
  * **spaCy/HanTa:** Runs on a standard CPU in milliseconds. For analyzing millions of words (e.g., data mining a library of books), traditional tools are orders of magnitude faster and cheaper.

#### 4\. Explainability and Debugging

When **LanguageTool** flags a grammar error, it points to a specific, coded rule (e.g., that sentence subject and verb must agree). You know *exactly* why it flagged the error.
An LLM might fix the sentence, but its "explanation" is a generated post-rationalization, which may or may not be the actual linguistic reason it changed the token.

#### 5\. Privacy (Air-Gapped)

You can easily run this hub entirely locally (including the optional AtD/ConceptNet APIs). You can process sensitive documents through these tools absolutely safely.

## 📝 License

The code for this Gradio interface is licensed under [CC-BY-SA-3.0](https://creativecommons.org/licenses/by-sa/3.0/).

The underlying models and data sources retain their original licenses.

**Note**: This is a simple educational tool and work-in-progress. Many results will not be consistent and faulty.