WiktionaryEN

Sleeping

App Files Files Community

WiktionaryEN / README.md

cstr

Update README.md

5172377 verified 25 days ago

preview code

raw

history blame contribute delete

3.6 kB

	---
	title: WiktionaryEN
	emoji: 🐠
	colorFrom: indigo
	colorTo: pink
	sdk: gradio
	sdk_version: 5.49.1
	app_file: app.py
	pinned: false
	license: cc-by-sa-3.0
	---

	# 🇬🇧 Consolidated English Linguistics Hub

	A unified, multi-engine interface for morpho-syntactic and semantic analysis of the English language. This application aggregates traditional rule-based NLP, statistical models, and knowledge graphs into a single Gradio dashboard.

	## 🚀 Features

	* Word Encyclopedia (Non-Contextual): A robust fallback chain that analyzes a single word using Wiktionary, HanTa, Stanza, NLTK, and TextBlob.
	* Comprehensive Analyzer (Contextual): Performs full-sentence analysis (POS tagging, dependency parsing) and ranks semantic senses based on context.
	* Grammar Checking: Dual-engine support using LanguageTool (Python library) and After The Deadline (Java Server via API).
	* Knowledge Graphs: Integration with Open English WordNet (OEWN), ConceptNet, and OpenBLP for synonymy, antonymy, and semantic relations.
	* Inflection Generation: Complete verb conjugation and noun declension tables via `pattern.en`.

	## 🧩 Architecture

	The app uses a "Waterfall" logic for robustness:

	1. Primary: Wiktionary (SQL) & spaCy (Transformer/Statistical).
	2. Secondary: HanTa (HMM) & OEWN (Graph).
	3. Fallback: Pattern (Rules), NLTK (Statistical), TextBlob (Naive Bayes).

	-----

	### 💡 Why use these tools in the Era of LLMs?

	#### 1\. Determinism vs. Probabilistic Outcome

	If you ask an LLM, there is always a non-zero probability that it derives from expectation / "ground truth". Traditional Tools return, if they "know" it, the ground truth. Every single time. It is mathematically impossible for them to hallucinate outcomes that are not in their rule set / "knowledge" base.

	#### 2\. Structured Data vs. Unstructured Text

	LLMs generate text. Tools like WordNet and ConceptNet generate graphs. If you are building a game or a search engine, or are interested in semantical / ontological relations, you often need a rigid hierarchy (e.g., Dog is-a Animal). Parsing an LLM's chatty response to get this data is fragile and slow / resources consuming. Traditional knowledge graphs like WordNet provide traversable tree structures that computers can process almost instantly.

	#### 3\. Latency and Cost

	* LLM: Requires an API call (cost) or a massive local GPU (hardware). A single request can take seconds.
	* spaCy/HanTa: Runs on a standard CPU in milliseconds. For analyzing millions of words (e.g., data mining a library of books), traditional tools are orders of magnitude faster and cheaper.

	#### 4\. Explainability and Debugging

	When LanguageTool flags a grammar error, it points to a specific, coded rule (e.g., that sentence subject and verb must agree). You know exactly why it flagged the error.
	An LLM might fix the sentence, but its "explanation" is a generated post-rationalization, which may or may not be the actual linguistic reason it changed the token.

	#### 5\. Privacy (Air-Gapped)

	You can easily run this hub entirely locally (including the optional AtD/ConceptNet APIs). You can process sensitive documents through these tools absolutely safely.

	## 📝 License

	The code for this Gradio interface is licensed under [CC-BY-SA-3.0](https://creativecommons.org/licenses/by-sa/3.0/).

	The underlying models and data sources retain their original licenses.

	Note: This is a simple educational tool and work-in-progress. Many results will not be consistent and faulty.