Spaces:

Abdullahrasheed45
/

Grammify

Sleeping

App Files Files Community

Abdullahrasheed45 commited on Nov 19, 2025

Commit

997fa71

verified ·

1 Parent(s): d7770bb

Update README.md

Browse files

Files changed (1) hide show

README.md +557 -0

README.md CHANGED Viewed

@@ -9,5 +9,562 @@ pinned: false
 license: apache-2.0
 sdk_version: 1.51.0
 ---
 Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 license: apache-2.0
 sdk_version: 1.51.0
 ---
+# Grammify - Intelligent Grammar Correction System
+## AI-Powered Grammar Error Detection Using Transformer Models
+<div align="center">
+![Python](https://img.shields.io/badge/Python-3776AB?style=for-the-badge&logo=python&logoColor=white)
+![FastAPI](https://img.shields.io/badge/FastAPI-009688?style=for-the-badge&logo=FastAPI&logoColor=white)
+![Streamlit](https://img.shields.io/badge/Streamlit-FF4B4B?style=for-the-badge&logo=Streamlit&logoColor=white)
+![Transformers](https://img.shields.io/badge/🤗%20Transformers-FFD21E?style=for-the-badge)
+[![Hugging Face Space](https://img.shields.io/badge/🤗%20Hugging%20Face-Space-yellow?style=for-the-badge)](https://huggingface.co/spaces/Abdullahrasheed45/Grammify)
+[![Apache License](https://img.shields.io/badge/License-Apache%202.0-blue.svg?style=for-the-badge)](https://opensource.org/licenses/Apache-2.0)
+[![Model](https://img.shields.io/badge/Model-T5--Based-orange?style=for-the-badge)]()
+**NLP Application Project** | Deployed on Hugging Face Spaces
+</div>
+---
+> An intelligent grammar correction application leveraging state-of-the-art Seq2Seq transformer models to detect and correct grammatical errors with real-time visual feedback and detailed linguistic error analysis.
+## Overview
+Grammify implements an advanced grammar correction system designed to enhance written communication across professional, academic, and personal contexts. Built on the Gramformer library and powered by a custom T5-based model, the system processes natural language input through a transformer architecture to identify and correct diverse grammatical errors with high accuracy and contextual awareness.
+**Technical Context:** Full-stack NLP application integrating FastAPI microservices, Streamlit frontend, and Hugging Face Transformers for production-grade grammar correction.
+---
+## Key Features
+### Transformer-Based Architecture
+- **Seq2Seq Deep Learning:** T5-based encoder-decoder architecture processes grammatical correction as sequence-to-sequence translation
+- **Production Deployment:** FastAPI inference server with uvicorn workers for concurrent request handling
+- **Real-time Processing:** ~2-3 second inference latency per sentence
+### Grammar Error Coverage
+The system corrects 15+ grammatical error types with high linguistic precision:
+| Error Type | Description | Example Correction |
+|------------|-------------|-------------------|
+| Subject-Verb Agreement | Verb conjugation matching subject | "Matt like fish" → "Matt likes fish" |
+| Verb Tense Consistency | Temporal coherence in narratives | "I walk to the store and I bought milk" → "I walked to the store and bought milk" |
+| Article Usage | Determiner selection (a/an/the) | Missing or incorrect articles |
+| Pronoun Errors | Possessive vs. contraction | "They're house" → "Their house" |
+| Preposition Selection | Contextual preposition choice | "Feel free reach out" → "Feel free to reach out" |
+| Word Form | Part-of-speech corrections | "Life is shortest" → "Life is short" |
+| Auxiliary Verbs | Modal and helping verb errors | "what be the reason" → "what is the reason" |
+| Gerund/Infinitive | Verb form following verbs | "everyone leave" → "everyone leaving" |
+| Pronoun Case | Subject/object pronoun usage | "How is you?" → "How are you?" |
+| Punctuation | Apostrophes, commas, periods | "Its going to rain" → "It's going to rain" |
+### Interactive Visualization
+- **Color-Coded Annotations:** Visual highlighting system distinguishes error types
+  - **Red (Deletion):** Words/characters to remove
+  - **Green (Addition):** Missing words/characters
+  - **Yellow (Change):** Word replacements or modifications
+- **Detailed Edit Tables:** Structured breakdown of each grammatical correction with token positions
+- **Linguistic Error Classification:** ERRANT-based error type identification (morphology, syntax, orthography)
+---
+## System Performance
+### Model Specifications
+```
+Model Architecture:     T5-based Seq2Seq Transformer
+Model Tag:             Custom fine-tuned model
+Tokenizer:             AutoTokenizer (SentencePiece)
+Maximum Sequence:      128 tokens
+Sampling Strategy:     Top-k (50) + Top-p (0.95)
+Temperature:           1.0 (diverse generation)
+Device:                CPU (GPU compatible)
+Inference Latency:     ~2-3 seconds per sentence
+Model Size:            ~220MB (full precision)
+```
+### Generation Parameters
+```python
+Generation Configuration:
+├── do_sample: True          # Stochastic sampling enabled
+├── max_length: 128          # Maximum output tokens
+├── top_k: 50                # Top-k sampling threshold
+├── top_p: 0.95             # Nucleus sampling probability
+├── early_stopping: True     # Stop at first EOS token
+└── num_return_sequences: 1  # Single best candidate
+```
+### System Architecture Performance
+| Component | Performance Metric |
+|-----------|-------------------|
+| FastAPI Server | Multi-worker uvicorn deployment |
+| Startup Time | ~15-20 seconds (model loading) |
+| Concurrent Requests | Handles 2+ simultaneous corrections |
+| Port Configuration | 8080 (inference server) |
+| Health Check | Socket-based port availability monitoring |
+---
+## Technical Architecture
+### Seq2Seq Transformer Pipeline
+```python
+Input Text: "what be the reason for everyone leave the company"
+    ↓
+Preprocessing: Add task prefix → "gec: what be the reason..."
+    ↓
+Tokenization: SentencePiece encoding → Token IDs
+    ↓
+T5 Encoder: Contextualized embeddings (512 dimensions)
+    ↓
+T5 Decoder: Autoregressive generation with beam search
+    ↓
+Sampling: Top-k (50) + Top-p (0.95) filtering
+    ↓
+Detokenization: Token IDs → "what is the reason for everyone leaving the company"
+    ↓
+Post-processing: Remove special tokens, strip whitespace
+    ↓
+Output: Corrected sentence + confidence score
+```
+**Key Technical Design:**
+- **Task Prefix:** `"gec: "` signals grammar error correction task to T5 model
+- **Encoder-Decoder:** Bidirectional attention in encoder, causal attention in decoder
+- **Sampling Strategy:** Balances diversity (top-p) and quality (top-k) for natural corrections
+- **Early Stopping:** Terminates generation at first end-of-sequence token for efficiency
+### Error Analysis Pipeline
+```python
+Original Sentence → spaCy Tokenization
+                          ↓
+Corrected Sentence → spaCy Tokenization
+                          ↓
+              ERRANT Alignment
+                          ↓
+        Edit Extraction & Classification
+                          ↓
+    ┌──────────────┬──────────────┐
+    │  Highlights  │  Edit Table  │
+    │  (Visual)    │  (Tabular)   │
+    └──────────────┴──────────────┘
+```
+**ERRANT Framework Integration:**
+- **Parse Trees:** spaCy dependency parsing for syntactic structure
+- **Token Alignment:** Levenshtein-based sequence alignment
+- **Edit Operations:** Insertions, deletions, substitutions, and transpositions
+- **Linguistic Classification:** Maps edits to error taxonomy (VERB:TENSE, DET, PREP, etc.)
+### System Architecture
+```
+┌─────────────────────────────────────────────────────────────┐
+│                    Streamlit Frontend                       │
+│         • Interactive text input interface                  │
+│         • Pre-loaded example selector                       │
+│         • Visual error highlighting display                 │
+│         • Expandable edit table components                  │
+└─────────────────┬───────────────────────────────────────────┘
+                  │ HTTP POST
+┌─────────────────▼───────────────────────────────────────────┐
+│              FastAPI Inference Server                       │
+│         • uvicorn ASGI server (port 8080)                   │
+│         • Multi-worker request handling                     │
+│         • Health check and monitoring                       │
+└─────────────────┬───────────────────────────────────────────┘
+                  │
+┌─────────────────▼───────────────────────────────────────────┐
+│           Grammar Correction Engine                         │
+│    ┌─────────────────────┬─────────────────────┐            │
+│    │   T5 Transformer    │   ERRANT Analyzer   │            │
+│    │   • Tokenization    │   • spaCy NLP       │            │
+│    │   • Seq2Seq Gen     │   • Error taxonomy  │            │
+│    └─────────────────────┴─────────────────────┘            │
+└─────────────────────────────────────────────────────────────┘
+```
+**Microservices Design:**
+- **Frontend Layer (Streamlit):** User interaction and visualization
+- **API Layer (FastAPI):** Stateless request processing
+- **Model Layer (Transformers):** Core correction logic
+- **Analysis Layer (ERRANT):** Linguistic error identification
+---
+## Installation
+### Prerequisites
+```bash
+Python 3.8+
+4GB RAM minimum
+Internet connection (initial model download)
+```
+### Backend Setup (FastAPI + Transformers)
+```bash
+# Clone repository
+git clone https://huggingface.co/spaces/Abdullahrasheed45/Grammify
+cd Grammify
+# Create virtual environment
+python3 -m venv venv
+source venv/bin/activate  # Windows: venv\Scripts\activate
+# Install Python dependencies
+pip install -r requirements.txt
+# Download spaCy language model
+python -m spacy download en_core_web_sm
+# Start FastAPI inference server (automatic on first run)
+# Server launches at http://0.0.0.0:8080
+# Start Streamlit application
+streamlit run app.py
+# Application available at http://localhost:8501
+```
+### Docker Deployment (Optional)
+```bash
+# Build Docker image
+docker build -t grammify:latest .
+# Run container
+docker run -p 8501:8501 -p 8080:8080 grammify:latest
+```
+### Hugging Face Spaces Deployment
+```bash
+# Configure space metadata in README.md
+---
+title: Grammify
+emoji: ⚡
+colorFrom: gray
+colorTo: blue
+sdk: streamlit
+app_file: app.py
+pinned: false
+license: apache-2.0
+sdk_version: 1.51.0
+---
+# Push to Hugging Face Hub
+git push https://huggingface.co/spaces/YOUR_USERNAME/Grammify main
+```
+---
+## Usage
+### Interactive Web Application
+The system provides a Streamlit-based interface with the following workflow:
+**Basic Correction:**
+1. **Choose Example** - Select from 14 pre-loaded grammatical error examples
+2. **Custom Input** - Enter your own sentence in the text input field
+3. **Automatic Processing** - Correction triggers on non-empty input
+4. **View Results** - Corrected text displayed in success banner
+5. **Analyze Errors** - Expand "Show highlights" for color-coded annotations
+6. **Inspect Edits** - Expand "Show edits" for detailed error breakdown
+**Example Workflow:**
+```python
+# Input
+"Matt like fish"
+# Output (Success Banner)
+"Matt likes fish"
+# Highlights (Expandable)
+Matt [like → likes (VERB:SVA)] fish
+# Edit Table (Expandable)
+| Type | Original | Pos | Corrected | Pos |
+|------|----------|-----|-----------|-----|
+| VERB:SVA | like | 1-2 | likes | 1-2 |
+```
+### API Integration
+For programmatic access, use the FastAPI endpoint:
+```python
+import requests
+# Make correction request
+response = requests.get(
+    "http://0.0.0.0:8080/correct",
+    params={"input_sentence": "They're house is on fire"}
+)
+# Parse response
+result = response.json()
+corrected_text = result["scored_corrected_sentence"][0]
+confidence = result["scored_corrected_sentence"][1]
+print(f"Corrected: {corrected_text}")
+# Output: "Their house is on fire"
+```
+### Python Library Integration
+```python
+# Direct model usage (without server)
+from gramformer import Gramformer
+# Initialize model
+gf = Gramformer(models=1, use_gpu=False)
+# Correct sentence
+corrections = gf.correct(
+    "Feel free reach out to me",
+    max_candidates=1
+)
+for corrected in corrections:
+    print(corrected)
+# Output: "Feel free to reach out to me"
+```
+---
+## Technical Implementation
+### File Structure
+```
+Grammify/
+├── app.py                    # Main Streamlit application
+├── InferenceServer.py        # FastAPI inference server
+├── requirements.txt          # Python dependencies
+├── .gitattributes           # Git LFS configuration
+└── README.md                # This documentation
+```
+### Core Dependencies
+**requirements.txt Analysis:**
+```python
+# NLP & Deep Learning
+transformers           # Hugging Face model hub
+torch                 # PyTorch backend
+sentencepiece         # Tokenization
+# Web Frameworks
+streamlit             # Interactive frontend
+fastapi              # API server
+uvicorn              # ASGI server
+# Grammar Analysis
+spacy                # Linguistic processing
+errant               # Error annotation toolkit
+nltk (>=3.6)         # Natural language toolkit
+# Utilities
+st-annotated-text    # Visual highlighting
+bs4                  # HTML parsing for annotations
+pandas               # Edit table generation
+protobuf (>=3.19.0)  # Model serialization
+requests             # HTTP client
+```
+### Key Code Components
+#### 1. InferenceServer.py - Core Correction Logic
+```python
+# Model initialization
+correction_model_tag = "custom_grammar_model"
+correction_tokenizer = AutoTokenizer.from_pretrained(correction_model_tag)
+correction_model = AutoModelForSeq2SeqLM.from_pretrained(correction_model_tag)
+# Correction function
+def correct(input_sentence, max_candidates=1):
+    correction_prefix = "gec: "
+    input_sentence = correction_prefix + input_sentence
+    input_ids = correction_tokenizer.encode(input_sentence, return_tensors='pt')
+    preds = correction_model.generate(
+        input_ids,
+        do_sample=True,
+        max_length=128,
+        top_k=50,
+        top_p=0.95,
+        early_stopping=True,
+        num_return_sequences=max_candidates
+    )
+    corrected = set()
+    for pred in preds:
+        corrected.add(correction_tokenizer.decode(pred, skip_special_tokens=True).strip())
+    return (corrected[0], 0)  # Corrected sentence, dummy confidence
+```
+#### 2. app.py - Error Analysis Pipeline
+```python
+# ERRANT-based edit extraction
+import errant
+import spacy
+# Initialize annotator
+nlp = spacy.load("en_core_web_sm")
+annotator = errant.load('en', nlp)
+# Extract edits
+orig = annotator.parse("Matt like fish")
+cor = annotator.parse("Matt likes fish")
+edits = annotator.annotate(orig, cor)
+# Generate visual highlights and edit tables
+for edit in edits:
+    print(f"{edit.o_str} → {edit.c_str} ({edit.type})")
+```
+---
+## Applications
+### Professional Writing
+- Email composition and review
+- Business document proofreading
+- Report and proposal refinement
+- Professional communication enhancement
+### Academic Support
+- Essay and paper proofreading
+- Research document editing
+- Thesis and dissertation review
+- Assignment quality improvement
+### Content Creation
+- Blog post editing
+- Social media content refinement
+- Marketing copy correction
+- Documentation writing assistance
+### Language Learning
+- Grammar error identification for ESL students
+- Writing practice feedback
+- Language proficiency development
+- Real-time correction for learners
+---
+## Limitations
+The system has several constraints and areas for future improvement:
+1. **Context Window:** Limited to 128 tokens per sentence; longer texts require segmentation
+2. **Domain Specificity:** Trained primarily on general English; may underperform on highly technical or specialized vocabulary
+3. **Stylistic Preservation:** Focuses on grammatical correctness rather than maintaining authorial voice or stylistic choices
+4. **Confidence Scoring:** Current implementation provides binary correction without probabilistic confidence metrics
+5. **Multi-Sentence Context:** Processes sentences independently; may miss inter-sentence coherence issues
+---
+## Future Directions
+### Technical Enhancements
+- Integration of larger T5 models (T5-large, T5-3B) for improved accuracy
+- Multi-sentence context processing for discourse-level corrections
+- Confidence score implementation using model perplexity
+- GPU acceleration for faster inference
+- Batch processing API for document-level corrections
+### Feature Expansion
+- Style-aware corrections (formal vs. informal)
+- Domain-specific fine-tuning (legal, medical, technical writing)
+- Multi-language support beyond English
+- Browser extension for real-time writing assistance
+- Mobile application development
+### Model Optimization
+- Knowledge distillation for smaller deployment footprint
+- Quantization-aware training for edge deployment
+- Adaptive inference based on error density
+- Custom fine-tuning on user-specific writing patterns
+---
+## Contributing
+Contributions are welcome in the following areas:
+**Technical Development:**
+- Model architecture improvements and optimization
+- Additional error type coverage and linguistic analysis
+- Performance benchmarking and optimization
+- Cross-platform deployment (mobile, browser extensions)
+**Dataset Contributions:**
+- Domain-specific grammar error corpora
+- Multi-language grammar correction datasets
+- Stylistic variation examples
+- Real-world writing samples for evaluation
+**Documentation:**
+- Tutorial content and usage examples
+- API documentation expansion
+- Multi-language documentation
+- Educational resources for grammar learning
+---
+## Acknowledgments
+This project leverages several open-source tools and resources:
+- **Gramformer Library** for the foundational grammar correction framework
+- **Hugging Face Transformers** for model infrastructure and deployment
+- **ERRANT Toolkit** (Bryant et al.) for error annotation and classification
+- **spaCy Team** for linguistic processing capabilities
+- **T5 Model Authors** (Google Research) for the transformer architecture
+- **Hugging Face Spaces** for hosting and deployment infrastructure
+---
+## License
+This project is released under the Apache License 2.0. See LICENSE file for details.
+---
+## Contact
+**Developer:** Muhammad Abdullah Rasheed
+[![Portfolio](https://img.shields.io/badge/Portfolio-000000?style=for-the-badge&logo=About.me&logoColor=white)](https://techvibes360.com)
+[![LinkedIn](https://img.shields.io/badge/LinkedIn-0077B5?style=for-the-badge&logo=linkedin&logoColor=white)](https://www.linkedin.com/in/abdullahrasheed-/)
+[![Email](https://img.shields.io/badge/Email-D14836?style=for-the-badge&logo=gmail&logoColor=white)](mailto:abdullahrasheed45@gmail.com)
+[![Hugging Face](https://img.shields.io/badge/🤗%20Hugging%20Face-Profile-yellow?style=for-the-badge)](https://huggingface.co/Abdullahrasheed45)
+For technical questions, collaboration opportunities, or NLP application discussions, please reach out via the channels above.
+---
+<div align="center">
+**Enhancing written communication through accessible AI technology**
+*"Clear communication begins with correct grammar"*
+</div>
 Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference