Grammify / README.md
Abdullahrasheed45's picture
Update README.md
997fa71 verified
---
title: Grammify
emoji:
colorFrom: gray
colorTo: blue
sdk: streamlit
app_file: app.py
pinned: false
license: apache-2.0
sdk_version: 1.51.0
---
# Grammify - Intelligent Grammar Correction System
## AI-Powered Grammar Error Detection Using Transformer Models
<div align="center">
![Python](https://img.shields.io/badge/Python-3776AB?style=for-the-badge&logo=python&logoColor=white)
![FastAPI](https://img.shields.io/badge/FastAPI-009688?style=for-the-badge&logo=FastAPI&logoColor=white)
![Streamlit](https://img.shields.io/badge/Streamlit-FF4B4B?style=for-the-badge&logo=Streamlit&logoColor=white)
![Transformers](https://img.shields.io/badge/🤗%20Transformers-FFD21E?style=for-the-badge)
[![Hugging Face Space](https://img.shields.io/badge/🤗%20Hugging%20Face-Space-yellow?style=for-the-badge)](https://huggingface.co/spaces/Abdullahrasheed45/Grammify)
[![Apache License](https://img.shields.io/badge/License-Apache%202.0-blue.svg?style=for-the-badge)](https://opensource.org/licenses/Apache-2.0)
[![Model](https://img.shields.io/badge/Model-T5--Based-orange?style=for-the-badge)]()
**NLP Application Project** | Deployed on Hugging Face Spaces
</div>
---
> An intelligent grammar correction application leveraging state-of-the-art Seq2Seq transformer models to detect and correct grammatical errors with real-time visual feedback and detailed linguistic error analysis.
## Overview
Grammify implements an advanced grammar correction system designed to enhance written communication across professional, academic, and personal contexts. Built on the Gramformer library and powered by a custom T5-based model, the system processes natural language input through a transformer architecture to identify and correct diverse grammatical errors with high accuracy and contextual awareness.
**Technical Context:** Full-stack NLP application integrating FastAPI microservices, Streamlit frontend, and Hugging Face Transformers for production-grade grammar correction.
---
## Key Features
### Transformer-Based Architecture
- **Seq2Seq Deep Learning:** T5-based encoder-decoder architecture processes grammatical correction as sequence-to-sequence translation
- **Production Deployment:** FastAPI inference server with uvicorn workers for concurrent request handling
- **Real-time Processing:** ~2-3 second inference latency per sentence
### Grammar Error Coverage
The system corrects 15+ grammatical error types with high linguistic precision:
| Error Type | Description | Example Correction |
|------------|-------------|-------------------|
| Subject-Verb Agreement | Verb conjugation matching subject | "Matt like fish" → "Matt likes fish" |
| Verb Tense Consistency | Temporal coherence in narratives | "I walk to the store and I bought milk" → "I walked to the store and bought milk" |
| Article Usage | Determiner selection (a/an/the) | Missing or incorrect articles |
| Pronoun Errors | Possessive vs. contraction | "They're house" → "Their house" |
| Preposition Selection | Contextual preposition choice | "Feel free reach out" → "Feel free to reach out" |
| Word Form | Part-of-speech corrections | "Life is shortest" → "Life is short" |
| Auxiliary Verbs | Modal and helping verb errors | "what be the reason" → "what is the reason" |
| Gerund/Infinitive | Verb form following verbs | "everyone leave" → "everyone leaving" |
| Pronoun Case | Subject/object pronoun usage | "How is you?" → "How are you?" |
| Punctuation | Apostrophes, commas, periods | "Its going to rain" → "It's going to rain" |
### Interactive Visualization
- **Color-Coded Annotations:** Visual highlighting system distinguishes error types
- **Red (Deletion):** Words/characters to remove
- **Green (Addition):** Missing words/characters
- **Yellow (Change):** Word replacements or modifications
- **Detailed Edit Tables:** Structured breakdown of each grammatical correction with token positions
- **Linguistic Error Classification:** ERRANT-based error type identification (morphology, syntax, orthography)
---
## System Performance
### Model Specifications
```
Model Architecture: T5-based Seq2Seq Transformer
Model Tag: Custom fine-tuned model
Tokenizer: AutoTokenizer (SentencePiece)
Maximum Sequence: 128 tokens
Sampling Strategy: Top-k (50) + Top-p (0.95)
Temperature: 1.0 (diverse generation)
Device: CPU (GPU compatible)
Inference Latency: ~2-3 seconds per sentence
Model Size: ~220MB (full precision)
```
### Generation Parameters
```python
Generation Configuration:
├── do_sample: True # Stochastic sampling enabled
├── max_length: 128 # Maximum output tokens
├── top_k: 50 # Top-k sampling threshold
├── top_p: 0.95 # Nucleus sampling probability
├── early_stopping: True # Stop at first EOS token
└── num_return_sequences: 1 # Single best candidate
```
### System Architecture Performance
| Component | Performance Metric |
|-----------|-------------------|
| FastAPI Server | Multi-worker uvicorn deployment |
| Startup Time | ~15-20 seconds (model loading) |
| Concurrent Requests | Handles 2+ simultaneous corrections |
| Port Configuration | 8080 (inference server) |
| Health Check | Socket-based port availability monitoring |
---
## Technical Architecture
### Seq2Seq Transformer Pipeline
```python
Input Text: "what be the reason for everyone leave the company"
Preprocessing: Add task prefix → "gec: what be the reason..."
Tokenization: SentencePiece encoding → Token IDs
T5 Encoder: Contextualized embeddings (512 dimensions)
T5 Decoder: Autoregressive generation with beam search
Sampling: Top-k (50) + Top-p (0.95) filtering
Detokenization: Token IDs → "what is the reason for everyone leaving the company"
Post-processing: Remove special tokens, strip whitespace
Output: Corrected sentence + confidence score
```
**Key Technical Design:**
- **Task Prefix:** `"gec: "` signals grammar error correction task to T5 model
- **Encoder-Decoder:** Bidirectional attention in encoder, causal attention in decoder
- **Sampling Strategy:** Balances diversity (top-p) and quality (top-k) for natural corrections
- **Early Stopping:** Terminates generation at first end-of-sequence token for efficiency
### Error Analysis Pipeline
```python
Original Sentence → spaCy Tokenization
Corrected Sentence → spaCy Tokenization
ERRANT Alignment
Edit Extraction & Classification
┌──────────────┬──────────────┐
│ Highlights │ Edit Table │
│ (Visual) │ (Tabular) │
└──────────────┴──────────────┘
```
**ERRANT Framework Integration:**
- **Parse Trees:** spaCy dependency parsing for syntactic structure
- **Token Alignment:** Levenshtein-based sequence alignment
- **Edit Operations:** Insertions, deletions, substitutions, and transpositions
- **Linguistic Classification:** Maps edits to error taxonomy (VERB:TENSE, DET, PREP, etc.)
### System Architecture
```
┌─────────────────────────────────────────────────────────────┐
│ Streamlit Frontend │
│ • Interactive text input interface │
│ • Pre-loaded example selector │
│ • Visual error highlighting display │
│ • Expandable edit table components │
└─────────────────┬───────────────────────────────────────────┘
│ HTTP POST
┌─────────────────▼───────────────────────────────────────────┐
│ FastAPI Inference Server │
│ • uvicorn ASGI server (port 8080) │
│ • Multi-worker request handling │
│ • Health check and monitoring │
└─────────────────┬───────────────────────────────────────────┘
┌─────────────────▼───────────────────────────────────────────┐
│ Grammar Correction Engine │
│ ┌─────────────────────┬─────────────────────┐ │
│ │ T5 Transformer │ ERRANT Analyzer │ │
│ │ • Tokenization │ • spaCy NLP │ │
│ │ • Seq2Seq Gen │ • Error taxonomy │ │
│ └─────────────────────┴─────────────────────┘ │
└─────────────────────────────────────────────────────────────┘
```
**Microservices Design:**
- **Frontend Layer (Streamlit):** User interaction and visualization
- **API Layer (FastAPI):** Stateless request processing
- **Model Layer (Transformers):** Core correction logic
- **Analysis Layer (ERRANT):** Linguistic error identification
---
## Installation
### Prerequisites
```bash
Python 3.8+
4GB RAM minimum
Internet connection (initial model download)
```
### Backend Setup (FastAPI + Transformers)
```bash
# Clone repository
git clone https://huggingface.co/spaces/Abdullahrasheed45/Grammify
cd Grammify
# Create virtual environment
python3 -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
# Install Python dependencies
pip install -r requirements.txt
# Download spaCy language model
python -m spacy download en_core_web_sm
# Start FastAPI inference server (automatic on first run)
# Server launches at http://0.0.0.0:8080
# Start Streamlit application
streamlit run app.py
# Application available at http://localhost:8501
```
### Docker Deployment (Optional)
```bash
# Build Docker image
docker build -t grammify:latest .
# Run container
docker run -p 8501:8501 -p 8080:8080 grammify:latest
```
### Hugging Face Spaces Deployment
```bash
# Configure space metadata in README.md
---
title: Grammify
emoji: ⚡
colorFrom: gray
colorTo: blue
sdk: streamlit
app_file: app.py
pinned: false
license: apache-2.0
sdk_version: 1.51.0
---
# Push to Hugging Face Hub
git push https://huggingface.co/spaces/YOUR_USERNAME/Grammify main
```
---
## Usage
### Interactive Web Application
The system provides a Streamlit-based interface with the following workflow:
**Basic Correction:**
1. **Choose Example** - Select from 14 pre-loaded grammatical error examples
2. **Custom Input** - Enter your own sentence in the text input field
3. **Automatic Processing** - Correction triggers on non-empty input
4. **View Results** - Corrected text displayed in success banner
5. **Analyze Errors** - Expand "Show highlights" for color-coded annotations
6. **Inspect Edits** - Expand "Show edits" for detailed error breakdown
**Example Workflow:**
```python
# Input
"Matt like fish"
# Output (Success Banner)
"Matt likes fish"
# Highlights (Expandable)
Matt [like → likes (VERB:SVA)] fish
# Edit Table (Expandable)
| Type | Original | Pos | Corrected | Pos |
|------|----------|-----|-----------|-----|
| VERB:SVA | like | 1-2 | likes | 1-2 |
```
### API Integration
For programmatic access, use the FastAPI endpoint:
```python
import requests
# Make correction request
response = requests.get(
"http://0.0.0.0:8080/correct",
params={"input_sentence": "They're house is on fire"}
)
# Parse response
result = response.json()
corrected_text = result["scored_corrected_sentence"][0]
confidence = result["scored_corrected_sentence"][1]
print(f"Corrected: {corrected_text}")
# Output: "Their house is on fire"
```
### Python Library Integration
```python
# Direct model usage (without server)
from gramformer import Gramformer
# Initialize model
gf = Gramformer(models=1, use_gpu=False)
# Correct sentence
corrections = gf.correct(
"Feel free reach out to me",
max_candidates=1
)
for corrected in corrections:
print(corrected)
# Output: "Feel free to reach out to me"
```
---
## Technical Implementation
### File Structure
```
Grammify/
├── app.py # Main Streamlit application
├── InferenceServer.py # FastAPI inference server
├── requirements.txt # Python dependencies
├── .gitattributes # Git LFS configuration
└── README.md # This documentation
```
### Core Dependencies
**requirements.txt Analysis:**
```python
# NLP & Deep Learning
transformers # Hugging Face model hub
torch # PyTorch backend
sentencepiece # Tokenization
# Web Frameworks
streamlit # Interactive frontend
fastapi # API server
uvicorn # ASGI server
# Grammar Analysis
spacy # Linguistic processing
errant # Error annotation toolkit
nltk (>=3.6) # Natural language toolkit
# Utilities
st-annotated-text # Visual highlighting
bs4 # HTML parsing for annotations
pandas # Edit table generation
protobuf (>=3.19.0) # Model serialization
requests # HTTP client
```
### Key Code Components
#### 1. InferenceServer.py - Core Correction Logic
```python
# Model initialization
correction_model_tag = "custom_grammar_model"
correction_tokenizer = AutoTokenizer.from_pretrained(correction_model_tag)
correction_model = AutoModelForSeq2SeqLM.from_pretrained(correction_model_tag)
# Correction function
def correct(input_sentence, max_candidates=1):
correction_prefix = "gec: "
input_sentence = correction_prefix + input_sentence
input_ids = correction_tokenizer.encode(input_sentence, return_tensors='pt')
preds = correction_model.generate(
input_ids,
do_sample=True,
max_length=128,
top_k=50,
top_p=0.95,
early_stopping=True,
num_return_sequences=max_candidates
)
corrected = set()
for pred in preds:
corrected.add(correction_tokenizer.decode(pred, skip_special_tokens=True).strip())
return (corrected[0], 0) # Corrected sentence, dummy confidence
```
#### 2. app.py - Error Analysis Pipeline
```python
# ERRANT-based edit extraction
import errant
import spacy
# Initialize annotator
nlp = spacy.load("en_core_web_sm")
annotator = errant.load('en', nlp)
# Extract edits
orig = annotator.parse("Matt like fish")
cor = annotator.parse("Matt likes fish")
edits = annotator.annotate(orig, cor)
# Generate visual highlights and edit tables
for edit in edits:
print(f"{edit.o_str} → {edit.c_str} ({edit.type})")
```
---
## Applications
### Professional Writing
- Email composition and review
- Business document proofreading
- Report and proposal refinement
- Professional communication enhancement
### Academic Support
- Essay and paper proofreading
- Research document editing
- Thesis and dissertation review
- Assignment quality improvement
### Content Creation
- Blog post editing
- Social media content refinement
- Marketing copy correction
- Documentation writing assistance
### Language Learning
- Grammar error identification for ESL students
- Writing practice feedback
- Language proficiency development
- Real-time correction for learners
---
## Limitations
The system has several constraints and areas for future improvement:
1. **Context Window:** Limited to 128 tokens per sentence; longer texts require segmentation
2. **Domain Specificity:** Trained primarily on general English; may underperform on highly technical or specialized vocabulary
3. **Stylistic Preservation:** Focuses on grammatical correctness rather than maintaining authorial voice or stylistic choices
4. **Confidence Scoring:** Current implementation provides binary correction without probabilistic confidence metrics
5. **Multi-Sentence Context:** Processes sentences independently; may miss inter-sentence coherence issues
---
## Future Directions
### Technical Enhancements
- Integration of larger T5 models (T5-large, T5-3B) for improved accuracy
- Multi-sentence context processing for discourse-level corrections
- Confidence score implementation using model perplexity
- GPU acceleration for faster inference
- Batch processing API for document-level corrections
### Feature Expansion
- Style-aware corrections (formal vs. informal)
- Domain-specific fine-tuning (legal, medical, technical writing)
- Multi-language support beyond English
- Browser extension for real-time writing assistance
- Mobile application development
### Model Optimization
- Knowledge distillation for smaller deployment footprint
- Quantization-aware training for edge deployment
- Adaptive inference based on error density
- Custom fine-tuning on user-specific writing patterns
---
## Contributing
Contributions are welcome in the following areas:
**Technical Development:**
- Model architecture improvements and optimization
- Additional error type coverage and linguistic analysis
- Performance benchmarking and optimization
- Cross-platform deployment (mobile, browser extensions)
**Dataset Contributions:**
- Domain-specific grammar error corpora
- Multi-language grammar correction datasets
- Stylistic variation examples
- Real-world writing samples for evaluation
**Documentation:**
- Tutorial content and usage examples
- API documentation expansion
- Multi-language documentation
- Educational resources for grammar learning
---
## Acknowledgments
This project leverages several open-source tools and resources:
- **Gramformer Library** for the foundational grammar correction framework
- **Hugging Face Transformers** for model infrastructure and deployment
- **ERRANT Toolkit** (Bryant et al.) for error annotation and classification
- **spaCy Team** for linguistic processing capabilities
- **T5 Model Authors** (Google Research) for the transformer architecture
- **Hugging Face Spaces** for hosting and deployment infrastructure
---
## License
This project is released under the Apache License 2.0. See LICENSE file for details.
---
## Contact
**Developer:** Muhammad Abdullah Rasheed
[![Portfolio](https://img.shields.io/badge/Portfolio-000000?style=for-the-badge&logo=About.me&logoColor=white)](https://techvibes360.com)
[![LinkedIn](https://img.shields.io/badge/LinkedIn-0077B5?style=for-the-badge&logo=linkedin&logoColor=white)](https://www.linkedin.com/in/abdullahrasheed-/)
[![Email](https://img.shields.io/badge/Email-D14836?style=for-the-badge&logo=gmail&logoColor=white)](mailto:abdullahrasheed45@gmail.com)
[![Hugging Face](https://img.shields.io/badge/🤗%20Hugging%20Face-Profile-yellow?style=for-the-badge)](https://huggingface.co/Abdullahrasheed45)
For technical questions, collaboration opportunities, or NLP application discussions, please reach out via the channels above.
---
<div align="center">
**Enhancing written communication through accessible AI technology**
*"Clear communication begins with correct grammar"*
</div>
Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference