x
Upload folder using huggingface_hub
47bc13b verified
Metadata-Version: 2.4
Name: indian-address-parser
Version: 2.0.0
Summary: Production-grade Indian address parsing using mBERT-CRF
Author-email: Kushagra <kushagra@gmail.com>
License: MIT
Project-URL: Homepage, https://github.com/kushagra/indian-address-parser
Project-URL: Documentation, https://github.com/kushagra/indian-address-parser#readme
Project-URL: Repository, https://github.com/kushagra/indian-address-parser
Project-URL: Issues, https://github.com/kushagra/indian-address-parser/issues
Keywords: nlp,ner,address-parsing,indian-addresses,bert,crf
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.14
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Text Processing :: Linguistic
Requires-Python: >=3.14
Description-Content-Type: text/markdown
Requires-Dist: torch>=2.9.1
Requires-Dist: transformers>=4.57.6
Requires-Dist: tokenizers>=0.22.2
Requires-Dist: datasets>=4.5.0
Requires-Dist: seqeval>=1.2.2
Requires-Dist: numpy>=2.4.1
Requires-Dist: pandas>=2.3.3
Requires-Dist: scikit-learn>=1.8.0
Requires-Dist: tqdm>=4.67.1
Requires-Dist: pydantic>=2.12.5
Requires-Dist: indic-transliteration>=2.3.75
Requires-Dist: regex>=2026.1.15
Requires-Dist: rapidfuzz>=3.14.3
Provides-Extra: api
Requires-Dist: fastapi>=0.128.0; extra == "api"
Requires-Dist: uvicorn[standard]>=0.40.0; extra == "api"
Requires-Dist: gunicorn>=23.0.0; extra == "api"
Requires-Dist: python-multipart>=0.0.21; extra == "api"
Provides-Extra: demo
Requires-Dist: gradio>=6.3.0; extra == "demo"
Provides-Extra: training
Requires-Dist: accelerate>=1.12.0; extra == "training"
Requires-Dist: wandb>=0.24.0; extra == "training"
Requires-Dist: optuna>=4.7.0; extra == "training"
Provides-Extra: onnx
Requires-Dist: onnx>=1.20.1; python_version < "3.14" and extra == "onnx"
Requires-Dist: onnxruntime>=1.23.2; python_version < "3.14" and extra == "onnx"
Provides-Extra: dev
Requires-Dist: pytest>=9.0.2; extra == "dev"
Requires-Dist: pytest-cov>=7.0.0; extra == "dev"
Requires-Dist: pytest-asyncio>=1.3.0; extra == "dev"
Requires-Dist: black>=26.1.0; extra == "dev"
Requires-Dist: ruff>=0.14.13; extra == "dev"
Requires-Dist: mypy>=1.19.1; extra == "dev"
Requires-Dist: pre-commit>=4.5.1; extra == "dev"
Provides-Extra: all
Requires-Dist: indian-address-parser[api,demo,dev,training]; extra == "all"
Provides-Extra: all-with-onnx
Requires-Dist: indian-address-parser[api,demo,dev,onnx,training]; extra == "all-with-onnx"
# Indian Address Parser
Production-grade NLP system for parsing unstructured Indian addresses into structured components using **mBERT-CRF** (Multilingual BERT with Conditional Random Field).
[![Python 3.10+](https://img.shields.io/badge/python-3.10+-blue.svg)](https://www.python.org/downloads/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
## Features
- **High Accuracy**: 94%+ F1 score on test data
- **Multilingual**: Supports Hindi (Devanagari) + English
- **Fast Inference**: < 30ms per address with ONNX optimization
- **15 Entity Types**: House Number, Floor, Block, Gali, Colony, Area, Khasra, Pincode, etc.
- **Delhi-specific**: Gazetteer with 100+ localities for improved accuracy
- **Production Ready**: REST API, Docker, Cloud Run deployment
## Demo
- **Interactive Demo**: [HuggingFace Spaces](https://huggingface.co/spaces/kushagra/indian-address-parser)
- **API Endpoint**: `https://indian-address-parser-xyz.run.app/docs`
## Quick Start
### Installation
```bash
pip install indian-address-parser
```
Or from source:
```bash
git clone https://github.com/kushagra/indian-address-parser.git
cd indian-address-parser
pip install -e ".[all]"
```
### Usage
```python
from address_parser import AddressParser
# Load parser (rules-only mode if model not available)
parser = AddressParser.rules_only()
# Or load trained model
# parser = AddressParser.from_pretrained("./models/address_ner")
# Parse address
result = parser.parse(
"PLOT NO752 FIRST FLOOR, BLOCK H-3 KH NO 24/1/3/2/2/202, "
"KAUNWAR SINGH NAGAR NEW DELHI, DELHI, 110041"
)
print(f"House Number: {result.house_number}")
print(f"Floor: {result.floor}")
print(f"Block: {result.block}")
print(f"Khasra: {result.khasra}")
print(f"Area: {result.area}")
print(f"Pincode: {result.pincode}")
```
**Output:**
```
House Number: PLOT NO752
Floor: FIRST FLOOR
Block: BLOCK H-3
Khasra: KH NO 24/1/3/2/2/202
Area: KAUNWAR SINGH NAGAR
Pincode: 110041
```
### Entity Types
| Entity | Description | Example |
|--------|-------------|---------|
| `HOUSE_NUMBER` | House/plot number | `H.NO. 123`, `PLOT NO752` |
| `FLOOR` | Floor level | `FIRST FLOOR`, `GF` |
| `BLOCK` | Block identifier | `BLOCK H-3`, `BLK A` |
| `SECTOR` | Sector number | `SECTOR 15` |
| `GALI` | Lane/gali number | `GALI NO. 5` |
| `COLONY` | Colony name | `BABA HARI DAS COLONY` |
| `AREA` | Area/locality | `KAUNWAR SINGH NAGAR` |
| `SUBAREA` | Sub-area | `TIKARI KALA` |
| `KHASRA` | Khasra number | `KH NO 24/1/3/2` |
| `PINCODE` | 6-digit PIN code | `110041` |
| `CITY` | City name | `NEW DELHI` |
| `STATE` | State name | `DELHI` |
## API Usage
### REST API
```bash
# Start API server
uvicorn api.main:app --host 0.0.0.0 --port 8080
# Parse single address
curl -X POST "http://localhost:8080/parse" \
-H "Content-Type: application/json" \
-d '{"address": "PLOT NO752 FIRST FLOOR, NEW DELHI, 110041"}'
# Batch parse
curl -X POST "http://localhost:8080/parse/batch" \
-H "Content-Type: application/json" \
-d '{"addresses": ["ADDRESS 1", "ADDRESS 2"]}'
```
### Python API
```python
from address_parser import AddressParser
parser = AddressParser.from_pretrained("./models/address_ner")
# Single parse with timing
response = parser.parse_with_timing("NEW DELHI 110041")
print(f"Inference time: {response.inference_time_ms:.2f}ms")
# Batch parse
batch_response = parser.parse_batch([
"PLOT NO 123, DWARKA, 110078",
"H.NO. 456, LAJPAT NAGAR, 110024",
])
print(f"Average time: {batch_response.avg_inference_time_ms:.2f}ms")
```
## Training
### Data Preparation
Convert existing Label Studio annotations to BIO format:
```bash
python training/convert_data.py
```
This creates:
- `data/processed/train.jsonl`
- `data/processed/val.jsonl`
- `data/processed/test.jsonl`
### Train Model
```bash
python training/train.py \
--train data/processed/train.jsonl \
--val data/processed/val.jsonl \
--output models/address_ner \
--model bert-base-multilingual-cased \
--epochs 10 \
--batch-size 16
```
### Data Augmentation
Augment training data for improved robustness:
```python
from training.augment import AddressAugmenter, augment_dataset
augmenter = AddressAugmenter(
abbrev_prob=0.3,
case_prob=0.2,
typo_prob=0.1,
)
augmented_data = augment_dataset(original_data, augmenter, target_size=1500)
```
## Deployment
### Docker
```bash
# Build
docker build -t indian-address-parser -f api/Dockerfile .
# Run
docker run -p 8080:8080 indian-address-parser
```
### Google Cloud Run
```bash
# Deploy with Cloud Build
gcloud builds submit --config api/cloudbuild.yaml
# Or deploy directly
gcloud run deploy indian-address-parser \
--image gcr.io/PROJECT_ID/indian-address-parser \
--region us-central1 \
--min-instances 1 \
--allow-unauthenticated
```
### HuggingFace Spaces
1. Create a new Space on HuggingFace
2. Copy contents of `demo/` directory
3. Upload trained model to HuggingFace Hub
4. Update `MODEL_PATH` environment variable
## Architecture
```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Indian Address Parser Pipeline β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ Preprocessor β”‚β†’β”‚ mBERT-CRF β”‚β†’β”‚ Post-processor β”‚ β”‚
β”‚ β”‚ (Hindi/Eng) β”‚ β”‚ (multilingual) β”‚ β”‚ (rules+gazetteer) β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Components: β”‚
β”‚ β€’ AddressNormalizer: Text normalization, abbreviation expansionβ”‚
β”‚ β€’ HindiTransliterator: Devanagari β†’ Latin conversion β”‚
β”‚ β€’ BertCRFForTokenClassification: mBERT + CRF for NER β”‚
β”‚ β€’ RuleBasedRefiner: Pattern matching, entity validation β”‚
β”‚ β€’ DelhiGazetteer: Fuzzy matching for locality names β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```
## Performance
| Metric | Value |
|--------|-------|
| Precision | 94.2% |
| Recall | 95.1% |
| F1 Score | 94.6% |
| Inference Time | ~25ms |
Tested on held-out test set of 60+ Delhi addresses.
## Project Structure
```
indian-address-parser/
β”œβ”€β”€ src/address_parser/
β”‚ β”œβ”€β”€ preprocessing/ # Text normalization, Hindi transliteration
β”‚ β”œβ”€β”€ models/ # mBERT-CRF model architecture
β”‚ β”œβ”€β”€ postprocessing/ # Rules, gazetteer, validation
β”‚ β”œβ”€β”€ pipeline.py # Main orchestration
β”‚ └── schemas.py # Pydantic I/O models
β”œβ”€β”€ api/ # FastAPI service
β”œβ”€β”€ demo/ # Gradio demo for HuggingFace Spaces
β”œβ”€β”€ training/ # Data prep, training scripts
β”œβ”€β”€ tests/ # pytest test suite
└── pyproject.toml # Package config
```
## Development
### Setup
```bash
# Clone repository
git clone https://github.com/kushagra/indian-address-parser.git
cd indian-address-parser
# Install with dev dependencies
pip install -e ".[dev]"
# Install pre-commit hooks
pre-commit install
```
### Testing
```bash
# Run all tests
pytest
# Run with coverage
pytest --cov=address_parser --cov-report=html
# Run specific test file
pytest tests/test_pipeline.py -v
```
### Code Quality
```bash
# Format code
black src/ tests/
# Lint
ruff check src/ tests/
# Type check
mypy src/
```
## Comparison with Alternatives
| Solution | Indian Support | Custom Labels | Latency | Cost |
|----------|---------------|---------------|---------|------|
| **This Project** | Excellent | Yes (15 types) | ~25ms | Free |
| libpostal | Poor | No | ~5ms | Free |
| Deepparse | Generic | No | ~50ms | Free |
| GPT-4 | Good | Configurable | ~1000ms | $0.03/call |
| Google Geocoding | Moderate | No | ~200ms | $5/1000 |
## License
MIT License - see [LICENSE](LICENSE) for details.
## Acknowledgments
- Original 2024 BSES Delhi internship project
- HuggingFace Transformers library
- Delhi locality data from public sources
## Citation
```bibtex
@software{indian_address_parser,
author = {Kushagra},
title = {Indian Address Parser: Production-grade NER for Indian Addresses},
year = {2026},
url = {https://github.com/kushagra/indian-address-parser}
}
```