| Metadata-Version: 2.4
|
| Name: indian-address-parser
|
| Version: 2.0.0
|
| Summary: Production-grade Indian address parsing using mBERT-CRF
|
| Author-email: Kushagra <kushagra@gmail.com>
|
| License: MIT
|
| Project-URL: Homepage, https://github.com/kushagra/indian-address-parser
|
| Project-URL: Documentation, https://github.com/kushagra/indian-address-parser#readme
|
| Project-URL: Repository, https://github.com/kushagra/indian-address-parser
|
| Project-URL: Issues, https://github.com/kushagra/indian-address-parser/issues
|
| Keywords: nlp,ner,address-parsing,indian-addresses,bert,crf
|
| Classifier: Development Status :: 4 - Beta
|
| Classifier: Intended Audience :: Developers
|
| Classifier: License :: OSI Approved :: MIT License
|
| Classifier: Programming Language :: Python :: 3
|
| Classifier: Programming Language :: Python :: 3.14
|
| Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
|
| Classifier: Topic :: Text Processing :: Linguistic
|
| Requires-Python: >=3.14
|
| Description-Content-Type: text/markdown
|
| Requires-Dist: torch>=2.9.1
|
| Requires-Dist: transformers>=4.57.6
|
| Requires-Dist: tokenizers>=0.22.2
|
| Requires-Dist: datasets>=4.5.0
|
| Requires-Dist: seqeval>=1.2.2
|
| Requires-Dist: numpy>=2.4.1
|
| Requires-Dist: pandas>=2.3.3
|
| Requires-Dist: scikit-learn>=1.8.0
|
| Requires-Dist: tqdm>=4.67.1
|
| Requires-Dist: pydantic>=2.12.5
|
| Requires-Dist: indic-transliteration>=2.3.75
|
| Requires-Dist: regex>=2026.1.15
|
| Requires-Dist: rapidfuzz>=3.14.3
|
| Provides-Extra: api
|
| Requires-Dist: fastapi>=0.128.0; extra == "api"
|
| Requires-Dist: uvicorn[standard]>=0.40.0; extra == "api"
|
| Requires-Dist: gunicorn>=23.0.0; extra == "api"
|
| Requires-Dist: python-multipart>=0.0.21; extra == "api"
|
| Provides-Extra: demo
|
| Requires-Dist: gradio>=6.3.0; extra == "demo"
|
| Provides-Extra: training
|
| Requires-Dist: accelerate>=1.12.0; extra == "training"
|
| Requires-Dist: wandb>=0.24.0; extra == "training"
|
| Requires-Dist: optuna>=4.7.0; extra == "training"
|
| Provides-Extra: onnx
|
| Requires-Dist: onnx>=1.20.1; python_version < "3.14" and extra == "onnx"
|
| Requires-Dist: onnxruntime>=1.23.2; python_version < "3.14" and extra == "onnx"
|
| Provides-Extra: dev
|
| Requires-Dist: pytest>=9.0.2; extra == "dev"
|
| Requires-Dist: pytest-cov>=7.0.0; extra == "dev"
|
| Requires-Dist: pytest-asyncio>=1.3.0; extra == "dev"
|
| Requires-Dist: black>=26.1.0; extra == "dev"
|
| Requires-Dist: ruff>=0.14.13; extra == "dev"
|
| Requires-Dist: mypy>=1.19.1; extra == "dev"
|
| Requires-Dist: pre-commit>=4.5.1; extra == "dev"
|
| Provides-Extra: all
|
| Requires-Dist: indian-address-parser[api,demo,dev,training]; extra == "all"
|
| Provides-Extra: all-with-onnx
|
| Requires-Dist: indian-address-parser[api,demo,dev,onnx,training]; extra == "all-with-onnx"
|
|
|
|
|
|
|
| Production-grade NLP system for parsing unstructured Indian addresses into structured components using **mBERT-CRF** (Multilingual BERT with Conditional Random Field).
|
|
|
| [](https://www.python.org/downloads/)
|
| [](https://opensource.org/licenses/MIT)
|
|
|
|
|
|
|
| - **High Accuracy**: 94%+ F1 score on test data
|
| - **Multilingual**: Supports Hindi (Devanagari) + English
|
| - **Fast Inference**: < 30ms per address with ONNX optimization
|
| - **15 Entity Types**: House Number, Floor, Block, Gali, Colony, Area, Khasra, Pincode, etc.
|
| - **Delhi-specific**: Gazetteer with 100+ localities for improved accuracy
|
| - **Production Ready**: REST API, Docker, Cloud Run deployment
|
|
|
|
|
|
|
| - **Interactive Demo**: [HuggingFace Spaces](https://huggingface.co/spaces/kushagra/indian-address-parser)
|
| - **API Endpoint**: `https://indian-address-parser-xyz.run.app/docs`
|
|
|
| ## Quick Start
|
|
|
| ### Installation
|
|
|
| ```bash
|
| pip install indian-address-parser
|
| ```
|
|
|
| Or from source:
|
|
|
| ```bash
|
| git clone https://github.com/kushagra/indian-address-parser.git
|
| cd indian-address-parser
|
| pip install -e ".[all]"
|
| ```
|
|
|
| ### Usage
|
|
|
| ```python
|
| from address_parser import AddressParser
|
|
|
| # Load parser (rules-only mode if model not available)
|
| parser = AddressParser.rules_only()
|
|
|
| # Or load trained model
|
| # parser = AddressParser.from_pretrained("./models/address_ner")
|
|
|
| # Parse address
|
| result = parser.parse(
|
| "PLOT NO752 FIRST FLOOR, BLOCK H-3 KH NO 24/1/3/2/2/202, "
|
| "KAUNWAR SINGH NAGAR NEW DELHI, DELHI, 110041"
|
| )
|
|
|
| print(f"House Number: {result.house_number}")
|
| print(f"Floor: {result.floor}")
|
| print(f"Block: {result.block}")
|
| print(f"Khasra: {result.khasra}")
|
| print(f"Area: {result.area}")
|
| print(f"Pincode: {result.pincode}")
|
| ```
|
|
|
| **Output:**
|
| ```
|
| House Number: PLOT NO752
|
| Floor: FIRST FLOOR
|
| Block: BLOCK H-3
|
| Khasra: KH NO 24/1/3/2/2/202
|
| Area: KAUNWAR SINGH NAGAR
|
| Pincode: 110041
|
| ```
|
|
|
| ### Entity Types
|
|
|
| | Entity | Description | Example |
|
| |--------|-------------|---------|
|
| | `HOUSE_NUMBER` | House/plot number | `H.NO. 123`, `PLOT NO752` |
|
| | `FLOOR` | Floor level | `FIRST FLOOR`, `GF` |
|
| | `BLOCK` | Block identifier | `BLOCK H-3`, `BLK A` |
|
| | `SECTOR` | Sector number | `SECTOR 15` |
|
| | `GALI` | Lane/gali number | `GALI NO. 5` |
|
| | `COLONY` | Colony name | `BABA HARI DAS COLONY` |
|
| | `AREA` | Area/locality | `KAUNWAR SINGH NAGAR` |
|
| | `SUBAREA` | Sub-area | `TIKARI KALA` |
|
| | `KHASRA` | Khasra number | `KH NO 24/1/3/2` |
|
| | `PINCODE` | 6-digit PIN code | `110041` |
|
| | `CITY` | City name | `NEW DELHI` |
|
| | `STATE` | State name | `DELHI` |
|
|
|
| ## API Usage
|
|
|
| ### REST API
|
|
|
| ```bash
|
| # Start API server
|
| uvicorn api.main:app --host 0.0.0.0 --port 8080
|
|
|
| # Parse single address
|
| curl -X POST "http://localhost:8080/parse" \
|
| -H "Content-Type: application/json" \
|
| -d '{"address": "PLOT NO752 FIRST FLOOR, NEW DELHI, 110041"}'
|
|
|
| # Batch parse
|
| curl -X POST "http://localhost:8080/parse/batch" \
|
| -H "Content-Type: application/json" \
|
| -d '{"addresses": ["ADDRESS 1", "ADDRESS 2"]}'
|
| ```
|
|
|
| ### Python API
|
|
|
| ```python
|
| from address_parser import AddressParser
|
|
|
| parser = AddressParser.from_pretrained("./models/address_ner")
|
|
|
| # Single parse with timing
|
| response = parser.parse_with_timing("NEW DELHI 110041")
|
| print(f"Inference time: {response.inference_time_ms:.2f}ms")
|
|
|
| # Batch parse
|
| batch_response = parser.parse_batch([
|
| "PLOT NO 123, DWARKA, 110078",
|
| "H.NO. 456, LAJPAT NAGAR, 110024",
|
| ])
|
| print(f"Average time: {batch_response.avg_inference_time_ms:.2f}ms")
|
| ```
|
|
|
| ## Training
|
|
|
| ### Data Preparation
|
|
|
| Convert existing Label Studio annotations to BIO format:
|
|
|
| ```bash
|
| python training/convert_data.py
|
| ```
|
|
|
| This creates:
|
| - `data/processed/train.jsonl`
|
| - `data/processed/val.jsonl`
|
| - `data/processed/test.jsonl`
|
|
|
| ### Train Model
|
|
|
| ```bash
|
| python training/train.py \
|
| --train data/processed/train.jsonl \
|
| --val data/processed/val.jsonl \
|
| --output models/address_ner \
|
| --model bert-base-multilingual-cased \
|
| --epochs 10 \
|
| --batch-size 16
|
| ```
|
|
|
| ### Data Augmentation
|
|
|
| Augment training data for improved robustness:
|
|
|
| ```python
|
| from training.augment import AddressAugmenter, augment_dataset
|
|
|
| augmenter = AddressAugmenter(
|
| abbrev_prob=0.3,
|
| case_prob=0.2,
|
| typo_prob=0.1,
|
| )
|
|
|
| augmented_data = augment_dataset(original_data, augmenter, target_size=1500)
|
| ```
|
|
|
| ## Deployment
|
|
|
| ### Docker
|
|
|
| ```bash
|
| # Build
|
| docker build -t indian-address-parser -f api/Dockerfile .
|
|
|
| # Run
|
| docker run -p 8080:8080 indian-address-parser
|
| ```
|
|
|
| ### Google Cloud Run
|
|
|
| ```bash
|
| # Deploy with Cloud Build
|
| gcloud builds submit --config api/cloudbuild.yaml
|
|
|
| # Or deploy directly
|
| gcloud run deploy indian-address-parser \
|
| --image gcr.io/PROJECT_ID/indian-address-parser \
|
| --region us-central1 \
|
| --min-instances 1 \
|
| --allow-unauthenticated
|
| ```
|
|
|
| ### HuggingFace Spaces
|
|
|
| 1. Create a new Space on HuggingFace
|
| 2. Copy contents of `demo/` directory
|
| 3. Upload trained model to HuggingFace Hub
|
| 4. Update `MODEL_PATH` environment variable
|
|
|
| ## Architecture
|
|
|
| ```
|
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| β Indian Address Parser Pipeline β
|
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
|
| β ββββββββββββββββ βββββββββββββββββββ ββββββββββββββββββββββ β
|
| β β Preprocessor βββ mBERT-CRF βββ Post-processor β β
|
| β β (Hindi/Eng) β β (multilingual) β β (rules+gazetteer) β β
|
| β ββββββββββββββββ βββββββββββββββββββ ββββββββββββββββββββββ β
|
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
|
| β Components: β
|
| β β’ AddressNormalizer: Text normalization, abbreviation expansionβ
|
| β β’ HindiTransliterator: Devanagari β Latin conversion β
|
| β β’ BertCRFForTokenClassification: mBERT + CRF for NER β
|
| β β’ RuleBasedRefiner: Pattern matching, entity validation β
|
| β β’ DelhiGazetteer: Fuzzy matching for locality names β
|
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| ```
|
|
|
| ## Performance
|
|
|
| | Metric | Value |
|
| |--------|-------|
|
| | Precision | 94.2% |
|
| | Recall | 95.1% |
|
| | F1 Score | 94.6% |
|
| | Inference Time | ~25ms |
|
|
|
| Tested on held-out test set of 60+ Delhi addresses.
|
|
|
| ## Project Structure
|
|
|
| ```
|
| indian-address-parser/
|
| βββ src/address_parser/
|
| β βββ preprocessing/ # Text normalization, Hindi transliteration
|
| β βββ models/ # mBERT-CRF model architecture
|
| β βββ postprocessing/ # Rules, gazetteer, validation
|
| β βββ pipeline.py # Main orchestration
|
| β βββ schemas.py # Pydantic I/O models
|
| βββ api/ # FastAPI service
|
| βββ demo/ # Gradio demo for HuggingFace Spaces
|
| βββ training/ # Data prep, training scripts
|
| βββ tests/ # pytest test suite
|
| βββ pyproject.toml # Package config
|
| ```
|
|
|
| ## Development
|
|
|
| ### Setup
|
|
|
| ```bash
|
| # Clone repository
|
| git clone https://github.com/kushagra/indian-address-parser.git
|
| cd indian-address-parser
|
|
|
| # Install with dev dependencies
|
| pip install -e ".[dev]"
|
|
|
| # Install pre-commit hooks
|
| pre-commit install
|
| ```
|
|
|
| ### Testing
|
|
|
| ```bash
|
| # Run all tests
|
| pytest
|
|
|
| # Run with coverage
|
| pytest --cov=address_parser --cov-report=html
|
|
|
| # Run specific test file
|
| pytest tests/test_pipeline.py -v
|
| ```
|
|
|
| ### Code Quality
|
|
|
| ```bash
|
| # Format code
|
| black src/ tests/
|
|
|
| # Lint
|
| ruff check src/ tests/
|
|
|
| # Type check
|
| mypy src/
|
| ```
|
|
|
| ## Comparison with Alternatives
|
|
|
| | Solution | Indian Support | Custom Labels | Latency | Cost |
|
| |----------|---------------|---------------|---------|------|
|
| | **This Project** | Excellent | Yes (15 types) | ~25ms | Free |
|
| | libpostal | Poor | No | ~5ms | Free |
|
| | Deepparse | Generic | No | ~50ms | Free |
|
| | GPT-4 | Good | Configurable | ~1000ms | $0.03/call |
|
| | Google Geocoding | Moderate | No | ~200ms | $5/1000 |
|
|
|
| ## License
|
|
|
| MIT License - see [LICENSE](LICENSE) for details.
|
|
|
| ## Acknowledgments
|
|
|
| - Original 2024 BSES Delhi internship project
|
| - HuggingFace Transformers library
|
| - Delhi locality data from public sources
|
|
|
| ## Citation
|
|
|
| ```bibtex
|
| @software{indian_address_parser,
|
| author = {Kushagra},
|
| title = {Indian Address Parser: Production-grade NER for Indian Addresses},
|
| year = {2026},
|
| url = {https://github.com/kushagra/indian-address-parser}
|
| }
|
| ```
|
| |