Spaces:

howdoiuse-keyboard
/

indian-address-parser

Sleeping

indian-address-parser / src /indian_address_parser.egg-info /PKG-INFO

Upload folder using huggingface_hub

47bc13b verified 3 months ago

12.2 kB

	Metadata-Version: 2.4
	Name: indian-address-parser
	Version: 2.0.0
	Summary: Production-grade Indian address parsing using mBERT-CRF
	Author-email: Kushagra <kushagra@gmail.com>
	License: MIT
	Project-URL: Homepage, https://github.com/kushagra/indian-address-parser
	Project-URL: Documentation, https://github.com/kushagra/indian-address-parser#readme
	Project-URL: Repository, https://github.com/kushagra/indian-address-parser
	Project-URL: Issues, https://github.com/kushagra/indian-address-parser/issues
	Keywords: nlp,ner,address-parsing,indian-addresses,bert,crf
	Classifier: Development Status :: 4 - Beta
	Classifier: Intended Audience :: Developers
	Classifier: License :: OSI Approved :: MIT License
	Classifier: Programming Language :: Python :: 3
	Classifier: Programming Language :: Python :: 3.14
	Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
	Classifier: Topic :: Text Processing :: Linguistic
	Requires-Python: >=3.14
	Description-Content-Type: text/markdown
	Requires-Dist: torch>=2.9.1
	Requires-Dist: transformers>=4.57.6
	Requires-Dist: tokenizers>=0.22.2
	Requires-Dist: datasets>=4.5.0
	Requires-Dist: seqeval>=1.2.2
	Requires-Dist: numpy>=2.4.1
	Requires-Dist: pandas>=2.3.3
	Requires-Dist: scikit-learn>=1.8.0
	Requires-Dist: tqdm>=4.67.1
	Requires-Dist: pydantic>=2.12.5
	Requires-Dist: indic-transliteration>=2.3.75
	Requires-Dist: regex>=2026.1.15
	Requires-Dist: rapidfuzz>=3.14.3
	Provides-Extra: api
	Requires-Dist: fastapi>=0.128.0; extra == "api"
	Requires-Dist: uvicorn[standard]>=0.40.0; extra == "api"
	Requires-Dist: gunicorn>=23.0.0; extra == "api"
	Requires-Dist: python-multipart>=0.0.21; extra == "api"
	Provides-Extra: demo
	Requires-Dist: gradio>=6.3.0; extra == "demo"
	Provides-Extra: training
	Requires-Dist: accelerate>=1.12.0; extra == "training"
	Requires-Dist: wandb>=0.24.0; extra == "training"
	Requires-Dist: optuna>=4.7.0; extra == "training"
	Provides-Extra: onnx
	Requires-Dist: onnx>=1.20.1; python_version < "3.14" and extra == "onnx"
	Requires-Dist: onnxruntime>=1.23.2; python_version < "3.14" and extra == "onnx"
	Provides-Extra: dev
	Requires-Dist: pytest>=9.0.2; extra == "dev"
	Requires-Dist: pytest-cov>=7.0.0; extra == "dev"
	Requires-Dist: pytest-asyncio>=1.3.0; extra == "dev"
	Requires-Dist: black>=26.1.0; extra == "dev"
	Requires-Dist: ruff>=0.14.13; extra == "dev"
	Requires-Dist: mypy>=1.19.1; extra == "dev"
	Requires-Dist: pre-commit>=4.5.1; extra == "dev"
	Provides-Extra: all
	Requires-Dist: indian-address-parser[api,demo,dev,training]; extra == "all"
	Provides-Extra: all-with-onnx
	Requires-Dist: indian-address-parser[api,demo,dev,onnx,training]; extra == "all-with-onnx"

	# Indian Address Parser

	Production-grade NLP system for parsing unstructured Indian addresses into structured components using mBERT-CRF (Multilingual BERT with Conditional Random Field).

	[![Python 3.10+](https://img.shields.io/badge/python-3.10+-blue.svg)](https://www.python.org/downloads/)
	[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

	## Features

	- High Accuracy: 94%+ F1 score on test data
	- Multilingual: Supports Hindi (Devanagari) + English
	- Fast Inference: < 30ms per address with ONNX optimization
	- 15 Entity Types: House Number, Floor, Block, Gali, Colony, Area, Khasra, Pincode, etc.
	- Delhi-specific: Gazetteer with 100+ localities for improved accuracy
	- Production Ready: REST API, Docker, Cloud Run deployment

	## Demo

	- Interactive Demo: [HuggingFace Spaces](https://huggingface.co/spaces/kushagra/indian-address-parser)
	- API Endpoint: `https://indian-address-parser-xyz.run.app/docs`

	## Quick Start

	### Installation

	```bash
	pip install indian-address-parser
	```

	Or from source:

	```bash
	git clone https://github.com/kushagra/indian-address-parser.git
	cd indian-address-parser
	pip install -e ".[all]"
	```

	### Usage

	```python
	from address_parser import AddressParser

	# Load parser (rules-only mode if model not available)
	parser = AddressParser.rules_only()

	# Or load trained model
	# parser = AddressParser.from_pretrained("./models/address_ner")

	# Parse address
	result = parser.parse(
	"PLOT NO752 FIRST FLOOR, BLOCK H-3 KH NO 24/1/3/2/2/202, "
	"KAUNWAR SINGH NAGAR NEW DELHI, DELHI, 110041"
	)

	print(f"House Number: {result.house_number}")
	print(f"Floor: {result.floor}")
	print(f"Block: {result.block}")
	print(f"Khasra: {result.khasra}")
	print(f"Area: {result.area}")
	print(f"Pincode: {result.pincode}")
	```

	Output:
	```
	House Number: PLOT NO752
	Floor: FIRST FLOOR
	Block: BLOCK H-3
	Khasra: KH NO 24/1/3/2/2/202
	Area: KAUNWAR SINGH NAGAR
	Pincode: 110041
	```

	### Entity Types

	\| Entity \| Description \| Example \|
	\|--------\|-------------\|---------\|
	\| `HOUSE_NUMBER` \| House/plot number \| `H.NO. 123`, `PLOT NO752` \|
	\| `FLOOR` \| Floor level \| `FIRST FLOOR`, `GF` \|
	\| `BLOCK` \| Block identifier \| `BLOCK H-3`, `BLK A` \|
	\| `SECTOR` \| Sector number \| `SECTOR 15` \|
	\| `GALI` \| Lane/gali number \| `GALI NO. 5` \|
	\| `COLONY` \| Colony name \| `BABA HARI DAS COLONY` \|
	\| `AREA` \| Area/locality \| `KAUNWAR SINGH NAGAR` \|
	\| `SUBAREA` \| Sub-area \| `TIKARI KALA` \|
	\| `KHASRA` \| Khasra number \| `KH NO 24/1/3/2` \|
	\| `PINCODE` \| 6-digit PIN code \| `110041` \|
	\| `CITY` \| City name \| `NEW DELHI` \|
	\| `STATE` \| State name \| `DELHI` \|

	## API Usage

	### REST API

	```bash
	# Start API server
	uvicorn api.main:app --host 0.0.0.0 --port 8080

	# Parse single address
	curl -X POST "http://localhost:8080/parse" \
	-H "Content-Type: application/json" \
	-d '{"address": "PLOT NO752 FIRST FLOOR, NEW DELHI, 110041"}'

	# Batch parse
	curl -X POST "http://localhost:8080/parse/batch" \
	-H "Content-Type: application/json" \
	-d '{"addresses": ["ADDRESS 1", "ADDRESS 2"]}'
	```

	### Python API

	```python
	from address_parser import AddressParser

	parser = AddressParser.from_pretrained("./models/address_ner")

	# Single parse with timing
	response = parser.parse_with_timing("NEW DELHI 110041")
	print(f"Inference time: {response.inference_time_ms:.2f}ms")

	# Batch parse
	batch_response = parser.parse_batch([
	"PLOT NO 123, DWARKA, 110078",
	"H.NO. 456, LAJPAT NAGAR, 110024",
	])
	print(f"Average time: {batch_response.avg_inference_time_ms:.2f}ms")
	```

	## Training

	### Data Preparation

	Convert existing Label Studio annotations to BIO format:

	```bash
	python training/convert_data.py
	```

	This creates:
	- `data/processed/train.jsonl`
	- `data/processed/val.jsonl`
	- `data/processed/test.jsonl`

	### Train Model

	```bash
	python training/train.py \
	--train data/processed/train.jsonl \
	--val data/processed/val.jsonl \
	--output models/address_ner \
	--model bert-base-multilingual-cased \
	--epochs 10 \
	--batch-size 16
	```

	### Data Augmentation

	Augment training data for improved robustness:

	```python
	from training.augment import AddressAugmenter, augment_dataset

	augmenter = AddressAugmenter(
	abbrev_prob=0.3,
	case_prob=0.2,
	typo_prob=0.1,
	)

	augmented_data = augment_dataset(original_data, augmenter, target_size=1500)
	```

	## Deployment

	### Docker

	```bash
	# Build
	docker build -t indian-address-parser -f api/Dockerfile .

	# Run
	docker run -p 8080:8080 indian-address-parser
	```

	### Google Cloud Run

	```bash
	# Deploy with Cloud Build
	gcloud builds submit --config api/cloudbuild.yaml

	# Or deploy directly
	gcloud run deploy indian-address-parser \
	--image gcr.io/PROJECT_ID/indian-address-parser \
	--region us-central1 \
	--min-instances 1 \
	--allow-unauthenticated
	```

	### HuggingFace Spaces

	1. Create a new Space on HuggingFace
	2. Copy contents of `demo/` directory
	3. Upload trained model to HuggingFace Hub
	4. Update `MODEL_PATH` environment variable

	## Architecture

	```
	┌─────────────────────────────────────────────────────────────────┐
	│ Indian Address Parser Pipeline │
	├─────────────────────────────────────────────────────────────────┤
	│ ┌──────────────┐ ┌─────────────────┐ ┌────────────────────┐ │
	│ │ Preprocessor │→│ mBERT-CRF │→│ Post-processor │ │
	│ │ (Hindi/Eng) │ │ (multilingual) │ │ (rules+gazetteer) │ │
	│ └──────────────┘ └─────────────────┘ └────────────────────┘ │
	├─────────────────────────────────────────────────────────────────┤
	│ Components: │
	│ • AddressNormalizer: Text normalization, abbreviation expansion│
	│ • HindiTransliterator: Devanagari → Latin conversion │
	│ • BertCRFForTokenClassification: mBERT + CRF for NER │
	│ • RuleBasedRefiner: Pattern matching, entity validation │
	│ • DelhiGazetteer: Fuzzy matching for locality names │
	└─────────────────────────────────────────────────────────────────┘
	```

	## Performance

	\| Metric \| Value \|
	\|--------\|-------\|
	\| Precision \| 94.2% \|
	\| Recall \| 95.1% \|
	\| F1 Score \| 94.6% \|
	\| Inference Time \| ~25ms \|

	Tested on held-out test set of 60+ Delhi addresses.

	## Project Structure

	```
	indian-address-parser/
	├── src/address_parser/
	│ ├── preprocessing/ # Text normalization, Hindi transliteration
	│ ├── models/ # mBERT-CRF model architecture
	│ ├── postprocessing/ # Rules, gazetteer, validation
	│ ├── pipeline.py # Main orchestration
	│ └── schemas.py # Pydantic I/O models
	├── api/ # FastAPI service
	├── demo/ # Gradio demo for HuggingFace Spaces
	├── training/ # Data prep, training scripts
	├── tests/ # pytest test suite
	└── pyproject.toml # Package config
	```

	## Development

	### Setup

	```bash
	# Clone repository
	git clone https://github.com/kushagra/indian-address-parser.git
	cd indian-address-parser

	# Install with dev dependencies
	pip install -e ".[dev]"

	# Install pre-commit hooks
	pre-commit install
	```

	### Testing

	```bash
	# Run all tests
	pytest

	# Run with coverage
	pytest --cov=address_parser --cov-report=html

	# Run specific test file
	pytest tests/test_pipeline.py -v
	```

	### Code Quality

	```bash
	# Format code
	black src/ tests/

	# Lint
	ruff check src/ tests/

	# Type check
	mypy src/
	```

	## Comparison with Alternatives

	\| Solution \| Indian Support \| Custom Labels \| Latency \| Cost \|
	\|----------\|---------------\|---------------\|---------\|------\|
	\| This Project \| Excellent \| Yes (15 types) \| ~25ms \| Free \|
	\| libpostal \| Poor \| No \| ~5ms \| Free \|
	\| Deepparse \| Generic \| No \| ~50ms \| Free \|
	\| GPT-4 \| Good \| Configurable \| ~1000ms \| $0.03/call \|
	\| Google Geocoding \| Moderate \| No \| ~200ms \| $5/1000 \|

	## License

	MIT License - see [LICENSE](LICENSE) for details.

	## Acknowledgments

	- Original 2024 BSES Delhi internship project
	- HuggingFace Transformers library
	- Delhi locality data from public sources

	## Citation

	```bibtex
	@software{indian_address_parser,
	author = {Kushagra},
	title = {Indian Address Parser: Production-grade NER for Indian Addresses},
	year = {2026},
	url = {https://github.com/kushagra/indian-address-parser}
	}
	```