US Address Parser LoRA - USPS-Style Component Extraction

This repository contains a LoRA fine-tuned adapter for parsing US street addresses into structured JSON components.

The model was fine-tuned from Qwen/Qwen2.5-0.5B-Instruct using Alpaca-style instruction examples. The task is to split a full US address into components such as house number, street name, street suffix, directional fields, apartment/unit, city, state, and ZIP code.

Intended Use

Given a US address instruction like:

Split and validate the USPS-style address: 55 Brooksby Village Way, Danvers MA 1923

The model is trained to return compact JSON:

{
  "HouseNumber": "55",
  "StreetPreDirection": "",
  "StreetName": "BROOKSBY VILLAGE",
  "StreetSuffix": "WAY",
  "StreetPostDirection": "",
  "Apt": "",
  "City": "DANVERS",
  "State": "MA",
  "ZipCode": "01923",
  "IsValidUSPSStyle": true,
  "ValidationNotes": ""
}

Important Validation Note

This model performs USPS-style structural parsing and normalization. It is not a USPS-certified or CASS-certified address validation system.

For production use, pair the model output with an authoritative USPS, CASS-certified, or licensed address-validation provider to confirm deliverability.

Training Data

The included Colab workflow generates approximately 10,000 Alpaca-format synthetic examples covering all 50 US states plus Washington, DC.

The generated dataset includes:

pre-directionals and post-directionals
apartments, units, suites, floors, and # unit markers
5-digit ZIP codes and ZIP+4
leading-zero ZIP normalization
multi-word street names
numbered streets
valid and intentionally invalid structural examples

Synthetic data is useful for learning schema and formatting behavior, but real labeled address data should be added for production-quality performance.

Training Setup

Base model: Qwen/Qwen2.5-0.5B-Instruct
Fine-tuning method: LoRA with PEFT
Training format: Alpaca-style instruction tuning
Core libraries: transformers, peft, torch, tiktoken
Default Colab target: T4 GPU or better
LoRA target modules: q_proj, k_proj, v_proj, o_proj

Evaluation

The notebook reports:

JSON parse rate
exact normalized match rate
component-level accuracy
ZIP accuracy
USPS-style validity agreement
validity confusion matrix
malformed-output inspection
training loss by optimizer step
validation loss by optimizer step
approximate examples seen per logged loss

Fill in your final numbers after training:

Metric	Value
JSON parse rate	TODO
Exact normalized match rate	TODO
ZIP accuracy	TODO
USPS-style validity agreement	TODO
Test examples evaluated	TODO

Inference Example

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

base_model = "Qwen/Qwen2.5-0.5B-Instruct"
adapter_id = "YOUR_USERNAME/YOUR_MODEL_REPO"

tokenizer = AutoTokenizer.from_pretrained(adapter_id)
base = AutoModelForCausalLM.from_pretrained(
    base_model,
    torch_dtype=torch.float16,
    low_cpu_mem_usage=True,
    trust_remote_code=True,
)
model = PeftModel.from_pretrained(base, adapter_id)
model.eval()
model.config.use_cache = True

prompt = """<|im_start|>system
You are a USPS-style US address parser. Return only valid compact JSON with the requested fields.<|im_end|>
<|im_start|>user
### Instruction:
Split and validate the USPS-style address: 1600 Pennsylvania Ave NW, Washington DC 20500<|im_end|>
<|im_start|>assistant
"""

inputs = tokenizer(prompt, return_tensors="pt")
with torch.no_grad():
    output_ids = model.generate(
        **inputs,
        max_new_tokens=180,
        do_sample=False,
        use_cache=True,
        pad_token_id=tokenizer.pad_token_id,
        eos_token_id=tokenizer.eos_token_id,
    )

print(tokenizer.decode(output_ids[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=False))

Upload Notes

After training in the Colab notebook, upload the adapter directory:

from huggingface_hub import login
login()

model.push_to_hub("YOUR_USERNAME/us-address-parser-lora")
tokenizer.push_to_hub("YOUR_USERNAME/us-address-parser-lora")

If you want the dataset in a separate repository:

from datasets import Dataset
dataset = Dataset.from_json("address_alpaca_10k.jsonl")
dataset.push_to_hub("YOUR_USERNAME/us-address-alpaca-10k")

Limitations

The model is not an authoritative deliverability validator.
Synthetic training data may not represent all real-world edge cases.
Outputs should be parsed and validated downstream before use.
Production systems should include deterministic post-processing and external address validation.

Responsible Use

This model is intended for address parsing, normalization assistance, data cleaning, and workflow prototyping. Do not rely on it as the sole source of truth for mailing, compliance, fraud detection, or other high-impact decisions.

Downloads last month: 2

Model tree for Trinetra1992/us-address-parser-lora

Base model

Qwen/Qwen2.5-0.5B

Finetuned

Qwen/Qwen2.5-0.5B-Instruct

Adapter

(662)

this model