File size: 5,243 Bytes
dd580e5
419f108
 
 
 
 
 
 
 
 
 
 
 
 
 
dd580e5
 
419f108
dd580e5
419f108
dd580e5
419f108
dd580e5
419f108
dd580e5
419f108
dd580e5
419f108
 
 
dd580e5
419f108
dd580e5
419f108
 
 
 
 
 
 
 
 
 
 
 
 
 
 
dd580e5
419f108
dd580e5
419f108
dd580e5
419f108
dd580e5
419f108
dd580e5
419f108
dd580e5
419f108
dd580e5
419f108
 
 
 
 
 
 
dd580e5
419f108
dd580e5
419f108
dd580e5
419f108
 
 
 
 
 
dd580e5
 
 
419f108
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
---
license: apache-2.0
base_model: Qwen/Qwen2.5-0.5B-Instruct
library_name: peft
pipeline_tag: text-generation
tags:
  - address-parsing
  - us-addresses
  - usps-style
  - alpaca
  - lora
  - peft
  - qwen2.5
language:
  - en
---

# US Address Parser LoRA - USPS-Style Component Extraction

This repository contains a LoRA fine-tuned adapter for parsing US street addresses into structured JSON components.

The model was fine-tuned from `Qwen/Qwen2.5-0.5B-Instruct` using Alpaca-style instruction examples. The task is to split a full US address into components such as house number, street name, street suffix, directional fields, apartment/unit, city, state, and ZIP code.

## Intended Use

Given a US address instruction like:

```text
Split and validate the USPS-style address: 55 Brooksby Village Way, Danvers MA 1923
```

The model is trained to return compact JSON:

```json
{
  "HouseNumber": "55",
  "StreetPreDirection": "",
  "StreetName": "BROOKSBY VILLAGE",
  "StreetSuffix": "WAY",
  "StreetPostDirection": "",
  "Apt": "",
  "City": "DANVERS",
  "State": "MA",
  "ZipCode": "01923",
  "IsValidUSPSStyle": true,
  "ValidationNotes": ""
}
```

## Important Validation Note

This model performs **USPS-style structural parsing and normalization**. It is not a USPS-certified or CASS-certified address validation system.

For production use, pair the model output with an authoritative USPS, CASS-certified, or licensed address-validation provider to confirm deliverability.

## Training Data

The included Colab workflow generates approximately 10,000 Alpaca-format synthetic examples covering all 50 US states plus Washington, DC.

The generated dataset includes:

- pre-directionals and post-directionals
- apartments, units, suites, floors, and `#` unit markers
- 5-digit ZIP codes and ZIP+4
- leading-zero ZIP normalization
- multi-word street names
- numbered streets
- valid and intentionally invalid structural examples

Synthetic data is useful for learning schema and formatting behavior, but real labeled address data should be added for production-quality performance.

## Training Setup

- Base model: `Qwen/Qwen2.5-0.5B-Instruct`
- Fine-tuning method: LoRA with PEFT
- Training format: Alpaca-style instruction tuning
- Core libraries: `transformers`, `peft`, `torch`, `tiktoken`
- Default Colab target: T4 GPU or better
- LoRA target modules: `q_proj`, `k_proj`, `v_proj`, `o_proj`

## Evaluation

The notebook reports:

- JSON parse rate
- exact normalized match rate
- component-level accuracy
- ZIP accuracy
- USPS-style validity agreement
- validity confusion matrix
- malformed-output inspection
- training loss by optimizer step
- validation loss by optimizer step
- approximate examples seen per logged loss

Fill in your final numbers after training:

| Metric | Value |
|---|---:|
| JSON parse rate | TODO |
| Exact normalized match rate | TODO |
| ZIP accuracy | TODO |
| USPS-style validity agreement | TODO |
| Test examples evaluated | TODO |

## Inference Example

```python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

base_model = "Qwen/Qwen2.5-0.5B-Instruct"
adapter_id = "YOUR_USERNAME/YOUR_MODEL_REPO"

tokenizer = AutoTokenizer.from_pretrained(adapter_id)
base = AutoModelForCausalLM.from_pretrained(
    base_model,
    torch_dtype=torch.float16,
    low_cpu_mem_usage=True,
    trust_remote_code=True,
)
model = PeftModel.from_pretrained(base, adapter_id)
model.eval()
model.config.use_cache = True

prompt = """<|im_start|>system
You are a USPS-style US address parser. Return only valid compact JSON with the requested fields.<|im_end|>
<|im_start|>user
### Instruction:
Split and validate the USPS-style address: 1600 Pennsylvania Ave NW, Washington DC 20500<|im_end|>
<|im_start|>assistant
"""

inputs = tokenizer(prompt, return_tensors="pt")
with torch.no_grad():
    output_ids = model.generate(
        **inputs,
        max_new_tokens=180,
        do_sample=False,
        use_cache=True,
        pad_token_id=tokenizer.pad_token_id,
        eos_token_id=tokenizer.eos_token_id,
    )

print(tokenizer.decode(output_ids[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=False))
```

## Upload Notes

After training in the Colab notebook, upload the adapter directory:

```python
from huggingface_hub import login
login()

model.push_to_hub("YOUR_USERNAME/us-address-parser-lora")
tokenizer.push_to_hub("YOUR_USERNAME/us-address-parser-lora")
```

If you want the dataset in a separate repository:

```python
from datasets import Dataset
dataset = Dataset.from_json("address_alpaca_10k.jsonl")
dataset.push_to_hub("YOUR_USERNAME/us-address-alpaca-10k")
```

## Limitations

- The model is not an authoritative deliverability validator.
- Synthetic training data may not represent all real-world edge cases.
- Outputs should be parsed and validated downstream before use.
- Production systems should include deterministic post-processing and external address validation.

## Responsible Use

This model is intended for address parsing, normalization assistance, data cleaning, and workflow prototyping. Do not rely on it as the sole source of truth for mailing, compliance, fraud detection, or other high-impact decisions.