File size: 4,621 Bytes
bf74958 7b6ffe4 bf74958 5355c40 ac2c360 7b6ffe4 ac2c360 7b6ffe4 ac2c360 7b6ffe4 ac2c360 7b6ffe4 ac2c360 7b6ffe4 ac2c360 7b6ffe4 ac2c360 7b6ffe4 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 | ---
license: apache-2.0
metrics:
- accuracy
base_model:
- google/gemma-3-270m
tags:
- unstructured-to-structured-data
- fine-tune
- salary-normalizer
- salary-parser
---
# Salary Normalizer
A fine-tuned [Gemma 3 270M](https://huggingface.co/google/gemma-3-270m) model that parses
and standardizes free-form salary text into structured JSON. Given an arbitrary salary
string and a country name, it extracts currency symbol, ISO code, numeric range, and
pay cadence in a single inference pass.
## Model Details
| Property | Value |
|----------------|-------------------------------|
| Base model | `google/gemma-3-270m` |
| Fine-tune type | Supervised (instruction-tuned)|
| License | Apache 2.0 |
| Task | Structured information extraction |
## Intended Use
- Extract and normalize salary mentions from job descriptions or candidate profiles.
- Standardize heterogeneous salary formats (e.g., `12 LPA`, `$60k–$80k/yr`, `€45,000 p.a.`)
into a consistent schema for downstream analytics or storage.
**Out-of-scope use:** This model is not designed for general text generation or tasks
unrelated to salary parsing.
## Input Format
Prompts must follow this exact template:
```text
<start_of_turn>user
summarize salary: <SALARY_TEXT>
country: <COUNTRY_NAME><end_of_turn>
<start_of_turn>model
```
**Examples of valid salary text:**
- `$60k - $80k per year`
- `INR 12 LPA`
- `€45,000 annually`
- `12 to 12.5 US $ per hr`
Country names must match one of the [supported countries](#supported-countries) listed below.
## Output Schema
The model returns a JSON object with the following fields:
```json
{
"currency": "$",
"iso_code": "USD",
"min_amount": 60000,
"max_amount": 80000,
"pay_rate": "ANNUALLY"
}
```
| Field | Type | Description |
|--------------|---------------|--------------------------------------------------------------|
| `currency` | `string` | Raw currency symbol or string as it appears in the input |
| `iso_code` | `string` | Standardized ISO 4217 currency code |
| `min_amount` | `int / float` | Lower bound of the salary range (annualized or as stated) |
| `max_amount` | `int / float` | Upper bound of the salary range (annualized or as stated) |
| `pay_rate` | `string` | One of: `HOURLY`, `DAILY`, `WEEKLY`, `BI-WEEKLY`, `MONTHLY`, `ANNUALLY`, `OTHERS` |
> **Note:** `min_amount` and `max_amount` reflect normalized numeric values,
> not raw token extractions. For single-value salaries, both fields will hold the same value.
## Usage
```python
from transformers import AutoTokenizer, AutoModelForCausalLM
model_name = "Draup/salary-normalizer"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
salary_text = "12 to 12.5 US $ per hr"
country = "United States"
prompt = (
f"<start_of_turn>user\n"
f"summarize salary: {salary_text}\n"
f"country: {country}<end_of_turn>\n"
f"<start_of_turn>model\n"
)
inputs = tokenizer(prompt, return_tensors="pt", truncation=True)
outputs = model.generate(
**inputs,
max_new_tokens=64,
do_sample=False,
eos_token_id=tokenizer.convert_tokens_to_ids("<end_of_turn>")
)
result = tokenizer.decode(
outputs[0][inputs["input_ids"].shape[-1]:],
skip_special_tokens=True
)
print(result)
# {"currency": "US $", "iso_code": "USD", "min_amount": 12, "max_amount": 12.5, "pay_rate": "HOURLY"}
```
## Supported Countries
The model supports salary parsing for the following 49 countries:
| | | | |
|---|---|---|---|
| Argentina | Australia | Austria | Belgium |
| Brazil | Canada | Chile | China |
| Colombia | Czechia | Denmark | Egypt |
| Finland | France | Germany | Hong Kong |
| Hungary | India | Indonesia | Ireland |
| Israel | Italy | Japan | Malaysia |
| Mexico | Netherlands | New Zealand | Norway |
| Pakistan | Peru | Philippines | Poland |
| Portugal | Romania | Russia | Saudi Arabia |
| Singapore | South Africa | South Korea | Spain |
| Sweden | Switzerland | Taiwan | Thailand |
| Turkey | United Arab Emirates | United Kingdom | United States |
| Vietnam | | | |
## Limitations
- Performance may degrade on salary formats not well-represented in training data.
- Country context is used for currency disambiguation; incorrect country input may
produce inaccurate `iso_code` or `currency` values.
- The model is not multilingual — salary text is expected to be in English or use
standard numeric/symbol conventions. |