File size: 4,621 Bytes

---
license: apache-2.0
metrics:
- accuracy
base_model:
- google/gemma-3-270m
tags:
- unstructured-to-structured-data
- fine-tune
- salary-normalizer
- salary-parser
---

# Salary Normalizer

A fine-tuned [Gemma 3 270M](https://huggingface.co/google/gemma-3-270m) model that parses
and standardizes free-form salary text into structured JSON. Given an arbitrary salary
string and a country name, it extracts currency symbol, ISO code, numeric range, and
pay cadence in a single inference pass.

## Model Details

| Property       | Value                          |
|----------------|-------------------------------|
| Base model     | `google/gemma-3-270m`         |
| Fine-tune type | Supervised (instruction-tuned)|
| License        | Apache 2.0                    |
| Task           | Structured information extraction |

## Intended Use

- Extract and normalize salary mentions from job descriptions or candidate profiles.
- Standardize heterogeneous salary formats (e.g., `12 LPA`, `$60k–$80k/yr`, `€45,000 p.a.`)
  into a consistent schema for downstream analytics or storage.

**Out-of-scope use:** This model is not designed for general text generation or tasks
unrelated to salary parsing.


## Input Format

Prompts must follow this exact template:
```text
<start_of_turn>user
summarize salary: <SALARY_TEXT>
country: <COUNTRY_NAME><end_of_turn>
<start_of_turn>model
```

**Examples of valid salary text:**
- `$60k - $80k per year`
- `INR 12 LPA`
- `€45,000 annually`
- `12 to 12.5 US $ per hr`

Country names must match one of the [supported countries](#supported-countries) listed below.


## Output Schema

The model returns a JSON object with the following fields:
```json
{
  "currency": "$",
  "iso_code": "USD",
  "min_amount": 60000,
  "max_amount": 80000,
  "pay_rate": "ANNUALLY"
}
```

| Field        | Type          | Description                                                  |
|--------------|---------------|--------------------------------------------------------------|
| `currency`   | `string`      | Raw currency symbol or string as it appears in the input     |
| `iso_code`   | `string`      | Standardized ISO 4217 currency code                          |
| `min_amount` | `int / float` | Lower bound of the salary range (annualized or as stated)    |
| `max_amount` | `int / float` | Upper bound of the salary range (annualized or as stated)    |
| `pay_rate`   | `string`      | One of: `HOURLY`, `DAILY`, `WEEKLY`, `BI-WEEKLY`, `MONTHLY`, `ANNUALLY`, `OTHERS` |

> **Note:** `min_amount` and `max_amount` reflect normalized numeric values,
> not raw token extractions. For single-value salaries, both fields will hold the same value.

## Usage
```python
from transformers import AutoTokenizer, AutoModelForCausalLM

model_name = "Draup/salary-normalizer"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

salary_text = "12 to 12.5 US $ per hr"
country = "United States"

prompt = (
    f"<start_of_turn>user\n"
    f"summarize salary: {salary_text}\n"
    f"country: {country}<end_of_turn>\n"
    f"<start_of_turn>model\n"
)

inputs = tokenizer(prompt, return_tensors="pt", truncation=True)

outputs = model.generate(
    **inputs,
    max_new_tokens=64,
    do_sample=False,
    eos_token_id=tokenizer.convert_tokens_to_ids("<end_of_turn>")
)

result = tokenizer.decode(
    outputs[0][inputs["input_ids"].shape[-1]:],
    skip_special_tokens=True
)
print(result)
# {"currency": "US $", "iso_code": "USD", "min_amount": 12, "max_amount": 12.5, "pay_rate": "HOURLY"}
```

## Supported Countries

The model supports salary parsing for the following 49 countries:

| | | | |
|---|---|---|---|
| Argentina | Australia | Austria | Belgium |
| Brazil | Canada | Chile | China |
| Colombia | Czechia | Denmark | Egypt |
| Finland | France | Germany | Hong Kong |
| Hungary | India | Indonesia | Ireland |
| Israel | Italy | Japan | Malaysia |
| Mexico | Netherlands | New Zealand | Norway |
| Pakistan | Peru | Philippines | Poland |
| Portugal | Romania | Russia | Saudi Arabia |
| Singapore | South Africa | South Korea | Spain |
| Sweden | Switzerland | Taiwan | Thailand |
| Turkey | United Arab Emirates | United Kingdom | United States |
| Vietnam | | | |

## Limitations

- Performance may degrade on salary formats not well-represented in training data.
- Country context is used for currency disambiguation; incorrect country input may
  produce inaccurate `iso_code` or `currency` values.
- The model is not multilingual — salary text is expected to be in English or use
  standard numeric/symbol conventions.