salary-normalizer / README.md
Draup-DS's picture
Update README.md
7b6ffe4 verified
---
license: apache-2.0
metrics:
- accuracy
base_model:
- google/gemma-3-270m
tags:
- unstructured-to-structured-data
- fine-tune
- salary-normalizer
- salary-parser
---
# Salary Normalizer
A fine-tuned [Gemma 3 270M](https://huggingface.co/google/gemma-3-270m) model that parses
and standardizes free-form salary text into structured JSON. Given an arbitrary salary
string and a country name, it extracts currency symbol, ISO code, numeric range, and
pay cadence in a single inference pass.
## Model Details
| Property | Value |
|----------------|-------------------------------|
| Base model | `google/gemma-3-270m` |
| Fine-tune type | Supervised (instruction-tuned)|
| License | Apache 2.0 |
| Task | Structured information extraction |
## Intended Use
- Extract and normalize salary mentions from job descriptions or candidate profiles.
- Standardize heterogeneous salary formats (e.g., `12 LPA`, `$60k–$80k/yr`, `€45,000 p.a.`)
into a consistent schema for downstream analytics or storage.
**Out-of-scope use:** This model is not designed for general text generation or tasks
unrelated to salary parsing.
## Input Format
Prompts must follow this exact template:
```text
<start_of_turn>user
summarize salary: <SALARY_TEXT>
country: <COUNTRY_NAME><end_of_turn>
<start_of_turn>model
```
**Examples of valid salary text:**
- `$60k - $80k per year`
- `INR 12 LPA`
- `€45,000 annually`
- `12 to 12.5 US $ per hr`
Country names must match one of the [supported countries](#supported-countries) listed below.
## Output Schema
The model returns a JSON object with the following fields:
```json
{
"currency": "$",
"iso_code": "USD",
"min_amount": 60000,
"max_amount": 80000,
"pay_rate": "ANNUALLY"
}
```
| Field | Type | Description |
|--------------|---------------|--------------------------------------------------------------|
| `currency` | `string` | Raw currency symbol or string as it appears in the input |
| `iso_code` | `string` | Standardized ISO 4217 currency code |
| `min_amount` | `int / float` | Lower bound of the salary range (annualized or as stated) |
| `max_amount` | `int / float` | Upper bound of the salary range (annualized or as stated) |
| `pay_rate` | `string` | One of: `HOURLY`, `DAILY`, `WEEKLY`, `BI-WEEKLY`, `MONTHLY`, `ANNUALLY`, `OTHERS` |
> **Note:** `min_amount` and `max_amount` reflect normalized numeric values,
> not raw token extractions. For single-value salaries, both fields will hold the same value.
## Usage
```python
from transformers import AutoTokenizer, AutoModelForCausalLM
model_name = "Draup/salary-normalizer"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
salary_text = "12 to 12.5 US $ per hr"
country = "United States"
prompt = (
f"<start_of_turn>user\n"
f"summarize salary: {salary_text}\n"
f"country: {country}<end_of_turn>\n"
f"<start_of_turn>model\n"
)
inputs = tokenizer(prompt, return_tensors="pt", truncation=True)
outputs = model.generate(
**inputs,
max_new_tokens=64,
do_sample=False,
eos_token_id=tokenizer.convert_tokens_to_ids("<end_of_turn>")
)
result = tokenizer.decode(
outputs[0][inputs["input_ids"].shape[-1]:],
skip_special_tokens=True
)
print(result)
# {"currency": "US $", "iso_code": "USD", "min_amount": 12, "max_amount": 12.5, "pay_rate": "HOURLY"}
```
## Supported Countries
The model supports salary parsing for the following 49 countries:
| | | | |
|---|---|---|---|
| Argentina | Australia | Austria | Belgium |
| Brazil | Canada | Chile | China |
| Colombia | Czechia | Denmark | Egypt |
| Finland | France | Germany | Hong Kong |
| Hungary | India | Indonesia | Ireland |
| Israel | Italy | Japan | Malaysia |
| Mexico | Netherlands | New Zealand | Norway |
| Pakistan | Peru | Philippines | Poland |
| Portugal | Romania | Russia | Saudi Arabia |
| Singapore | South Africa | South Korea | Spain |
| Sweden | Switzerland | Taiwan | Thailand |
| Turkey | United Arab Emirates | United Kingdom | United States |
| Vietnam | | | |
## Limitations
- Performance may degrade on salary formats not well-represented in training data.
- Country context is used for currency disambiguation; incorrect country input may
produce inaccurate `iso_code` or `currency` values.
- The model is not multilingual — salary text is expected to be in English or use
standard numeric/symbol conventions.