| --- |
| license: apache-2.0 |
| metrics: |
| - accuracy |
| base_model: |
| - google/gemma-3-270m |
| tags: |
| - unstructured-to-structured-data |
| - fine-tune |
| - salary-normalizer |
| - salary-parser |
| --- |
| |
| # Salary Normalizer |
|
|
| A fine-tuned [Gemma 3 270M](https://huggingface.co/google/gemma-3-270m) model that parses |
| and standardizes free-form salary text into structured JSON. Given an arbitrary salary |
| string and a country name, it extracts currency symbol, ISO code, numeric range, and |
| pay cadence in a single inference pass. |
|
|
| ## Model Details |
|
|
| | Property | Value | |
| |----------------|-------------------------------| |
| | Base model | `google/gemma-3-270m` | |
| | Fine-tune type | Supervised (instruction-tuned)| |
| | License | Apache 2.0 | |
| | Task | Structured information extraction | |
|
|
| ## Intended Use |
|
|
| - Extract and normalize salary mentions from job descriptions or candidate profiles. |
| - Standardize heterogeneous salary formats (e.g., `12 LPA`, `$60k–$80k/yr`, `€45,000 p.a.`) |
| into a consistent schema for downstream analytics or storage. |
|
|
| **Out-of-scope use:** This model is not designed for general text generation or tasks |
| unrelated to salary parsing. |
|
|
|
|
| ## Input Format |
|
|
| Prompts must follow this exact template: |
| ```text |
| <start_of_turn>user |
| summarize salary: <SALARY_TEXT> |
| country: <COUNTRY_NAME><end_of_turn> |
| <start_of_turn>model |
| ``` |
|
|
| **Examples of valid salary text:** |
| - `$60k - $80k per year` |
| - `INR 12 LPA` |
| - `€45,000 annually` |
| - `12 to 12.5 US $ per hr` |
|
|
| Country names must match one of the [supported countries](#supported-countries) listed below. |
|
|
|
|
| ## Output Schema |
|
|
| The model returns a JSON object with the following fields: |
| ```json |
| { |
| "currency": "$", |
| "iso_code": "USD", |
| "min_amount": 60000, |
| "max_amount": 80000, |
| "pay_rate": "ANNUALLY" |
| } |
| ``` |
|
|
| | Field | Type | Description | |
| |--------------|---------------|--------------------------------------------------------------| |
| | `currency` | `string` | Raw currency symbol or string as it appears in the input | |
| | `iso_code` | `string` | Standardized ISO 4217 currency code | |
| | `min_amount` | `int / float` | Lower bound of the salary range (annualized or as stated) | |
| | `max_amount` | `int / float` | Upper bound of the salary range (annualized or as stated) | |
| | `pay_rate` | `string` | One of: `HOURLY`, `DAILY`, `WEEKLY`, `BI-WEEKLY`, `MONTHLY`, `ANNUALLY`, `OTHERS` | |
|
|
| > **Note:** `min_amount` and `max_amount` reflect normalized numeric values, |
| > not raw token extractions. For single-value salaries, both fields will hold the same value. |
|
|
| ## Usage |
| ```python |
| from transformers import AutoTokenizer, AutoModelForCausalLM |
| |
| model_name = "Draup/salary-normalizer" |
| tokenizer = AutoTokenizer.from_pretrained(model_name) |
| model = AutoModelForCausalLM.from_pretrained(model_name) |
| |
| salary_text = "12 to 12.5 US $ per hr" |
| country = "United States" |
| |
| prompt = ( |
| f"<start_of_turn>user\n" |
| f"summarize salary: {salary_text}\n" |
| f"country: {country}<end_of_turn>\n" |
| f"<start_of_turn>model\n" |
| ) |
| |
| inputs = tokenizer(prompt, return_tensors="pt", truncation=True) |
| |
| outputs = model.generate( |
| **inputs, |
| max_new_tokens=64, |
| do_sample=False, |
| eos_token_id=tokenizer.convert_tokens_to_ids("<end_of_turn>") |
| ) |
| |
| result = tokenizer.decode( |
| outputs[0][inputs["input_ids"].shape[-1]:], |
| skip_special_tokens=True |
| ) |
| print(result) |
| # {"currency": "US $", "iso_code": "USD", "min_amount": 12, "max_amount": 12.5, "pay_rate": "HOURLY"} |
| ``` |
|
|
| ## Supported Countries |
|
|
| The model supports salary parsing for the following 49 countries: |
|
|
| | | | | | |
| |---|---|---|---| |
| | Argentina | Australia | Austria | Belgium | |
| | Brazil | Canada | Chile | China | |
| | Colombia | Czechia | Denmark | Egypt | |
| | Finland | France | Germany | Hong Kong | |
| | Hungary | India | Indonesia | Ireland | |
| | Israel | Italy | Japan | Malaysia | |
| | Mexico | Netherlands | New Zealand | Norway | |
| | Pakistan | Peru | Philippines | Poland | |
| | Portugal | Romania | Russia | Saudi Arabia | |
| | Singapore | South Africa | South Korea | Spain | |
| | Sweden | Switzerland | Taiwan | Thailand | |
| | Turkey | United Arab Emirates | United Kingdom | United States | |
| | Vietnam | | | | |
|
|
| ## Limitations |
|
|
| - Performance may degrade on salary formats not well-represented in training data. |
| - Country context is used for currency disambiguation; incorrect country input may |
| produce inaccurate `iso_code` or `currency` values. |
| - The model is not multilingual — salary text is expected to be in English or use |
| standard numeric/symbol conventions. |