Draup
/

salary-normalizer

@@ -5,19 +5,86 @@ metrics:
 base_model:
 - google/gemma-3-270m
 tags:
-- undstructure-to-structured-data
 - fine-tune
 - salary-normalizer
 - salary-parser
 ---
-# Implementation Code
 ```
 from transformers import AutoTokenizer, AutoModelForCausalLM
 model_name = "Draup/salary-normalizer"
 tokenizer = AutoTokenizer.from_pretrained(model_name)
 model = AutoModelForCausalLM.from_pretrained(model_name)
@@ -32,14 +99,46 @@ prompt = (
 )
 inputs = tokenizer(prompt, return_tensors="pt", truncation=True)
 outputs = model.generate(
     **inputs,
-    max_new_tokens=70,
     do_sample=False,
     eos_token_id=tokenizer.convert_tokens_to_ids("<end_of_turn>")
 )
-result = tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True)
 print(result)
-# Output: {"currency": "US $", "iso_code": "USD", "min_amount": 12, "max_amount": 12.5, "pay_rate": "HOURLY"}
-```

 base_model:
 - google/gemma-3-270m
 tags:
+- unstructured-to-structured-data
 - fine-tune
 - salary-normalizer
 - salary-parser
 ---
+# Salary Normalizer
+A fine-tuned [Gemma 3 270M](https://huggingface.co/google/gemma-3-270m) model that parses
+and standardizes free-form salary text into structured JSON. Given an arbitrary salary
+string and a country name, it extracts currency symbol, ISO code, numeric range, and
+pay cadence in a single inference pass.
+## Model Details
+| Property       | Value                          |
+|----------------|-------------------------------|
+| Base model     | `google/gemma-3-270m`         |
+| Fine-tune type | Supervised (instruction-tuned)|
+| License        | Apache 2.0                    |
+| Task           | Structured information extraction |
+## Intended Use
+- Extract and normalize salary mentions from job descriptions or candidate profiles.
+- Standardize heterogeneous salary formats (e.g., `12 LPA`, `$60k–$80k/yr`, `€45,000 p.a.`)
+  into a consistent schema for downstream analytics or storage.
+**Out-of-scope use:** This model is not designed for general text generation or tasks
+unrelated to salary parsing.
+## Input Format
+Prompts must follow this exact template:
+```text
+<start_of_turn>user
+summarize salary: <SALARY_TEXT>
+country: <COUNTRY_NAME><end_of_turn>
+<start_of_turn>model
+```
+**Examples of valid salary text:**
+- `$60k - $80k per year`
+- `INR 12 LPA`
+- `€45,000 annually`
+- `12 to 12.5 US $ per hr`
+Country names must match one of the [supported countries](#supported-countries) listed below.
+## Output Schema
+The model returns a JSON object with the following fields:
+```json
+{
+  "currency": "$",
+  "iso_code": "USD",
+  "min_amount": 60000,
+  "max_amount": 80000,
+  "pay_rate": "ANNUALLY"
+}
 ```
+| Field        | Type          | Description                                                  |
+|--------------|---------------|--------------------------------------------------------------|
+| `currency`   | `string`      | Raw currency symbol or string as it appears in the input     |
+| `iso_code`   | `string`      | Standardized ISO 4217 currency code                          |
+| `min_amount` | `int / float` | Lower bound of the salary range (annualized or as stated)    |
+| `max_amount` | `int / float` | Upper bound of the salary range (annualized or as stated)    |
+| `pay_rate`   | `string`      | One of: `HOURLY`, `DAILY`, `WEEKLY`, `BI-WEEKLY`, `MONTHLY`, `ANNUALLY`, `OTHERS` |
+> **Note:** `min_amount` and `max_amount` reflect normalized numeric values,
+> not raw token extractions. For single-value salaries, both fields will hold the same value.
+## Usage
+```python
 from transformers import AutoTokenizer, AutoModelForCausalLM
 model_name = "Draup/salary-normalizer"
 tokenizer = AutoTokenizer.from_pretrained(model_name)
 model = AutoModelForCausalLM.from_pretrained(model_name)
 )
 inputs = tokenizer(prompt, return_tensors="pt", truncation=True)
 outputs = model.generate(
     **inputs,
+    max_new_tokens=64,
     do_sample=False,
     eos_token_id=tokenizer.convert_tokens_to_ids("<end_of_turn>")
 )
+result = tokenizer.decode(
+    outputs[0][inputs["input_ids"].shape[-1]:],
+    skip_special_tokens=True
+)
 print(result)
+# {"currency": "US $", "iso_code": "USD", "min_amount": 12, "max_amount": 12.5, "pay_rate": "HOURLY"}
+```
+## Supported Countries
+The model supports salary parsing for the following 49 countries:
+| | | | |
+|---|---|---|---|
+| Argentina | Australia | Austria | Belgium |
+| Brazil | Canada | Chile | China |
+| Colombia | Czechia | Denmark | Egypt |
+| Finland | France | Germany | Hong Kong |
+| Hungary | India | Indonesia | Ireland |
+| Israel | Italy | Japan | Malaysia |
+| Mexico | Netherlands | New Zealand | Norway |
+| Pakistan | Peru | Philippines | Poland |
+| Portugal | Romania | Russia | Saudi Arabia |
+| Singapore | South Africa | South Korea | Spain |
+| Sweden | Switzerland | Taiwan | Thailand |
+| Turkey | United Arab Emirates | United Kingdom | United States |
+| Vietnam | | | |
+## Limitations
+- Performance may degrade on salary formats not well-represented in training data.
+- Country context is used for currency disambiguation; incorrect country input may
+  produce inaccurate `iso_code` or `currency` values.
+- The model is not multilingual — salary text is expected to be in English or use
+  standard numeric/symbol conventions.