Draup-DS commited on
Commit
7b6ffe4
·
verified ·
1 Parent(s): ac2c360

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +106 -7
README.md CHANGED
@@ -5,19 +5,86 @@ metrics:
5
  base_model:
6
  - google/gemma-3-270m
7
  tags:
8
- - undstructure-to-structured-data
9
  - fine-tune
10
  - salary-normalizer
11
  - salary-parser
12
  ---
13
 
14
- # Implementation Code
15
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
16
  ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
17
  from transformers import AutoTokenizer, AutoModelForCausalLM
18
 
19
  model_name = "Draup/salary-normalizer"
20
-
21
  tokenizer = AutoTokenizer.from_pretrained(model_name)
22
  model = AutoModelForCausalLM.from_pretrained(model_name)
23
 
@@ -32,14 +99,46 @@ prompt = (
32
  )
33
 
34
  inputs = tokenizer(prompt, return_tensors="pt", truncation=True)
 
35
  outputs = model.generate(
36
  **inputs,
37
- max_new_tokens=70,
38
  do_sample=False,
39
  eos_token_id=tokenizer.convert_tokens_to_ids("<end_of_turn>")
40
  )
41
 
42
- result = tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True)
 
 
 
43
  print(result)
44
- # Output: {"currency": "US $", "iso_code": "USD", "min_amount": 12, "max_amount": 12.5, "pay_rate": "HOURLY"}
45
- ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5
  base_model:
6
  - google/gemma-3-270m
7
  tags:
8
+ - unstructured-to-structured-data
9
  - fine-tune
10
  - salary-normalizer
11
  - salary-parser
12
  ---
13
 
14
+ # Salary Normalizer
15
 
16
+ A fine-tuned [Gemma 3 270M](https://huggingface.co/google/gemma-3-270m) model that parses
17
+ and standardizes free-form salary text into structured JSON. Given an arbitrary salary
18
+ string and a country name, it extracts currency symbol, ISO code, numeric range, and
19
+ pay cadence in a single inference pass.
20
+
21
+ ## Model Details
22
+
23
+ | Property | Value |
24
+ |----------------|-------------------------------|
25
+ | Base model | `google/gemma-3-270m` |
26
+ | Fine-tune type | Supervised (instruction-tuned)|
27
+ | License | Apache 2.0 |
28
+ | Task | Structured information extraction |
29
+
30
+ ## Intended Use
31
+
32
+ - Extract and normalize salary mentions from job descriptions or candidate profiles.
33
+ - Standardize heterogeneous salary formats (e.g., `12 LPA`, `$60k–$80k/yr`, `€45,000 p.a.`)
34
+ into a consistent schema for downstream analytics or storage.
35
+
36
+ **Out-of-scope use:** This model is not designed for general text generation or tasks
37
+ unrelated to salary parsing.
38
+
39
+
40
+ ## Input Format
41
+
42
+ Prompts must follow this exact template:
43
+ ```text
44
+ <start_of_turn>user
45
+ summarize salary: <SALARY_TEXT>
46
+ country: <COUNTRY_NAME><end_of_turn>
47
+ <start_of_turn>model
48
+ ```
49
+
50
+ **Examples of valid salary text:**
51
+ - `$60k - $80k per year`
52
+ - `INR 12 LPA`
53
+ - `€45,000 annually`
54
+ - `12 to 12.5 US $ per hr`
55
+
56
+ Country names must match one of the [supported countries](#supported-countries) listed below.
57
+
58
+
59
+ ## Output Schema
60
+
61
+ The model returns a JSON object with the following fields:
62
+ ```json
63
+ {
64
+ "currency": "$",
65
+ "iso_code": "USD",
66
+ "min_amount": 60000,
67
+ "max_amount": 80000,
68
+ "pay_rate": "ANNUALLY"
69
+ }
70
  ```
71
+
72
+ | Field | Type | Description |
73
+ |--------------|---------------|--------------------------------------------------------------|
74
+ | `currency` | `string` | Raw currency symbol or string as it appears in the input |
75
+ | `iso_code` | `string` | Standardized ISO 4217 currency code |
76
+ | `min_amount` | `int / float` | Lower bound of the salary range (annualized or as stated) |
77
+ | `max_amount` | `int / float` | Upper bound of the salary range (annualized or as stated) |
78
+ | `pay_rate` | `string` | One of: `HOURLY`, `DAILY`, `WEEKLY`, `BI-WEEKLY`, `MONTHLY`, `ANNUALLY`, `OTHERS` |
79
+
80
+ > **Note:** `min_amount` and `max_amount` reflect normalized numeric values,
81
+ > not raw token extractions. For single-value salaries, both fields will hold the same value.
82
+
83
+ ## Usage
84
+ ```python
85
  from transformers import AutoTokenizer, AutoModelForCausalLM
86
 
87
  model_name = "Draup/salary-normalizer"
 
88
  tokenizer = AutoTokenizer.from_pretrained(model_name)
89
  model = AutoModelForCausalLM.from_pretrained(model_name)
90
 
 
99
  )
100
 
101
  inputs = tokenizer(prompt, return_tensors="pt", truncation=True)
102
+
103
  outputs = model.generate(
104
  **inputs,
105
+ max_new_tokens=64,
106
  do_sample=False,
107
  eos_token_id=tokenizer.convert_tokens_to_ids("<end_of_turn>")
108
  )
109
 
110
+ result = tokenizer.decode(
111
+ outputs[0][inputs["input_ids"].shape[-1]:],
112
+ skip_special_tokens=True
113
+ )
114
  print(result)
115
+ # {"currency": "US $", "iso_code": "USD", "min_amount": 12, "max_amount": 12.5, "pay_rate": "HOURLY"}
116
+ ```
117
+
118
+ ## Supported Countries
119
+
120
+ The model supports salary parsing for the following 49 countries:
121
+
122
+ | | | | |
123
+ |---|---|---|---|
124
+ | Argentina | Australia | Austria | Belgium |
125
+ | Brazil | Canada | Chile | China |
126
+ | Colombia | Czechia | Denmark | Egypt |
127
+ | Finland | France | Germany | Hong Kong |
128
+ | Hungary | India | Indonesia | Ireland |
129
+ | Israel | Italy | Japan | Malaysia |
130
+ | Mexico | Netherlands | New Zealand | Norway |
131
+ | Pakistan | Peru | Philippines | Poland |
132
+ | Portugal | Romania | Russia | Saudi Arabia |
133
+ | Singapore | South Africa | South Korea | Spain |
134
+ | Sweden | Switzerland | Taiwan | Thailand |
135
+ | Turkey | United Arab Emirates | United Kingdom | United States |
136
+ | Vietnam | | | |
137
+
138
+ ## Limitations
139
+
140
+ - Performance may degrade on salary formats not well-represented in training data.
141
+ - Country context is used for currency disambiguation; incorrect country input may
142
+ produce inaccurate `iso_code` or `currency` values.
143
+ - The model is not multilingual — salary text is expected to be in English or use
144
+ standard numeric/symbol conventions.