lbourdois commited on
Commit
e835408
·
verified ·
1 Parent(s): a310726

Improve language tag

Browse files

Hi! As the model is multilingual, this is a PR to add other languages than English to the language tag to improve the referencing. Note that 29 languages are announced in the README, but only 13 are explicitly listed. I was therefore only able to add these 13 languages.

Files changed (1) hide show
  1. README.md +189 -180
README.md CHANGED
@@ -1,180 +1,189 @@
1
-
2
- ---
3
-
4
- license: mit
5
- language:
6
- - multilingual
7
- tags:
8
- - nlp
9
- base_model: Qwen/Qwen2.5-0.5B
10
- pipeline_tag: text-generation
11
-
12
- ---
13
-
14
- [![QuantFactory Banner](https://lh7-rt.googleusercontent.com/docsz/AD_4nXeiuCm7c8lEwEJuRey9kiVZsRn2W-b4pWlu3-X534V3YmVuVc2ZL-NXg2RkzSOOS2JXGHutDuyyNAUtdJI65jGTo8jT9Y99tMi4H4MqL44Uc5QKG77B0d6-JfIkZHFaUA71-RtjyYZWVIhqsNZcx8-OMaA?key=xt3VSDoCbmTY7o-cwwOFwQ)](https://hf.co/QuantFactory)
15
-
16
-
17
- # QuantFactory/NuExtract-1.5-tiny-GGUF
18
- This is quantized version of [numind/NuExtract-1.5-tiny](https://huggingface.co/numind/NuExtract-1.5-tiny) created using llama.cpp
19
-
20
- # Original Model Card
21
-
22
-
23
- # NuExtract-tiny-v1.5 by NuMind 🔥
24
-
25
- NuExtract-tiny-v1.5 is a fine-tuning of [Qwen/Qwen2.5-0.5B](https://huggingface.co/Qwen/Qwen2.5-0.5B), trained on a private high-quality dataset for structured information extraction. It supports long documents and several languages (English, French, Spanish, German, Portuguese, and Italian).
26
- To use the model, provide an input text and a JSON template describing the information you need to extract.
27
-
28
- Note: This model is trained to prioritize pure extraction, so in most cases all text generated by the model is present as is in the original text.
29
-
30
- We also provide a 3.8B version which is based on Phi-3.5-mini-instruct: [NuExtract-v1.5](https://huggingface.co/numind/NuExtract-v1.5)
31
-
32
- Check out the [blog post](https://numind.ai/blog/nuextract-1-5---multilingual-infinite-context-still-small-and-better-than-gpt-4o).
33
-
34
- Try the 3.8B model here: [Playground](https://huggingface.co/spaces/numind/NuExtract-v1.5)
35
-
36
- ## Benchmark
37
-
38
- Zero-shot performance (English):
39
-
40
- <p align="left">
41
- <img src="english_bench.png" style="width: 600; height: auto;">
42
- </p>
43
-
44
- Few-shot fine-tuning:
45
-
46
- <p align="left">
47
- <img src="fewshot_bench.png" style="width: 750; height: auto;">
48
- </p>
49
-
50
-
51
- ## Usage
52
-
53
- To use the model:
54
-
55
- ```python
56
- import json
57
- import torch
58
- from transformers import AutoModelForCausalLM, AutoTokenizer
59
-
60
- def predict_NuExtract(model, tokenizer, texts, template, batch_size=1, max_length=10_000, max_new_tokens=4_000):
61
- template = json.dumps(json.loads(template), indent=4)
62
- prompts = [f"""<|input|>\n### Template:\n{template}\n### Text:\n{text}\n\n<|output|>""" for text in texts]
63
-
64
- outputs = []
65
- with torch.no_grad():
66
- for i in range(0, len(prompts), batch_size):
67
- batch_prompts = prompts[i:i+batch_size]
68
- batch_encodings = tokenizer(batch_prompts, return_tensors="pt", truncation=True, padding=True, max_length=max_length).to(model.device)
69
-
70
- pred_ids = model.generate(**batch_encodings, max_new_tokens=max_new_tokens)
71
- outputs += tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
72
-
73
- return [output.split("<|output|>")[1] for output in outputs]
74
-
75
- model_name = "numind/NuExtract-tiny-v1.5"
76
- device = "cuda"
77
- model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16, trust_remote_code=True).to(device).eval()
78
- tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
79
-
80
- text = """We introduce Mistral 7B, a 7–billion-parameter language model engineered for
81
- superior performance and efficiency. Mistral 7B outperforms the best open 13B
82
- model (Llama 2) across all evaluated benchmarks, and the best released 34B
83
- model (Llama 1) in reasoning, mathematics, and code generation. Our model
84
- leverages grouped-query attention (GQA) for faster inference, coupled with sliding
85
- window attention (SWA) to effectively handle sequences of arbitrary length with a
86
- reduced inference cost. We also provide a model fine-tuned to follow instructions,
87
- Mistral 7B – Instruct, that surpasses Llama 2 13B – chat model both on human and
88
- automated benchmarks. Our models are released under the Apache 2.0 license.
89
- Code: <https://github.com/mistralai/mistral-src>
90
- Webpage: <https://mistral.ai/news/announcing-mistral-7b/>"""
91
-
92
- template = """{
93
- "Model": {
94
- "Name": "",
95
- "Number of parameters": "",
96
- "Number of max token": "",
97
- "Architecture": []
98
- },
99
- "Usage": {
100
- "Use case": [],
101
- "Licence": ""
102
- }
103
- }"""
104
-
105
- prediction = predict_NuExtract(model, tokenizer, [text], template)[0]
106
- print(prediction)
107
-
108
- ```
109
-
110
- Sliding window prompting:
111
-
112
- ```python
113
- import json
114
-
115
- MAX_INPUT_SIZE = 20_000
116
- MAX_NEW_TOKENS = 6000
117
-
118
- def clean_json_text(text):
119
- text = text.strip()
120
- text = text.replace("\#", "#").replace("\&", "&")
121
- return text
122
-
123
- def predict_chunk(text, template, current, model, tokenizer):
124
- current = clean_json_text(current)
125
-
126
- input_llm = f"<|input|>\n### Template:\n{template}\n### Current:\n{current}\n### Text:\n{text}\n\n<|output|>" + "{"
127
- input_ids = tokenizer(input_llm, return_tensors="pt", truncation=True, max_length=MAX_INPUT_SIZE).to("cuda")
128
- output = tokenizer.decode(model.generate(**input_ids, max_new_tokens=MAX_NEW_TOKENS)[0], skip_special_tokens=True)
129
-
130
- return clean_json_text(output.split("<|output|>")[1])
131
-
132
- def split_document(document, window_size, overlap):
133
- tokens = tokenizer.tokenize(document)
134
- print(f"\tLength of document: {len(tokens)} tokens")
135
-
136
- chunks = []
137
- if len(tokens) > window_size:
138
- for i in range(0, len(tokens), window_size-overlap):
139
- print(f"\t{i} to {i + len(tokens[i:i + window_size])}")
140
- chunk = tokenizer.convert_tokens_to_string(tokens[i:i + window_size])
141
- chunks.append(chunk)
142
-
143
- if i + len(tokens[i:i + window_size]) >= len(tokens):
144
- break
145
- else:
146
- chunks.append(document)
147
- print(f"\tSplit into {len(chunks)} chunks")
148
-
149
- return chunks
150
-
151
- def handle_broken_output(pred, prev):
152
- try:
153
- if all([(v in ["", []]) for v in json.loads(pred).values()]):
154
- # if empty json, return previous
155
- pred = prev
156
- except:
157
- # if broken json, return previous
158
- pred = prev
159
-
160
- return pred
161
-
162
- def sliding_window_prediction(text, template, model, tokenizer, window_size=4000, overlap=128):
163
- # split text into chunks of n tokens
164
- tokens = tokenizer.tokenize(text)
165
- chunks = split_document(text, window_size, overlap)
166
-
167
- # iterate over text chunks
168
- prev = template
169
- for i, chunk in enumerate(chunks):
170
- print(f"Processing chunk {i}...")
171
- pred = predict_chunk(chunk, template, prev, model, tokenizer)
172
-
173
- # handle broken output
174
- pred = handle_broken_output(pred, prev)
175
-
176
- # iterate
177
- prev = pred
178
-
179
- return pred
180
- ```
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ language:
4
+ - zho
5
+ - eng
6
+ - fra
7
+ - spa
8
+ - por
9
+ - deu
10
+ - ita
11
+ - rus
12
+ - jpn
13
+ - kor
14
+ - vie
15
+ - tha
16
+ - ara
17
+ tags:
18
+ - nlp
19
+ base_model: Qwen/Qwen2.5-0.5B
20
+ pipeline_tag: text-generation
21
+ ---
22
+
23
+ [![QuantFactory Banner](https://lh7-rt.googleusercontent.com/docsz/AD_4nXeiuCm7c8lEwEJuRey9kiVZsRn2W-b4pWlu3-X534V3YmVuVc2ZL-NXg2RkzSOOS2JXGHutDuyyNAUtdJI65jGTo8jT9Y99tMi4H4MqL44Uc5QKG77B0d6-JfIkZHFaUA71-RtjyYZWVIhqsNZcx8-OMaA?key=xt3VSDoCbmTY7o-cwwOFwQ)](https://hf.co/QuantFactory)
24
+
25
+
26
+ # QuantFactory/NuExtract-1.5-tiny-GGUF
27
+ This is quantized version of [numind/NuExtract-1.5-tiny](https://huggingface.co/numind/NuExtract-1.5-tiny) created using llama.cpp
28
+
29
+ # Original Model Card
30
+
31
+
32
+ # NuExtract-tiny-v1.5 by NuMind 🔥
33
+
34
+ NuExtract-tiny-v1.5 is a fine-tuning of [Qwen/Qwen2.5-0.5B](https://huggingface.co/Qwen/Qwen2.5-0.5B), trained on a private high-quality dataset for structured information extraction. It supports long documents and several languages (English, French, Spanish, German, Portuguese, and Italian).
35
+ To use the model, provide an input text and a JSON template describing the information you need to extract.
36
+
37
+ Note: This model is trained to prioritize pure extraction, so in most cases all text generated by the model is present as is in the original text.
38
+
39
+ We also provide a 3.8B version which is based on Phi-3.5-mini-instruct: [NuExtract-v1.5](https://huggingface.co/numind/NuExtract-v1.5)
40
+
41
+ Check out the [blog post](https://numind.ai/blog/nuextract-1-5---multilingual-infinite-context-still-small-and-better-than-gpt-4o).
42
+
43
+ Try the 3.8B model here: [Playground](https://huggingface.co/spaces/numind/NuExtract-v1.5)
44
+
45
+ ## Benchmark
46
+
47
+ Zero-shot performance (English):
48
+
49
+ <p align="left">
50
+ <img src="english_bench.png" style="width: 600; height: auto;">
51
+ </p>
52
+
53
+ Few-shot fine-tuning:
54
+
55
+ <p align="left">
56
+ <img src="fewshot_bench.png" style="width: 750; height: auto;">
57
+ </p>
58
+
59
+
60
+ ## Usage
61
+
62
+ To use the model:
63
+
64
+ ```python
65
+ import json
66
+ import torch
67
+ from transformers import AutoModelForCausalLM, AutoTokenizer
68
+
69
+ def predict_NuExtract(model, tokenizer, texts, template, batch_size=1, max_length=10_000, max_new_tokens=4_000):
70
+ template = json.dumps(json.loads(template), indent=4)
71
+ prompts = [f"""<|input|>\n### Template:\n{template}\n### Text:\n{text}\n\n<|output|>""" for text in texts]
72
+
73
+ outputs = []
74
+ with torch.no_grad():
75
+ for i in range(0, len(prompts), batch_size):
76
+ batch_prompts = prompts[i:i+batch_size]
77
+ batch_encodings = tokenizer(batch_prompts, return_tensors="pt", truncation=True, padding=True, max_length=max_length).to(model.device)
78
+
79
+ pred_ids = model.generate(**batch_encodings, max_new_tokens=max_new_tokens)
80
+ outputs += tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
81
+
82
+ return [output.split("<|output|>")[1] for output in outputs]
83
+
84
+ model_name = "numind/NuExtract-tiny-v1.5"
85
+ device = "cuda"
86
+ model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16, trust_remote_code=True).to(device).eval()
87
+ tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
88
+
89
+ text = """We introduce Mistral 7B, a 7–billion-parameter language model engineered for
90
+ superior performance and efficiency. Mistral 7B outperforms the best open 13B
91
+ model (Llama 2) across all evaluated benchmarks, and the best released 34B
92
+ model (Llama 1) in reasoning, mathematics, and code generation. Our model
93
+ leverages grouped-query attention (GQA) for faster inference, coupled with sliding
94
+ window attention (SWA) to effectively handle sequences of arbitrary length with a
95
+ reduced inference cost. We also provide a model fine-tuned to follow instructions,
96
+ Mistral 7B Instruct, that surpasses Llama 2 13B – chat model both on human and
97
+ automated benchmarks. Our models are released under the Apache 2.0 license.
98
+ Code: <https://github.com/mistralai/mistral-src>
99
+ Webpage: <https://mistral.ai/news/announcing-mistral-7b/>"""
100
+
101
+ template = """{
102
+ "Model": {
103
+ "Name": "",
104
+ "Number of parameters": "",
105
+ "Number of max token": "",
106
+ "Architecture": []
107
+ },
108
+ "Usage": {
109
+ "Use case": [],
110
+ "Licence": ""
111
+ }
112
+ }"""
113
+
114
+ prediction = predict_NuExtract(model, tokenizer, [text], template)[0]
115
+ print(prediction)
116
+
117
+ ```
118
+
119
+ Sliding window prompting:
120
+
121
+ ```python
122
+ import json
123
+
124
+ MAX_INPUT_SIZE = 20_000
125
+ MAX_NEW_TOKENS = 6000
126
+
127
+ def clean_json_text(text):
128
+ text = text.strip()
129
+ text = text.replace("\#", "#").replace("\&", "&")
130
+ return text
131
+
132
+ def predict_chunk(text, template, current, model, tokenizer):
133
+ current = clean_json_text(current)
134
+
135
+ input_llm = f"<|input|>\n### Template:\n{template}\n### Current:\n{current}\n### Text:\n{text}\n\n<|output|>" + "{"
136
+ input_ids = tokenizer(input_llm, return_tensors="pt", truncation=True, max_length=MAX_INPUT_SIZE).to("cuda")
137
+ output = tokenizer.decode(model.generate(**input_ids, max_new_tokens=MAX_NEW_TOKENS)[0], skip_special_tokens=True)
138
+
139
+ return clean_json_text(output.split("<|output|>")[1])
140
+
141
+ def split_document(document, window_size, overlap):
142
+ tokens = tokenizer.tokenize(document)
143
+ print(f"\tLength of document: {len(tokens)} tokens")
144
+
145
+ chunks = []
146
+ if len(tokens) > window_size:
147
+ for i in range(0, len(tokens), window_size-overlap):
148
+ print(f"\t{i} to {i + len(tokens[i:i + window_size])}")
149
+ chunk = tokenizer.convert_tokens_to_string(tokens[i:i + window_size])
150
+ chunks.append(chunk)
151
+
152
+ if i + len(tokens[i:i + window_size]) >= len(tokens):
153
+ break
154
+ else:
155
+ chunks.append(document)
156
+ print(f"\tSplit into {len(chunks)} chunks")
157
+
158
+ return chunks
159
+
160
+ def handle_broken_output(pred, prev):
161
+ try:
162
+ if all([(v in ["", []]) for v in json.loads(pred).values()]):
163
+ # if empty json, return previous
164
+ pred = prev
165
+ except:
166
+ # if broken json, return previous
167
+ pred = prev
168
+
169
+ return pred
170
+
171
+ def sliding_window_prediction(text, template, model, tokenizer, window_size=4000, overlap=128):
172
+ # split text into chunks of n tokens
173
+ tokens = tokenizer.tokenize(text)
174
+ chunks = split_document(text, window_size, overlap)
175
+
176
+ # iterate over text chunks
177
+ prev = template
178
+ for i, chunk in enumerate(chunks):
179
+ print(f"Processing chunk {i}...")
180
+ pred = predict_chunk(chunk, template, prev, model, tokenizer)
181
+
182
+ # handle broken output
183
+ pred = handle_broken_output(pred, prev)
184
+
185
+ # iterate
186
+ prev = pred
187
+
188
+ return pred
189
+ ```