amarinference commited on
Commit
5410bd0
·
verified ·
1 Parent(s): 62acf8e

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +233 -164
README.md CHANGED
@@ -4,173 +4,242 @@ license: llama3.2
4
  base_model: meta-llama/Llama-3.2-3B-Instruct
5
  ---
6
 
7
- <!-- This model card has been generated automatically according to the information the Trainer had access to. You
8
- should probably proofread and complete it, then remove this comment. -->
9
-
10
- [<img src="https://raw.githubusercontent.com/axolotl-ai-cloud/axolotl/main/image/axolotl-badge-web.png" alt="Built with Axolotl" width="200" height="32"/>](https://github.com/axolotl-ai-cloud/axolotl)
11
- <details><summary>See axolotl config</summary>
12
-
13
- axolotl version: `0.10.0`
14
- ```yaml
15
- # base_model: NousResearch/Meta-Llama-3.1-8B
16
- # base_model: meta-llama/Meta-Llama-3.1-8B
17
- base_model: meta-llama/Llama-3.2-3B-Instruct
18
- # Automatically upload checkpoint and final model to HF
19
- # hub_model_id: username/custom_model_name
20
- is_llama_derived_model: true
21
- model_type: AutoModelForCausalLM
22
- tokenizer_type: AutoTokenizer
23
-
24
- plugins:
25
- - axolotl.integrations.liger.LigerPlugin
26
- liger_rope: true
27
- liger_rms_norm: true
28
- liger_glu_activation: true
29
- liger_fused_linear_cross_entropy: true
30
-
31
-
32
- datasets:
33
- - path: /workspace/final_html_dataset.jsonl
34
- type: chat_template
35
-
36
- # field_messages: messages
37
- # message_property_mappings:
38
- # role: role
39
- # content: content
40
- field_messages: conversations
41
- message_property_mappings:
42
- role: from
43
- content: value
44
-
45
-
46
- train_on_inputs: false
47
- dataset_prepared_path: ./last_run_prepared
48
-
49
- # dataset_prepared_path: last_run_prepared
50
- # val_set_size: 0.02
51
- output_dir: ./outputs/out
52
-
53
- sequence_len: 128000
54
- sample_packing: true
55
- # eval_sample_packing: false
56
-
57
-
58
-
59
- # wandb_project:
60
- # wandb_entity:
61
- # wandb_watch:
62
- # wandb_name:
63
- # wandb_log_model:
64
- use_wandb: true
65
- wandb_name: "test_run"
66
-
67
-
68
- gradient_accumulation_steps: 2
69
- micro_batch_size: 1
70
- num_epochs: 1
71
- optimizer: adamw_torch_fused
72
- lr_scheduler: cosine
73
- learning_rate: 2e-5
74
-
75
- # sequence_parallel_degree: 4 # Set to the number of GPUs to split sequences across
76
- # flash_attention: true # SP requires flash attention
77
- # heads_k_stride: 1
78
-
79
- bf16: auto
80
- tf32: false
81
-
82
- gradient_checkpointing: true
83
- gradient_checkpointing_kwargs:
84
- use_reentrant: false
85
- resume_from_checkpoint:
86
- logging_steps: 1
87
- # flash_attention: true
88
-
89
- warmup_ratio: 0.1
90
- evals_per_epoch: 2
91
- saves_per_epoch: 1
92
- weight_decay: 0.0
93
-
94
- flash_attention: true
95
- torch_dtype: bfloat16
96
-
97
- # save_strategy: "no"
98
- # eval_strategy: "no"
99
-
100
- load_in_8bit: false
101
- load_in_4bit: false
102
- device_map: auto
103
-
104
- special_tokens:
105
- pad_token: <|finetune_right_pad_id|>
106
- eos_token: <|eot_id|>
107
-
108
-
109
- # fsdp:
110
- # - full_shard
111
- # - auto_wrap
112
- # fsdp_config:
113
- # fsdp_limit_all_gathers: true
114
- # fsdp_sync_module_states: true
115
- # fsdp_offload_params: true
116
- # fsdp_use_orig_params: false
117
- # fsdp_cpu_ram_efficient_loading: true
118
- # fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
119
- # fsdp_transformer_layer_cls_to_wrap: LlamaDecoderLayer
120
- # fsdp_state_dict_type: FULL_STATE_DICT
121
- # fsdp_sharding_strategy: FULL_SHARD
122
- # fsdp_backward_prefetch: BACKWARD_PRE
123
- # special_tokens:
124
- # pad_token: <|finetune_right_pad_id|>
125
- # eos_token: <|eot_id|>
126
-
127
- # save_first_step: true # uncomment this to validate checkpoint saving works with your config
128
  ```
129
 
130
- </details><br>
131
-
132
- # outputs/out
133
-
134
- This model is a fine-tuned version of [meta-llama/Llama-3.2-3B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct) on the /workspace/final_html_dataset.jsonl dataset.
135
-
136
- ## Model description
137
-
138
- More information needed
139
-
140
- ## Intended uses & limitations
141
-
142
- More information needed
143
-
144
- ## Training and evaluation data
145
-
146
- More information needed
147
-
148
- ## Training procedure
149
-
150
- ### Training hyperparameters
151
-
152
- The following hyperparameters were used during training:
153
- - learning_rate: 2e-05
154
- - train_batch_size: 1
155
- - eval_batch_size: 1
156
- - seed: 42
157
- - distributed_type: multi-GPU
158
- - num_devices: 8
159
- - gradient_accumulation_steps: 2
160
- - total_train_batch_size: 16
161
- - total_eval_batch_size: 8
162
- - optimizer: Use OptimizerNames.ADAMW_TORCH_FUSED with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
163
- - lr_scheduler_type: cosine
164
- - lr_scheduler_warmup_steps: 208
165
- - training_steps: 2087
166
-
167
- ### Training results
168
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
169
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
170
 
171
- ### Framework versions
172
 
173
- - Transformers 4.52.3
174
- - Pytorch 2.8.0+cu126
175
- - Datasets 4.0.0
176
- - Tokenizers 0.21.4
 
4
  base_model: meta-llama/Llama-3.2-3B-Instruct
5
  ---
6
 
7
+ ###IN ORDER TO USE THIS:
8
+
9
+ Request the HTML from a page. You should clean the HTML using something like
10
+
11
+
12
+ python```
13
+ from lxml.html.clean import Cleaner
14
+ import lxml.html as LH
15
+
16
+ HTML_CLEANER = Cleaner(
17
+ scripts=True,
18
+ javascript=True,
19
+ style=True,
20
+ inline_style=True,
21
+ safe_attrs_only=False,
22
+ )
23
+
24
+
25
+ def strip_noise(html: str) -> str:
26
+ """Remove scripts, styles, and JavaScript from HTML using lxml.
27
+ """
28
+ if not html or not html.strip():
29
+ return ""
30
+ try:
31
+ doc = LH.fromstring(html)
32
+ cleaned = HTML_CLEANER.clean_html(doc)
33
+ return LH.tostring(cleaned, encoding="unicode")
34
+ except Exception:
35
+ return ""
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
36
  ```
37
 
38
+ There are three parts to the prompt:
39
+ ```
40
+ {
41
+ "prompt_part_one": "You are going to be given a JSON schema following the standardized JSON Schema format. You are going to be given a HTML page and you are going to apply the schema to the HTML page however you see it as applicable and return the results in a JSON object. The schema is as follows:",
42
+ "prompt_part_two": "Here is the HTML page:",
43
+ "prompt_part_three": "MAKE SURE ITS VALID JSON."
44
+ }
45
+ ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
46
 
47
+ The draft schema is:
48
+ ```
49
+ {
50
+ "$schema": "http://json-schema.org/draft-07/schema#",
51
+ "$id": "http://json-schema.org/draft-07/schema#",
52
+ "title": "Core schema meta-schema",
53
+ "definitions": {
54
+ "schemaArray": {
55
+ "type": "array",
56
+ "minItems": 1,
57
+ "items": { "$ref": "#" }
58
+ },
59
+ "nonNegativeInteger": {
60
+ "type": "integer",
61
+ "minimum": 0
62
+ },
63
+ "nonNegativeIntegerDefault0": {
64
+ "allOf": [
65
+ { "$ref": "#/definitions/nonNegativeInteger" },
66
+ { "default": 0 }
67
+ ]
68
+ },
69
+ "simpleTypes": {
70
+ "enum": [
71
+ "array",
72
+ "boolean",
73
+ "integer",
74
+ "null",
75
+ "number",
76
+ "object",
77
+ "string"
78
+ ]
79
+ },
80
+ "stringArray": {
81
+ "type": "array",
82
+ "items": { "type": "string" },
83
+ "uniqueItems": true,
84
+ "default": []
85
+ }
86
+ },
87
+ "type": ["object", "boolean"],
88
+ "properties": {
89
+ "$id": {
90
+ "type": "string",
91
+ "format": "uri-reference"
92
+ },
93
+ "$schema": {
94
+ "type": "string",
95
+ "format": "uri"
96
+ },
97
+ "$ref": {
98
+ "type": "string",
99
+ "format": "uri-reference"
100
+ },
101
+ "$comment": {
102
+ "type": "string"
103
+ },
104
+ "title": {
105
+ "type": "string"
106
+ },
107
+ "description": {
108
+ "type": "string"
109
+ },
110
+ "default": true,
111
+ "readOnly": {
112
+ "type": "boolean",
113
+ "default": false
114
+ },
115
+ "writeOnly": {
116
+ "type": "boolean",
117
+ "default": false
118
+ },
119
+ "examples": {
120
+ "type": "array",
121
+ "items": true
122
+ },
123
+ "multipleOf": {
124
+ "type": "number",
125
+ "exclusiveMinimum": 0
126
+ },
127
+ "maximum": {
128
+ "type": "number"
129
+ },
130
+ "exclusiveMaximum": {
131
+ "type": "number"
132
+ },
133
+ "minimum": {
134
+ "type": "number"
135
+ },
136
+ "exclusiveMinimum": {
137
+ "type": "number"
138
+ },
139
+ "maxLength": { "$ref": "#/definitions/nonNegativeInteger" },
140
+ "minLength": { "$ref": "#/definitions/nonNegativeIntegerDefault0" },
141
+ "pattern": {
142
+ "type": "string",
143
+ "format": "regex"
144
+ },
145
+ "additionalItems": { "$ref": "#" },
146
+ "items": {
147
+ "anyOf": [
148
+ { "$ref": "#" },
149
+ { "$ref": "#/definitions/schemaArray" }
150
+ ],
151
+ "default": true
152
+ },
153
+ "maxItems": { "$ref": "#/definitions/nonNegativeInteger" },
154
+ "minItems": { "$ref": "#/definitions/nonNegativeIntegerDefault0" },
155
+ "uniqueItems": {
156
+ "type": "boolean",
157
+ "default": false
158
+ },
159
+ "contains": { "$ref": "#" },
160
+ "maxProperties": { "$ref": "#/definitions/nonNegativeInteger" },
161
+ "minProperties": { "$ref": "#/definitions/nonNegativeIntegerDefault0" },
162
+ "required": { "$ref": "#/definitions/stringArray" },
163
+ "additionalProperties": { "$ref": "#" },
164
+ "definitions": {
165
+ "type": "object",
166
+ "additionalProperties": { "$ref": "#" },
167
+ "default": {}
168
+ },
169
+ "properties": {
170
+ "type": "object",
171
+ "additionalProperties": { "$ref": "#" },
172
+ "default": {}
173
+ },
174
+ "patternProperties": {
175
+ "type": "object",
176
+ "additionalProperties": { "$ref": "#" },
177
+ "propertyNames": { "format": "regex" },
178
+ "default": {}
179
+ },
180
+ "dependencies": {
181
+ "type": "object",
182
+ "additionalProperties": {
183
+ "anyOf": [
184
+ { "$ref": "#" },
185
+ { "$ref": "#/definitions/stringArray" }
186
+ ]
187
+ }
188
+ },
189
+ "propertyNames": { "$ref": "#" },
190
+ "const": true,
191
+ "enum": {
192
+ "type": "array",
193
+ "items": true,
194
+ "minItems": 1,
195
+ "uniqueItems": true
196
+ },
197
+ "type": {
198
+ "anyOf": [
199
+ { "$ref": "#/definitions/simpleTypes" },
200
+ {
201
+ "type": "array",
202
+ "items": { "$ref": "#/definitions/simpleTypes" },
203
+ "minItems": 1,
204
+ "uniqueItems": true
205
+ }
206
+ ]
207
+ },
208
+ "format": { "type": "string" },
209
+ "contentMediaType": { "type": "string" },
210
+ "contentEncoding": { "type": "string" },
211
+ "if": { "$ref": "#" },
212
+ "then": { "$ref": "#" },
213
+ "else": { "$ref": "#" },
214
+ "allOf": { "$ref": "#/definitions/schemaArray" },
215
+ "anyOf": { "$ref": "#/definitions/schemaArray" },
216
+ "oneOf": { "$ref": "#/definitions/schemaArray" },
217
+ "not": { "$ref": "#" }
218
+ },
219
+ "default": true
220
+ }
221
+ ```
222
 
223
+ You can combine the prompt, schema, and HTML together using something like:
224
+
225
+ python```
226
+ def construct_messages(schema, html):
227
+ """Construct messages for OpenAI API"""
228
+ user_prompt = (
229
+ response_prompt['prompt_part_one'] +
230
+ "\n\n" + schema + "\n\n" +
231
+ response_prompt['prompt_part_two'] +
232
+ "\n\n" + html + "\n\n" +
233
+ response_prompt['prompt_part_three']
234
+ )
235
+
236
+ messages = [
237
+ {"role": "system", "content": "You are a helpful assistant"},
238
+ {"role": "user", "content": user_prompt}
239
+ ]
240
+
241
+ return messages
242
+ ```
243
 
244
+ such that the schema is copied from above and the html is the response from the lxml cleaning function. The output should be the filled out JSON.
245