inference-net
/

Schematron-3B

@@ -4,173 +4,242 @@ license: llama3.2
 base_model: meta-llama/Llama-3.2-3B-Instruct
 ---
-<!-- This model card has been generated automatically according to the information the Trainer had access to. You
-should probably proofread and complete it, then remove this comment. -->
-[<img src="https://raw.githubusercontent.com/axolotl-ai-cloud/axolotl/main/image/axolotl-badge-web.png" alt="Built with Axolotl" width="200" height="32"/>](https://github.com/axolotl-ai-cloud/axolotl)
-<details><summary>See axolotl config</summary>
-axolotl version: `0.10.0`
-```yaml
-# base_model: NousResearch/Meta-Llama-3.1-8B
-# base_model: meta-llama/Meta-Llama-3.1-8B
-base_model: meta-llama/Llama-3.2-3B-Instruct
-# Automatically upload checkpoint and final model to HF
-# hub_model_id: username/custom_model_name
-is_llama_derived_model: true
-model_type: AutoModelForCausalLM
-tokenizer_type: AutoTokenizer
-plugins:
-  - axolotl.integrations.liger.LigerPlugin
-liger_rope: true
-liger_rms_norm: true
-liger_glu_activation: true
-liger_fused_linear_cross_entropy: true
-datasets:
-  - path: /workspace/final_html_dataset.jsonl
-    type: chat_template
-    # field_messages: messages
-    # message_property_mappings:
-    #   role: role
-    #   content: content
-    field_messages: conversations
-    message_property_mappings:
-      role: from
-      content: value
-train_on_inputs: false
-dataset_prepared_path: ./last_run_prepared
-# dataset_prepared_path: last_run_prepared
-# val_set_size: 0.02
-output_dir: ./outputs/out
-sequence_len: 128000
-sample_packing: true
-# eval_sample_packing: false
-# wandb_project:
-# wandb_entity:
-# wandb_watch:
-# wandb_name:
-# wandb_log_model:
-use_wandb: true
-wandb_name: "test_run"
-gradient_accumulation_steps: 2
-micro_batch_size: 1
-num_epochs: 1
-optimizer: adamw_torch_fused
-lr_scheduler: cosine
-learning_rate: 2e-5
-# sequence_parallel_degree: 4  # Set to the number of GPUs to split sequences across
-# flash_attention: true  # SP requires flash attention
-# heads_k_stride: 1
-bf16: auto
-tf32: false
-gradient_checkpointing: true
-gradient_checkpointing_kwargs:
-  use_reentrant: false
-resume_from_checkpoint:
-logging_steps: 1
-# flash_attention: true
-warmup_ratio: 0.1
-evals_per_epoch: 2
-saves_per_epoch: 1
-weight_decay: 0.0
-flash_attention: true
-torch_dtype: bfloat16
-# save_strategy: "no"
-# eval_strategy: "no"
-load_in_8bit: false
-load_in_4bit: false
-device_map: auto
-special_tokens:
-  pad_token: <|finetune_right_pad_id|>
-  eos_token: <|eot_id|>
-# fsdp:
-#   - full_shard
-#   - auto_wrap
-# fsdp_config:
-#   fsdp_limit_all_gathers: true
-#   fsdp_sync_module_states: true
-#   fsdp_offload_params: true
-#   fsdp_use_orig_params: false
-#   fsdp_cpu_ram_efficient_loading: true
-#   fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
-#   fsdp_transformer_layer_cls_to_wrap: LlamaDecoderLayer
-#   fsdp_state_dict_type: FULL_STATE_DICT
-#   fsdp_sharding_strategy: FULL_SHARD
-#   fsdp_backward_prefetch: BACKWARD_PRE
-# special_tokens:
-#   pad_token: <|finetune_right_pad_id|>
-#   eos_token: <|eot_id|>
-# save_first_step: true  # uncomment this to validate checkpoint saving works with your config
 ```
-</details><br>
-# outputs/out
-This model is a fine-tuned version of [meta-llama/Llama-3.2-3B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct) on the /workspace/final_html_dataset.jsonl dataset.
-## Model description
-More information needed
-## Intended uses & limitations
-More information needed
-## Training and evaluation data
-More information needed
-## Training procedure
-### Training hyperparameters
-The following hyperparameters were used during training:
-- learning_rate: 2e-05
-- train_batch_size: 1
-- eval_batch_size: 1
-- seed: 42
-- distributed_type: multi-GPU
-- num_devices: 8
-- gradient_accumulation_steps: 2
-- total_train_batch_size: 16
-- total_eval_batch_size: 8
-- optimizer: Use OptimizerNames.ADAMW_TORCH_FUSED with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
-- lr_scheduler_type: cosine
-- lr_scheduler_warmup_steps: 208
-- training_steps: 2087
-### Training results
-### Framework versions
-- Transformers 4.52.3
-- Pytorch 2.8.0+cu126
-- Datasets 4.0.0
-- Tokenizers 0.21.4

 base_model: meta-llama/Llama-3.2-3B-Instruct
 ---
+###IN ORDER TO USE THIS:
+Request the HTML from a page. You should clean the HTML using something like
+python```
+from lxml.html.clean import Cleaner
+import lxml.html as LH
+HTML_CLEANER = Cleaner(
+    scripts=True,
+    javascript=True,
+    style=True,
+    inline_style=True,
+    safe_attrs_only=False,
+)
+def strip_noise(html: str) -> str:
+    """Remove scripts, styles, and JavaScript from HTML using lxml.
+    """
+    if not html or not html.strip():
+        return ""
+    try:
+        doc = LH.fromstring(html)
+        cleaned = HTML_CLEANER.clean_html(doc)
+        return LH.tostring(cleaned, encoding="unicode")
+    except Exception:
+        return ""
 ```
+There are three parts to the prompt:
+```
+{
+    "prompt_part_one": "You are going to be given a JSON schema following the standardized JSON Schema format. You are going to be given a HTML page and you are going to apply the schema to the HTML page however you see it as applicable and return the results in a JSON object. The schema is as follows:",
+    "prompt_part_two": "Here is the HTML page:",
+    "prompt_part_three": "MAKE SURE ITS VALID JSON."
+}
+```
+The draft schema is:
+```
+{
+    "$schema": "http://json-schema.org/draft-07/schema#",
+    "$id": "http://json-schema.org/draft-07/schema#",
+    "title": "Core schema meta-schema",
+    "definitions": {
+        "schemaArray": {
+            "type": "array",
+            "minItems": 1,
+            "items": { "$ref": "#" }
+        },
+        "nonNegativeInteger": {
+            "type": "integer",
+            "minimum": 0
+        },
+        "nonNegativeIntegerDefault0": {
+            "allOf": [
+                { "$ref": "#/definitions/nonNegativeInteger" },
+                { "default": 0 }
+            ]
+        },
+        "simpleTypes": {
+            "enum": [
+                "array",
+                "boolean",
+                "integer",
+                "null",
+                "number",
+                "object",
+                "string"
+            ]
+        },
+        "stringArray": {
+            "type": "array",
+            "items": { "type": "string" },
+            "uniqueItems": true,
+            "default": []
+        }
+    },
+    "type": ["object", "boolean"],
+    "properties": {
+        "$id": {
+            "type": "string",
+            "format": "uri-reference"
+        },
+        "$schema": {
+            "type": "string",
+            "format": "uri"
+        },
+        "$ref": {
+            "type": "string",
+            "format": "uri-reference"
+        },
+        "$comment": {
+            "type": "string"
+        },
+        "title": {
+            "type": "string"
+        },
+        "description": {
+            "type": "string"
+        },
+        "default": true,
+        "readOnly": {
+            "type": "boolean",
+            "default": false
+        },
+        "writeOnly": {
+            "type": "boolean",
+            "default": false
+        },
+        "examples": {
+            "type": "array",
+            "items": true
+        },
+        "multipleOf": {
+            "type": "number",
+            "exclusiveMinimum": 0
+        },
+        "maximum": {
+            "type": "number"
+        },
+        "exclusiveMaximum": {
+            "type": "number"
+        },
+        "minimum": {
+            "type": "number"
+        },
+        "exclusiveMinimum": {
+            "type": "number"
+        },
+        "maxLength": { "$ref": "#/definitions/nonNegativeInteger" },
+        "minLength": { "$ref": "#/definitions/nonNegativeIntegerDefault0" },
+        "pattern": {
+            "type": "string",
+            "format": "regex"
+        },
+        "additionalItems": { "$ref": "#" },
+        "items": {
+            "anyOf": [
+                { "$ref": "#" },
+                { "$ref": "#/definitions/schemaArray" }
+            ],
+            "default": true
+        },
+        "maxItems": { "$ref": "#/definitions/nonNegativeInteger" },
+        "minItems": { "$ref": "#/definitions/nonNegativeIntegerDefault0" },
+        "uniqueItems": {
+            "type": "boolean",
+            "default": false
+        },
+        "contains": { "$ref": "#" },
+        "maxProperties": { "$ref": "#/definitions/nonNegativeInteger" },
+        "minProperties": { "$ref": "#/definitions/nonNegativeIntegerDefault0" },
+        "required": { "$ref": "#/definitions/stringArray" },
+        "additionalProperties": { "$ref": "#" },
+        "definitions": {
+            "type": "object",
+            "additionalProperties": { "$ref": "#" },
+            "default": {}
+        },
+        "properties": {
+            "type": "object",
+            "additionalProperties": { "$ref": "#" },
+            "default": {}
+        },
+        "patternProperties": {
+            "type": "object",
+            "additionalProperties": { "$ref": "#" },
+            "propertyNames": { "format": "regex" },
+            "default": {}
+        },
+        "dependencies": {
+            "type": "object",
+            "additionalProperties": {
+                "anyOf": [
+                    { "$ref": "#" },
+                    { "$ref": "#/definitions/stringArray" }
+                ]
+            }
+        },
+        "propertyNames": { "$ref": "#" },
+        "const": true,
+        "enum": {
+            "type": "array",
+            "items": true,
+            "minItems": 1,
+            "uniqueItems": true
+        },
+        "type": {
+            "anyOf": [
+                { "$ref": "#/definitions/simpleTypes" },
+                {
+                    "type": "array",
+                    "items": { "$ref": "#/definitions/simpleTypes" },
+                    "minItems": 1,
+                    "uniqueItems": true
+                }
+            ]
+        },
+        "format": { "type": "string" },
+        "contentMediaType": { "type": "string" },
+        "contentEncoding": { "type": "string" },
+        "if": { "$ref": "#" },
+        "then": { "$ref": "#" },
+        "else": { "$ref": "#" },
+        "allOf": { "$ref": "#/definitions/schemaArray" },
+        "anyOf": { "$ref": "#/definitions/schemaArray" },
+        "oneOf": { "$ref": "#/definitions/schemaArray" },
+        "not": { "$ref": "#" }
+    },
+    "default": true
+}
+```
+You can combine the prompt, schema, and HTML together using something like:
+python```
+def construct_messages(schema, html):
+  """Construct messages for OpenAI API"""
+  user_prompt = (
+      response_prompt['prompt_part_one'] +
+      "\n\n" + schema + "\n\n" +
+      response_prompt['prompt_part_two'] +
+      "\n\n" + html + "\n\n" +
+      response_prompt['prompt_part_three']
+  )
+  messages = [
+      {"role": "system", "content": "You are a helpful assistant"},
+      {"role": "user", "content": user_prompt}
+  ]
+  return messages
+```
+such that the schema is copied from above and the html is the response from the lxml cleaning function. The output should be the filled out JSON.