Update README.md
Browse files
README.md
CHANGED
|
@@ -3,233 +3,161 @@ library_name: transformers
|
|
| 3 |
base_model: meta-llama/Llama-3.1-8B-Instruct
|
| 4 |
---
|
| 5 |
|
|
|
|
|
|
|
|
|
|
| 6 |
|
| 7 |
-
|
| 8 |
-
|
| 9 |
-
-
|
| 10 |
-
|
| 11 |
-
|
| 12 |
-
- VLM
|
| 13 |
-
- video-understanding
|
| 14 |
-
- image-captioning
|
| 15 |
-
- gemma
|
| 16 |
-
- json-mode
|
| 17 |
-
- structured-output
|
| 18 |
-
library_name: transformers
|
| 19 |
-
base_model: meta-llama/Llama-3.1-8B-Instruct
|
| 20 |
-
pipeline_tag: image-text-to-text
|
| 21 |
-
model-index:
|
| 22 |
-
- name: Schematron-8B
|
| 23 |
-
results:
|
| 24 |
-
- task:
|
| 25 |
-
type: image-to-text
|
| 26 |
-
name: Text Generation
|
| 27 |
-
metrics:
|
| 28 |
-
- name: Average Judge Score
|
| 29 |
-
type: quality
|
| 30 |
-
value: 3.53
|
| 31 |
-
- name: ROUGE-1
|
| 32 |
-
type: rouge-1
|
| 33 |
-
value: 0.674
|
| 34 |
-
- name: ROUGE-L
|
| 35 |
-
type: rouge-l
|
| 36 |
-
value: 0.520
|
| 37 |
-
- name: BLEU
|
| 38 |
-
type: bleu
|
| 39 |
-
value: 0.267
|
| 40 |
-
---
|
| 41 |
-
|
| 42 |
-
# ClipTagger-12b
|
| 43 |
|
| 44 |
-
|
| 45 |
|
| 46 |
-
## Model
|
|
|
|
| 47 |
|
| 48 |
-
|
| 49 |
|
| 50 |
-
|
|
|
|
| 51 |
|
| 52 |
-
|
|
|
|
| 53 |
|
| 54 |
-
|
|
|
|
|
|
|
| 55 |
|
| 56 |
-
|
| 57 |
-
|
| 58 |
-
- **Schema-consistent JSON** - Reliable structured output for every frame
|
| 59 |
-
- **Cost-efficient** - Optimized for high-throughput inference
|
| 60 |
-
- **Open source** - Build and deploy without proprietary API dependencies
|
| 61 |
|
| 62 |
-
##
|
|
|
|
|
|
|
|
|
|
| 63 |
|
| 64 |
-
|
| 65 |
-
|
| 66 |
-
|
| 67 |
-
- **
|
| 68 |
-
- **
|
| 69 |
-
- **Quantization**: FP8 (no quality loss vs bf16)
|
| 70 |
-
- **Input**: Single video frame per request
|
| 71 |
-
- **Output**: Structured JSON with fixed schema
|
| 72 |
-
- **Supported Formats**: JPEG, PNG, WebP, GIF
|
| 73 |
-
- **Max Image Size**: 1MB
|
| 74 |
-
|
| 75 |
-
## Training
|
| 76 |
-
|
| 77 |
-
The model was trained on 1 million carefully curated single-frame samples from publicly available video data. Training employed knowledge distillation from a high-quality teacher model to ensure consistent, accurate outputs while maintaining the ability to generalize across diverse video content types.
|
| 78 |
-
|
| 79 |
-
### Training Process
|
| 80 |
-
- **Dataset Size**: 1M video frames
|
| 81 |
-
- **Training Method**: Teacher-student distillation
|
| 82 |
-
- **Data Source**: Publicly available video content
|
| 83 |
-
- **Focus**: Single-frame understanding with temporal awareness
|
| 84 |
|
| 85 |
## Benchmarks
|
| 86 |
|
| 87 |
-
|
| 88 |
-
|
| 89 |
-
|
| 90 |
-
|
| 91 |
-
|
| 92 |
-
|
| 93 |
-
|
|
| 94 |
-
|
|
| 95 |
-
|
| 96 |
-
|
| 97 |
-
|
| 98 |
-
|
| 99 |
-
|
| 100 |
-
|
| 101 |
-
|
| 102 |
-
|
| 103 |
-
|
| 104 |
-
|
| 105 |
-
|
| 106 |
-
|
| 107 |
-
|
| 108 |
-
|
| 109 |
-
|
| 110 |
-
|
| 111 |
-
|
| 112 |
-
|
| 113 |
-
|
| 114 |
-
|
| 115 |
-
|
| 116 |
-
|
| 117 |
-
|
| 118 |
-
|
| 119 |
-
|
| 120 |
-
|
| 121 |
-
|
| 122 |
-
|
| 123 |
-
|
| 124 |
-
|
| 125 |
-
|
| 126 |
-
|
| 127 |
-
|
| 128 |
-
|
| 129 |
-
|
| 130 |
-
|
| 131 |
-
|
| 132 |
-
|
| 133 |
-
|
| 134 |
-
|
| 135 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 136 |
```
|
| 137 |
-
You are an image annotation API trained to analyze YouTube video keyframes. You must respond with a valid JSON object matching the exact structure below.
|
| 138 |
-
|
| 139 |
-
Your job is to extract detailed **factual elements directly visible** in the image. Do not speculate or interpret artistic intent, camera focus, or composition. Do not include phrases like "this appears to be", "this looks like", or anything about the image itself. Describe what **is physically present in the frame**, and nothing more.
|
| 140 |
-
|
| 141 |
-
Return JSON in this structure:
|
| 142 |
-
|
| 143 |
-
{
|
| 144 |
-
"description": "A detailed, factual account of what is visibly happening (4 sentences max). Only mention concrete elements or actions that are clearly shown. Do not include anything about how the image is styled, shot, or composed. Do not lead the description with something like 'This image shows' or 'this keyframe is...', just get right into the details.",
|
| 145 |
-
"objects": ["object1 with relevant visual details", "object2 with relevant visual details", ...],
|
| 146 |
-
"actions": ["action1 with participants and context", "action2 with participants and context", ...],
|
| 147 |
-
"environment": "Detailed factual description of the setting and atmosphere based on visible cues (e.g., interior of a classroom with fluorescent lighting, or outdoor forest path with snow-covered trees).",
|
| 148 |
-
"content_type": "The type of content it is, e.g. 'real-world footage', 'video game', 'animation', 'cartoon', 'CGI', 'VTuber', etc.",
|
| 149 |
-
"specific_style": "Specific genre, aesthetic, or platform style (e.g., anime, 3D animation, mobile gameplay, vlog, tutorial, news broadcast, etc.)",
|
| 150 |
-
"production_quality": "Visible production level: e.g., 'professional studio', 'amateur handheld', 'webcam recording', 'TV broadcast', etc.",
|
| 151 |
-
"summary": "One clear, comprehensive sentence summarizing the visual content of the frame. Like the description, get right to the point.",
|
| 152 |
-
"logos": ["logo1 with visual description", "logo2 with visual description", ...]
|
| 153 |
-
}
|
| 154 |
-
|
| 155 |
-
Rules:
|
| 156 |
-
- Be specific and literal. Focus on what is explicitly visible.
|
| 157 |
-
- Do NOT include interpretations of emotion, mood, or narrative unless it's visually explicit.
|
| 158 |
-
- No artistic or cinematic analysis.
|
| 159 |
-
- Always include the language of any text in the image if present as an object, e.g. "English text", "Japanese text", "Russian text", etc.
|
| 160 |
-
- Maximum 10 objects and 5 actions.
|
| 161 |
-
- Return an empty array for 'logos' if none are present.
|
| 162 |
-
- Always output strictly valid JSON with proper escaping.
|
| 163 |
-
- Output **only the JSON**, no extra text or explanation.
|
| 164 |
-
```
|
| 165 |
-
|
| 166 |
-
### Inference Parameters
|
| 167 |
-
|
| 168 |
-
- **Temperature**: 0.1 (recommended for consistency)
|
| 169 |
-
- **Max Tokens**: 2000
|
| 170 |
-
- **Response Format**: `{"type": "json_object"}`
|
| 171 |
-
|
| 172 |
-
### Output Schema
|
| 173 |
-
|
| 174 |
-
The model outputs a fixed JSON structure with the following fields:
|
| 175 |
|
| 176 |
-
|
| 177 |
-
|
| 178 |
-
|
| 179 |
-
|
| 180 |
-
|
| 181 |
-
|
| 182 |
-
|
| 183 |
-
|
| 184 |
-
|
| 185 |
-
|
| 186 |
-
|
| 187 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 188 |
```
|
| 189 |
|
| 190 |
-
|
| 191 |
-
|
| 192 |
-
Given a nature scene with a wooden boardwalk through grassland:
|
| 193 |
-
|
| 194 |
-
```json
|
| 195 |
-
{
|
| 196 |
-
"description": "A wooden boardwalk path extends from the foreground into the distance, cutting through a field of tall, vibrant green grass. The path is flanked on both sides by the dense grass. In the background, a line of trees is visible on the horizon under a blue sky with scattered white clouds.",
|
| 197 |
-
"objects": [
|
| 198 |
-
"Wooden boardwalk",
|
| 199 |
-
"Tall green grass",
|
| 200 |
-
"Blue sky",
|
| 201 |
-
"White clouds",
|
| 202 |
-
"Trees"
|
| 203 |
-
],
|
| 204 |
-
"actions": [],
|
| 205 |
-
"environment": "An outdoor, natural landscape, likely a marsh or wetland, on a clear day. The scene is characterized by a wooden boardwalk, lush green vegetation, and a bright blue sky with wispy clouds.",
|
| 206 |
-
"content_type": "real-world footage",
|
| 207 |
-
"specific_style": "landscape photography",
|
| 208 |
-
"production_quality": "professional photography",
|
| 209 |
-
"summary": "A wooden boardwalk path winds through a lush green field under a bright blue sky with scattered clouds.",
|
| 210 |
-
"logos": []
|
| 211 |
-
}
|
| 212 |
-
```
|
| 213 |
-
|
| 214 |
-
## Use Cases
|
| 215 |
|
| 216 |
-
- **Video Search & Discovery** - Build searchable databases with structured metadata
|
| 217 |
-
- **Content Moderation** - Automated content analysis and categorization
|
| 218 |
-
- **Accessibility** - Generate consistent alt-text and scene descriptions
|
| 219 |
-
- **Ad Verification** - Track product visibility and brand appearances
|
| 220 |
-
- **Video Analytics** - Extract insights from large video collections
|
| 221 |
-
- **Content Management** - Automatic tagging and organization of video libraries
|
| 222 |
|
| 223 |
-
##
|
|
|
|
|
|
|
|
|
|
|
|
|
| 224 |
|
| 225 |
-
|
|
|
|
|
|
|
|
|
|
| 226 |
|
| 227 |
-
##
|
| 228 |
-
|
| 229 |
-
-
|
| 230 |
-
-
|
| 231 |
-
- **Email**: support@inference.net
|
| 232 |
|
| 233 |
## License
|
|
|
|
| 234 |
|
| 235 |
-
|
|
|
|
|
|
|
|
|
| 3 |
base_model: meta-llama/Llama-3.1-8B-Instruct
|
| 4 |
---
|
| 5 |
|
| 6 |
+
<p align="center">
|
| 7 |
+
<img alt="Schematron" src="https://huggingface.co/inference-net/Schematron-3B/resolve/main/Banner.png">
|
| 8 |
+
</p>
|
| 9 |
|
| 10 |
+
<p align="center">
|
| 11 |
+
<a href="https://docs.inference.net/use-cases/json-extraction"><strong>Documentation</strong></a> ·
|
| 12 |
+
<a href="https://inference.net/models/schematron-8b"><strong>Serverless API</strong></a> ·
|
| 13 |
+
<a href="https://inference.net/blog/Schematron"><strong>Announcement blog</strong></a>
|
| 14 |
+
</p>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 15 |
|
| 16 |
+
<br>
|
| 17 |
|
| 18 |
+
## Model Overview
|
| 19 |
+
Welcome to the Schematron series, [Inference.net's](https://inference.net/) long‑context extraction models specialized in converting noisy HTML into clean, typed JSON that conforms to your custom schema. The Schematron series was purpose‑trained for web scraping, data ingestion, and transforming arbitrary pages into structured records.
|
| 20 |
|
| 21 |
+
We're releasing these models in two different sizes:
|
| 22 |
|
| 23 |
+
- **Schematron‑8B** — marginal quality lift on harder/longer pages
|
| 24 |
+
- **Schematron‑3B** — recommended default; near‑parity quality at ~50% cost of Schematron-8B
|
| 25 |
|
| 26 |
+
> [!NOTE]
|
| 27 |
+
> This model card is dedicated to the smaller `Schematron-8B` model. Check out [`Schematron-8B`](https://huggingface.co/inference-net/Schematron-8B) for the smaller model.
|
| 28 |
|
| 29 |
+
## I/O at a glance
|
| 30 |
+
- **Input**: Cleaned HTML + JSON Schema (can be extracted from typed model like Pydantic/Zod)
|
| 31 |
+
- **Output**: Strictly valid JSON conforming to the provided schema (no narration)
|
| 32 |
|
| 33 |
+
> [!NOTE]
|
| 34 |
+
> The JSON Schema passed as input needs to conform to the [schema.org](https://json-schema.org/draft-07/schema) schema.
|
|
|
|
|
|
|
|
|
|
| 35 |
|
| 36 |
+
## Highlights
|
| 37 |
+
- **Schema-first extraction**: 100% schema‑conformant JSON outputs
|
| 38 |
+
- **Long context**: Robust to lengthy, noisy HTML (up to 128K tokens)
|
| 39 |
+
- **Variants**: 3B (default, most cost‑efficient) · 8B (marginal quality lift at ~2× cost)
|
| 40 |
|
| 41 |
+
## Model Details
|
| 42 |
+
- **Family**: Schematron (3B and 8B)
|
| 43 |
+
- **Context window**: Up to 128K tokens
|
| 44 |
+
- **Input**: Cleaned or raw HTML and a JSON Schema
|
| 45 |
+
- **Output**: Strict JSON that conforms to the provided schema
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 46 |
|
| 47 |
## Benchmarks
|
| 48 |
|
| 49 |
+
### HTML-to-JSON Extraction Quality
|
| 50 |
+
|
| 51 |
+
We evaluated extraction quality using Gemini 2.5 Pro as a judge, scoring extractions from 1-5 where 5 represents perfect extraction.
|
| 52 |
+
|
| 53 |
+
| Model | LLM-as-Judge Score |
|
| 54 |
+
|-------|-------------------|
|
| 55 |
+
| GPT-4.1 | 4.74 |
|
| 56 |
+
| **Schematron-8B** | **4.64** |
|
| 57 |
+
| **Schematron-3B** | **4.41** |
|
| 58 |
+
| Gemini-3B-Base | 2.24 |
|
| 59 |
+
|
| 60 |
+
### Web-Augmented Factuality on SimpleQA
|
| 61 |
+
|
| 62 |
+
We evaluated Schematron's real-world impact on LLM factuality using SimpleQA.
|
| 63 |
+
|
| 64 |
+
**Test Pipeline:**
|
| 65 |
+
1. **Query Generation**: Primary LLM (GPT-5 Nano or GPT-4.1) generates search queries and defines extraction schema
|
| 66 |
+
2. **Web Search**: Search provider (SERP or Exa) retrieves relevant pages
|
| 67 |
+
3. **Structured Extraction**: Schematron extracts JSON data from retrieved pages using the schema
|
| 68 |
+
4. **Answer Synthesis**: Primary LLM produces final answer from structured data
|
| 69 |
+
|
| 70 |
+

|
| 71 |
+
|
| 72 |
+
**Key findings:**
|
| 73 |
+
- Web search paired with JSON extraction improves factuality: Adding Schematron with web retrieval improves GPT-5 Nano's accuracy from 8.54% to 82.87%—nearly a 10x improvement
|
| 74 |
+
- Search provider matters: Exa (82.9%) significantly outperforms SERP (64.2%) for factual retrieval, while also being more cost-effective
|
| 75 |
+
- Structured extraction beats raw HTML: Processing raw HTML would require 100k+ tokens for 10 searches; Schematron's JSON extraction reduces this by orders of magnitude
|
| 76 |
+
- Small specialized models win: Schematron-8B (82.87%) outperforms the much larger Gemini 2.5 Flash (80.61%) on this task, showing that fine-tuning for well-defined tasks beats general purpose models
|
| 77 |
+
- Performance scales with model quality: When paired with GPT-4.1, Schematron achieves 85.58% accuracy, showing the approach benefits from stronger base models
|
| 78 |
+
|
| 79 |
+
## Minimal Quickstart
|
| 80 |
+
Use these local snippets to prepare HTML and compose a schema‑guided prompt. The model returns strictly valid JSON; validate it against your schema downstream.
|
| 81 |
+
|
| 82 |
+
```python
|
| 83 |
+
from lxml.html.clean import Cleaner
|
| 84 |
+
import lxml.html as LH
|
| 85 |
+
|
| 86 |
+
HTML_CLEANER = Cleaner(
|
| 87 |
+
scripts=True,
|
| 88 |
+
javascript=True,
|
| 89 |
+
style=True,
|
| 90 |
+
inline_style=True,
|
| 91 |
+
safe_attrs_only=False,
|
| 92 |
+
)
|
| 93 |
+
|
| 94 |
+
|
| 95 |
+
def strip_noise(html: str) -> str:
|
| 96 |
+
"""Remove scripts, styles, and JavaScript from HTML using lxml.
|
| 97 |
+
"""
|
| 98 |
+
if not html or not html.strip():
|
| 99 |
+
return ""
|
| 100 |
+
try:
|
| 101 |
+
doc = LH.fromstring(html)
|
| 102 |
+
cleaned = HTML_CLEANER.clean_html(doc)
|
| 103 |
+
return LH.tostring(cleaned, encoding="unicode")
|
| 104 |
+
except Exception:
|
| 105 |
+
return ""
|
| 106 |
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 107 |
|
| 108 |
+
Compose messages with your schema and cleaned HTML:
|
| 109 |
+
|
| 110 |
+
```python
|
| 111 |
+
def construct_messages(schema: str, html: str):
|
| 112 |
+
"""Construct messages for a schema‑guided extraction request."""
|
| 113 |
+
response_prompt = {
|
| 114 |
+
"prompt_part_one": (
|
| 115 |
+
"You are going to be given a JSON schema following the standardized JSON "
|
| 116 |
+
"Schema format. You are going to be given a HTML page and you are going "
|
| 117 |
+
"to apply the schema to the HTML page however you see it as applicable "
|
| 118 |
+
"and return the results in a JSON object. The schema is as follows:"
|
| 119 |
+
),
|
| 120 |
+
"prompt_part_two": "Here is the HTML page:",
|
| 121 |
+
"prompt_part_three": "MAKE SURE ITS VALID JSON.",
|
| 122 |
+
}
|
| 123 |
+
|
| 124 |
+
user_prompt = (
|
| 125 |
+
response_prompt['prompt_part_one']
|
| 126 |
+
+ "\n\n" + schema + "\n\n"
|
| 127 |
+
+ response_prompt['prompt_part_two']
|
| 128 |
+
+ "\n\n" + html + "\n\n"
|
| 129 |
+
+ response_prompt['prompt_part_three']
|
| 130 |
+
)
|
| 131 |
+
|
| 132 |
+
return [
|
| 133 |
+
{"role": "system", "content": "You are a helpful assistant"},
|
| 134 |
+
{"role": "user", "content": user_prompt},
|
| 135 |
+
]
|
| 136 |
```
|
| 137 |
|
| 138 |
+
> [!NOTE]
|
| 139 |
+
> In the [serverless API](https://inference.net/models/schematron-3b) there's no need to pass anything but the HTML. We handle the prompt formatting for you.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 140 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 141 |
|
| 142 |
+
## Recommendations
|
| 143 |
+
- Temperature 0 and JSON mode for deterministic, parseable output
|
| 144 |
+
- Validate responses against your schema (e.g., Pydantic or Zod)
|
| 145 |
+
- Pre‑clean HTML (remove scripts/styles) when possible; avoid over‑aggressive removal
|
| 146 |
+
- Using lxml to clean the HTML is not required, but is recommended as it matches the training data.
|
| 147 |
|
| 148 |
+
## Limitations
|
| 149 |
+
- Static HTML only; render client‑side content upstream
|
| 150 |
+
- Very large pages may require truncation
|
| 151 |
+
- Ambiguous fields depend on schema clarity; be explicit in field descriptions
|
| 152 |
|
| 153 |
+
## Safety and Responsible Use
|
| 154 |
+
- Extracted data may include personal or sensitive information present in the page—handle and store responsibly
|
| 155 |
+
- Respect site terms, robots.txt, and applicable laws
|
| 156 |
+
- Use downstream validation and guardrails for compliance
|
|
|
|
| 157 |
|
| 158 |
## License
|
| 159 |
+
See license in the metadata above.
|
| 160 |
|
| 161 |
+
## Support
|
| 162 |
+
- Docs: https://docs.inference.net/use-cases/json-extraction
|
| 163 |
+
- Email: support@inference.net
|