opensporks commited on
Commit
6721498
·
verified ·
1 Parent(s): 2d92c73

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +134 -206
README.md CHANGED
@@ -3,233 +3,161 @@ library_name: transformers
3
  base_model: meta-llama/Llama-3.1-8B-Instruct
4
  ---
5
 
 
 
 
6
 
7
- ---
8
- language:
9
- - en
10
- license: apache-2.0
11
- tags:
12
- - VLM
13
- - video-understanding
14
- - image-captioning
15
- - gemma
16
- - json-mode
17
- - structured-output
18
- library_name: transformers
19
- base_model: meta-llama/Llama-3.1-8B-Instruct
20
- pipeline_tag: image-text-to-text
21
- model-index:
22
- - name: Schematron-8B
23
- results:
24
- - task:
25
- type: image-to-text
26
- name: Text Generation
27
- metrics:
28
- - name: Average Judge Score
29
- type: quality
30
- value: 3.53
31
- - name: ROUGE-1
32
- type: rouge-1
33
- value: 0.674
34
- - name: ROUGE-L
35
- type: rouge-l
36
- value: 0.520
37
- - name: BLEU
38
- type: bleu
39
- value: 0.267
40
- ---
41
-
42
- # ClipTagger-12b
43
 
44
- ![ClipTagger-12b](./assets/grass-x-inference.png)
45
 
46
- ## Model Description
 
47
 
48
- **ClipTagger-12b** is a 12-billion parameter vision-language model (VLM) designed for video understanding at massive scale. Developed by [Inference.net](https://inference.net) in collaboration with [Grass](https://grass.io), this model was created to meet the demanding requirements of trillion-scale video frame captioning workloads.
49
 
50
- **ClipTagger-12b exceeds or matches the performance of GPT-4.1 and Claude 4 Sonnet, while costing 15x less per generation.**
 
51
 
52
- The model generates structured, schema-consistent JSON outputs for every video frame, making it ideal for building searchable video databases, content moderation systems, and accessibility tools. It maintains temporal consistency across frames while delivering frontier-quality performance at a fraction of the cost of closed-source alternatives.
 
53
 
54
- ### Key Features
 
 
55
 
56
- - **Frontier-quality performance** - Comparable to top closed models in captioning quality
57
- - **Production-ready** - Battle-tested on trillion-scale video frame captioning workloads
58
- - **Schema-consistent JSON** - Reliable structured output for every frame
59
- - **Cost-efficient** - Optimized for high-throughput inference
60
- - **Open source** - Build and deploy without proprietary API dependencies
61
 
62
- ## Architecture
 
 
 
63
 
64
- ClipTagger-12b is based on the Gemma-12B architecture and has been optimized with FP8 quantization for maximum throughput on modern GPUs. The model is specifically tuned for RTX 40-series and H100 GPUs, leveraging native FP8 support for efficient inference.
65
-
66
- ### Technical Specifications
67
- - **Parameters**: 12 billion
68
- - **Base Architecture**: Gemma-12B
69
- - **Quantization**: FP8 (no quality loss vs bf16)
70
- - **Input**: Single video frame per request
71
- - **Output**: Structured JSON with fixed schema
72
- - **Supported Formats**: JPEG, PNG, WebP, GIF
73
- - **Max Image Size**: 1MB
74
-
75
- ## Training
76
-
77
- The model was trained on 1 million carefully curated single-frame samples from publicly available video data. Training employed knowledge distillation from a high-quality teacher model to ensure consistent, accurate outputs while maintaining the ability to generalize across diverse video content types.
78
-
79
- ### Training Process
80
- - **Dataset Size**: 1M video frames
81
- - **Training Method**: Teacher-student distillation
82
- - **Data Source**: Publicly available video content
83
- - **Focus**: Single-frame understanding with temporal awareness
84
 
85
  ## Benchmarks
86
 
87
- ClipTagger-12b achieves **equal or superior performance** compared to the leading closed-source models across all major evaluation metrics. Despite being open-source and significantly more cost-effective, our model **outperforms Claude 4 Sonnet across every metric** and achieves **comparable quality to GPT-4.1**.
88
-
89
- Performance metrics on our internal evaluation set:
90
- | Model | Avg Judge Score | ROUGE-1 | ROUGE-2 | ROUGE-L | BLEU |
91
- |-------|-----------------|---------|---------|---------|------|
92
- | cliptagger_12b | **3.53** | **0.674** | **0.404** | **0.520** | **0.267** |
93
- | claude_4_sonnet | 3.16 | 0.463 | 0.179 | 0.281 | 0.060 |
94
- | gpt_4.1 | 3.64 | 0.581 | 0.260 | 0.376 | 0.119 |
95
-
96
- We used Gemini-2.5-Pro as the judge model, which ranks ClipTagger-12b roughly equal to GPT-4.1, and better than Claude 4 Sonnet.
97
-
98
- <img src="./assets/judge-score.png" alt="Average Judge Score Comparison" width="100%" />
99
-
100
-
101
- FP8 quantization showed no measurable quality degradation compared to bf16 precision.
102
-
103
- ## Cost Comparison
104
-
105
- ClipTagger-12b delivers frontier-quality performance at a fraction of the cost of closed-source alternatives. Based on typical usage patterns (700 input tokens and 250 output tokens per generation), here's how the costs compare:
106
-
107
- <img src="./assets/cost.png" alt="Cost Comparison Per 1 Million Generations" width="100%" />
108
-
109
- ClipTagger-12b offers **15x cost savings** compared to GPT-4.1 and **17x cost savings** compared to Claude 4 Sonnet, while maintaining comparable quality metrics.
110
-
111
- | Model | Input Cost/MTok | Output Cost/MTok | Cost per 1M Generations | Cost per Generation |
112
- | --------------- | --------------- | ---------------- | ----------------------- | ------------------- |
113
- | ClipTagger-12b | $0.30 | $0.50 | $335 | $0.000335 |
114
- | GPT-4.1 | $3.00 | $12.00 | $5,100 | $0.0051 |
115
- | Claude 4 Sonnet | $3.00 | $15.00 | $5,850 | $0.00585 |
116
-
117
-
118
- ## Usage
119
-
120
- ### API Access
121
-
122
- For production deployments, we recommend using our managed API service which includes advanced features like batch processing, webhooks, and automatic scaling:
123
-
124
- **[Run ClipTagger-12b via Inference.net API →](https://docs.inference.net/use-cases/video-understanding)**
125
-
126
- ### Required Prompts
127
-
128
- The model requires specific system and user prompts for optimal performance. Use these prompts exactly as shown:
129
-
130
- #### System Prompt
131
- ```
132
- You are an image annotation API trained to analyze YouTube video keyframes. You will be given instructions on the output format, what to caption, and how to perform your job. Follow those instructions. For descriptions and summaries, provide them directly and do not lead them with 'This image shows' or 'This keyframe displays...', just get right into the details.
133
- ```
134
-
135
- #### User Prompt
 
 
 
 
 
 
 
 
136
  ```
137
- You are an image annotation API trained to analyze YouTube video keyframes. You must respond with a valid JSON object matching the exact structure below.
138
-
139
- Your job is to extract detailed **factual elements directly visible** in the image. Do not speculate or interpret artistic intent, camera focus, or composition. Do not include phrases like "this appears to be", "this looks like", or anything about the image itself. Describe what **is physically present in the frame**, and nothing more.
140
-
141
- Return JSON in this structure:
142
-
143
- {
144
- "description": "A detailed, factual account of what is visibly happening (4 sentences max). Only mention concrete elements or actions that are clearly shown. Do not include anything about how the image is styled, shot, or composed. Do not lead the description with something like 'This image shows' or 'this keyframe is...', just get right into the details.",
145
- "objects": ["object1 with relevant visual details", "object2 with relevant visual details", ...],
146
- "actions": ["action1 with participants and context", "action2 with participants and context", ...],
147
- "environment": "Detailed factual description of the setting and atmosphere based on visible cues (e.g., interior of a classroom with fluorescent lighting, or outdoor forest path with snow-covered trees).",
148
- "content_type": "The type of content it is, e.g. 'real-world footage', 'video game', 'animation', 'cartoon', 'CGI', 'VTuber', etc.",
149
- "specific_style": "Specific genre, aesthetic, or platform style (e.g., anime, 3D animation, mobile gameplay, vlog, tutorial, news broadcast, etc.)",
150
- "production_quality": "Visible production level: e.g., 'professional studio', 'amateur handheld', 'webcam recording', 'TV broadcast', etc.",
151
- "summary": "One clear, comprehensive sentence summarizing the visual content of the frame. Like the description, get right to the point.",
152
- "logos": ["logo1 with visual description", "logo2 with visual description", ...]
153
- }
154
-
155
- Rules:
156
- - Be specific and literal. Focus on what is explicitly visible.
157
- - Do NOT include interpretations of emotion, mood, or narrative unless it's visually explicit.
158
- - No artistic or cinematic analysis.
159
- - Always include the language of any text in the image if present as an object, e.g. "English text", "Japanese text", "Russian text", etc.
160
- - Maximum 10 objects and 5 actions.
161
- - Return an empty array for 'logos' if none are present.
162
- - Always output strictly valid JSON with proper escaping.
163
- - Output **only the JSON**, no extra text or explanation.
164
- ```
165
-
166
- ### Inference Parameters
167
-
168
- - **Temperature**: 0.1 (recommended for consistency)
169
- - **Max Tokens**: 2000
170
- - **Response Format**: `{"type": "json_object"}`
171
-
172
- ### Output Schema
173
-
174
- The model outputs a fixed JSON structure with the following fields:
175
 
176
- ```json
177
- {
178
- "description": "string - Detailed factual description (max 4 sentences)",
179
- "objects": ["array of strings - Up to 10 objects with visual details"],
180
- "actions": ["array of strings - Up to 5 actions with context"],
181
- "environment": "string - Setting and atmosphere description",
182
- "content_type": "string - Type of visual content",
183
- "specific_style": "string - Genre or style classification",
184
- "production_quality": "string - Production level assessment",
185
- "summary": "string - Single sentence summary",
186
- "logos": ["array of strings - Detected logos with descriptions"]
187
- }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
188
  ```
189
 
190
- ## Example Output
191
-
192
- Given a nature scene with a wooden boardwalk through grassland:
193
-
194
- ```json
195
- {
196
- "description": "A wooden boardwalk path extends from the foreground into the distance, cutting through a field of tall, vibrant green grass. The path is flanked on both sides by the dense grass. In the background, a line of trees is visible on the horizon under a blue sky with scattered white clouds.",
197
- "objects": [
198
- "Wooden boardwalk",
199
- "Tall green grass",
200
- "Blue sky",
201
- "White clouds",
202
- "Trees"
203
- ],
204
- "actions": [],
205
- "environment": "An outdoor, natural landscape, likely a marsh or wetland, on a clear day. The scene is characterized by a wooden boardwalk, lush green vegetation, and a bright blue sky with wispy clouds.",
206
- "content_type": "real-world footage",
207
- "specific_style": "landscape photography",
208
- "production_quality": "professional photography",
209
- "summary": "A wooden boardwalk path winds through a lush green field under a bright blue sky with scattered clouds.",
210
- "logos": []
211
- }
212
- ```
213
-
214
- ## Use Cases
215
 
216
- - **Video Search & Discovery** - Build searchable databases with structured metadata
217
- - **Content Moderation** - Automated content analysis and categorization
218
- - **Accessibility** - Generate consistent alt-text and scene descriptions
219
- - **Ad Verification** - Track product visibility and brand appearances
220
- - **Video Analytics** - Extract insights from large video collections
221
- - **Content Management** - Automatic tagging and organization of video libraries
222
 
223
- ## Interested in training your own model?
 
 
 
 
224
 
225
- Contact us at [partners@inference.net](mailto:partners@inference.net) for a free consultation with our research team.
 
 
 
226
 
227
- ## Support
228
-
229
- - **Documentation**: [docs.inference.net](https://docs.inference.net/use-cases/video-understanding)
230
- - **API Access**: Get $25 in free credits when you [sign up](https://inference.net/register) for an account
231
- - **Email**: support@inference.net
232
 
233
  ## License
 
234
 
235
- This model is released under the Apache-2.0 license, allowing for commercial use and modification with proper attribution.
 
 
 
3
  base_model: meta-llama/Llama-3.1-8B-Instruct
4
  ---
5
 
6
+ <p align="center">
7
+ <img alt="Schematron" src="https://huggingface.co/inference-net/Schematron-3B/resolve/main/Banner.png">
8
+ </p>
9
 
10
+ <p align="center">
11
+ <a href="https://docs.inference.net/use-cases/json-extraction"><strong>Documentation</strong></a> ·
12
+ <a href="https://inference.net/models/schematron-8b"><strong>Serverless API</strong></a> ·
13
+ <a href="https://inference.net/blog/Schematron"><strong>Announcement blog</strong></a>
14
+ </p>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
15
 
16
+ <br>
17
 
18
+ ## Model Overview
19
+ Welcome to the Schematron series, [Inference.net's](https://inference.net/) long‑context extraction models specialized in converting noisy HTML into clean, typed JSON that conforms to your custom schema. The Schematron series was purpose‑trained for web scraping, data ingestion, and transforming arbitrary pages into structured records.
20
 
21
+ We're releasing these models in two different sizes:
22
 
23
+ - **Schematron‑8B** marginal quality lift on harder/longer pages
24
+ - **Schematron‑3B** — recommended default; near‑parity quality at ~50% cost of Schematron-8B
25
 
26
+ > [!NOTE]
27
+ > This model card is dedicated to the smaller `Schematron-8B` model. Check out [`Schematron-8B`](https://huggingface.co/inference-net/Schematron-8B) for the smaller model.
28
 
29
+ ## I/O at a glance
30
+ - **Input**: Cleaned HTML + JSON Schema (can be extracted from typed model like Pydantic/Zod)
31
+ - **Output**: Strictly valid JSON conforming to the provided schema (no narration)
32
 
33
+ > [!NOTE]
34
+ > The JSON Schema passed as input needs to conform to the [schema.org](https://json-schema.org/draft-07/schema) schema.
 
 
 
35
 
36
+ ## Highlights
37
+ - **Schema-first extraction**: 100% schema‑conformant JSON outputs
38
+ - **Long context**: Robust to lengthy, noisy HTML (up to 128K tokens)
39
+ - **Variants**: 3B (default, most cost‑efficient) · 8B (marginal quality lift at ~2× cost)
40
 
41
+ ## Model Details
42
+ - **Family**: Schematron (3B and 8B)
43
+ - **Context window**: Up to 128K tokens
44
+ - **Input**: Cleaned or raw HTML and a JSON Schema
45
+ - **Output**: Strict JSON that conforms to the provided schema
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
46
 
47
  ## Benchmarks
48
 
49
+ ### HTML-to-JSON Extraction Quality
50
+
51
+ We evaluated extraction quality using Gemini 2.5 Pro as a judge, scoring extractions from 1-5 where 5 represents perfect extraction.
52
+
53
+ | Model | LLM-as-Judge Score |
54
+ |-------|-------------------|
55
+ | GPT-4.1 | 4.74 |
56
+ | **Schematron-8B** | **4.64** |
57
+ | **Schematron-3B** | **4.41** |
58
+ | Gemini-3B-Base | 2.24 |
59
+
60
+ ### Web-Augmented Factuality on SimpleQA
61
+
62
+ We evaluated Schematron's real-world impact on LLM factuality using SimpleQA.
63
+
64
+ **Test Pipeline:**
65
+ 1. **Query Generation**: Primary LLM (GPT-5 Nano or GPT-4.1) generates search queries and defines extraction schema
66
+ 2. **Web Search**: Search provider (SERP or Exa) retrieves relevant pages
67
+ 3. **Structured Extraction**: Schematron extracts JSON data from retrieved pages using the schema
68
+ 4. **Answer Synthesis**: Primary LLM produces final answer from structured data
69
+
70
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/6626a246891c75742bd19aaf/mU_01IPsf0FvkXYNYstRZ.png)
71
+
72
+ **Key findings:**
73
+ - Web search paired with JSON extraction improves factuality: Adding Schematron with web retrieval improves GPT-5 Nano's accuracy from 8.54% to 82.87%—nearly a 10x improvement
74
+ - Search provider matters: Exa (82.9%) significantly outperforms SERP (64.2%) for factual retrieval, while also being more cost-effective
75
+ - Structured extraction beats raw HTML: Processing raw HTML would require 100k+ tokens for 10 searches; Schematron's JSON extraction reduces this by orders of magnitude
76
+ - Small specialized models win: Schematron-8B (82.87%) outperforms the much larger Gemini 2.5 Flash (80.61%) on this task, showing that fine-tuning for well-defined tasks beats general purpose models
77
+ - Performance scales with model quality: When paired with GPT-4.1, Schematron achieves 85.58% accuracy, showing the approach benefits from stronger base models
78
+
79
+ ## Minimal Quickstart
80
+ Use these local snippets to prepare HTML and compose a schema‑guided prompt. The model returns strictly valid JSON; validate it against your schema downstream.
81
+
82
+ ```python
83
+ from lxml.html.clean import Cleaner
84
+ import lxml.html as LH
85
+
86
+ HTML_CLEANER = Cleaner(
87
+ scripts=True,
88
+ javascript=True,
89
+ style=True,
90
+ inline_style=True,
91
+ safe_attrs_only=False,
92
+ )
93
+
94
+
95
+ def strip_noise(html: str) -> str:
96
+ """Remove scripts, styles, and JavaScript from HTML using lxml.
97
+ """
98
+ if not html or not html.strip():
99
+ return ""
100
+ try:
101
+ doc = LH.fromstring(html)
102
+ cleaned = HTML_CLEANER.clean_html(doc)
103
+ return LH.tostring(cleaned, encoding="unicode")
104
+ except Exception:
105
+ return ""
106
  ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
107
 
108
+ Compose messages with your schema and cleaned HTML:
109
+
110
+ ```python
111
+ def construct_messages(schema: str, html: str):
112
+ """Construct messages for a schema‑guided extraction request."""
113
+ response_prompt = {
114
+ "prompt_part_one": (
115
+ "You are going to be given a JSON schema following the standardized JSON "
116
+ "Schema format. You are going to be given a HTML page and you are going "
117
+ "to apply the schema to the HTML page however you see it as applicable "
118
+ "and return the results in a JSON object. The schema is as follows:"
119
+ ),
120
+ "prompt_part_two": "Here is the HTML page:",
121
+ "prompt_part_three": "MAKE SURE ITS VALID JSON.",
122
+ }
123
+
124
+ user_prompt = (
125
+ response_prompt['prompt_part_one']
126
+ + "\n\n" + schema + "\n\n"
127
+ + response_prompt['prompt_part_two']
128
+ + "\n\n" + html + "\n\n"
129
+ + response_prompt['prompt_part_three']
130
+ )
131
+
132
+ return [
133
+ {"role": "system", "content": "You are a helpful assistant"},
134
+ {"role": "user", "content": user_prompt},
135
+ ]
136
  ```
137
 
138
+ > [!NOTE]
139
+ > In the [serverless API](https://inference.net/models/schematron-3b) there's no need to pass anything but the HTML. We handle the prompt formatting for you.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
140
 
 
 
 
 
 
 
141
 
142
+ ## Recommendations
143
+ - Temperature 0 and JSON mode for deterministic, parseable output
144
+ - Validate responses against your schema (e.g., Pydantic or Zod)
145
+ - Pre‑clean HTML (remove scripts/styles) when possible; avoid over‑aggressive removal
146
+ - Using lxml to clean the HTML is not required, but is recommended as it matches the training data.
147
 
148
+ ## Limitations
149
+ - Static HTML only; render client‑side content upstream
150
+ - Very large pages may require truncation
151
+ - Ambiguous fields depend on schema clarity; be explicit in field descriptions
152
 
153
+ ## Safety and Responsible Use
154
+ - Extracted data may include personal or sensitive information present in the page—handle and store responsibly
155
+ - Respect site terms, robots.txt, and applicable laws
156
+ - Use downstream validation and guardrails for compliance
 
157
 
158
  ## License
159
+ See license in the metadata above.
160
 
161
+ ## Support
162
+ - Docs: https://docs.inference.net/use-cases/json-extraction
163
+ - Email: support@inference.net