thearnabsarkar/json-semval-synth-v1
Viewer • Updated • 50 • 16
How to use thearnabsarkar/json-semval-minilm-v1 with Transformers:
# Use a pipeline as a high-level helper
from transformers import pipeline
pipe = pipeline("text-classification", model="thearnabsarkar/json-semval-minilm-v1") # Load model directly
from transformers import AutoTokenizer, AutoModelForSequenceClassification
tokenizer = AutoTokenizer.from_pretrained("thearnabsarkar/json-semval-minilm-v1")
model = AutoModelForSequenceClassification.from_pretrained("thearnabsarkar/json-semval-minilm-v1")Hybrid rule+ML model that detects semantic JSON issues and suggests minimal fixes.
This model is fine-tuned from nreimers/MiniLM-L6-H384-uncased to classify semantic errors in JSON payloads and predict appropriate fix actions. It works in conjunction with a deterministic rules engine (JSON Schema validation) to provide a hybrid validation approach.
Dataset: thearnabsarkar/json-semval-synth-v1
wrong_type - Incorrect data typealias_key - Alternative field namesinvalid_date - Malformed datesenum_near_miss - Close but incorrect enum valuescross_field - Logical inconsistencies across fieldsboolean_text - Text representations of booleansnumber_text - Text representations of numbersextra_key - Unexpected additional propertiesrename_key - Rename field to expected namecast_number - Convert text to numbercast_bool - Convert text to booleanparse_date_iso - Parse and normalize datesmap_enum - Fuzzy match to valid enum valueswap_dates - Fix inverted date rangesfill_default - Use schema default valuemodel.safetensors - PyTorch model weightsmodel.onnx - ONNX export for fast CPU inferenceconfig.json - Model configurationtokenizer.json / tokenizer_config.json / vocab.txt - Tokenizer filesreports/metrics.json - Evaluation metricsSee reports/metrics.json in this repository for detailed evaluation metrics on the synthetic test set.
from transformers import AutoModelForSequenceClassification, AutoTokenizer
import json
model = AutoModelForSequenceClassification.from_pretrained("thearnabsarkar/json-semval-minilm-v1")
tokenizer = AutoTokenizer.from_pretrained("thearnabsarkar/json-semval-minilm-v1")
schema = {"type": "object", "properties": {"age": {"type": "integer"}}}
payload = {"age": "25"}
input_text = f"Schema: {json.dumps(schema)} JSON: {json.dumps(payload)}"
inputs = tokenizer(input_text, return_tensors="pt", truncation=True, max_length=512)
outputs = model(**inputs)
# Get predicted error type
predicted_class = outputs.logits.argmax(-1).item()
error_types = ["wrong_type", "alias_key", "invalid_date", "enum_near_miss",
"cross_field", "boolean_text", "number_text", "extra_key"]
print(f"Predicted error: {error_types[predicted_class]}")
import onnxruntime as ort
import numpy as np
session = ort.InferenceSession("model.onnx")
# ... tokenize input ...
outputs = session.run(None, {"input_ids": input_ids})
For the complete validation and auto-fixing pipeline, see:
This model is designed for:
@misc{json-semval-minilm-v1,
author = {Arnab Sarkar},
title = {JSON Semantic Validator - MiniLM v1},
year = {2025},
publisher = {Hugging Face},
url = {https://huggingface.co/thearnabsarkar/json-semval-minilm-v1}
}
Base model
nreimers/MiniLM-L6-H384-uncased