Model --- # MasterControlAIML R1-Qwen2.5-1.5b SFT R1 JSON Unstructured-To-Structured LoRA Model [![Unsloth](https://raw.githubusercontent.com/unslothai/unsloth/main/images/unsloth%20made%20with%20love.png)](https://github.com/unslothai/unsloth) This repository provides a fine-tuned Qwen2 model optimized for transforming unstructured text into structured JSON outputs according to a predefined schema. The model is finetuned from the base model **MasterControlAIML/DeepSeek-R1-Strategy-Qwen-2.5-1.5b-Unstructured-To-Structured** and leverages LoRA techniques for efficient adaptation. > **Key Highlights:** > > - **Developed by:** [bhaviktheslider](https://github.com/bhaviktheslider) > - **License:** [Apache-2.0](LICENSE) > - **Finetuned from:** `MasterControlAIML/DeepSeek-R1-Strategy-Qwen-2.5-1.5b-Unstructured-To-Structured` > - **Accelerated Training:** Achieved 2x faster training using [Unsloth](https://github.com/unslothai/unsloth) and Hugging Face's TRL library. --- ## Table of Contents - [Overview](#overview) - [Features](#features) - [Installation](#installation) - [Quick Start](#quick-start) - [Using Unsloth for Fast Inference](#using-unsloth-for-fast-inference) - [Using Transformers for Inference](#using-transformers-for-inference) - [Advanced Example with LangChain Prompt](#advanced-example-with-langchain-prompt) - [Contributing](#contributing) - [License](#license) - [Acknowledgments](#acknowledgments) --- ## Overview This model is tailored for tasks where mapping unstructured text (e.g., manuals, QA documents) into a structured JSON format is required. It supports hierarchical data extraction based on a given JSON Schema, ensuring that the generated outputs follow the exact structure and rules defined by the schema. --- ## Features - **Efficient Inference:** Utilizes the [Unsloth](https://github.com/unslothai/unsloth) library for fast model inference. - **Structured Output:** Maps text inputs into a strict JSON schema with hierarchical relationships. - **Flexible Integration:** Example code snippets show how to use both the Unsloth API and Hugging Face’s Transformers. - **Advanced Prompting:** Includes an example of using LangChain prompt templates for detailed instruction-driven output. --- ## Installation ### Prerequisites - **Python:** 3.8+ - **PyTorch:** (Preferably with CUDA support) - **Required Libraries:** `transformers`, `torch`, `unsloth`, `langchain` (for advanced usage) ### Installation Command Install the required Python packages with: ```bash pip install torch transformers unsloth langchain ``` --- ## Quick Start ### Using Unsloth for Fast Inference The Unsloth library allows you to quickly load and run inference with the model. Below is a basic example: ```python from unsloth import FastLanguageModel import torch # Specify the model name MODEL = "MasterControlAIML/DeepSeek-R1-Qwen2.5-1.5b-SFT-R1-JSON-Unstructured-To-Structured" # Load the model and tokenizer model, tokenizer = FastLanguageModel.from_pretrained( model_name=MODEL, max_seq_length=2048, dtype=None, load_in_4bit=False, ) # Prepare the model for inference FastLanguageModel.for_inference(model) # Define a prompt template ALPACA_PROMPT = """ Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request. ### Instruction: {} ### Response: {} """ # Example: Create input and generate output instruction = "Provide a summary of the Quality Assurance Manual." prompt = ALPACA_PROMPT.format(instruction, "") inputs = tokenizer([prompt], return_tensors="pt").to("cuda") output = model.generate(**inputs, max_new_tokens=2000) # Decode and print the generated text print(tokenizer.batch_decode(output, skip_special_tokens=True)[0]) ``` --- ### Using Transformers for Inference If you prefer to use Hugging Face's Transformers directly, here’s an alternative example: ```python from transformers import AutoTokenizer, AutoModelForCausalLM, TextStreamer import torch MODEL = "MasterControlAIML/DeepSeek-R1-Qwen2.5-1.5b-SFT-R1-JSON-Unstructured-To-Structured" # Initialize tokenizer and model tokenizer = AutoTokenizer.from_pretrained(MODEL) model = AutoModelForCausalLM.from_pretrained(MODEL, torch_dtype=torch.float16, device_map="auto") ALPACA_PROMPT = """ Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request. ### Instruction: {} ### Response: {} """ # Define your text input TEXT = "" prompt = ALPACA_PROMPT.format(TEXT, "") inputs = tokenizer([prompt], return_tensors="pt").to("cuda") text_streamer = TextStreamer(tokenizer) # Generate output with specific generation parameters with torch.no_grad(): output_ids = model.generate( input_ids=inputs["input_ids"], attention_mask=inputs["attention_mask"], max_new_tokens=2000, temperature=0.7, top_p=0.9, repetition_penalty=1.1, streamer=text_streamer, pad_token_id=tokenizer.pad_token_id, ) # Print the decoded output print(tokenizer.decode(output_ids[0], skip_special_tokens=True)) ``` --- ### Advanced Example with LangChain Prompt For advanced users, the repository includes an example that integrates with LangChain to map hierarchical text data into a JSON schema. This example uses a prompt template to instruct the model on how to generate an output that includes both the JSON object (``) and the reasoning behind the mapping decisions (``). ```python from langchain_core.prompts import PromptTemplate SYSTEM_PROMPT = """ ### Role: You are an expert data extractor specializing in mapping hierarchical text data into a given JSON Schema. ### DATA INPUT: - **Text:** ```{TEXT}``` - **Blank JSON Schema:** ```{SCHEMA}``` ### TASK REQUIREMENT: 1. Analyze the given text and map all relevant information strictly into the provided JSON Schema. 2. Provide your output in **two mandatory sections**: - **``:** The filled JSON object - **``:** Reasoning for the mapping decisions ### OUTPUT STRUCTURE: ``` /* Explanation of mapping logic */ /* Completed JSON Object */ ``` ### STRICT RULES FOR GENERATING OUTPUT: 1. **Both Tags Required:** - Always provide both the `` and `` sections. - If reasoning is minimal, state: "Direct mapping from text to schema." 2. **JSON Schema Mapping:** - Strictly map the text data to the given JSON Schema without modification or omissions. 3. **Hierarchy Preservation:** - Maintain proper parent-child relationships and follow the schema's hierarchical structure. 4. **Correct Mapping of Attributes:** - Map key attributes, including `id`, `idc`, `idx`, `level_type`, and `component_type`. 5. **JSON Format Compliance:** - Escape quotes, replace newlines with `\\n`, avoid trailing commas, and use double quotes exclusively. 6. **Step-by-Step Reasoning:** - Explain your reasoning within the `` tag. ### IMPORTANT: If either the `` or `` tags is missing, the response will be considered incomplete. """ # Create a prompt template with LangChain system_prompt_template = PromptTemplate(template=SYSTEM_PROMPT, input_variables=["TEXT", "SCHEMA"]) # Format the prompt with your text and JSON schema system_prompt_str = system_prompt_template.format( TEXT="Your detailed text input here...", SCHEMA="""{ "type": "object", "properties": { "id": {"type": "string", "description": "Unique identifier."}, "title": {"type": "string", "description": "Section title."}, "level": {"type": "integer", "description": "Hierarchy level."}, "level_type": {"type": "string", "enum": ["ROOT", "SECTION", "SUBSECTION", "DETAIL_N"], "description": "Hierarchy type."}, "component": { "type": "array", "items": { "type": "object", "properties": { "idc": {"type": "integer", "description": "Component ID."}, "component_type": {"type": "string", "enum": ["PARAGRAPH", "TABLE", "CALCULATION", "CHECKBOX"], "description": "Component type."}, "metadata": {"type": "string", "description": "Additional metadata."}, "properties": {"type": "object"} }, "required": ["idc", "component_type", "metadata", "properties"] } }, "children": {"type": "array", "items": {}} }, "required": ["id", "title", "level", "level_type", "component", "children"] }""" ) # Use the system prompt with your inference code as shown in previous examples. ``` --- ## Contributing Contributions, bug reports, and feature requests are welcome! Please open an issue or submit a pull request if you would like to contribute to this project. --- ## License This project is licensed under the [Apache-2.0 License](LICENSE). --- ## Acknowledgments - **Unsloth:** For providing fast model inference capabilities. ([GitHub](https://github.com/unslothai/unsloth)) - **Hugging Face:** For the [Transformers](https://github.com/huggingface/transformers) and [TRL](https://github.com/huggingface/trl) libraries. - **LangChain:** For advanced prompt management and integration. - And, of course, thanks to the community and contributors who helped shape this project. --- Enjoy using the model, and happy coding!