Agentic-Data-1 / README.md
Shen-Pandi's picture
Upload README.md with huggingface_hub
7f1712c verified
|
raw
history blame
12.9 kB
metadata
language:
  - en
license: llama3.1
base_model: deepseek-ai/DeepSeek-R1-Distill-Llama-8B
tags:
  - data-management
  - data-migration
  - sql
  - etl
  - grpo
  - reinforcement-learning
  - oracle-to-postgres
  - db2-to-snowflake
  - data-quality
  - schema-analysis
pipeline_tag: text-generation
datasets:
  - custom
model-index:
  - name: Agentic-Data-1
    results:
      - task:
          type: text-generation
          name: Data Management Tasks
        metrics:
          - type: composite
            value: 52
            name: Composite Score
          - type: reasoning
            value: 24
            name: Reasoning Quality
          - type: sql_validity
            value: 40
            name: SQL Validity

🚀 Agentic Data 1

The First Open-Source LLM Purpose-Built for Data Operations

SQL Migration • Schema Analysis • Data Quality • ETL Design • Performance Tuning

License Model Size Training HuggingFace

Built by DataManagement.AI — Powering enterprise data operations with intelligent AI agents.


🎯 What is Agentic Data 1?

Agentic Data 1 is the first open-source language model specifically designed for data management and migration tasks. While general-purpose LLMs like GPT-4 or Claude treat data operations as just another coding task, Agentic Data 1 understands the unique challenges of enterprise data ecosystems — from legacy Oracle databases to modern cloud data warehouses.

Built on DeepSeek-R1-Distill-Llama-8B and enhanced through a rigorous two-stage training pipeline (Supervised Fine-Tuning + GRPO Reinforcement Learning), it delivers specialist-grade performance at a fraction of the cost of frontier models.

💡 Why a Specialized Data Model?

Challenge General LLMs Agentic Data 1
Oracle → PostgreSQL migration Basic syntax conversion Deep understanding of Oracle-specific constructs (NVL, DECODE, ROWNUM, PL/SQL)
Schema normalization Generic suggestions Industry-aware normalization with proper foreign key design
Data quality rules Surface-level checks Comprehensive quality framework (duplicates, PII, referential integrity)
ETL pipeline design Abstract descriptions Practical, implementable pipelines with error handling and rollback
Query performance tuning Basic index suggestions Multi-strategy optimization (partitioning, materialized views, query rewriting)
Cost to operate $3-30 per million tokens Near-zero (self-hosted inference)

🏗️ Training Pipeline

Agentic Data 1 uses a two-stage training approach that combines domain knowledge injection with reasoning reinforcement:

Stage 1: Supervised Fine-Tuning (SFT)
├── 1,000+ curated data management examples
├── Real-world migration scenarios
├── Multi-database dialect coverage
└── Expert-written chain-of-thought reasoning

Stage 2: Group Relative Policy Optimization (GRPO)
├── 500 RL training steps on NVIDIA H100
├── Reward: SQL parsability (30%) + Reasoning quality (25%) + Answer accuracy (45%)
├── 10 full epochs over training data
└── Result: 3× improvement in reasoning, +37% code parsability

GRPO Training Results

Metric Before GRPO After GRPO Improvement
Reasoning Quality 7.5% 24.0% +220% 🔥
Performance Tuning 42.5% 86.3% +103%
Schema Analysis 41.2% 63.1% +53%
Data Quality 68.8% 75.0% +9%
Inference Speed 26.6s 21.8s 18% faster

🔧 Use Cases

1. Database Migration

Transform your legacy database migration from weeks of manual work to hours of AI-assisted automation.

Supported Migration Paths:

Source Target Coverage
Oracle PostgreSQL ✅ Full (DDL, DML, PL/SQL → PL/pgSQL)
DB2 Snowflake ✅ Full (SQL, stored procedures, data types)
MySQL PostgreSQL ✅ Full (AUTO_INCREMENT, ENUM, JSON, charset)
SQL Server PostgreSQL ✅ Functions, procedures, T-SQL conversion
Oracle Snowflake ✅ Including materialized views, sequences
Legacy COBOL/DB2 Modern cloud ✅ Schema extraction and modernization

Example — Oracle to PostgreSQL:

prompt = """Convert this Oracle SQL to PostgreSQL:

SELECT employee_id, first_name,
  NVL(commission_pct, 0) as commission,
  DECODE(department_id, 10, 'Admin', 20, 'Marketing', 'Other') as dept,
  TO_CHAR(hire_date, 'DD-MON-YYYY') as hire_dt
FROM employees
WHERE ROWNUM <= 100;"""

Agentic Data 1 produces:

SELECT employee_id, first_name,
  COALESCE(commission_pct, 0) AS commission,
  CASE department_id
    WHEN 10 THEN 'Admin'
    WHEN 20 THEN 'Marketing'
    ELSE 'Other'
  END AS dept,
  TO_CHAR(hire_date, 'DD-Mon-YYYY') AS hire_dt
FROM employees
ORDER BY hire_date DESC
LIMIT 100;

Key conversions handled automatically:

  • NVL()COALESCE()
  • DECODE()CASE WHEN
  • ROWNUMLIMIT
  • Oracle date formats → PostgreSQL date formats

2. Schema Analysis & Normalization

Automatically detect denormalized schemas, suggest proper normal forms, and generate migration DDL.

prompt = """Analyze this schema and suggest normalization:

CREATE TABLE orders (
  order_id INT PRIMARY KEY,
  customer_name VARCHAR(100),
  customer_email VARCHAR(100),
  product_name VARCHAR(100),
  product_price DECIMAL(10,2),
  quantity INT
);"""

The model identifies:

  • Repeating customer data (1NF/2NF violation)
  • Product data mixed with order data (3NF violation)
  • Missing foreign key relationships
  • Suggests proper customers, products, and order_items tables

3. Data Quality Assessment

Generate comprehensive data quality checks for any schema:

  • Duplicate detection — fuzzy matching on key fields
  • Referential integrity — orphan record identification
  • Format validation — email, phone, date patterns
  • Anomaly detection — statistical outliers in numeric fields
  • PII exposure — identify unmasked sensitive data
  • Completeness — NULL pattern analysis with thresholds

4. ETL Pipeline Design

Get production-ready ETL architectures with:

  • Extraction strategies (full, incremental, CDC)
  • Transformation logic with business rules
  • Error handling and dead-letter queues
  • Rollback procedures and checkpointing
  • Performance optimization for large datasets (50M+ rows)

5. Performance Tuning

The model's strongest capability after GRPO training (+103% improvement):

  • Index recommendations — composite, partial, covering indexes
  • Query rewriting — subquery elimination, join optimization
  • Partitioning strategies — range, hash, list partitioning
  • Materialized views — for heavy aggregation queries
  • EXPLAIN plan analysis — identify sequential scans, nested loops

6. Real-Time Pipeline Architecture

Design event-driven data pipelines with:

  • Technology selection (Kafka, Flink, Spark Streaming)
  • Exactly-once processing semantics
  • Schema evolution and compatibility
  • Dead-letter handling and retry logic
  • Monitoring and alerting strategies

🏢 Industry Applications

Banking & Finance

  • Regulatory data migration (Basel III/IV compliance)
  • Core banking system modernization (mainframe → cloud)
  • Customer data platform consolidation
  • Anti-money laundering data quality

Insurance

  • Policy administration system migration
  • Claims data standardization
  • Actuarial data warehouse modernization
  • Regulatory reporting (Solvency II)

Healthcare & Pharma

  • EHR/EMR system migration
  • Clinical data quality validation
  • HIPAA-compliant data transformation
  • Research data lake design

Logistics & Supply Chain

  • Legacy ERP migration (SAP → cloud)
  • Real-time inventory data pipelines
  • Multi-source data reconciliation
  • IoT sensor data architecture

⚡ Quick Start

Basic Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "DataManagement-AI/Agentic-Data-1",
    device_map="auto",
    torch_dtype="auto",
)
tokenizer = AutoTokenizer.from_pretrained("DataManagement-AI/Agentic-Data-1")

prompt = "Convert this Oracle SQL to PostgreSQL: SELECT NVL(salary, 0) FROM employees WHERE ROWNUM <= 10;"

messages = [{"role": "user", "content": prompt}]
input_text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(input_text, return_tensors="pt").to(model.device)

outputs = model.generate(**inputs, max_new_tokens=512, temperature=0.7)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))

4-Bit Quantized (Recommended for Production)

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)

model = AutoModelForCausalLM.from_pretrained(
    "DataManagement-AI/Agentic-Data-1",
    quantization_config=bnb_config,
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("DataManagement-AI/Agentic-Data-1")

With vLLM (High-Throughput API Server)

pip install vllm
vllm serve DataManagement-AI/Agentic-Data-1 --dtype auto --max-model-len 4096
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="unused")
response = client.chat.completions.create(
    model="DataManagement-AI/Agentic-Data-1",
    messages=[{"role": "user", "content": "Convert Oracle NVL to PostgreSQL equivalent"}],
)

💰 Cost Comparison

Running your own Agentic Data 1 vs using commercial LLM APIs:

Model Input $/M tokens Output $/M tokens Monthly Cost (100 active users)
GPT-4 Turbo $10.00 $30.00 $11,500
Claude Sonnet 3.5 $3.00 $15.00 $1,015
Claude Haiku $0.25 $1.25 $440
Agentic Data 1 (self-hosted) ~$0.003 ~$0.003 $330 (GPU only)

99.7% cost reduction vs GPT-4 Turbo. 67% reduction vs Claude Haiku. With better domain performance.


🤝 Part of the DataManagement.AI Ecosystem

Agentic Data 1 powers the AI backbone of the DataManagement.AI platform — an enterprise-grade data operations platform featuring 8 specialized AI agents:

Agent Function
Profile AI Automated data profiling and pattern detection
Map AI Intelligent source-to-target schema mapping
Discovery AI Data landscape exploration and dependency analysis
Cleanse AI Automated data cleansing and deduplication
Quality AI Continuous data quality monitoring
Transform AI Complex data transformations with business rules
Reconcile AI Post-migration validation and reconciliation
Damian End-to-end migration advisor and automation

Start Free TrialSchedule a DemoLearn More


📋 Model Specifications

Specification Value
Architecture LlamaForCausalLM
Parameters 8.03 Billion
Context Length 4,096 tokens
Training Data 1,000+ curated data management examples
Base Model DeepSeek-R1-Distill-Llama-8B
Training Method SFT + GRPO (500 steps, NVIDIA H100)
Precision BFloat16
License Llama 3.1 Community License
Model Size ~16 GB (FP16) / ~4 GB (4-bit quantized)

⚠️ Limitations

  • Optimized for data management tasks — not a general-purpose chatbot
  • Best results with structured prompts that include schema definitions or SQL code
  • May hallucinate table/column names not provided in the prompt
  • Performance on non-English content is limited
  • Not suitable for real-time production without proper guardrails

📖 Citation

@misc{agentic-data-1,
  title={Agentic Data 1: A Domain-Specific LLM for Data Management and Migration},
  author={DataManagement-AI},
  year={2026},
  url={https://huggingface.co/DataManagement-AI/Agentic-Data-1}
}

Built with ❤️ by DataManagement.AI

WebsiteData MigrationContact