| | --- |
| | base_model: |
| | - unsloth/Meta-Llama-3.1-70B-Instruct |
| | library_name: peft |
| | datasets: |
| | - ARM-Development/11k_Tabular |
| | language: |
| | - en |
| | --- |
| | |
| | ## Model Card for `sciencebase-metadata-llama3-70b` *(v 1.0)* |
| |
|
| | ### Model Details |
| | | Field | Value | |
| | |-------|-------| |
| | | **Developed by** | Quan Quy, Travis Ping, Tudor Garbulet, Chirag Shah, Austin Aguilar | |
| | | **Contact** | quyqm@ornl.gov • pingts@ornl.gov • garbuletvt@ornl.gov • shahch@ornl.gov • aguilaral@ornl.gov | |
| | | **Funded by** | U.S. Geological Survey (USGS) & Oak Ridge National Laboratory – ARM Data Center | |
| | | **Model type** | Autoregressive LLM, instruction-tuned for *structured → metadata* generation | |
| | | **Base model** | `meta-llama/Llama-3.1-70B-Instruct` | |
| | | **Languages** | English | |
| | | **Finetuned from** | `unsloth/Meta-Llama-3.1-70B-Instruct` | |
| |
|
| | ### Model Description |
| | Fine-tuned on ≈ 9 000 ScienceBase “data → metadata” pairs to automate creation of FGDC/ISO-style metadata records for scientific datasets. |
| |
|
| | ### Model Sources |
| | | Resource | Link | |
| | |----------|------| |
| | | **Repository** | <https://huggingface.co/ARM-Development/Llama-3.3-70B-tabular-1.0> | |
| | | **Demo** | <https://colab.research.google.com/drive/1saCEFhkBYDhQWkdTwnwiE_-AiWmD6p0f#scrollTo=WeniLP-Ah1QL> | |
| |
|
| | --- |
| |
|
| | ## Uses |
| |
|
| | ### Direct Use |
| | Generate schema-compliant metadata text from a JSON/CSV representation of a ScienceBase item. |
| |
|
| | ### Downstream Use |
| | Integrate as a micro-service in data-repository pipelines. |
| |
|
| | ### Out-of-Scope |
| | Open-ended content generation, or any application outside metadata curation. |
| |
|
| | --- |
| |
|
| | ## Bias, Risks, and Limitations |
| | * Domain-specific bias toward ScienceBase field names. |
| | * Possible hallucination of fields when prompts are underspecified. |
| |
|
| |
|
| | --- |
| |
|
| | ## Training Details |
| |
|
| | ### Training Data |
| | * ~9 k ScienceBase records with curated metadata. |
| |
|
| |
|
| | ### Training Procedure |
| | | Hyper-parameter | Value | |
| | |-----------------|-------| |
| | | Max sequence length | 20 000 | |
| | | Precision | fp16 / bf16 (auto) | |
| | | Quantisation | 4-bit QLoRA (`load_in_4bit=True`) | |
| | | LoRA rank / α | 16 / 16 | |
| | | Target modules | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj | |
| | | Optimiser | `adamw_8bit` | |
| | | LR / schedule | 2 × 10⁻⁴, linear | |
| | | Epochs | 1 | |
| | | Effective batch | 4 (1 GPU × grad-acc 4) | |
| | | Trainer | `trl` SFTTrainer + `peft` 0.15.2 | |
| |
|
| | ### Hardware & Runtime |
| | | Field | Value | |
| | |-------|-------| |
| | | GPU | 1 × NVIDIA A100 80 GB | |
| | | Total training hours | ~120 hours | |
| | | Cloud/HPC provider | ARM Cumulus HPC | |
| |
|
| | ### Software Stack |
| | | Package | Version | |
| | |---------|---------| |
| | | Python | 3.12.9 | |
| | | PyTorch | 2.6.0 + CUDA 12.4 | |
| | | Transformers | 4.51.3 | |
| | | Accelerate | 1.6.0 | |
| | | PEFT | 0.15.2 | |
| | | Unsloth | 2025.3.19 | |
| | | BitsAndBytes | 0.45.5 | |
| | | TRL | 0.15.2 | |
| | | Xformers | 0.0.29.post3 | |
| | | Datasets | 3.5.0 | |
| | | … | |
| |
|
| | --- |
| |
|
| | ## Evaluation |
| | *Evaluation still in progress.* |
| |
|
| |
|
| |
|
| | --- |
| |
|
| | ## Technical Specifications |
| |
|
| | ### Architecture & Objective |
| | QLoRA-tuned `Llama-3.1-70B-Instruct`; causal-LM objective with structured-to-text instruction prompts. |
| |
|
| | --- |
| |
|
| | ## Model Card Authors |
| | Quan Quy, Travis Ping, Tudor Garbulet, Chirag Shah, Austin Aguilar |
| |
|
| | --- |