qq1990

Update README.md

392715d verified 11 months ago

3.09 kB

	---
	base_model:
	- unsloth/Meta-Llama-3.1-70B-Instruct
	library_name: peft
	datasets:
	- ARM-Development/11k_Tabular
	language:
	- en
	---

	## Model Card for `sciencebase-metadata-llama3-70b` (v 1.0)

	### Model Details
	\| Field \| Value \|
	\|-------\|-------\|
	\| Developed by \| Quan Quy, Travis Ping, Tudor Garbulet, Chirag Shah, Austin Aguilar \|
	\| Contact \| quyqm@ornl.gov • pingts@ornl.gov • garbuletvt@ornl.gov • shahch@ornl.gov • aguilaral@ornl.gov \|
	\| Funded by \| U.S. Geological Survey (USGS) & Oak Ridge National Laboratory – ARM Data Center \|
	\| Model type \| Autoregressive LLM, instruction-tuned for structured → metadata generation \|
	\| Base model \| `meta-llama/Llama-3.1-70B-Instruct` \|
	\| Languages \| English \|
	\| Finetuned from \| `unsloth/Meta-Llama-3.1-70B-Instruct` \|

	### Model Description
	Fine-tuned on ≈ 9 000 ScienceBase “data → metadata” pairs to automate creation of FGDC/ISO-style metadata records for scientific datasets.

	### Model Sources
	\| Resource \| Link \|
	\|----------\|------\|
	\| Repository \| <https://huggingface.co/ARM-Development/Llama-3.3-70B-tabular-1.0> \|
	\| Demo \| <https://colab.research.google.com/drive/1saCEFhkBYDhQWkdTwnwiE_-AiWmD6p0f#scrollTo=WeniLP-Ah1QL> \|

	---

	## Uses

	### Direct Use
	Generate schema-compliant metadata text from a JSON/CSV representation of a ScienceBase item.

	### Downstream Use
	Integrate as a micro-service in data-repository pipelines.

	### Out-of-Scope
	Open-ended content generation, or any application outside metadata curation.

	---

	## Bias, Risks, and Limitations
	* Domain-specific bias toward ScienceBase field names.
	* Possible hallucination of fields when prompts are underspecified.


	---

	## Training Details

	### Training Data
	* ~9 k ScienceBase records with curated metadata.


	### Training Procedure
	\| Hyper-parameter \| Value \|
	\|-----------------\|-------\|
	\| Max sequence length \| 20 000 \|
	\| Precision \| fp16 / bf16 (auto) \|
	\| Quantisation \| 4-bit QLoRA (`load_in_4bit=True`) \|
	\| LoRA rank / α \| 16 / 16 \|
	\| Target modules \| q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj \|
	\| Optimiser \| `adamw_8bit` \|
	\| LR / schedule \| 2 × 10⁻⁴, linear \|
	\| Epochs \| 1 \|
	\| Effective batch \| 4 (1 GPU × grad-acc 4) \|
	\| Trainer \| `trl` SFTTrainer + `peft` 0.15.2 \|

	### Hardware & Runtime
	\| Field \| Value \|
	\|-------\|-------\|
	\| GPU \| 1 × NVIDIA A100 80 GB \|
	\| Total training hours \| ~120 hours \|
	\| Cloud/HPC provider \| ARM Cumulus HPC \|

	### Software Stack
	\| Package \| Version \|
	\|---------\|---------\|
	\| Python \| 3.12.9 \|
	\| PyTorch \| 2.6.0 + CUDA 12.4 \|
	\| Transformers \| 4.51.3 \|
	\| Accelerate \| 1.6.0 \|
	\| PEFT \| 0.15.2 \|
	\| Unsloth \| 2025.3.19 \|
	\| BitsAndBytes \| 0.45.5 \|
	\| TRL \| 0.15.2 \|
	\| Xformers \| 0.0.29.post3 \|
	\| Datasets \| 3.5.0 \|
	\| … \|

	---

	## Evaluation
	Evaluation still in progress.



	---

	## Technical Specifications

	### Architecture & Objective
	QLoRA-tuned `Llama-3.1-70B-Instruct`; causal-LM objective with structured-to-text instruction prompts.

	---

	## Model Card Authors
	Quan Quy, Travis Ping, Tudor Garbulet, Chirag Shah, Austin Aguilar

	---

	---
	base_model:
	- unsloth/Meta-Llama-3.1-70B-Instruct
	library_name: peft
	datasets:
	- ARM-Development/11k_Tabular
	language:
	- en
	---

	## Model Card for `sciencebase-metadata-llama3-70b` (v 1.0)

	### Model Details
	\| Field \| Value \|
	\|-------\|-------\|
	\| Developed by \| Quan Quy, Travis Ping, Tudor Garbulet, Chirag Shah, Austin Aguilar \|
	\| Contact \| quyqm@ornl.gov • pingts@ornl.gov • garbuletvt@ornl.gov • shahch@ornl.gov • aguilaral@ornl.gov \|
	\| Funded by \| U.S. Geological Survey (USGS) & Oak Ridge National Laboratory – ARM Data Center \|
	\| Model type \| Autoregressive LLM, instruction-tuned for structured → metadata generation \|
	\| Base model \| `meta-llama/Llama-3.1-70B-Instruct` \|
	\| Languages \| English \|
	\| Finetuned from \| `unsloth/Meta-Llama-3.1-70B-Instruct` \|

	### Model Description
	Fine-tuned on ≈ 9 000 ScienceBase “data → metadata” pairs to automate creation of FGDC/ISO-style metadata records for scientific datasets.

	### Model Sources
	\| Resource \| Link \|
	\|----------\|------\|
	\| Repository \| <https://huggingface.co/ARM-Development/Llama-3.3-70B-tabular-1.0> \|
	\| Demo \| <https://colab.research.google.com/drive/1saCEFhkBYDhQWkdTwnwiE_-AiWmD6p0f#scrollTo=WeniLP-Ah1QL> \|

	---

	## Uses

	### Direct Use
	Generate schema-compliant metadata text from a JSON/CSV representation of a ScienceBase item.

	### Downstream Use
	Integrate as a micro-service in data-repository pipelines.

	### Out-of-Scope
	Open-ended content generation, or any application outside metadata curation.

	---

	## Bias, Risks, and Limitations
	* Domain-specific bias toward ScienceBase field names.
	* Possible hallucination of fields when prompts are underspecified.


	---

	## Training Details

	### Training Data
	* ~9 k ScienceBase records with curated metadata.


	### Training Procedure
	\| Hyper-parameter \| Value \|
	\|-----------------\|-------\|
	\| Max sequence length \| 20 000 \|
	\| Precision \| fp16 / bf16 (auto) \|
	\| Quantisation \| 4-bit QLoRA (`load_in_4bit=True`) \|
	\| LoRA rank / α \| 16 / 16 \|
	\| Target modules \| q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj \|
	\| Optimiser \| `adamw_8bit` \|
	\| LR / schedule \| 2 × 10⁻⁴, linear \|
	\| Epochs \| 1 \|
	\| Effective batch \| 4 (1 GPU × grad-acc 4) \|
	\| Trainer \| `trl` SFTTrainer + `peft` 0.15.2 \|

	### Hardware & Runtime
	\| Field \| Value \|
	\|-------\|-------\|
	\| GPU \| 1 × NVIDIA A100 80 GB \|
	\| Total training hours \| ~120 hours \|
	\| Cloud/HPC provider \| ARM Cumulus HPC \|

	### Software Stack
	\| Package \| Version \|
	\|---------\|---------\|
	\| Python \| 3.12.9 \|
	\| PyTorch \| 2.6.0 + CUDA 12.4 \|
	\| Transformers \| 4.51.3 \|
	\| Accelerate \| 1.6.0 \|
	\| PEFT \| 0.15.2 \|
	\| Unsloth \| 2025.3.19 \|
	\| BitsAndBytes \| 0.45.5 \|
	\| TRL \| 0.15.2 \|
	\| Xformers \| 0.0.29.post3 \|
	\| Datasets \| 3.5.0 \|
	\| … \|

	---

	## Evaluation
	Evaluation still in progress.



	---

	## Technical Specifications

	### Architecture & Objective
	QLoRA-tuned `Llama-3.1-70B-Instruct`; causal-LM objective with structured-to-text instruction prompts.

	---

	## Model Card Authors
	Quan Quy, Travis Ping, Tudor Garbulet, Chirag Shah, Austin Aguilar

	---