Gaia-LLM-4B / README.md

my2000cup

Update README.md

78006c2 verified 10 months ago

3.88 kB

	---
	library_name: transformers
	license: other
	base_model: Qwen/Qwen3-4B
	tags:
	- llama-factory
	- full
	- generated_from_trainer
	model-index:
	- name: train_2025-05-04-15-25-21
	results: []
	---


	# train_2025-05-04-15-25-21

	This model is a fine-tuned version of [../pretrained/Qwen3-4B](https://huggingface.co/../pretrained/Qwen3-1.7B) on the wikipedia_zh and the petro_books datasets.

	## Model description

	Gaia-Petro-LLM is a large language model specialized in the oil and gas industry, fine-tuned from Qwen/Qwen3-4B. It was further pre-trained on a curated 20GB corpus of petroleum engineering texts, including technical documents, academic papers, and domain literature. The model is designed to support domain experts, researchers, and engineers in petroleum-related tasks, providing high-quality, domain-specific language understanding and generation.
	## Model Details
	Base Model: Qwen/Qwen3-4B
	Domain: Oil & Gas / Petroleum Engineering
	Corpus Size: ~20GB (petroleum engineering)
	Languages: Primarily Chinese; domain-specific English supported
	Repository: my2000cup/Gaia-Petro-LLM
	## Intended uses & limitations

	Technical Q&A in petroleum engineering
	Document summarization for oil & gas reports
	Knowledge extraction from unstructured domain texts
	Education & training in oil & gas technologies

	Not suitable for general domain tasks outside oil & gas.
	May not be up to date with the latest industry developments (post-2023).
	Not to be used for critical, real-time decision-making without expert review.

	## Training and evaluation data

	The model was further pre-trained on an in-house text corpus (~20GB) collected from:

	Wikipedia (Chinese, petroleum-related entries)
	Open petroleum engineering books and literature
	Technical standards and manuals

	## Usage
	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer

	# Replace with your model repository
	model_name = "my2000cup/Gaia-LLM-4B"

	# Load tokenizer and model
	tokenizer = AutoTokenizer.from_pretrained(model_name)
	model = AutoModelForCausalLM.from_pretrained(
	model_name,
	torch_dtype="auto",
	device_map="auto"
	)

	# Prepare a petroleum engineering prompt
	prompt = "What are the main challenges in enhanced oil recovery (EOR) methods?"
	messages = [
	{"role": "user", "content": prompt}
	]
	text = tokenizer.apply_chat_template(
	messages,
	tokenize=False,
	add_generation_prompt=True,
	enable_thinking=True # Optional: enables model's 'thinking' mode
	)
	model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

	# Generate the model's response
	generated_ids = model.generate(
	**model_inputs,
	max_new_tokens=1024 # adjust as needed
	)
	output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist()

	# Optional: parse 'thinking' content, if your template uses it
	try:
	# Find the index of the </think> token (ID may differ in your tokenizer!)
	think_token_id = 151668 # double-check this ID in your tokenizer
	index = len(output_ids) - output_ids[::-1].index(think_token_id)
	except ValueError:
	index = 0

	thinking_content = tokenizer.decode(output_ids[:index], skip_special_tokens=True).strip("\n")
	content = tokenizer.decode(output_ids[index:], skip_special_tokens=True).strip("\n")

	print("Thinking content:", thinking_content)
	print("Answer:", content)
	```

	### Training hyperparameters

	The following hyperparameters were used during training:
	- learning_rate: 2e-05
	- train_batch_size: 1
	- eval_batch_size: 8
	- seed: 42
	- gradient_accumulation_steps: 8
	- total_train_batch_size: 8
	- optimizer: Use OptimizerNames.ADAMW_TORCH with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
	- lr_scheduler_type: cosine
	- lr_scheduler_warmup_steps: 16
	- num_epochs: 3.0

	### Training results



	### Framework versions

	- Transformers 4.51.3
	- Pytorch 2.6.0+cu124
	- Datasets 3.5.0
	- Tokenizers 0.21.1