mamba2_exp2 / README.md

Update README.md

8ca89dd verified about 1 month ago

4.65 kB

	---
	language:
	- zh
	library_name: transformers
	pipeline_tag: text-generation
	license: mit
	datasets:
	- telecomadm1145/esjzone_novel_cn
	tags:
	- mamba2
	---

	# mamba2_exp2

	<!-- Provide a quick summary of what the model is/does. -->

	mamba2_exp2 is a Mamba2 architecture model with approximately 0.2 Billion parameters. It has been pre-trained on a dataset of Chinese light novels (esjzone). It is intended for text generation and story continuation tasks in Chinese.

	## Model Details

	### Model Description

	This model utilizes the Mamba2 state-space model architecture, designed for efficient inference. It was pre-trained from scratch on a corpus of uncleaned Chinese light novels.

	Note: This is a base model (pre-trained only), meaning it has not undergone instruction tuning (RLHF or SFT). It is best suited for completing text based on a prompt (continuation) rather than answering questions or following complex instructions.

	- Developed by: telecomadm1145
	- Model type: Mamba2 (State Space Model)
	- Language(s) (NLP): Chinese (zh)
	- License: MIT
	- Finetuned from model: None (Trained from scratch)
	- Model Size: ~0.2B parameters
	- Context Length: 1024 tokens

	### Model Sources

	- Repository: [https://huggingface.co/telecomadm1145/mamba2_exp2](https://huggingface.co/telecomadm1145/mamba2_exp2)
	- Dataset: [telecomadm1145/esjzone_novel_cn](https://huggingface.co/datasets/telecomadm1145/esjzone_novel_cn)

	## Uses

	### Direct Use

	The model is designed for:
	- Creative Writing: Generating light novel-style stories.
	- Text Completion: Continuing a given text narrative in Chinese.
	- Style Imitation: Mimicking the tropes and writing styles found in web novels.

	### Out-of-Scope Use

	- Factual Question Answering: Since it is trained on fiction, it will likely hallucinate facts.
	- Instruction Following: It has not been fine-tuned to follow commands (e.g., "Write a summary of...").
	- Code Generation: Not trained on code.
	- Long-context retrieval: The model was trained with a context window of 1024 tokens; performance may degrade significantly beyond this length.

	## Bias, Risks, and Limitations

	- Dataset Quality: The training data consists of uncleaned web novels. Consequently, the model may generate text containing typos, grammatical errors, or non-standard formatting present in the source material.
	- Content Warnings: The model may generate content that includes violence, mature themes, or offensive language, reflecting the nature of some web fiction genres.
	- Hallucinations: As a fiction-focused model, it creates content and should not be used as a knowledge base.

	## How to Get Started with the Model

	Use the code below to get started with the model.

	Note: You may need to install `mamba-ssm` and `causal-conv1d` depending on the environment configuration for Mamba2 models.

	```python
	from transformers import AutoTokenizer, AutoModelForCausalLM
	import torch

	# Load model and tokenizer
	model_id = "telecomadm1145/mamba2_exp2"
	tokenizer = AutoTokenizer.from_pretrained(model_id)
	model = AutoModelForCausalLM.from_pretrained(model_id, trust_remote_code=True)

	# Move to GPU if available
	device = "cuda" if torch.cuda.is_available() else "cpu"
	model.to(device)

	# Generate text
	text = "<replace your prompt here>"
	inputs = tokenizer(text, return_tensors="pt").to(device)

	outputs = model.generate(
	**inputs,
	max_new_tokens=100,
	do_sample=True,
	top_k=50,
	top_p=0.95,
	repetition_penalty=1.1
	)
	print(tokenizer.decode(outputs[0], skip_special_tokens=True))
	```

	## Training Details

	### Training Data

	- Dataset Name: [esjzone_novel_cn](https://huggingface.co/datasets/telecomadm1145/esjzone_novel_cn)
	- Data Type: Chinese Light Novels (轻小说).
	- Data Size: Approximately 1GB.
	- Preprocessing: The data was uncleaned (raw text) during training.

	### Training Procedure

	#### Training Hyperparameters

	- Context Length: 1024 tokens
	- Training Stage: Pre-training (Causal Language Modeling)

	#### Speeds, Sizes, Times

	- Hardware: 2x NVIDIA T4 GPUs
	- Training Duration: ~23 hours
	- Model Parameters: ~0.2 Billion

	## Environmental Impact

	- Hardware Type: NVIDIA T4 x2
	- Hours used: 23 hours
	- Compute Region: [Unknown/Cloud]

	## Technical Specifications

	### Model Architecture and Objective

	The model follows the Mamba2 architecture, which is a type of State Space Model (SSM) designed to handle sequences efficiently. The objective was standard Causal Language Modeling (predicting the next token) on a dataset of fiction.

	---