Text Generation
Transformers
Safetensors
qwen3
Generated from Trainer
conversational
text-generation-inference
Instructions to use Abner0803/qwen_nq_compressed with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Abner0803/qwen_nq_compressed with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="Abner0803/qwen_nq_compressed") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("Abner0803/qwen_nq_compressed") model = AutoModelForCausalLM.from_pretrained("Abner0803/qwen_nq_compressed") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use Abner0803/qwen_nq_compressed with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "Abner0803/qwen_nq_compressed" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Abner0803/qwen_nq_compressed", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/Abner0803/qwen_nq_compressed
- SGLang
How to use Abner0803/qwen_nq_compressed with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "Abner0803/qwen_nq_compressed" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Abner0803/qwen_nq_compressed", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "Abner0803/qwen_nq_compressed" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Abner0803/qwen_nq_compressed", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use Abner0803/qwen_nq_compressed with Docker Model Runner:
docker model run hf.co/Abner0803/qwen_nq_compressed
| [2026-05-21 05:26:08,359] [DEBUG] [axolotl.utils.config.log_gpu_memory_usage:127] [PID:3208933] baseline 0.000GB () | |
| [2026-05-21 05:26:08,361] [INFO] [axolotl.cli.config.load_cfg:341] [PID:3208933] config: | |
| { | |
| "activation_offloading": false, | |
| "axolotl_config_path": "configs/axolotl_qwen3-1.7b_nq-text-title.yml", | |
| "base_model": "Qwen/Qwen3-1.7B-Base", | |
| "base_model_config": "Qwen/Qwen3-1.7B-Base", | |
| "batch_size": 128, | |
| "bf16": true, | |
| "capabilities": { | |
| "bf16": true, | |
| "compute_capability": "sm_120", | |
| "fp8": true, | |
| "n_gpu": 1, | |
| "n_node": 1, | |
| "tf32": true | |
| }, | |
| "context_parallel_size": 1, | |
| "dataloader_num_workers": 2, | |
| "dataset_num_proc": 128, | |
| "datasets": [ | |
| { | |
| "chat_template": "tokenizer_default_fallback_chatml", | |
| "field_messages": "conversations", | |
| "message_property_mappings": { | |
| "content": "content", | |
| "role": "role" | |
| }, | |
| "path": "nq_text_compressed_axolotl/train_with_pseudo_axolotl.jsonl", | |
| "roles": { | |
| "assistant": [ | |
| "assistant", | |
| "gpt", | |
| "model" | |
| ], | |
| "system": [ | |
| "system" | |
| ], | |
| "user": [ | |
| "user", | |
| "human" | |
| ] | |
| }, | |
| "trust_remote_code": false, | |
| "type": "chat_template" | |
| } | |
| ], | |
| "ddp": false, | |
| "device": "cuda:0", | |
| "dion_rank_fraction": 1.0, | |
| "dion_rank_multiple_of": 1, | |
| "eaft_alpha": 1.0, | |
| "eaft_k": 20, | |
| "env_capabilities": { | |
| "torch_version": "2.8.0" | |
| }, | |
| "eval_batch_size": 4, | |
| "eval_causal_lm_metrics": [ | |
| "sacrebleu", | |
| "comet", | |
| "ter", | |
| "chrf" | |
| ], | |
| "eval_max_new_tokens": 128, | |
| "eval_table_size": 0, | |
| "experimental_skip_move_to_device": true, | |
| "flash_attention": false, | |
| "flex_attention": false, | |
| "fp16": false, | |
| "generate_samples": false, | |
| "generation_do_sample": true, | |
| "generation_max_new_tokens": 50, | |
| "generation_prompt_ratio": 0.5, | |
| "generation_temperature": 0.7, | |
| "gradient_accumulation_steps": 32, | |
| "gradient_checkpointing": true, | |
| "gradient_checkpointing_kwargs": { | |
| "use_reentrant": true | |
| }, | |
| "include_tkps": true, | |
| "layer_offloading": false, | |
| "learning_rate": 0.0001, | |
| "lisa_layers_attribute": "model.layers", | |
| "load_best_model_at_end": false, | |
| "load_in_4bit": false, | |
| "load_in_8bit": false, | |
| "local_rank": 0, | |
| "logging_steps": 50, | |
| "lora_dropout": 0.0, | |
| "loraplus_lr_embedding": 1e-06, | |
| "lr_scheduler": "cosine", | |
| "mean_resizing_embeddings": false, | |
| "merge_method": "memory_efficient", | |
| "micro_batch_size": 4, | |
| "model_config_type": "qwen3", | |
| "num_epochs": 10.0, | |
| "num_generation_samples": 3, | |
| "optimizer": "adamw_torch", | |
| "otel_metrics_host": "localhost", | |
| "otel_metrics_port": 8000, | |
| "output_dir": "./checkpoint/Qwen3-1.7B-nq_text_compressed-with_pseudo-lr1e-4-10epochs", | |
| "pad_to_sequence_len": false, | |
| "pretrain_multipack_attn": true, | |
| "profiler_steps_start": 0, | |
| "qlora_sharded_model_loading": false, | |
| "quantize_moe_experts": false, | |
| "ray_num_workers": 1, | |
| "resources_per_worker": { | |
| "GPU": 1 | |
| }, | |
| "sample_packing": false, | |
| "sample_packing_bin_size": 200, | |
| "sample_packing_group_size": 100000, | |
| "save_only_model": false, | |
| "save_safetensors": true, | |
| "save_strategy": "epoch", | |
| "save_total_limit": 3, | |
| "sdp_attention": true, | |
| "sequence_len": 512, | |
| "shuffle_before_merging_datasets": false, | |
| "shuffle_merged_datasets": true, | |
| "skip_prepare_dataset": false, | |
| "special_tokens": { | |
| "eos_token": "<|im_end|>" | |
| }, | |
| "streaming_multipack_buffer_size": 10000, | |
| "strict": false, | |
| "tensor_parallel_size": 1, | |
| "tf32": false, | |
| "tiled_mlp_use_original_mlp": true, | |
| "tokenizer_config": "Qwen/Qwen3-1.7B-Base", | |
| "tokenizer_save_jinja_files": true, | |
| "torch_dtype": "torch.bfloat16", | |
| "train_on_inputs": false, | |
| "trl": { | |
| "async_prefetch": false, | |
| "log_completions": false, | |
| "mask_truncated_completions": false, | |
| "ref_model_mixup_alpha": 0.9, | |
| "ref_model_sync_steps": 64, | |
| "replay_buffer_size": 0, | |
| "replay_recompute_logps": true, | |
| "reroll_max_groups": 1, | |
| "reroll_start_fraction": 1.0, | |
| "reward_num_workers": 1, | |
| "scale_rewards": true, | |
| "skip_zero_advantage_batches": true, | |
| "sync_ref_model": false, | |
| "use_data_producer": false, | |
| "use_vllm": false, | |
| "vllm_lora_sync": false, | |
| "vllm_server_host": "0.0.0.0", | |
| "vllm_server_port": 8000 | |
| }, | |
| "use_otel_metrics": false, | |
| "use_ray": false, | |
| "use_wandb": true, | |
| "val_set_size": 0.0, | |
| "vllm": { | |
| "device": "auto", | |
| "dtype": "auto", | |
| "gpu_memory_utilization": 0.9, | |
| "host": "0.0.0.0", | |
| "port": 8000 | |
| }, | |
| "wandb_entity": "abnerden0803-national-taiwan-university", | |
| "wandb_name": "qwen3-1.7b-nq_text_compressed-pseudo-lr1e-4-10epochs", | |
| "wandb_project": "ICLGR-NQ", | |
| "warmup_ratio": 0.1, | |
| "weight_decay": 0.0, | |
| "world_size": 1, | |
| "xformers_attention": false | |
| } | |
| [2026-05-21 05:26:10,257] [DEBUG] [axolotl.loaders.tokenizer.load_tokenizer:307] [PID:3208933] EOS: 151645 / <|im_end|> | |
| [2026-05-21 05:26:10,257] [DEBUG] [axolotl.loaders.tokenizer.load_tokenizer:308] [PID:3208933] BOS: None / None | |
| [2026-05-21 05:26:10,257] [DEBUG] [axolotl.loaders.tokenizer.load_tokenizer:309] [PID:3208933] PAD: 151643 / <|endoftext|> | |
| [2026-05-21 05:26:10,257] [DEBUG] [axolotl.loaders.tokenizer.load_tokenizer:310] [PID:3208933] UNK: None / None | |
| [2026-05-21 05:26:10,259] [INFO] [axolotl.utils.data.shared.load_preprocessed_dataset:480] [PID:3208933] Unable to find prepared dataset in last_run_prepared/a8d61713fe28909dcab9370999e181f6 | |
| [2026-05-21 05:26:10,259] [INFO] [axolotl.utils.data.sft._load_raw_datasets:320] [PID:3208933] Loading raw datasets... | |
| [2026-05-21 05:26:10,259] [WARNING] [axolotl.utils.data.sft._load_raw_datasets:322] [PID:3208933] Processing datasets during training can lead to VRAM instability. Please pre-process your dataset using `axolotl preprocess path/to/config.yml`. | |
| [2026-05-21 05:26:10,953] [INFO] [axolotl.utils.data.wrappers.get_dataset_wrapper:87] [PID:3208933] Loading dataset: nq_text_compressed_axolotl/train_with_pseudo_axolotl.jsonl with base_type: chat_template and prompt_style: None | |
| [2026-05-21 05:26:10,956] [INFO] [axolotl.prompt_strategies.chat_template.__call__:998] [PID:3208933] Using chat template: | |
| --- | |
| {%- if tools %} | |
| {{- '<|im_start|>system\n' }} | |
| {%- if messages[0].role == 'system' %} | |
| {{- messages[0].content + '\n\n' }} | |
| {%- endif %} | |
| {{- "# Tools\n\nYou may call one or more functions to assist with the user query.\n\nYou are provided with function signatures within <tools></tools> XML tags:\n<tools>" }} | |
| {%- for tool in tools %} | |
| {{- "\n" }} | |
| {{- tool | tojson }} | |
| {%- endfor %} | |
| {{- "\n</tools>\n\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\n<tool_call>\n{\"name\": <function-name>, \"arguments\": <args-json-object>}\n</tool_call><|im_end|>\n" }} | |
| {%- else %} | |
| {%- if messages[0].role == 'system' %} | |
| {{- '<|im_start|>system\n' + messages[0].content + '<|im_end|>\n' }} | |
| {%- endif %} | |
| {%- endif %} | |
| {%- set ns = namespace(multi_step_tool=true, last_query_index=messages|length - 1) %} | |
| {%- for message in messages[::-1] %} | |
| {%- set index = (messages|length - 1) - loop.index0 %} | |
| {%- if ns.multi_step_tool and message.role == "user" and not(message.content.startswith('<tool_response>') and message.content.endswith('</tool_response>')) %} | |
| {%- set ns.multi_step_tool = false %} | |
| {%- set ns.last_query_index = index %} | |
| {%- endif %} | |
| {%- endfor %} | |
| {%- for message in messages %} | |
| {%- if (message.role == "user") or (message.role == "system" and not loop.first) %} | |
| {{- '<|im_start|>' + message.role + '\n' + message.content + '<|im_end|>' + '\n' }} | |
| {%- elif message.role == "assistant" %} | |
| {%- set content = message.content %} | |
| {%- set reasoning_content = '' %} | |
| {%- if message.reasoning_content is defined and message.reasoning_content is not none %} | |
| {%- set reasoning_content = message.reasoning_content %} | |
| {%- else %} | |
| {%- if '</think>' in message.content %} | |
| {%- set content = message.content.split('</think>')[-1].lstrip('\n') %} | |
| {%- set reasoning_content = message.content.split('</think>')[0].rstrip('\n').split('<think>')[-1].lstrip('\n') %} | |
| {%- endif %} | |
| {%- endif %} | |
| {%- if loop.index0 > ns.last_query_index %} | |
| {%- if loop.last or (not loop.last and reasoning_content) %} | |
| {{- '<|im_start|>' + message.role + '\n<think>\n' + reasoning_content.strip('\n') + '\n</think>\n\n' + content.lstrip('\n') }} | |
| {%- else %} | |
| {{- '<|im_start|>' + message.role + '\n' + content }} | |
| {%- endif %} | |
| {%- else %} | |
| {{- '<|im_start|>' + message.role + '\n' + content }} | |
| {%- endif %} | |
| {%- if message.tool_calls %} | |
| {%- for tool_call in message.tool_calls %} | |
| {%- if (loop.first and content) or (not loop.first) %} | |
| {{- '\n' }} | |
| {%- endif %} | |
| {%- if tool_call.function %} | |
| {%- set tool_call = tool_call.function %} | |
| {%- endif %} | |
| {{- '<tool_call>\n{"name": "' }} | |
| {{- tool_call.name }} | |
| {{- '", "arguments": ' }} | |
| {%- if tool_call.arguments is string %} | |
| {{- tool_call.arguments }} | |
| {%- else %} | |
| {{- tool_call.arguments | tojson }} | |
| {%- endif %} | |
| {{- '}\n</tool_call>' }} | |
| {%- endfor %} | |
| {%- endif %} | |
| {{- '<|im_end|>\n' }} | |
| {%- elif message.role == "tool" %} | |
| {%- if loop.first or (messages[loop.index0 - 1].role != "tool") %} | |
| {{- '<|im_start|>user' }} | |
| {%- endif %} | |
| {{- '\n<tool_response>\n' }} | |
| {{- message.content }} | |
| {{- '\n</tool_response>' }} | |
| {%- if loop.last or (messages[loop.index0 + 1].role != "tool") %} | |
| {{- '<|im_end|>\n' }} | |
| {%- endif %} | |
| {%- endif %} | |
| {%- endfor %} | |
| {%- if add_generation_prompt %} | |
| {{- '<|im_start|>assistant\n' }} | |
| {%- if enable_thinking is defined and enable_thinking is false %} | |
| {{- '<think>\n\n</think>\n\n' }} | |
| {%- endif %} | |
| {%- endif %} | |
| --- | |
| [2026-05-21 05:26:17,621] [INFO] [axolotl.utils.data.utils._log_dataset_stats:212] [PID:3208933] min_input_len: 16 | |
| [2026-05-21 05:26:17,621] [INFO] [axolotl.utils.data.utils._log_dataset_stats:213] [PID:3208933] max_input_len: 987 | |
| Dropping Invalid Sequences (<None or >512) (num_proc=128): 0%| | 0/748586 [00:00<?, ? examples/s] Dropping Invalid Sequences (<None or >512) (num_proc=128): 0%| | 2000/748586 [00:01<08:11, 1519.14 examples/s] Dropping Invalid Sequences (<None or >512) (num_proc=128): 3%|ββ | 25000/748586 [00:01<00:30, 23716.64 examples/s] Dropping Invalid Sequences (<None or >512) (num_proc=128): 7%|βββ | 51849/748586 [00:01<00:13, 53251.21 examples/s] Dropping Invalid Sequences (<None or >512) (num_proc=128): 10%|ββββ | 71396/748586 [00:01<00:10, 67423.15 examples/s] Dropping Invalid Sequences (<None or >512) (num_proc=128): 12%|βββββ | 88792/748586 [00:01<00:09, 69990.23 examples/s] Dropping Invalid Sequences (<None or >512) (num_proc=128): 15%|ββββββ | 108641/748586 [00:02<00:07, 85445.81 examples/s] Dropping Invalid Sequences (<None or >512) (num_proc=128): 16%|βββββββ | 123188/748586 [00:02<00:06, 94851.39 examples/s] Dropping Invalid Sequences (<None or >512) (num_proc=128): 19%|ββββββββ | 138584/748586 [00:02<00:05, 103872.86 examples/s] Dropping Invalid Sequences (<None or >512) (num_proc=128): 58%|ββββββββββββββββββββββββ | 435162/748586 [00:02<00:00, 726531.03 examples/s] Dropping Invalid Sequences (<None or >512) (num_proc=128): 97%|βββββββββββββββββββββββββββββββββββββββ | 723258/748586 [00:02<00:00, 1225601.15 examples/s] Dropping Invalid Sequences (<None or >512) (num_proc=128): 100%|βββββββββββββββββββββββββββββββββββββββββ| 748586/748586 [00:02<00:00, 249715.33 examples/s] | |
| [2026-05-21 05:26:21,561] [INFO] [axolotl.utils.data.utils._drop_outside_range:306] [PID:3208933] Dropped 467 sequences outside valid range ([None, 512]) | |
| Saving the dataset (0/128 shards): 0%| | 0/748119 [00:00<?, ? examples/s] Saving the dataset (0/128 shards): 0%|β | 3000/748119 [00:04<18:33, 668.92 examples/s] Saving the dataset (1/128 shards): 2%|β | 11845/748119 [00:04<18:20, 668.92 examples/s] Saving the dataset (2/128 shards): 3%|βββ | 24690/748119 [00:04<18:01, 668.92 examples/s] Saving the dataset (3/128 shards): 4%|βββ | 27535/748119 [00:04<17:57, 668.92 examples/s] Saving the dataset (4/128 shards): 4%|βββ | 29380/748119 [00:04<17:54, 668.92 examples/s] Saving the dataset (5/128 shards): 6%|ββββ | 42070/748119 [00:04<17:35, 668.92 examples/s] Saving the dataset (6/128 shards): 6%|ββββ | 42070/748119 [00:04<17:35, 668.92 examples/s] Saving the dataset (7/128 shards): 7%|βββββ | 48915/748119 [00:04<17:25, 668.92 examples/s] Saving the dataset (8/128 shards): 7%|βββββ | 49760/748119 [00:04<17:24, 668.92 examples/s] Saving the dataset (9/128 shards): 8%|ββββββ | 56605/748119 [00:04<17:13, 668.92 examples/s] Saving the dataset (10/128 shards): 8%|ββββββ | 58450/748119 [00:04<17:11, 668.92 examples/s] Saving the dataset (11/128 shards): 9%|βββββββ | 70295/748119 [00:04<16:53, 668.92 examples/s] Saving the dataset (12/128 shards): 10%|βββββββ | 73140/748119 [00:04<16:49, 668.92 examples/s] Saving the dataset (13/128 shards): 10%|βββββββ | 75985/748119 [00:04<16:44, 668.92 examples/s] Saving the dataset (14/128 shards): 11%|ββββββββ | 84830/748119 [00:04<16:31, 668.92 examples/s] Saving the dataset (15/128 shards): 12%|ββββββββ | 87675/748119 [00:04<16:27, 668.92 examples/s] Saving the dataset (16/128 shards): 13%|βββββββββ | 99520/748119 [00:04<16:09, 668.92 examples/s] Saving the dataset (17/128 shards): 14%|βββββββββ | 101365/748119 [00:04<16:06, 668.92 examples/s] Saving the dataset (18/128 shards): 14%|ββββββββββ | 107210/748119 [00:04<15:58, 668.92 examples/s] Saving the dataset (19/128 shards): 15%|ββββββββββ | 113055/748119 [00:04<15:49, 668.92 examples/s] Saving the dataset (20/128 shards): 16%|βββββββββββ | 116900/748119 [00:04<15:43, 668.92 examples/s] Saving the dataset (21/128 shards): 17%|ββββββββββββ | 125745/748119 [00:04<15:30, 668.92 examples/s] Saving the dataset (22/128 shards): 17%|ββββββββββββ | 128590/748119 [00:04<15:26, 668.92 examples/s] Saving the dataset (23/128 shards): 18%|ββββββββββββ | 134435/748119 [00:04<15:17, 668.92 examples/s] Saving the dataset (24/128 shards): 19%|βββββββββββββ | 144280/748119 [00:04<15:02, 668.92 examples/s] Saving the dataset (25/128 shards): 20%|ββββββββββββββ | 148125/748119 [00:04<14:56, 668.92 examples/s] Saving the dataset (26/128 shards): 20%|ββββββββββββββ | 151970/748119 [00:04<14:51, 668.92 examples/s] Saving the dataset (27/128 shards): 22%|βββββββββββββββ | 162815/748119 [00:04<14:34, 668.92 examples/s] Saving the dataset (28/128 shards): 22%|βββββββββββββββ | 166660/748119 [00:04<14:29, 668.92 examples/s] Saving the dataset (29/128 shards): 23%|ββββββββββββββββ | 172505/748119 [00:04<14:20, 668.92 examples/s] Saving the dataset (30/128 shards): 24%|βββββββββββββββββ | 181350/748119 [00:04<14:07, 668.92 examples/s] Saving the dataset (31/128 shards): 25%|βββββββββββββββββ | 187195/748119 [00:04<13:58, 668.92 examples/s] Saving the dataset (32/128 shards): 26%|ββββββββββββββββββ | 193040/748119 [00:04<13:49, 668.92 examples/s] Saving the dataset (33/128 shards): 26%|ββββββββββββββββββ | 195885/748119 [00:04<13:45, 668.92 examples/s] Saving the dataset (34/128 shards): 27%|ββββββββββββββββββ | 198730/748119 [00:04<13:41, 668.92 examples/s] Saving the dataset (35/128 shards): 28%|βββββββββββββββββββ | 207575/748119 [00:04<13:28, 668.92 examples/s] Saving the dataset (36/128 shards): 29%|ββββββββββββββββββββ | 217420/748119 [00:04<13:13, 668.92 examples/s] Saving the dataset (37/128 shards): 29%|ββββββββββββββββββββ | 219265/748119 [00:04<13:10, 668.92 examples/s] Saving the dataset (38/128 shards): 30%|βββββββββββββββββββββ | 225110/748119 [00:04<13:01, 668.92 examples/s] Saving the dataset (39/128 shards): 31%|βββββββββββββββββββββ | 231955/748119 [00:04<12:51, 668.92 examples/s] Saving the dataset (40/128 shards): 32%|ββββββββββββββββββββββ | 237800/748119 [00:04<12:42, 668.92 examples/s] Saving the dataset (41/128 shards): 32%|ββββββββββββββββββββββ | 239645/748119 [00:04<12:40, 668.92 examples/s] Saving the dataset (42/128 shards): 33%|βββββββββββββββββββββββ | 248490/748119 [00:04<12:26, 668.92 examples/s] Saving the dataset (43/128 shards): 34%|βββββββββββββββββββββββ | 251335/748119 [00:04<12:22, 668.92 examples/s] Saving the dataset (44/128 shards): 35%|ββββββββββββββββββββββββ | 260180/748119 [00:04<12:09, 668.92 examples/s] Saving the dataset (45/128 shards): 35%|ββββββββββββββββββββββββ | 263025/748119 [00:04<12:05, 668.92 examples/s] Saving the dataset (46/128 shards): 36%|βββββββββββββββββββββββββ | 271870/748119 [00:04<11:51, 668.92 examples/s] Saving the dataset (47/128 shards): 37%|βββββββββββββββββββββββββ | 274715/748119 [00:04<11:47, 668.92 examples/s] Saving the dataset (48/128 shards): 39%|ββββββββββββββββββββββββββ | 288560/748119 [00:04<11:27, 668.92 examples/s] Saving the dataset (49/128 shards): 39%|ββββββββββββββββββββββββββ | 289405/748119 [00:04<11:25, 668.92 examples/s] Saving the dataset (50/128 shards): 39%|βββββββββββββββββββββββββββ | 292250/748119 [00:04<11:21, 668.92 examples/s] Saving the dataset (51/128 shards): 40%|βββββββββββββββββββββββββββ | 298095/748119 [00:04<11:12, 668.92 examples/s] Saving the dataset (52/128 shards): 42%|ββββββββββββββββββββββββββββ | 311940/748119 [00:04<10:52, 668.92 examples/s] Saving the dataset (53/128 shards): 42%|ββββββββββββββββββββββββββββ | 313785/748119 [00:04<10:49, 668.92 examples/s] Saving the dataset (54/128 shards): 43%|βββββββββββββββββββββββββββββ | 318630/748119 [00:04<10:42, 668.92 examples/s] Saving the dataset (55/128 shards): 44%|ββββββββββββββββββββββββββββββ | 328475/748119 [00:04<10:27, 668.92 examples/s] Saving the dataset (56/128 shards): 44%|ββββββββββββββββββββββββββββββ | 331320/748119 [00:04<10:23, 668.92 examples/s] Saving the dataset (57/128 shards): 45%|ββββββββββββββββββββββββββββββ | 333165/748119 [00:04<10:20, 668.92 examples/s] Saving the dataset (58/128 shards): 46%|βββββββββββββββββββββββββββββββ | 342010/748119 [00:04<10:07, 668.92 examples/s] Saving the dataset (59/128 shards): 46%|ββββββββββββββββββββββββββββββββ | 347855/748119 [00:04<09:58, 668.92 examples/s] Saving the dataset (60/128 shards): 48%|ββββββββββββββββββββββββββββββββ | 358700/748119 [00:04<09:42, 668.92 examples/s] Saving the dataset (61/128 shards): 48%|βββββββββββββββββββββββββββββββββ | 359545/748119 [00:04<09:40, 668.92 examples/s] Saving the dataset (62/128 shards): 48%|βββββββββββββββββββββββββββββββββ | 362390/748119 [00:04<09:36, 668.92 examples/s] Saving the dataset (63/128 shards): 50%|ββββββββββββββββββββββββββββββββββ | 374235/748119 [00:04<09:18, 668.92 examples/s] Saving the dataset (64/128 shards): 51%|βββββββββββββββββββββββββββββββββββ | 381080/748119 [00:04<09:08, 668.92 examples/s] Saving the dataset (65/128 shards): 51%|βββββββββββββββββββββββββββββββββββ | 382925/748119 [00:04<09:05, 668.92 examples/s] Saving the dataset (66/128 shards): 52%|βββββββββββββββββββββββββββββββββββ | 385770/748119 [00:04<09:01, 668.92 examples/s] Saving the dataset (67/128 shards): 53%|ββββββββββββββββββββββββββββββββββββ | 394615/748119 [00:04<08:48, 668.92 examples/s] Saving the dataset (68/128 shards): 53%|ββββββββββββββββββββββββββββββββββββ | 397460/748119 [00:04<08:44, 668.92 examples/s] Saving the dataset (69/128 shards): 55%|βββββββββββββββββββββββββββββββββββββ | 411305/748119 [00:04<08:23, 668.92 examples/s] Saving the dataset (70/128 shards): 55%|βββββββββββββββββββββββββββββββββββββ | 412150/748119 [00:04<08:22, 668.92 examples/s] Saving the dataset (71/128 shards): 55%|ββββββββββββββββββββββββββββββββββββββ | 414995/748119 [00:04<08:17, 668.92 examples/s] Saving the dataset (72/128 shards): 57%|ββββββββββββββββββββββββββββββββββββββ | 423840/748119 [00:04<08:04, 668.92 examples/s] Saving the dataset (73/128 shards): 57%|βββββββββββββββββββββββββββββββββββββββ | 426685/748119 [00:04<08:00, 668.92 examples/s] Saving the dataset (74/128 shards): 58%|βββββββββββββββββββββββββββββββββββββββ | 435530/748119 [00:04<07:47, 668.92 examples/s] Saving the dataset (75/128 shards): 59%|ββββββββββββββββββββββββββββββββββββββββ | 438375/748119 [00:04<07:43, 668.92 examples/s] Saving the dataset (76/128 shards): 60%|ββββββββββββββββββββββββββββββββββββββββ | 447220/748119 [00:04<07:29, 668.92 examples/s] Saving the dataset (77/128 shards): 61%|βββββββββββββββββββββββββββββββββββββββββ | 453065/748119 [00:04<07:21, 668.92 examples/s] Saving the dataset (78/128 shards): 62%|ββββββββββββββββββββββββββββββββββββββββββ | 461910/748119 [00:04<07:07, 668.92 examples/s] Saving the dataset (79/128 shards): 63%|ββββββββββββββββββββββββββββββββββββββββββ | 469755/748119 [00:04<06:56, 668.92 examples/s] Saving the dataset (80/128 shards): 63%|βββββββββββββββββββββββββββββββββββββββββββ | 470600/748119 [00:04<06:54, 668.92 examples/s] Saving the dataset (81/128 shards): 63%|βββββββββββββββββββββββββββββββββββββββββββ | 473445/748119 [00:04<06:50, 668.92 examples/s] Saving the dataset (82/128 shards): 64%|ββββββββββββββββββββββββββββββββββββββββββββ | 482290/748119 [00:04<06:37, 668.92 examples/s] Saving the dataset (83/128 shards): 66%|ββββββββββββββββββββββββββββββββββββββββββββ | 492135/748119 [00:04<06:22, 668.92 examples/s] Saving the dataset (84/128 shards): 67%|βββββββββββββββββββββββββββββββββββββββββββββ | 497980/748119 [00:04<06:13, 668.92 examples/s] Saving the dataset (85/128 shards): 67%|βββββββββββββββββββββββββββββββββββββββββββββ | 500825/748119 [00:04<06:09, 668.92 examples/s] Saving the dataset (86/128 shards): 67%|βββββββββββββββββββββββββββββββββββββββββββββ | 502670/748119 [00:04<06:06, 668.92 examples/s] Saving the dataset (87/128 shards): 68%|ββββββββββββββββββββββββββββββββββββββββββββββ | 508515/748119 [00:04<05:58, 668.92 examples/s] Saving the dataset (88/128 shards): 69%|βββββββββββββββββββββββββββββββββββββββββββββββ | 517359/748119 [00:04<05:44, 668.92 examples/s] Saving the dataset (89/128 shards): 70%|βββββββββββββββββββββββββββββββββββββββββββββββ | 523203/748119 [00:04<05:36, 668.92 examples/s] Saving the dataset (90/128 shards): 71%|ββββββββββββββββββββββββββββββββββββββββββββββββ | 529047/748119 [00:04<05:27, 668.92 examples/s] Saving the dataset (91/128 shards): 71%|ββββββββββββββββββββββββββββββββββββββββββββββββ | 531891/748119 [00:04<05:23, 668.92 examples/s] Saving the dataset (92/128 shards): 72%|βββββββββββββββββββββββββββββββββββββββββββββββββ | 540735/748119 [00:04<05:10, 668.92 examples/s] Saving the dataset (93/128 shards): 73%|βββββββββββββββββββββββββββββββββββββββββββββββββ | 547579/748119 [00:04<04:59, 668.92 examples/s] Saving the dataset (94/128 shards): 73%|ββββββββββββββββββββββββββββββββββββββββββββββββββ | 549423/748119 [00:04<04:57, 668.92 examples/s] Saving the dataset (95/128 shards): 74%|ββββββββββββββββββββββββββββββββββββββββββββββββββ | 555267/748119 [00:04<04:48, 668.92 examples/s] Saving the dataset (96/128 shards): 75%|βββββββββββββββββββββββββββββββββββββββββββββββββββ | 564111/748119 [00:04<04:35, 668.92 examples/s] Saving the dataset (97/128 shards): 76%|βββββββββββββββββββββββββββββββββββββββββββββββββββ | 566955/748119 [00:04<04:30, 668.92 examples/s] Saving the dataset (98/128 shards): 77%|ββββββββββββββββββββββββββββββββββββββββββββββββββββ | 575799/748119 [00:04<04:17, 668.92 examples/s] Saving the dataset (99/128 shards): 78%|ββββββββββββββββββββββββββββββββββββββββββββββββββββ | 581643/748119 [00:04<04:08, 668.92 examples/s] Saving the dataset (100/128 shards): 79%|ββββββββββββββββββββββββββββββββββββββββββββββββββββ | 587487/748119 [00:04<04:00, 668.92 examples/s] Saving the dataset (101/128 shards): 79%|βββββββββββββββββββββββββββββββββββββββββββββββββββββ | 593331/748119 [00:04<03:51, 668.92 examples/s] Saving the dataset (102/128 shards): 80%|βββββββββββββββββββββββββββββββββββββββββββββββββββββ | 599175/748119 [00:04<03:42, 668.92 examples/s] Saving the dataset (103/128 shards): 81%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 605019/748119 [00:04<03:33, 668.92 examples/s] Saving the dataset (104/128 shards): 82%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 611863/748119 [00:04<03:23, 668.92 examples/s] Saving the dataset (105/128 shards): 82%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 613707/748119 [00:04<03:20, 668.92 examples/s] Saving the dataset (106/128 shards): 84%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 626551/748119 [00:04<03:01, 668.92 examples/s] Saving the dataset (107/128 shards): 84%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 628395/748119 [00:04<02:58, 668.92 examples/s] Saving the dataset (108/128 shards): 84%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 631239/748119 [00:04<02:54, 668.92 examples/s] Saving the dataset (109/128 shards): 86%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 640083/748119 [00:04<02:41, 668.92 examples/s] Saving the dataset (110/128 shards): 87%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 649927/748119 [00:04<02:26, 668.92 examples/s] Saving the dataset (111/128 shards): 87%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 651771/748119 [00:04<02:24, 668.92 examples/s] Saving the dataset (112/128 shards): 88%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 657615/748119 [00:04<02:15, 668.92 examples/s] Saving the dataset (113/128 shards): 89%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 666459/748119 [00:04<02:02, 668.92 examples/s] Saving the dataset (114/128 shards): 90%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 672147/748119 [00:04<01:53, 668.92 examples/s] Saving the dataset (115/128 shards): 90%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 672147/748119 [00:04<01:53, 668.92 examples/s] Saving the dataset (116/128 shards): 91%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 682991/748119 [00:04<01:37, 668.92 examples/s] Saving the dataset (117/128 shards): 92%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 689835/748119 [00:04<01:27, 668.92 examples/s] Saving the dataset (118/128 shards): 93%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 696679/748119 [00:04<01:16, 668.92 examples/s] Saving the dataset (119/128 shards): 93%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 698523/748119 [00:04<01:14, 668.92 examples/s] Saving the dataset (120/128 shards): 95%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 707367/748119 [00:04<01:00, 668.92 examples/s] Saving the dataset (121/128 shards): 95%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 714211/748119 [00:04<00:50, 668.92 examples/s] Saving the dataset (122/128 shards): 96%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 721055/748119 [00:04<00:40, 668.92 examples/s] Saving the dataset (123/128 shards): 97%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 723899/748119 [00:04<00:36, 668.92 examples/s] Saving the dataset (124/128 shards): 97%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 724743/748119 [00:04<00:34, 668.92 examples/s] Saving the dataset (125/128 shards): 98%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 733587/748119 [00:04<00:21, 668.92 examples/s] Saving the dataset (125/128 shards): 99%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 740431/748119 [00:04<00:00, 227547.30 examples/s] Saving the dataset (126/128 shards): 99%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 740431/748119 [00:04<00:00, 227547.30 examples/s] Saving the dataset (127/128 shards): 99%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 744275/748119 [00:04<00:00, 227547.30 examples/s] Saving the dataset (128/128 shards): 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 748119/748119 [00:04<00:00, 227547.30 examples/s] Saving the dataset (128/128 shards): 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 748119/748119 [00:04<00:00, 158325.36 examples/s] | |
| [2026-05-21 05:26:30,874] [DEBUG] [axolotl.utils.trainer.calculate_total_num_steps:420] [PID:3208933] total_num_tokens: 41_357_188 | |
| [2026-05-21 05:26:33,832] [DEBUG] [axolotl.utils.trainer.calculate_total_num_steps:438] [PID:3208933] `total_supervised_tokens: 4_704_021` | |
| [2026-05-21 05:26:33,833] [DEBUG] [axolotl.utils.trainer.calculate_total_num_steps:521] [PID:3208933] total_num_steps: 58447 | |
| [2026-05-21 05:26:33,833] [INFO] [axolotl.utils.data.sft._prepare_standard_dataset:121] [PID:3208933] Maximum number of steps set at 58447 | |
| [2026-05-21 05:26:33,887] [DEBUG] [axolotl.train.setup_model_and_tokenizer:70] [PID:3208933] loading tokenizer... Qwen/Qwen3-1.7B-Base | |
| [2026-05-21 05:26:35,407] [DEBUG] [axolotl.loaders.tokenizer.load_tokenizer:307] [PID:3208933] EOS: 151645 / <|im_end|> | |
| [2026-05-21 05:26:35,407] [DEBUG] [axolotl.loaders.tokenizer.load_tokenizer:308] [PID:3208933] BOS: None / None | |
| [2026-05-21 05:26:35,407] [DEBUG] [axolotl.loaders.tokenizer.load_tokenizer:309] [PID:3208933] PAD: 151643 / <|endoftext|> | |
| [2026-05-21 05:26:35,407] [DEBUG] [axolotl.loaders.tokenizer.load_tokenizer:310] [PID:3208933] UNK: None / None | |
| [2026-05-21 05:26:35,408] [DEBUG] [axolotl.train.setup_model_and_tokenizer:81] [PID:3208933] Loading model | |
| [2026-05-21 05:26:35,615] [DEBUG] [axolotl.monkeypatch.torchao_optim.patch_torchao_optim_state_8bit:75] [PID:3208933] Patched OptimState8bit for torch.compile compatibility | |
| [2026-05-21 05:26:35,615] [DEBUG] [axolotl.monkeypatch.torchao_optim.patch_torchao_optim_state_8bit:122] [PID:3208933] Patched OptimState4bit for torch.compile compatibility | |
| [2026-05-21 05:26:35,615] [DEBUG] [axolotl.monkeypatch.torchao_optim.patch_torchao_optim_state_8bit:154] [PID:3208933] Patched OptimStateFp8 for torch.compile compatibility | |
| [2026-05-21 05:26:35,621] [DEBUG] [axolotl.monkeypatch.transformers.trainer_loss_calc.patch_evaluation_loop:94] [PID:3208933] Patched Trainer.evaluation_loop with nanmean loss calculation | |
| [2026-05-21 05:26:35,622] [DEBUG] [axolotl.monkeypatch.transformers.trainer_loss_calc.patch_maybe_log_save_evaluate:148] [PID:3208933] Patched Trainer._maybe_log_save_evaluate with nanmean loss calculation | |
| Loading weights: 0%| | 0/310 [00:00<?, ?it/s] Loading weights: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 310/310 [00:00<00:00, 10198.32it/s] | |
| [2026-05-21 05:26:36,736] [DEBUG] [axolotl.loaders.model.log_gpu_memory_usage:127] [PID:3208933] Memory usage after model load 0.000GB () | |
| [2026-05-21 05:26:41,122] [INFO] [axolotl.train.save_initial_configs:421] [PID:3208933] Pre-saving tokenizer to ./checkpoint/Qwen3-1.7B-nq_text_compressed-with_pseudo-lr1e-4-10epochs... | |
| [2026-05-21 05:26:41,234] [INFO] [axolotl.train.save_initial_configs:426] [PID:3208933] Pre-saving model config to ./checkpoint/Qwen3-1.7B-nq_text_compressed-with_pseudo-lr1e-4-10epochs... | |
| [2026-05-21 05:26:41,238] [INFO] [axolotl.train.execute_training:222] [PID:3208933] Starting trainer... | |
| [34m[1mwandb[0m: [wandb.login()] Loaded credentials for https://api.wandb.ai from /mnt/raid0/home/abner/.netrc. | |
| [34m[1mwandb[0m: Currently logged in as: [33mabnerden0803[0m ([33mabnerden0803-national-taiwan-university[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin | |
| [34m[1mwandb[0m: [38;5;178mβ’Ώ[0m Waiting for wandb.init()... | |
| [Am[2K [34m[1mwandb[0m: [38;5;178mβ£»[0m setting up run 0zmanqq0 (0.2s) | |
| [Am[2K [34m[1mwandb[0m: [38;5;178mβ£½[0m setting up run 0zmanqq0 (0.2s) | |
| [Am[2K [34m[1mwandb[0m: [38;5;178mβ£Ύ[0m setting up run 0zmanqq0 (0.2s) | |
| [Am[2K [34m[1mwandb[0m: [38;5;178mβ£·[0m setting up run 0zmanqq0 (0.2s) | |
| [Am[2K [34m[1mwandb[0m: [38;5;178mβ£―[0m setting up run 0zmanqq0 (0.2s) | |
| [Am[2K [34m[1mwandb[0m: [38;5;178mβ£[0m setting up run 0zmanqq0 (0.7s) | |
| [Am[2K [34m[1mwandb[0m: [38;5;178mβ‘Ώ[0m setting up run 0zmanqq0 (0.7s) | |
| [Am[2K [34m[1mwandb[0m: Tracking run with wandb version 0.26.0 | |
| [34m[1mwandb[0m: Run data is saved locally in [35m[1m/mnt/raid0/home/abner/git/ICLGR/wandb/run-20260521_052641-0zmanqq0[0m | |
| [34m[1mwandb[0m: Run [1m`wandb offline`[0m to turn off syncing. | |
| [34m[1mwandb[0m: Syncing run [33mqwen3-1.7b-nq_text_compressed-pseudo-lr1e-4-10epochs[0m | |
| [34m[1mwandb[0m: βοΈ View project at [34m[4mhttps://wandb.ai/abnerden0803-national-taiwan-university/ICLGR-NQ[0m | |
| [34m[1mwandb[0m: π View run at [34m[4mhttps://wandb.ai/abnerden0803-national-taiwan-university/ICLGR-NQ/runs/0zmanqq0[0m | |
| [34m[1mwandb[0m: [33mWARNING[0m Saving files without folders. If you want to preserve subdirectories pass base_path to wandb.save, i.e. wandb.save("/mnt/folder/file.h5", base_path="/mnt") | |
| [34m[1mwandb[0m: [33mWARNING[0m Symlinked 1 file into the W&B run directory; call wandb.save again to sync new files. | |
| [2026-05-21 05:26:44,641] [INFO] [axolotl.utils.callbacks.on_train_begin:757] [PID:3208933] The Axolotl config has been saved to the WandB run under files. | |
| 0%| | 0/58447 [00:00<?, ?it/s] 0%| | 1/58447 [00:06<98:43:15, 6.08s/it] 0%| | 2/58447 [00:11<94:24:48, 5.82s/it] 0%| | 3/58447 [00:17<93:47:10, 5.78s/it] 0%| | 4/58447 [00:22<92:02:08, 5.67s/it] 0%| | 5/58447 [00:28<92:38:38, 5.71s/it] 0%| | 6/58447 [00:34<94:00:50, 5.79s/it] 0%| | 7/58447 [00:40<93:43:16, 5.77s/it] 0%| | 8/58447 [00:45<92:30:08, 5.70s/it] 0%| | 9/58447 [00:51<91:27:17, 5.63s/it] 0%| | 10/58447 [00:56<90:37:10, 5.58s/it] 0%| | 11/58447 [01:01<88:10:08, 5.43s/it] 0%| | 12/58447 [01:07<88:15:30, 5.44s/it] 0%| | 13/58447 [01:13<89:00:47, 5.48s/it] 0%| | 14/58447 [01:18<89:45:36, 5.53s/it] 0%| | 15/58447 [01:24<89:33:02, 5.52s/it] 0%| | 16/58447 [01:29<88:36:17, 5.46s/it] 0%| | 17/58447 [01:35<89:16:27, 5.50s/it] 0%| | 18/58447 [01:40<88:16:16, 5.44s/it] 0%| | 19/58447 [01:46<89:30:06, 5.51s/it] 0%| | 20/58447 [01:51<89:40:46, 5.53s/it] 0%| | 21/58447 [01:57<89:37:15, 5.52s/it] 0%| | 22/58447 [02:02<89:52:48, 5.54s/it] 0%| | 23/58447 [02:08<89:11:26, 5.50s/it] 0%| | 24/58447 [02:13<89:04:53, 5.49s/it] 0%| | 25/58447 [02:19<88:48:49, 5.47s/it] 0%| | 26/58447 [02:24<89:27:22, 5.51s/it] 0%| | 27/58447 [02:29<88:20:42, 5.44s/it] 0%| | 28/58447 [02:35<89:40:57, 5.53s/it] 0%| | 29/58447 [02:41<89:54:19, 5.54s/it] 0%| | 30/58447 [02:46<90:36:04, 5.58s/it] 0%| | 31/58447 [02:51<85:52:50, 5.29s/it] 0%| | 32/58447 [02:56<82:03:32, 5.06s/it] 0%| | 33/58447 [03:00<79:50:40, 4.92s/it] 0%| | 34/58447 [03:05<78:08:31, 4.82s/it] 0%| | 35/58447 [03:09<76:27:18, 4.71s/it] 0%| | 36/58447 [03:14<75:21:20, 4.64s/it] 0%| | 37/58447 [03:18<76:05:25, 4.69s/it] 0%| | 38/58447 [03:24<78:25:20, 4.83s/it] 0%| | 39/58447 [03:29<81:19:58, 5.01s/it] 0%| | 40/58447 [03:34<83:31:22, 5.15s/it] 0%| | 41/58447 [03:39<82:24:26, 5.08s/it] 0%| | 42/58447 [03:45<84:59:51, 5.24s/it] 0%| | 43/58447 [03:50<84:47:41, 5.23s/it] 0%| | 44/58447 [03:56<85:12:25, 5.25s/it] 0%| | 45/58447 [04:01<86:10:28, 5.31s/it] 0%| | 46/58447 [04:06<86:38:05, 5.34s/it] 0%| | 47/58447 [04:12<87:19:03, 5.38s/it] 0%| | 48/58447 [04:17<87:38:39, 5.40s/it] 0%| | 49/58447 [04:23<89:55:22, 5.54s/it] 0%| | 50/58447 [04:29<89:21:57, 5.51s/it] {'loss': '6.238', 'grad_norm': '2.27', 'learning_rate': '8.385e-07', 'ppl': '512', 'memory/max_active (GiB)': '19.7', 'memory/max_allocated (GiB)': '19.7', 'memory/device_reserved (GiB)': '20.59', 'tokens/train_per_sec_per_gpu': '4.694', 'tokens/total': 800316, 'tokens/trainable': 40263, 'epoch': '0.008555'} | |
| 0%| | 50/58447 [04:29<89:21:57, 5.51s/it] 0%| | 51/58447 [04:34<89:04:29, 5.49s/it] 0%| | 52/58447 [04:40<88:48:15, 5.47s/it] 0%| | 53/58447 [04:45<90:09:01, 5.56s/it] 0%| | 54/58447 [04:51<90:12:42, 5.56s/it] 0%| | 55/58447 [04:56<87:51:41, 5.42s/it]Process Process-2: | |
| Traceback (most recent call last): | |
| File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap | |
| self.run() | |
| File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run | |
| self._target(*self._args, **self._kwargs) | |
| File "/mnt/raid0/home/abner/git/conv-gr-new-dataset/.venv/lib/python3.12/site-packages/torch/utils/data/_utils/worker.py", line 315, in _worker_loop | |
| r = index_queue.get(timeout=MP_STATUS_CHECK_INTERVAL) | |
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
| File "/usr/lib/python3.12/multiprocessing/queues.py", line 113, in get | |
| if not self._poll(timeout): | |
| ^^^^^^^^^^^^^^^^^^^ | |
| File "/usr/lib/python3.12/multiprocessing/connection.py", line 257, in poll | |
| return self._poll(timeout) | |
| ^^^^^^^^^^^^^^^^^^^ | |
| File "/usr/lib/python3.12/multiprocessing/connection.py", line 440, in _poll | |
| r = wait([self], timeout) | |
| ^^^^^^^^^^^^^^^^^^^^^ | |
| File "/usr/lib/python3.12/multiprocessing/connection.py", line 1136, in wait | |
| ready = selector.select(timeout) | |
| ^^^^^^^^^^^^^^^^^^^^^^^^ | |
| File "/usr/lib/python3.12/selectors.py", line 415, in select | |
| fd_event_list = self._selector.poll(timeout) | |
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
| File "/mnt/raid0/home/abner/git/conv-gr-new-dataset/.venv/lib/python3.12/site-packages/axolotl/train.py", line 175, in <lambda> | |
| lambda signum, frame: terminate_handler(signum, frame, _model_weakref), | |
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
| File "/mnt/raid0/home/abner/git/conv-gr-new-dataset/.venv/lib/python3.12/site-packages/axolotl/train.py", line 167, in terminate_handler | |
| _model.save_pretrained(cfg.output_dir) | |
| File "/mnt/raid0/home/abner/git/conv-gr-new-dataset/.venv/lib/python3.12/site-packages/transformers/modeling_utils.py", line 3352, in save_pretrained | |
| state_dict = remove_tied_weights_from_state_dict(state_dict, model_to_save) | |
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
| File "/mnt/raid0/home/abner/git/conv-gr-new-dataset/.venv/lib/python3.12/site-packages/transformers/modeling_utils.py", line 438, in remove_tied_weights_from_state_dict | |
| shared_names, disjoint_names = _find_disjoint(shared_ptrs.values(), state_dict) | |
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
| File "/mnt/raid0/home/abner/git/conv-gr-new-dataset/.venv/lib/python3.12/site-packages/transformers/modeling_utils.py", line 352, in _find_disjoint | |
| areas.append((tensor.data_ptr(), _end_ptr(tensor), name)) | |
| ^^^^^^^^^^^^^^^^ | |
| File "/mnt/raid0/home/abner/git/conv-gr-new-dataset/.venv/lib/python3.12/site-packages/transformers/modeling_utils.py", line 328, in _end_ptr | |
| stop = tensor.view(-1)[-1].data_ptr() + tensor.element_size() | |
| ~~~~~~~~~~~~~~~^^^^ | |
| torch.AcceleratorError: CUDA error: initialization error | |
| CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. | |
| For debugging consider passing CUDA_LAUNCH_BLOCKING=1 | |
| Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions. | |
| Process Process-1: | |
| Traceback (most recent call last): | |
| File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap | |
| self.run() | |
| File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run | |
| self._target(*self._args, **self._kwargs) | |
| File "/mnt/raid0/home/abner/git/conv-gr-new-dataset/.venv/lib/python3.12/site-packages/torch/utils/data/_utils/worker.py", line 315, in _worker_loop | |
| r = index_queue.get(timeout=MP_STATUS_CHECK_INTERVAL) | |
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
| File "/usr/lib/python3.12/multiprocessing/queues.py", line 113, in get | |
| if not self._poll(timeout): | |
| ^^^^^^^^^^^^^^^^^^^ | |
| File "/usr/lib/python3.12/multiprocessing/connection.py", line 257, in poll | |
| return self._poll(timeout) | |
| ^^^^^^^^^^^^^^^^^^^ | |
| File "/usr/lib/python3.12/multiprocessing/connection.py", line 440, in _poll | |
| r = wait([self], timeout) | |
| ^^^^^^^^^^^^^^^^^^^^^ | |
| File "/usr/lib/python3.12/multiprocessing/connection.py", line 1136, in wait | |
| ready = selector.select(timeout) | |
| ^^^^^^^^^^^^^^^^^^^^^^^^ | |
| File "/usr/lib/python3.12/selectors.py", line 415, in select | |
| fd_event_list = self._selector.poll(timeout) | |
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
| File "/mnt/raid0/home/abner/git/conv-gr-new-dataset/.venv/lib/python3.12/site-packages/axolotl/train.py", line 175, in <lambda> | |
| lambda signum, frame: terminate_handler(signum, frame, _model_weakref), | |
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
| File "/mnt/raid0/home/abner/git/conv-gr-new-dataset/.venv/lib/python3.12/site-packages/axolotl/train.py", line 167, in terminate_handler | |
| _model.save_pretrained(cfg.output_dir) | |
| File "/mnt/raid0/home/abner/git/conv-gr-new-dataset/.venv/lib/python3.12/site-packages/transformers/modeling_utils.py", line 3352, in save_pretrained | |
| state_dict = remove_tied_weights_from_state_dict(state_dict, model_to_save) | |
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
| File "/mnt/raid0/home/abner/git/conv-gr-new-dataset/.venv/lib/python3.12/site-packages/transformers/modeling_utils.py", line 438, in remove_tied_weights_from_state_dict | |
| shared_names, disjoint_names = _find_disjoint(shared_ptrs.values(), state_dict) | |
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
| File "/mnt/raid0/home/abner/git/conv-gr-new-dataset/.venv/lib/python3.12/site-packages/transformers/modeling_utils.py", line 352, in _find_disjoint | |
| areas.append((tensor.data_ptr(), _end_ptr(tensor), name)) | |
| ^^^^^^^^^^^^^^^^ | |
| File "/mnt/raid0/home/abner/git/conv-gr-new-dataset/.venv/lib/python3.12/site-packages/transformers/modeling_utils.py", line 328, in _end_ptr | |
| stop = tensor.view(-1)[-1].data_ptr() + tensor.element_size() | |
| ~~~~~~~~~~~~~~~^^^^ | |
| torch.AcceleratorError: CUDA error: initialization error | |
| CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. | |
| For debugging consider passing CUDA_LAUNCH_BLOCKING=1 | |
| Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions. | |
| Writing model shards: 0%| | 0/1 [00:00<?, ?it/s][A | |
| Writing model shards: 0%| | 0/1 [00:00<?, ?it/s][A[A | |
| Writing model shards: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 1/1 [00:13<00:00, 13.44s/it][A[A Writing model shards: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 1/1 [00:13<00:00, 13.44s/it] | |
| Exception ignored in: <generator object tqdm.__iter__ at 0x74400ae771c0> | |
| Traceback (most recent call last): | |
| File "/mnt/raid0/home/abner/git/conv-gr-new-dataset/.venv/lib/python3.12/site-packages/tqdm/std.py", line 1196, in __iter__ | |
| self.close() | |
| File "/mnt/raid0/home/abner/git/conv-gr-new-dataset/.venv/lib/python3.12/site-packages/tqdm/std.py", line 1265, in close | |
| def close(self): | |
| File "/mnt/raid0/home/abner/git/conv-gr-new-dataset/.venv/lib/python3.12/site-packages/torch/utils/data/_utils/signal_handling.py", line 73, in handler | |
| _error_if_any_worker_fails() | |
| RuntimeError: DataLoader worker (pid 3210856) exited unexpectedly with exit code 1. Details are lost due to multiprocessing. Rerunning with num_workers=0 may give better error trace. | |
| Writing model shards: 0%| | 0/1 [00:14<?, ?it/s] | |
| 0%| | 55/58447 [05:11<91:52:41, 5.66s/it] | |