[xcodec2 model mismatch]
I am getting errors
"You are using a model of type xcodec2 to instantiate a model of type xcodec. This is not supported for all configurations of models and can yield errors."
It seems like as of now, the xcodec2 model is not natively supported by Hugging Face Transformers
Should I try to customize AutoConfig and Trainer at the code level?
It's working for me with xcodec2 version 0.1.3 and the with 0.1.5 it fails.
Furthermore if xcodec2 version is 0.1.3 and transformers is latest then also it fails. Fails in the sense, it shows below message
You are using a model of type xcodec2 to instantiate a model of type xcodec. This is not supported for all configurations of models and can yield errors.
Some weights of the model checkpoint at /data1/nowshad/models/xcodec2 were not used when initializing XCodec2Model: ['CodecEnc.conv_blocks.1.block.0.block.0.act.beta', 'CodecEnc.conv_blocks.1.block.0.block.2.act.beta', 'CodecEnc.conv_blocks.1.block.1.block.0.act.beta', 'CodecEnc.conv_blocks.1.block.1.block.2.act.beta', 'CodecEnc.conv_blocks.1.block.2.block.0.act.beta', 'CodecEnc.conv_blocks.1.block.2.block.2.act.beta', 'CodecEnc.conv_blocks.1.block.3.act.beta', 'CodecEnc.conv_blocks.2.block.0.block.0.act.beta', 'CodecEnc.conv_blocks.2.block.0.block.2.act.beta', 'CodecEnc.conv_blocks.2.block.1.block.0.act.beta', 'CodecEnc.conv_blocks.2.block.1.block.2.act.beta', 'CodecEnc.conv_blocks.2.block.2.block.0.act.beta', 'CodecEnc.conv_blocks.2.block.2.block.2.act.beta', 'CodecEnc.conv_blocks.2.block.3.act.beta', 'CodecEnc.conv_blocks.3.block.0.block.0.act.beta', 'CodecEnc.conv_blocks.3.block.0.block.2.act.beta', 'CodecEnc.conv_blocks.3.block.1.block.0.act.beta', 'CodecEnc.conv_blocks.3.block.1.block.2.act.beta', 'CodecEnc.conv_blocks.3.block.2.block.0.act.beta', 'CodecEnc.conv_blocks.3.block.2.block.2.act.beta', 'CodecEnc.conv_blocks.3.block.3.act.beta', 'CodecEnc.conv_blocks.4.block.0.block.0.act.beta', 'CodecEnc.conv_blocks.4.block.0.block.2.act.beta', 'CodecEnc.conv_blocks.4.block.1.block.0.act.beta', 'CodecEnc.conv_blocks.4.block.1.block.2.act.beta', 'CodecEnc.conv_blocks.4.block.2.block.0.act.beta', 'CodecEnc.conv_blocks.4.block.2.block.2.act.beta', 'CodecEnc.conv_blocks.4.block.3.act.beta', 'CodecEnc.conv_blocks.5.block.0.block.0.act.beta', 'CodecEnc.conv_blocks.5.block.0.block.2.act.beta', 'CodecEnc.conv_blocks.5.block.1.block.0.act.beta', 'CodecEnc.conv_blocks.5.block.1.block.2.act.beta', 'CodecEnc.conv_blocks.5.block.2.block.0.act.beta', 'CodecEnc.conv_blocks.5.block.2.block.2.act.beta', 'CodecEnc.conv_blocks.5.block.3.act.beta', 'CodecEnc.conv_final_block.0.act.beta']
- This IS expected if you are initializing XCodec2Model from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing XCodec2Model from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of XCodec2Model were not initialized from the model checkpoint at /data1/nowshad/models/xcodec2 and are newly initialized: ['CodecEnc.conv_blocks.1.block.0.block.0.act.bias', 'CodecEnc.conv_blocks.1.block.0.block.2.act.bias', 'CodecEnc.conv_blocks.1.block.1.block.0.act.bias', 'CodecEnc.conv_blocks.1.block.1.block.2.act.bias', 'CodecEnc.conv_blocks.1.block.2.block.0.act.bias', 'CodecEnc.conv_blocks.1.block.2.block.2.act.bias', 'CodecEnc.conv_blocks.1.block.3.act.bias', 'CodecEnc.conv_blocks.2.block.0.block.0.act.bias', 'CodecEnc.conv_blocks.2.block.0.block.2.act.bias', 'CodecEnc.conv_blocks.2.block.1.block.0.act.bias', 'CodecEnc.conv_blocks.2.block.1.block.2.act.bias', 'CodecEnc.conv_blocks.2.block.2.block.0.act.bias', 'CodecEnc.conv_blocks.2.block.2.block.2.act.bias', 'CodecEnc.conv_blocks.2.block.3.act.bias', 'CodecEnc.conv_blocks.3.block.0.block.0.act.bias', 'CodecEnc.conv_blocks.3.block.0.block.2.act.bias', 'CodecEnc.conv_blocks.3.block.1.block.0.act.bias', 'CodecEnc.conv_blocks.3.block.1.block.2.act.bias', 'CodecEnc.conv_blocks.3.block.2.block.0.act.bias', 'CodecEnc.conv_blocks.3.block.2.block.2.act.bias', 'CodecEnc.conv_blocks.3.block.3.act.bias', 'CodecEnc.conv_blocks.4.block.0.block.0.act.bias', 'CodecEnc.conv_blocks.4.block.0.block.2.act.bias', 'CodecEnc.conv_blocks.4.block.1.block.0.act.bias', 'CodecEnc.conv_blocks.4.block.1.block.2.act.bias', 'CodecEnc.conv_blocks.4.block.2.block.0.act.bias', 'CodecEnc.conv_blocks.4.block.2.block.2.act.bias', 'CodecEnc.conv_blocks.4.block.3.act.bias', 'CodecEnc.conv_blocks.5.block.0.block.0.act.bias', 'CodecEnc.conv_blocks.5.block.0.block.2.act.bias', 'CodecEnc.conv_blocks.5.block.1.block.0.act.bias', 'CodecEnc.conv_blocks.5.block.1.block.2.act.bias', 'CodecEnc.conv_blocks.5.block.2.block.0.act.bias', 'CodecEnc.conv_blocks.5.block.2.block.2.act.bias', 'CodecEnc.conv_blocks.5.block.3.act.bias', 'CodecEnc.conv_final_block.0.act.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Because some weights are not initialised properly the encoded values are 0.
Ok, the issue is not with the xcodec2 but with the latest transformers. As shown in the earlier warning message (the weights getting not loaded properly), after modifying the keys in model.safetensors file it's working without any issues for the latest transformers as well as the xcodec2.
Basically in the latest transformers the name of the key expected is converted to bias (Ex: CodecEnc.conv_blocks.1.block.0.block.0.act.beta key is present in original safetensors file but transformers expects CodecEnc.conv_blocks.1.block.0.block.0.act.bias). So after renaming it works as expected.
After diging into audio encoder, I can generate audio successfullly with the following testing codes:
目前可以跑通的代码如下:
import omegaconf
import typing
import torch
import soundfile as sf
from transformers import AutoConfig, AutoModelForCausalLM, AutoTokenizer, PreTrainedModel
from transformers import AutoFeatureExtractor, Wav2Vec2BertModel
from accelerate import init_empty_weights
from collections import OrderedDict
import collections
from configuration_bigcodec import BigCodecConfig
from xcodec2.modeling_xcodec2 import XCodec2Model
from safetensors.torch import load_file, save_file
ROOT='/aifs4su/yiakwy/shared/audio'
PROJECT_ROOT='/home/yiakwy/workspace/Github/sglang/3rdparty/X-Codec-2.0'
model_path = f"{ROOT}/xcodec2"
Debug=True
def find_meta_tensors(module, prefix=""):
metas = []
for name, param in module.named_parameters(recurse=True):
if param.is_meta:
metas.append(prefix + name)
for name, buf in module.named_buffers(recurse=True):
if buf.is_meta:
metas.append(prefix + name)
return metas
if Debug:
config = AutoConfig.from_pretrained(model_path, trust_remote_code=True)
with init_empty_weights():
model = AutoModelForCausalLM.from_config(config, trust_remote_code=True)
# torch >= 2.6
torch.serialization.add_safe_globals([list])
torch.serialization.add_safe_globals([dict])
torch.serialization.add_safe_globals([int])
torch.serialization.add_safe_globals([typing.Any])
torch.serialization.add_safe_globals([collections.defaultdict])
torch.serialization.add_safe_globals([omegaconf.base.ContainerMetadata])
torch.serialization.add_safe_globals([omegaconf.listconfig.ListConfig])
torch.serialization.add_safe_globals([omegaconf.dictconfig.DictConfig])
torch.serialization.add_safe_globals([omegaconf.nodes.AnyNode])
torch.serialization.add_safe_globals([omegaconf.base.Metadata])
ckpt = f"{model_path}/ckpt/epoch=4-step=1400000.ckpt"
state_dict = torch.load(ckpt, map_location="cpu", weights_only=True)
state_dict = state_dict['state_dict']
filtered_state_dict_codec = OrderedDict()
filtered_state_dict_semantic_encoder = OrderedDict()
filtered_state_dict_gen = OrderedDict()
filtered_state_dict_fc_post_a = OrderedDict()
filtered_state_dict_fc_prior = OrderedDict()
for key, value in state_dict.items():
if key.startswith('CodecEnc.'):
print(f"adding key = {key} ...")
new_key = key[len('CodecEnc.'):]
if "beta" in new_key:
old_key = new_key
new_key = new_key.replace("beta", "bias")
print(f"updating key from {old_key} to {new_key}")
filtered_state_dict_codec[new_key] = value
elif key.startswith('generator.'):
new_key = key[len('generator.'):]
filtered_state_dict_gen[new_key] = value
elif key.startswith('fc_post_a.'):
new_key = key[len('fc_post_a.'):]
filtered_state_dict_fc_post_a[new_key] = value
elif key.startswith('SemanticEncoder_module.'):
new_key = key[len('SemanticEncoder_module.'):]
filtered_state_dict_semantic_encoder[new_key] = value
elif key.startswith('fc_prior.'):
new_key = key[len('fc_prior.'):]
filtered_state_dict_fc_prior[new_key] = value
model.semantic_model = Wav2Vec2BertModel.from_pretrained(f"{config.Wav2Vec2BertModelROOT}/w2v-bert-2.0", output_hidden_states=True)
model.semantic_model.eval().cuda()
model.SemanticEncoder_module.load_state_dict(filtered_state_dict_semantic_encoder, strict=False, assign=True)
model.SemanticEncoder_module.eval().cuda()
model.CodecEnc.load_state_dict(filtered_state_dict_codec, strict=False, assign=True)
metas = find_meta_tensors(model.CodecEnc)
model.CodecEnc.eval().cuda()
model.generator.load_state_dict(filtered_state_dict_gen, strict=False, assign=True)
model.generator.eval().cuda()
model.fc_prior.load_state_dict(filtered_state_dict_fc_prior, strict=False, assign=True)
model.fc_prior.eval().cuda()
model.fc_post_a.load_state_dict(filtered_state_dict_fc_post_a, strict=False, assign=True)
model.fc_post_a.eval().cuda()
model.feature_extractor = AutoFeatureExtractor.from_pretrained(f"{config.Wav2Vec2BertModelROOT}/w2v-bert-2.0")
output_dir='/aifs4su/yiakwy/shared/audio/test_cvt_hf_xcodec2'
model.eval().cuda()
wav, sr = sf.read(f"{ROOT}/xcodec2/test.flac")
wav_tensor = torch.from_numpy(wav).float().unsqueeze(0) # Shape: (1, T)
with torch.no_grad():
vq_code = model.encode_code(input_waveform=wav_tensor)
print("Code:", vq_code )
recon_wav = model.decode_code(vq_code).cpu() # Shape: (1, 1, T')
sf.write("reconstructed.wav", recon_wav[0, 0, :].numpy(), sr)
print("Done! Check reconstructed.wav")
...
This codes can generate correct reconstructed audio, subjected to the following modifications:
- load pt files
- converting
CodecEnc.conv_final_block.1.betatoconv_final_block.1.biasin new models.
We are updating huggingface modules so that people can use it more friendly. Let me know if there is any inputs from you.
env :
Torch : 2.8
local config :
{
"Wav2Vec2BertModelROOT": "/aifs4su/yiakwy/shared/audio/",
"architectures": [
"XCodec2Model"
],
"auto_map": {
"AutoConfig": "configuration_bigcodec.BigCodecConfig",
"AutoModelForCausalLM": "modeling_xcodec2.XCodec2Model"
},
"codec_decoder_hidden_size": 1024,
"codec_encoder_hidden_size": 1024,
"model_type": "bigcodec",
"semantic_hidden_size": 1024,
"torch_dtype": "float32",
"transformers_version": "4.51.3",
"use_vocos": true
}
Thank you.