Skip to content

problem when load audio form #53

@jiangshenyuan

Description

@jiangshenyuan

Thanks for your great work!
I encountered the following error when use model to generate answer with audio input. The version of the model is ming-lite-omni-v1.5.
Traceback (most recent call last):
File "/workspace/audio_eval_llms/evals/ming_lite_omni1_5.py", line 247, in
main(args)
File "/workspace/audio_eval_llms/evals/ming_lite_omni1_5.py", line 219, in main
generated_ids = model.generate(
File "/anaconda3/envs/ming_lite_omni/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/workspace/Ming/modeling_bailingmm.py", line 655, in generate
audio_embeds, audio_embeds_lengths = self.extract_audio_feature(
File "/workspace/Ming/modeling_bailingmm.py", line 313, in extract_audio_feature
audio_embeds, _, audio_embeds_lengths = encode_audio_segments(
File "/workspace/Ming/modeling_utils.py", line 913, in encode_audio_segments
audio_feats_seg = encoder(feat_segs_batch)
File "/anaconda3/envs/ming_lite_omni/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/anaconda3/envs/ming_lite_omni/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
return forward_call(*args, **kwargs)
File "/workspace/Ming/modeling_whisper_encoder.py", line 24, in forward
x = F.gelu(self.conv1(x))
File "/anaconda3/envs/ming_lite_omni/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/anaconda3/envs/ming_lite_omni/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
return forward_call(*args, **kwargs)
File "/anaconda3/envs/ming_lite_omni/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 375, in forward
return self._conv_forward(input, self.weight, self.bias)
File "/anaconda3/envs/ming_lite_omni/lib/python3.10/site-packages/whisper/model.py", line 57, in _conv_forward
return super()._conv_forward(
File "/anaconda3/envs/ming_lite_omni/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 370, in _conv_forward
return F.conv1d(
RuntimeError: Given groups=1, weight of size [1280, 128, 3], expected input[1, 560, 353] to have 128 channels, but got 560 channels instead

Seems that the new edition of ming is using whisper audio encoder that cannot correctly deal with the original data form.
How can I make this work?
Waiting for your reply, thanks.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions