Great work. I had an issue running this in colab

by smjain - opened Mar 18, 2023

Mar 18, 2023

/usr/local/lib/python3.9/dist-packages/bitsandbytes/functional.py in transform(A, to_order, from_order, out, transpose, state, ld)
1696
1697 def transform(A, to_order, from_order='row', out=None, transpose=False, state=None, ld=None):
-> 1698 prev_device = pre_call(A.device)
1699 if state is None: state = (A.shape, from_order)
1700 else: from_order = state[1]

AttributeError: 'NoneType' object has no attribute 'device'
Can you please check.

mrm8488

Owner Mar 18, 2023

•

edited Mar 18, 2023

Look here: https://huggingface.co/mrm8488/Alpacoom/discussions/3 @smjain

smjain

Mar 18, 2023

Thanks. I checked and got it working.

smjain

Mar 18, 2023

•

edited Mar 18, 2023

I see another issue. At each call of generate, the gpu memory keeps growing. Its not getting released.

mrm8488

Owner Mar 18, 2023

cc: @ybelkada

spuun

Mar 22, 2023

Hi!
I've been running this model for the past couple days, really nice model, tysm for open-sourcing it! 😊
Anyway, currently having the same issue with the VRAM usage, any development on this?
If it's of any help, I don't see an increase on every call from the looks of it, just occasionally.

spuun

Mar 23, 2023

Hi!
I've been running this model for the past couple days, really nice model, tysm for open-sourcing it! 😊
Anyway, currently having the same issue with the VRAM usage, any development on this?
If it's of any help, I don't see an increase on every call from the looks of it, just occasionally.

Messed around with it today, seems like adding a

    torch.cuda.empty_cache()
    import gc; gc.collect()

to the generate() function helped! :)

mrm8488

Owner Mar 23, 2023

Thank you so much @spuun ! Could you share the code snippet here for the rest of folks?

spuun

Mar 23, 2023

•

edited Mar 24, 2023

Sure! It's really just adding those calls into the function (idt the place you put them matters tbh, they're just garbage collector calls, added two of them just to make sure).

def generate(
        instruction,
        input=None,
        temperature=0.1,
        top_p=0.75,
        top_k=40,
        num_beams=4,
        **kwargs,
):
    torch.cuda.empty_cache()
    import gc; gc.collect()
    prompt = generate_prompt(instruction, input)
    inputs = tokenizer(prompt, return_tensors="pt")
    input_ids = inputs["input_ids"].cuda()
    generation_config = GenerationConfig(
        temperature=temperature,
        top_p=top_p,
        top_k=top_k,
        num_beams=num_beams,
        **kwargs,
    )
    with torch.no_grad():
        generation_output = model.generate(
            input_ids=input_ids,
            generation_config=generation_config,
            return_dict_in_generate=True,
            output_scores=True,
            max_new_tokens=256,
        )
    s = generation_output.sequences[0]
    output = tokenizer.decode(s)
    torch.cuda.empty_cache()
    import gc; gc.collect()
    return output.split("### Response:")[1].strip().split("Below")[0]

Before:

After:

What I've found is that this seems to only occur for large prompts, I'm not sure where the threshold is to trigger it, but from what I can tell the size of the prompt is really what did it.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment