5 Comments
Dec 19, 2023Liked by Benjamin Marie

Hi Benjamin,

Do we know how the inference computation is performed with a device map spread over VRAM/RAM/disk? Is everything being transferred to the GPU for computation or do we have inference performed on CPU for layers with parameters on RAM/disk? Thanks!

Expand full comment
author

I don't know how it works. I didn't see anything about that in the doc of accelerate. I would guess that all the parts of the model are read where they are while the computation happens on the fastest device.

Expand full comment
Dec 20, 2023Liked by Benjamin Marie

Thanks! I'll ask people at HF.

The reason I was asking: at some point you mentioned setting `max_memory` for the VRAM and leaving some room to avoid OOM. I wonder if the OOM was due, on top of loading other things, to loading other parts of the model that were on disk/RAM into the VRAM for computation. I'll post my answer here when I receive it.

Expand full comment
Dec 20, 2023Liked by Benjamin Marie

From this video, it looks like everything is computed on the GPU (weights are moved from their location to the VRAM as inference is performed). That's another good reason for leaving some space on the VRAM while setting max_memory.

https://youtu.be/MWCSGj9jEAo

Expand full comment
author

It also depends on the task we want to do with the model. I think device-map only takes care of the loading but if later we do batch decoding then we need to store the batches somewhere where there is space. For fine-tuning, as far as I know, it doesn't leave space for optimizer states.

Expand full comment