Have you ever tried quantizing a Bert or Roberta classifier model with GGUF? I'm curious how the performance compares on CPU to that of an unquantized classifier on GPU.
No, I have never tried. I guess these classifiers would be very fast on CPU but I would be worried about the accuracy of the models after quantization since BERT and RoBERTa are already small.
GGUF it in FP16 (so, no quantization, i.e., just run the convert hf to guff) and run it with llama.cpp. If you have a recent CPU, it should be quite fast.
Have you ever tried quantizing a Bert or Roberta classifier model with GGUF? I'm curious how the performance compares on CPU to that of an unquantized classifier on GPU.
No, I have never tried. I guess these classifiers would be very fast on CPU but I would be worried about the accuracy of the models after quantization since BERT and RoBERTa are already small.
Good point. Do you know of a way to make a Roberta classifier run quickly on CPU without quantization?
GGUF it in FP16 (so, no quantization, i.e., just run the convert hf to guff) and run it with llama.cpp. If you have a recent CPU, it should be quite fast.
ARGH! GGUF doesn't support Bert or Roberta.
NotImplementedError: Architecture 'RobertaForSequenceClassification' not supported!
This seems to work for Bert model: https://github.com/skeskinen/bert.cpp?tab=readme-ov-file#converting-models-to-ggml-format
Maybe I'll just use TEI on CPU. https://github.com/huggingface/text-embeddings-inference
Okay, thanks. I'll give that a shot...and then report back how GGUF on Threadripper 3970X compares to non-GGUF on an A6000 GPU.
Using TEI with gRPC, I classified 500 strings on CPU and GPU.
CPU: 10.313s (Threadripper 3970X)
GPU: 0.233s (RTX A6000)
I think I'll stick to the GPU. :)
Yes, I think the CPU is only a good option if the GPU can't load the full model.