New Budget local LLM card: 5060ti with 16GB VRAM, check out its LLM Inference Speed

16GB VRAM is what is needed at the minimum if you don’t want to run the crap tear local LLMs, with 16GB VRAM you can easily run 14B and 24B parameters models in q4/q6 quant thus having a local LLM that is usable. NVIDIA’s latest offering 5060ti is going to be just that, featuring 16GB VRAM and 448GB/s bandwidth, it would serve the dual purpose, gaming and local LLMs. Lets check out the inference speed tok/s without the hardware optimizations(FP4) that are currently not available on local LLM software stack.

5060ti 16GB Model Inference Speed for local LLM use

Because 5060ti has 16GB VRAM we decided to make this calculations with even a better quant at Q6_K which comes in around 12.12GB. Using this model and 5060ti memory bandwidth at 448GB/s we will get around 36.7 tok/s of inference speed, since this is the theoretical max, the real number will be lower.

Qwen2.5-Coder-14B-Q6_K.gguf

Q6_K

12.12GB

Now, lets check the 24B parameter numbers

Mistral-Small-24B-Instruct-2501-Q4_K_M.gguf

Q4_K_M

14.33GB

The above gguf will fill about 14.33GB and some will be used by the OS, still leaving some amount for context, using the above 24B LLM model, we will get around 31.2 tok/s inference speed which should be a joy to use.

5060ti 16GB Model Inference Speed for local LLM use

Leave a Reply Cancel reply