GPU suitable for LLM inference:

CUDA Cores: These are the primary processing units of the GPU. Higher CUDA core counts generally translate to better parallel processing performance.
Tensor Cores: Specialized cores designed specifically for deep learning tasks, such as matrix multiplications, which are crucial for neural network operations.
VRAM (Video RAM): This is the memory available to the GPU for storing data and models. More VRAM allows for handling larger models and datasets efficiently.
Clock Frequency: Represents the speed at which the GPU operates, measured in MHz. Higher frequencies generally lead to better performance.
Price: The cost of the GPU is a crucial factor, especially for businesses or research labs with budget constraints. It’s essential to balance performance needs with affordability.

NVIDIA GPUs based on their suitability for LLM inference