Best GPU for LLMs — AI GPU Comparison
Compare the NVIDIA GPUs people actually use to run and fine-tune open-weight LLMs — VRAM, bandwidth, FP16 tensor TFLOPS, power and price. Pick a model and quantization and see instantly which card runs it, on how many GPUs, and the best value buy. Specs cited, runs in your browser.
How it works
Every hardware number in the table — VRAM, memory bandwidth, FP16/BF16 tensor TFLOPS, board power and launch year — is a static value taken straight from the NVIDIA datasheet linked for each card. Nothing about the hardware is guessed. The only things computed in your browser are how much memory a model needs, whether it fits, the TFLOPS-per-dollar value, and the rupee price.
VRAM a model needs. The tool uses the same bytes-per-parameter convention as the LLM VRAM Calculator:
- weights_GB = params_billion × bytes_per_param
- bytes_per_param = 2 (FP16), 1 (8-bit), 0.5 (4-bit)
- total_GB = weights_GB × 1.2
The ×1.2 covers activations, the CUDA context and a modest KV cache for short context. It is a deliberate approximation so you get a fast hardware shortlist; for exact context-length and batch-size KV-cache math, the VRAM calculator is the precise tool and is linked throughout.
Does it fit?The tool divides the VRAM needed by the card's memory and rounds up: cards = ceil(total_GB / gpu_vram_GB). One card means it fits, two to 8 means multi-GPU, and more than 8is flagged as won't-fit. The exactly-full case (needed equals capacity) counts as one card, not two.
Tensor TFLOPS, made comparable. Vendors quote tensor throughput several ways — with or without 2:1 sparsity, FP16 versus FP32 accumulate — which can make a card look two to four times faster than another on paper. Every figure here is recorded on one basis: dense FP16/BF16 with FP32 accumulate, no sparsity. The data-center anchors are datasheet-exact under that convention (A100 = 312, H100 SXM = 989, L40S = 362 TFLOPS), and consumer and workstation cards use the matching whitepaper figure, so the value column is apples-to-apples.
Value and price. The value column is TFLOPS / price_usd, and the highest value in your current filter is highlighted. Consumer cards have no public-cloud overhead so they win this metric handily. Prices are launch MSRP for consumer cards; data-center cards and the L40S have no public MSRP, so they carry an approximate street price flagged with an asterisk and dated 2026-06-14. Rupee figures use a reference rate of Rs 300 to the US dollar.
Worked examples
Frequently asked questions
Sources & references
- NVIDIA — GeForce RTX 4090 specifications
- NVIDIA — A100 Tensor Core GPU datasheet (PDF)
- NVIDIA — H100 Tensor Core GPU datasheet (PDF)
- NVIDIA — L40S datasheet
- NVIDIA — H200 Tensor Core GPU
- Meta Llama 3 70B model card — parameter count
GPU specifications and the USD→LKR reference rate were last cross-checked against these sources on 2026-06-14. Tensor TFLOPS are recorded as dense FP16/BF16 (FP32 accumulate, no sparsity) so every card is comparable. Prices for data-center cards are approximate street prices, not official MSRP.
Related tools
Comments & feedback
Spotted a bug or want an improvement? Tell us — our team reviews every comment, and good ideas get built. Comments are public and anonymous.
Found a spec that's out of date, or want another GPU added?
Email me at [email protected] — most fixes ship within 24 hours.