induwara.lk
induwara.lkAI · Hardware

Best GPU for LLMs — AI GPU Comparison

Compare the NVIDIA GPUs people actually use to run and fine-tune open-weight LLMs — VRAM, bandwidth, FP16 tensor TFLOPS, power and price. Pick a model and quantization and see instantly which card runs it, on how many GPUs, and the best value buy. Specs cited, runs in your browser.

By Induwara AshinsanaUpdated Jun 14, 2026
Compare GPUs for LLMs10 cards
NVIDIA specs · verified 2026-06-14
Quick setups
Compare these GPUs(3/6 selected)
Quantization
VRAM needed
4.8 GB
Llama 3 8B at 4-bit = 8.03B × 0.5 × 1.2

Best value: RTX 4090 24 GB — runs Llama 3 8B at 4-bit (4.8 GB) on one card at the lowest $/TFLOPS in this list (0.103 TFLOPS/$).

GPURun Llama 3 8B?VRAMBandwidthTFLOPSTDPYearPriceValue
A100 80 GB
Data-center
Fits on 1 card80 GB2,039 GB/s312400 W2020
$15,000.00*
Rs 4,500,000
0.021
H100 80 GB (SXM)
Data-center
Fits on 1 card80 GB3,350 GB/s989700 W2022
$28,000.00*
Rs 8,400,000
0.035
RTX 4090 24 GB
Consumer
Fits on 1 card24 GB1,008 GB/s165450 W2022
$1,599.00
Rs 479,700
0.103best
Specs from NVIDIA datasheets · cited below* approximate street price (no public MSRP)USD → LKR at Rs 300/$ (2026-06-14)

How it works

Every hardware number in the table — VRAM, memory bandwidth, FP16/BF16 tensor TFLOPS, board power and launch year — is a static value taken straight from the NVIDIA datasheet linked for each card. Nothing about the hardware is guessed. The only things computed in your browser are how much memory a model needs, whether it fits, the TFLOPS-per-dollar value, and the rupee price.

VRAM a model needs. The tool uses the same bytes-per-parameter convention as the LLM VRAM Calculator:

  • weights_GB = params_billion × bytes_per_param
  • bytes_per_param = 2 (FP16), 1 (8-bit), 0.5 (4-bit)
  • total_GB = weights_GB × 1.2

The ×1.2 covers activations, the CUDA context and a modest KV cache for short context. It is a deliberate approximation so you get a fast hardware shortlist; for exact context-length and batch-size KV-cache math, the VRAM calculator is the precise tool and is linked throughout.

Does it fit?The tool divides the VRAM needed by the card's memory and rounds up: cards = ceil(total_GB / gpu_vram_GB). One card means it fits, two to 8 means multi-GPU, and more than 8is flagged as won't-fit. The exactly-full case (needed equals capacity) counts as one card, not two.

Tensor TFLOPS, made comparable. Vendors quote tensor throughput several ways — with or without 2:1 sparsity, FP16 versus FP32 accumulate — which can make a card look two to four times faster than another on paper. Every figure here is recorded on one basis: dense FP16/BF16 with FP32 accumulate, no sparsity. The data-center anchors are datasheet-exact under that convention (A100 = 312, H100 SXM = 989, L40S = 362 TFLOPS), and consumer and workstation cards use the matching whitepaper figure, so the value column is apples-to-apples.

Value and price. The value column is TFLOPS / price_usd, and the highest value in your current filter is highlighted. Consumer cards have no public-cloud overhead so they win this metric handily. Prices are launch MSRP for consumer cards; data-center cards and the L40S have no public MSRP, so they carry an approximate street price flagged with an asterisk and dated 2026-06-14. Rupee figures use a reference rate of Rs 300 to the US dollar.

Worked examples

Llama 3 8B at FP16 — RTX 4090 vs RTX 3060

  1. VRAM needed: 8.03B × 2 bytes = 16.06 GB → × 1.2 = 19.3 GB
  2. RTX 4090 24 GB: ceil(19.3 / 24) = 1 → fits on one card
  3. RTX 3060 12 GB: ceil(19.3 / 12) = 2 → needs 2 cards at FP16
  4. Drop to 4-bit: 8.03B × 0.5 × 1.2 = 4.8 GB → fits the 3060 easily

Llama 3 70B at 4-bit — RTX 4090 vs A100 80 GB vs H100 80 GB

  1. VRAM needed: 70.6B × 0.5 = 35.3 GB → × 1.2 = 42.4 GB
  2. RTX 4090 24 GB: ceil(42.4 / 24) = 2 → needs 2× RTX 4090
  3. A100 80 GB: ceil(42.4 / 80) = 1 → fits on one card
  4. H100 80 GB: fits on one card, ~3× the BF16 TFLOPS of the A100 (989 vs 312)

Edge case — a 10B model exactly fills a 24 GB card

  1. VRAM needed: 10B × 2 × 1.2 = 24.0 GB (exactly the card's capacity)
  2. RTX 4090 24 GB: ceil(24.0 / 24) = ceil(1.0) = 1 → fits, not 2
  3. This off-by-one boundary is the most common bug in fit tools
  4. One byte more and ceil rounds to 2 cards — the math is exact here

Frequently asked questions

Sources & references

GPU specifications and the USD→LKR reference rate were last cross-checked against these sources on 2026-06-14. Tensor TFLOPS are recorded as dense FP16/BF16 (FP32 accumulate, no sparsity) so every card is comparable. Prices for data-center cards are approximate street prices, not official MSRP.

Related tools

Rate this tool
Be the first to rate

Comments & feedback

Spotted a bug or want an improvement? Tell us — our team reviews every comment, and good ideas get built. Comments are public and anonymous.

Found a spec that's out of date, or want another GPU added?

Email me at [email protected] — most fixes ship within 24 hours.