How much memory does a local model need?

At 4-bit quantization, a rule of thumb: memory in GB is roughly the parameter count in billions times 0.5, plus about 20% overhead. A 14B model needs ~8–9GB. A 70B needs ~40GB. Longer context windows add more on top of that.

What is quantization, in plain terms?

Quantization shrinks a model by storing its numbers at lower precision, like saving a photo as a smaller JPEG. The common 'Q4_K_M' setting cuts memory by about 70–75% versus full precision (down to roughly a quarter the size) for a quality drop of a few percent most people never notice. It's what makes local models practical.

Can I run a model on a laptop with no graphics card?

Yes, the smaller ones. A 4–8B model runs on plenty of laptops using regular RAM and the CPU, just slower. Macs do especially well here because their unified memory is shared between CPU and GPU.

Is a local model as good as ChatGPT or Claude?

Not at the top end. The best hosted models are still ahead on the hardest tasks. But modern open models in the 14–70B range are genuinely useful for everyday work, and they run entirely on your machine with nothing leaving it.

What AI model can my computer actually run?

Rough rule: a local model needs about half its parameter count in gigabytes of memory once it's quantized to 4-bit. So 8GB of VRAM comfortably runs a 4–8B model, 16GB runs a 14B, 24GB runs a 32B, and 48GB+ gets you into 70B territory. On a Mac, the number that matters is your unified memory.

Last updated 2026-06-14 · Physea Labs

The honest answer to “what can my computer run?” comes down to one number: how much memory your graphics card has. On a Mac, it’s how much total memory the machine has. Everything else follows from that.

The one rule worth memorizing

A model’s size is measured in parameters, in billions (a “7B” model has seven billion). Quantize it to 4-bit, which is the normal way to run things locally, and it needs roughly half its parameter count in gigabytes, plus about 20% for working room.

That gives a simple ladder:

Your memory	Comfortable model size	Examples of what fits
8 GB	4–8B	Small Qwen, Llama, Gemma models
16 GB	up to 14B	Mid-size general models
24 GB	up to 32B	Strong all-rounders
48 GB+	up to 70B	Near-frontier open models

On a PC, “your memory” means the VRAM on your graphics card. On a Mac, it means the machine’s unified memory, which the chip shares between CPU and GPU. That’s why a Mac with 64GB can run models that would need an expensive dedicated GPU on a PC.

What quantization actually does

Quantization sounds technical but the idea is everyday. A model is millions of numbers. Storing each one at full precision is accurate but heavy. Quantization stores them at lower precision instead, the same way a JPEG throws away detail you won’t miss to make a photo smaller.

The setting you’ll see most is Q4_K_M. It’s the common default for a reason: against full precision it cuts the memory a model needs by about 70–75%, roughly a quarter the original size, while losing only a few percent of quality, usually invisible in normal use. There are smaller, more aggressive settings if you’re tight on memory, and larger ones if you have room to spare.

A caveat about “mixture of experts” models

Some newer models are built as a “mixture of experts” (MoE). They have a huge total parameter count but only use a fraction at a time. These can be faster than their size suggests. The catch: the memory math still works on the total size, not the active part, so a big MoE model still needs the memory to hold all of it.

What you give up, and what you gain

A model running on your own machine won’t match the very best hosted models on the hardest problems. That gap is real, and worth being honest about. What you get in return is that everything stays local: nothing you type leaves your computer, there’s no per-token bill, and it works offline. For a lot of everyday work, drafting, summarizing, coding help, private notes, a 14B to 70B local model is more than enough. And it’s yours.

If you want a model that’s built from the start to run on your hardware and keep adapting to your work, that’s what Sia is for.

Common questions

How much memory does a local model need?: At 4-bit quantization, a rule of thumb: memory in GB is roughly the parameter count in billions times 0.5, plus about 20% overhead. A 14B model needs ~8–9GB. A 70B needs ~40GB. Longer context windows add more on top of that.
What is quantization, in plain terms?: Quantization shrinks a model by storing its numbers at lower precision, like saving a photo as a smaller JPEG. The common 'Q4_K_M' setting cuts memory by about 70–75% versus full precision (down to roughly a quarter the size) for a quality drop of a few percent most people never notice. It's what makes local models practical.
Can I run a model on a laptop with no graphics card?: Yes, the smaller ones. A 4–8B model runs on plenty of laptops using regular RAM and the CPU, just slower. Macs do especially well here because their unified memory is shared between CPU and GPU.
Is a local model as good as ChatGPT or Claude?: Not at the top end. The best hosted models are still ahead on the hardest tasks. But modern open models in the 14–70B range are genuinely useful for everyday work, and they run entirely on your machine with nothing leaving it.

Keep reading