Which local model should I start with?

Qwen3 is the easiest first choice. It comes in everything from a 0.6B phone model to a 32B local flagship, uses the permissive Apache 2.0 license, and is a strong all-rounder. Pick the largest Qwen3 size your memory can hold.

What is the best local model for reasoning and math?

QwQ-32B (Alibaba) is the standout for reasoning and math. It posts strong results on the hardest math and reasoning benchmarks and still fits on a single 24GB GPU at 4-bit. It uses the permissive Apache 2.0 license, so commercial use is unrestricted. If you want an MIT option, DeepSeek-R1-Distill-Qwen-32B is a close alternative, and Qwen3-32B (Apache 2.0) is another.

Which local model has the biggest context window?

Llama 4 Scout, at 10 million tokens on paper. But there's a catch: Scout's weights fit on a single H100 only at INT4 (~55GB), and actually filling that 10M-token window needs a huge KV cache no single 80GB card can hold. One H100 handles only about 35K tokens of context; the full 10M takes many GPUs (roughly 32+ H100s). So the giant window is real, but it isn't a one-GPU feature.

What does 'A22B' or 'A3B' mean in a model name?

It's the active parameter count for a Mixture-of-Experts model. Qwen3-235B-A22B has 235B total weights but only uses 22B per token, so it runs faster than its size implies. You still need memory for all 235B, though.

What are the best local AI models to run in 2026?

For most people running models locally in 2026, Qwen3 is the safest starting point. It spans tiny to flagship sizes and uses a permissive Apache 2.0 license. QwQ-32B leads on reasoning and math, Llama 4 Scout offers a giant 10M-token context (though using it all takes many GPUs, not one), and Mistral Small is the efficient multilingual pick. The right one depends on your hardware tier.

Last updated 2026-06-14 · Physea Labs

Open weights moved fast, and the honest summary for 2026 is that you have an embarrassment of good options. Finding a capable local model is the easy part. The trick is matching one to your hardware and to what you actually do all day. Here’s how the five main families stack up.

The five families at a glance

Family	Maker	Best local pick	License	Strength
Qwen3	Alibaba	32B / 30B-A3B	Apache 2.0	All-round, many sizes
Llama 4	Meta	Scout (weights on 1x H100 at INT4)	Community*	10M context, but full window needs many GPUs
DeepSeek	DeepSeek	R1-Distill-Qwen-32B / V3.2	MIT	Reasoning & math
Mistral	Mistral	Small 3.1 / Medium 3.5	Apache 2.0 / Modified MIT	Efficient, multilingual
Gemma	Google	27B	Gemma	Multimodal, 140+ languages

* Llama 4’s Community License is free for most users, but it requires a separate Meta license above 700M monthly active users, and it currently blocks EU-domiciled users. The Apache 2.0 and MIT families (Qwen, DeepSeek, most Mistral) have no such strings.

Qwen3, the default answer

If you don’t have a strong reason to pick something else, start with Qwen3. It’s the most complete lineup: a 0.6B model small enough for a phone, a 4B that’s the best “small” all-rounder, mid-size 8B/14B options, and a 32B dense flagship that holds its own as a local top pick. The Apache 2.0 license means real commercial use with no asterisks. For most people on a 16–24GB GPU, the largest Qwen3 you can fit is the right answer.

There’s also Qwen3-30B-A3B, a Mixture-of-Experts model that only activates 3B of its 30B weights per token. It feels much faster than a 30B dense model and still punches above a typical small model. Pick it if speed matters and you have 24GB to hold it.

DeepSeek, when the task is reasoning

Reach for a dedicated reasoning model when correctness on hard problems matters more than breadth. The cross-family standout here is QwQ-32B from Alibaba: a 32B reasoning model that’s strong on tough math and reasoning, Apache 2.0 licensed, and fits a single 24GB card (a 4090 or A6000) at 4-bit. If you specifically want MIT terms, DeepSeek-R1-Distill-Qwen-32B is the close alternative, and Qwen3-32B is another Apache 2.0 option. The bigger DeepSeek-V3.2 is frontier-class, but it needs a data center (8x H200). Impressive, just not something you’ll run at home.

Llama 4 Scout, when context is everything

Scout’s headline is a 10-million-token context window, by far the largest of anything open. It’s a 109B-total Mixture-of-Experts design with 17B active, so it’s faster than the total size suggests. Here’s the catch that trips people up, though. Two separate things get conflated. Scout’s weights fit on a single H100 only at INT4 (~55GB), which is the “runs on one card” claim you’ll see. But actually using the 10M-token window is a different story: that needs an enormous KV cache, and no single 80GB card can hold it. In practice one H100 gives you only about 35K tokens of context. To feed a whole codebase in one shot at the full 10M, you’re looking at many GPUs, roughly 32 or more H100s. The giant window is real, but it’s a multi-GPU feature, not a one-card trick. Watch the license too. The Community terms carry the 700M-user ceiling and the EU restriction.

Mistral and Gemma, the specialists

Mistral Small 3.1 (~24B) is the efficient multimodal pick. It runs on 16–24GB and handles many languages well. Mistral Medium 3.5 is a balanced choice for agent harnesses if you have multi-GPU. Gemma 3 27B is Google’s open multimodal model, strong across 140+ languages and comfortable on a 24GB card. Just note that the Gemma license asks you to accept Google’s terms before use.

A word on Mixture-of-Experts

Several of these are “Mixture-of-Experts”: Qwen3-30B-A3B, Llama 4 Scout, the big DeepSeek and Mistral models. The idea is simple. Only a slice of the model activates per token, so it computes faster than its total size implies. Here’s the catch that trips people up. You still need memory to store all the weights. A 30B-A3B model is as fast as a 3B in compute but needs the memory of a 30B. Size your hardware to total parameters, and expect your speed from the active ones.

Picking by hardware

The cleanest filter is your memory. Roughly: 8GB runs a 4–8B model, 16GB a 14B, 24GB a 32B, and 48GB+ opens the 70B class. The full ladder, including how Macs use unified memory to punch above PCs, is in what AI model can my computer run.

If you’d rather not pick and maintain a model at all, Sia is built to run on your hardware and keep adapting to your work, and Liminal is trained specifically on the software you already use. And if a task genuinely needs frontier quality, the frontier vs local guide covers when to reach for a hosted model instead.

Common questions

Which local model should I start with?: Qwen3 is the easiest first choice. It comes in everything from a 0.6B phone model to a 32B local flagship, uses the permissive Apache 2.0 license, and is a strong all-rounder. Pick the largest Qwen3 size your memory can hold.
What is the best local model for reasoning and math?: QwQ-32B (Alibaba) is the standout for reasoning and math. It posts strong results on the hardest math and reasoning benchmarks and still fits on a single 24GB GPU at 4-bit. It uses the permissive Apache 2.0 license, so commercial use is unrestricted. If you want an MIT option, DeepSeek-R1-Distill-Qwen-32B is a close alternative, and Qwen3-32B (Apache 2.0) is another.
Which local model has the biggest context window?: Llama 4 Scout, at 10 million tokens on paper. But there's a catch: Scout's weights fit on a single H100 only at INT4 (~55GB), and actually filling that 10M-token window needs a huge KV cache no single 80GB card can hold. One H100 handles only about 35K tokens of context; the full 10M takes many GPUs (roughly 32+ H100s). So the giant window is real, but it isn't a one-GPU feature.
What does 'A22B' or 'A3B' mean in a model name?: It's the active parameter count for a Mixture-of-Experts model. Qwen3-235B-A22B has 235B total weights but only uses 22B per token, so it runs faster than its size implies. You still need memory for all 235B, though.

Keep reading