Breaking: Z.AI GLM-4.7 Flash Delivers Frontier-Level Coding Performance at Consumer Prices
AI Hardware • January 29, 2026
GLM-4.7 Flash: 30B Dense Model with 128K Context
Z.AI’s GLM-4.7 Flash is a 30-billion parameter dense model designed for cost-effective coding workflows. Unlike its larger 355B Mixture-of-Experts sibling, Flash uses a streamlined dense structure for predictable latency and easier deployment. It achieves strong results on SWE-bench Verified, punching above its weight class against proprietary models that cost 5x more to run.
Mixture-of-Experts Design: 30B Total, 3B Active
GLM-4.7-Flash employs a sparse MoE design with only ~3B active parameters per token, making it 10x faster than dense 30B models while retaining the full knowledge base. Community testing shows GLM-4.7 excels at UI generation and tool calling, with users reporting “best 70B-or-less model” experiences. It runs at 60-80+ tokens/second on Apple Silicon and NVIDIA GPUs.
Hardware Testing: Runs Comfortably on 24GB GPUs
Hardware testing shows GLM-4.7 Flash requires just 17GB VRAM at 4K context, scaling to 23GB at 65K context. This means a single RTX 3090 or 4090 can handle very large contexts without multi-GPU setups. Performance on RTX 3090: 93 t/s generation at 4K context, maintaining 27 t/s even at 65K context. RTX 5090 pushes this to 158 t/s at 4K and 95 t/s at 65K.
Official Benchmarks: 73.8% on SWE-bench Verified
Z.AI’s official benchmarks show GLM-4.7 achieves 73.8% on SWE-bench Verified (+5.8% over GLM-4.6) and 66.7% on SWE-bench Multilingual (+12.9%). Terminal Bench 2.0 scores improved to 41% (+16.5%). The model also shows significant gains in UI quality, producing cleaner, more modern webpages with accurate layout and sizing. Tool using capabilities improved substantially on τ²-Bench.
Local Deployment: llama.cpp and OpenCode Integration
GLM-4.7 Flash can be run entirely locally using llama.cpp with CUDA support. The Q4_K_XL quantization requires 17GB storage and fits in 16-24GB VRAM. Performance reaches around 100 tokens per second on RTX 3090. The model integrates seamlessly with OpenCode AI coding agent, enabling fully local automated code generation, execution, and testing workflows without cloud dependencies.
Pricing: Aggressively Cheap at $0.07-$0.60 per Million Tokens
The strongest selling point is price. Using the Z.AI API, the model costs roughly $0.07 per million input tokens and $0.40 per million output tokens in the standard tier, with a free tier available. This is 70-80% lower than comparable tiers from OpenAI or Anthropic. For automated agents running in loops or processing massive datasets, those savings add up quickly.