Blog

My 2026 Q3 AI Plan

Have to pay to play.
Jun 30, 2026

Despite how technology typically works (it gets better and cheaper over time), we are seeing the opposite. Costs are increasing across the board and will continue to, since demand is rising and it's still heavily subsidized. I have never been one for subscriptions, if I can help it. It's the whole death by a thousands paper cuts thing. Realistically, with a subscription, I need to now spend $100 (but realistically more like $200) a month on AI costs. So I wanted a solution that works for me.

I looked at all of the hardware in play, including with the Nvidia Inception program pricing. Here's what the hardware looks like:

RTX 4090 - 24 GB ~1.3 GHZ memory. My baseline. RTX 5090 - 32 GB ~1.75 GHZ memory and more CUDA cores. RTX 6000 Pro - 96 GB, otherwise the same as a 5090. RTX 6000 - 48 GB, basically a 4090 with more memory. RTX 5000 pro - 48 or 72 GB, about the same as 6000 pro MAX Q (much lower power), but cut down

Originallym I thought a 6000 pro was the way to go. Even after my discounts, it comes in at about $10,000. But I researched and researched and researched some more. My conclusion was the extra VRAM doesn't do all that much for me. I currently run Qwen3.6-35b-a3b (not using a not so great quant), so it has plenty of room context (about 200k). It fits better than the 27b variant and is very fast, but it's not as good, by a decent margin. So the goal is to get to Q4_K_M -- the sweet spot. You don't lose too much quality and using the better 27b model, it brings it inline with Sonnet 4.6, my daily driver. I don't really need more AI horsepower power than that.

After running some tests, I get can better performance out of Qwen3.6-27b using MTP, at the cost of some extra VRAM, but I can offset it without issues using an FP8 KV cache. This is to say, I can run a Sonnet 4.6 equivalent model on 32 GB of VRAM with large context for whatever my electricity costs are. I can upgrade from the lesser 35b MoE model by switching to an RTX 5090, so that's exactly what I am doing.

I need a bigger PSU, but after selling my 4090, the project costs me about $2,500 and it takes into account eBay fees, cash back, tax -- everything. The ROI, if I spend $150 a month on AI is 16.7 months, which is pretty solid. I use it for other things though, the occasional gaming and for a chat bot, which only decreases the time to break even, since I no longer need to maintain a separate Claude subscription for that.

I've never needed Opus level performance. It's far too expensive to make it worth it, but the Chinese models are very competitive now and cost a fraction to run. To me it seems the best course of action is to hook up OpenRouter with VS Code, giving me access to GLM 5.2. Both Opus 4.8 and GLM 5.2 have a 1m context window, but GLM performs at a Fable level. The price difference? Huge. With 1m in and 1m tokens out, that's $60 to use Fable, $30 for Opus, and only $4 for GLM. So if and when I do need more AI horsepower, it will cost me far less than my $100 - $200 worth of subscriptions while being far more capable.

Fortunately GLM 5.2 is open source, but unfortunately it won fit on any consumer or prosumer level hardware. Even at a 1 bit quant (ouch), it needs over 200 GB of memory. 512 would actually be great, but there's not a lot of hardware out there that can run it. Maybe the next generation of Mac Studios, but it won't be cheap. If there does become a point in time where I need to run something like that I can cross that bridge then. But I suspect, we won't ever need that much memory. There are advancements every day that bring more and more capable models to consumer hardware and with z.ai making as much progress as they are, as the 3rd large Chinese competitor now, it won't take very long for it to trickle down. If we are able to shrink it down to 128 GB or so, it could be worth an investment in the next M ultra chip. The 5090 can be used for super fast Sonnet level work and the Mac could be used for the most difficult of tasks. Time will tell though!