Blog

Running VS Code Copilot on Your Own Hardware

It's not as hard as you think!
May 13, 2026

I am a heavy VS Code Copilot user, but recent billing changes led me to re-evaluate my own usage. I can easily hit 50% of my monthly cap with their highest plan in 10 days. It's worrisome and after the discounts companies get to adopt it run out, some companies have found their bills suddenly 3x. So it's an interesting problem for me and all the way up to the enterprise. So I sought out to find a good solution, even if it is just as a backup. I think it's likely some companies will end up with some hybrid architecture, given the landscape.

One of the big problems is AI is changing fast, both on the hardware and software fronts. The industry is full steam ahead on making enterprise quality models run on consumer hardware. I've covered this before, I think next year will really bring that to life, at a much more affordable price point. That is unless GPU and memory pricing and inventory remains too scarce and too high. But the enterprise is interested in these changes too. It means they can serve similar quality models at a much lower operating cost.

I already knew the M5 pro and ultra weren't going to cut it, so when I purchased my new laptop I did not opt for 128 GB of memory. The M5 Ultra should run Llama 3.1 8B at about 160 tokens per second, which will bring it inline with a 4090. Not bad at all! But we won't see it in notebook form. It will likely be limited to the Mac Studios and the prices will hurt.

So I have watched the prices and performance closely. I thought about upgrading to a 5090, or maybe an RTX 6000 or 6000 pro. I even considered used V100s. The 5090 does improve on some things, but also sets you back a bit in other ways and is due for a replacement next years. The 6000 is a good choice, but I'd really only gain VRAM and the 6000 pro looks great, but it's $10,000. And the V100s are enticing, but they are aging quickly. Qwen 3.6 27b can fit comfortably in 24 GB of VRAM and supposedly can keep up with Claude, so this was my goal: use my 4090 and Qwen.

Before I get into the set up, I was already comfortable tuning models, so if you need help on your quest, feel free to message me on LinkedIn and I'm happy to help. There are a lot of moving pieces and changes every day to the AI world and finding what works best for your hardware can take some time and patience.

Now to the brass tacks of making it at all work.

Using Copilot meant I don't have to do much other than host a model. My machine is hardwired on a 2.5G network, so the latency is kept to a minimum and even better than using a cloud hosted model! So for chatty interactions it's a big win. I used LM Studio to host the model and expose an OpenAI compatible endpoint. There are more efficient ways, especially given it uses llama.cpp, but it's good enough. Make sure you are offloading 100% of everything to your GPU. Check all of the boxes and when you load your model, the layers slider should be all the way to the right.

VS Code Insiders has a key feature: `chatLanguageModels.json`, that should land in the main app any day now. You can call it whatever you want and give it any ID you want. The most important thing is to set the URL and specify if it can make tool calls or not. The URL should look something like `http://192.168.1.xxx:1234/v1`. You can choose to use HTTPS or enable api keys for more security if you would like. You can now choose your model from the list. That's it. Copilot now runs off of your local hardware!

Here is where your tuning and tweaking skills come into play. You won't have to adjust things like temperature, but picking the right model and context size is key. I was getting about 33 tokens per second using Qwen 3.6 27b with a Q4_K_M quantization. It's... fine, but far too slow for my taste, so I switched to Qwen3.6-35B-A3B-UD-IQ4_XS. What does this change? A lot. 35 billion parameters increases the VRAM requirements, but only have 3B active keeps it manageable. The UD is an Unsloth quantization technology that helps you reduce the amount of active parameters at once. And IQ4_XS reduces the memory footprint even further than Q4_K_M, with minimal loss. What you end up with is about 130 tokens per second. It's fast and more competent than the baseline 27b model. Pretty amazing stuff.

But I had one problem. I could make a call to it and it ran great, but any follow up questions spiked my GPU and it would just hang. If it did complete, it was running at 3 tokens per second, ouch. So I ran the same test in VS Code and it was strange. It would start to output text, stop, then try again and again in a loop, until it failed. When hosting a model like this, there aren't a lot of dials to turn, since CoPilot handles most of it. That more or less left me with tweaking my context.

Originally, I cranked the context window up as high as it would let me. LM Studio doesn't really know how much a model will use, it's all estimated. But since I was monitoring my PC's resources, I noticed my VRAM usage was going above the 24 GB I have, which means it's spilling over into RAM, slowing the whole thing down. So I dialed the context size back a bit to 150,000. That's more than enough for the work I do -- I manage my contexts pretty well and I always have my primary hosted models that allow for much more. I did tweak my `chatLanguageModels.json` to allow for 128k input tokens, reserving the remaining 22k for output. I might tweak this a bit, the output can probably be smaller, but we will see how it behaves.

Now that I have it all setup the way I like (good size context and great performance), the next step is to test it, test it, and test it. With every model you have to learn how they behave and this is no different, but now I have more control over it. I can host it other ways so I can tune it more to my liking as well.

And lastly, there are other ways to get more performance out of a model, namely speculative decoding. The performance of 27b model can match the 35b variant I am using, but at this time LM Studio doesn't support the feature for Qwen 3.6 (since it's so new), so I didn't test it for now. There is an accuracy penalty though and added intelligence of the 35b variant is appealing. I could use speculative decoding with 35b, but it does use VRAM, so I'd have to give it a smaller context window. Lots of choices, but I think I've got as close to an ideal setup as I can at the moment, so experiments for another day.

As always, I hope you enjoyed and happy hacking!