I have a desktop PC with a strong GPU. My main work happens on a MacBook. I wanted my MacBook to use a coding assistant that runs on the desktop, not in the cloud. I was thinking about less of API bills, it can work offline and it is private.
This is not a polished “in 10 easy steps” post. It’s a story of what I actually did. Most steps had a problem.
Table of contents
Open Table of contents
- What I started with
- Step 0: Ollama or vLLM
- Step 1: WSL2
- Step 2: Mirrored networking
- Step 3: A separate disk for models
- Step 4: vLLM in a Python venv
- Step 5: The model - Qwen3.5-35B-A3B GPTQ
- Step 6: The vLLM serve script
- Step 7: It works on Windows. It does not work on Windows.
- Step 8: The first Opencode session
- Step 9: Windows restart breaks everything
- Step 10: I gave up on autostart
- Step 11: One more rabbit hole - Claude Code
- Where I ended up
What I started with
The desktop:
- Windows 11
- NVIDIA RTX 5090, 32 GB VRAM
- Intel Core Ultra 9 285K
- Wired to my home router
The MacBook is on the same Wi-Fi. I planned to use Opencode on it - a terminal AI coding agent that talks to any OpenAI-compatible server.
Before doing anything, I wrote down the constraints:
- LAN only. No public exposure, no tunnels.
- No money on cloud APIs.
- Quality good enough for real coding tasks.
- The kids use the same desktop for games (the setup must give the GPU back when I’m not using it).
Step 0: Ollama or vLLM
The “easy path” answer is Ollama. One installer on Windows. No Linux, no Python venv. For most people that’s the right pick.
I started research there.
Coding agents resend a lot of context every turn - the whole system prompt, tool schemas, file reads. By turn five of a real session, the prompt is 30k+ tokens and the model only generates a couple thousand. Repeating that prefix every turn is slow.
vLLM has prefix caching that survives across requests. Ollama only has single-slot caching (as of writing) that gets evicted when a model unloads. For an agent workflow having cache layer was a big deal.
I decided to take the harder path: vLLM. But vLLM doesn’t run on Windows directly and it needs Linux, that means I will use WSL2.
If you only do single chat turns, Ollama is the better trade. The rest of this post would still apply - just skip the WSL section.
Step 1: WSL2
WSL2 is a thin Linux VM that ships with Windows.
Open PowerShell as Administrator and run:
wsl --install -d Ubuntu-24.04After Ubuntu was installed, I checked the WSL version and got my first surprise:
Windows Subsystem for Linux 2.6.3vLLM on Blackwell needs WSL 2.7.0+ to get CUDA-graph capture working without the --enforce-eager slowdown.
The Microsoft Store still had 2.6.3 as current version. I updated WSL through the pre-release channel:
wsl --update --pre-releasewsl --versionThat gave me 2.7.x.
Reboot and wsl --version confirmed it. Lesson: always check the version before you trust a guide.
Step 2: Mirrored networking
Mirrored networking is the cleaner WSL networking mode in newer versions, it makes WSL share the host’s network identity instead of running its own NAT.
I wrote C:\Users\<me>\.wslconfig:
[wsl2]networkingMode=mirroreddnsTunneling=trueautoProxy=truefirewall=trueRestarted WSL. It started but printed:
wsl: The wsl2.localhostForwarding setting has no effect when using mirrored networking modeThat warning is harmless, localhostForwarding is for NAT mode, and with mirrored mode it’s redundant. I skipped this warning and moved on.
I’ll spoil the ending here: mirrored mode didn’t fully solve my problem. I ended up using NAT mode + a port proxy anyway, because mirrored mode had its own quirks I didn’t want to fight, more on that below.
Step 3: A separate disk for models
Models are big. I didn’t want them clogging C:\. I put them on E:\ as a virtual disk that WSL could mount.
New-Item -ItemType Directory -Path E:\wsl-vllm -Force$vhd = "E:\wsl-vllm\wsl-models.vhdx"New-VHD -Path $vhd -Dynamic -SizeBytes 500GBThen I tried to mount it:
wsl --mount --vhd E:\wsl-vllm\wsl-models.vhdx --bareand got next error:
The system cannot find the file specified.Error code: Wsl/ERROR_FILE_NOT_FOUNDThe file existed, I could ls it. I shut WSL down, recreated the directory, ran wsl --shutdown, tried again - same error.
Eventually I found the cause: the VHD path was case-sensitive in some places and I had a stray issue with the parent folder.
I deleted everything, recreated cleanly and mounted again and it worked.
Inside WSL, running lsblk showed the new device:
sde 8:64 0 500G 0 diskI formatted, mounted and added it to /etc/fstab:
sudo mkfs.ext4 /dev/sdesudo mkdir -p /mnt/modelsecho '/dev/sde /mnt/models ext4 defaults 0 2' | sudo tee -a /etc/fstabsudo mount -aFor the VHD to be available after every Windows restart, I scheduled the mount at logon. Open a new PowerShell as Administrator and run:
$action = New-ScheduledTaskAction -Execute "wsl.exe" ` -Argument "--mount --vhd E:\wsl-vllm\wsl-models.vhdx --bare"
$trigger = New-ScheduledTaskTrigger -AtLogOn
Register-ScheduledTask -TaskName "WSL-MountModels" ` -Action $action -Trigger $trigger -RunLevel Highest -ForceThe first time I tried Register-ScheduledTask it said the task already existed. I added -Force (note: you can also try Unregister-ScheduledTask first).
Step 4: vLLM in a Python venv
We need to prepare system and environment to runb vLLM. Open PowerShell and run WSL, once you are inside WSL terminal, run:
sudo apt updatesudo apt install -y python3.12-venv build-essential
mkdir ~/llm && cd ~/llmpython3.12 -m venv .venvsource .venv/bin/activate
pip install --upgrade pippip install vllmA few GB of CUDA wheels later it was installed.
nvidia-smi from inside WSL showed the 5090 with the right driver and CUDA version.
Good - the GPU passthrough worked.
Step 5: The model - Qwen3.5-35B-A3B GPTQ
I picked Qwen3.5-35B-A3B-Instruct (GPTQ-Int4). It has 35B total parameters but only ~3B activate per token (MoE).
On a single GPU, that means it’s noticeably faster than a dense 32B at the same quality. Roughly 20 GB on disk, fits in 32 GB VRAM with room for context.
We will download model from Hugging Face. First install the CLI and login with a read-only token:
pip install huggingface_hubhuggingface-cli login # paste a read-only HF token
huggingface-cli download Qwen/Qwen3.5-35B-A3B-Instruct-GPTQ-Int4 \ --local-dir /mnt/models/qwen3.5-35b-a3b-gptqI almost grabbed Qwen3.6 because I saw a tweet about it. It wasn’t actually published yet at the time - only the announcement.
Note: You should verify the Hugging Face page exists before planning to use a model.
Step 6: The vLLM serve script
I wrote ~/llm/serve-qwen.sh based on the guide I had:
1#!/usr/bin/env bash2set -euo pipefail3source ~/llm/.venv/bin/activate4
5vllm serve /mnt/models/qwen3.5-35b-a3b-gptq \6 --served-model-name qwen3.5-35b \7 --host 0.0.0.0 --port 8000 \8 --api-key "$VLLM_API_KEY" \9 --max-model-len 65536 \10 --gpu-memory-utilization 0.92 \11 --enable-prefix-caching \12 --disable-log-requestsOn the first run, it printed:
vllm: error: unrecognized arguments: --disable-log-requestsThat flag had been removed in my installed vLLM version, so I dropped it from script too.
Next run:
vllm: error: unrecognized arguments: falseSome other flag wanted a value, not a true/false. Fixed it.
Note: always run
vllm serve --helpagainst your actual installed version, don’t trust guides
After two more iterations, this is what worked:
1#!/usr/bin/env bash2set -euo pipefail3source ~/llm/.venv/bin/activate4
5vllm serve /mnt/models/qwen3.5-35b-a3b-gptq \6 --served-model-name qwen3.5-35b \7
8 --host 0.0.0.0 --port 8000 \9
10 --api-key "$VLLM_API_KEY" \11
12 --max-model-len 65536 \13 --gpu-memory-utilization 0.92 \14
15 --enable-prefix-caching \16 --enable-auto-tool-choice \17
18 --reasoning-parser qwen3 \19 --tool-call-parser qwen3_coderI generated value for VLLM_API_KEY with a random value and saved it for later:
export VLLM_API_KEY="sk-local-$(openssl rand -hex 16)"echo $VLLM_API_KEY # save thisThe first launch took about 2 minutes.
When I saw Application startup complete. I felt happy but that didn’t last long.
Step 7: It works on Windows. It does not work on Windows.
When I ran in a second WSL terminal curl, it worked:
curl -H "Authorization: Bearer $VLLM_API_KEY" http://127.0.0.1:8000/v1/modelsJSON came back. The model is reachable from inside WSL.
Then I tried from a Windows PowerShell:
curl.exe -H "Authorization: Bearer ..." http://192.168.50.94:8000/v1/modelsAnd I got an error:
curl: (7) Failed to connect to 192.168.50.94 port 8000 after 2046 ms: Could not connect to servervLLM logs were silent, nothing showed out after Application startup complete.. The request never reached vLLM. Something in the network stack between Windows and WSL was blocking it.
This is where I lost a few hours.
7.1 Network profile was Public
Get-NetConnectionProfile told me my Ethernet was classified as Public. Most LAN-friendly firewall rules only fire on Private. So I switched it too:
Set-NetConnectionProfile -InterfaceAlias "Ethernet" -NetworkCategory PrivateWindows reclassifies your network sometimes. Re-check this any time something just stops working
7.2 The Hyper-V firewall
I had a Defender rule for port 8000. With mirrored networking, WSL traffic goes through a separate Hyper-V firewall layer that defaults to block.
1New-NetFirewallHyperVRule -Name "WSL-vLLM-8000" `2 -DisplayName "vLLM (WSL) 8000/TCP" -Direction Inbound `3
4 -VMCreatorId '{40E0AC32-XXXX-XXXX-XXXX-2B479E8F2E90}' `5 -Protocol TCP -LocalPorts 8000I had to grab the right VMCreatorId first:
Get-NetFirewallHyperVVMCreatorFor me it returned the WSL group ID, and I used it with New-NetFirewallHyperVRule.
7.3 The Defender firewall
Scoped to my LAN subnet only:
1New-NetFirewallRule -DisplayName "vLLM LAN 8000" `2 -Direction Inbound -Action Allow -Protocol TCP -LocalPort 8000 `3
4 -Profile Private -RemoteAddress 192.168.50.0/24I don’t want random devices on the network hitting my LLM, just my MacBook and laptop.
7.4 The WSL IP problem and the port proxy
After all of the above, Windows-to-Windows curl at 127.0.0.1:8000 started working. MacBook-to-Windows still didn’t.
Then I learned the next thing: WSL has its own private IP (mine was 172.25.125.199) and traffic arriving at the Windows host on 192.168.50.94:8000 doesn’t automatically forward to that.
The fix was to use netsh portproxy.
WSL’s IP changes when WSL restarts, so I needed both a one-time forward and a script that refreshes it on every login.
$wslIp = (wsl hostname -I).Trim().Split(" ")[0]netsh interface portproxy add v4tov4 listenaddress=0.0.0.0 listenport=8000 ` connectaddress=$wslIp connectport=8000That made the MacBook curl work and to keep it working across restarts, I saved this as E:\wsl-models\wsl-portproxy.ps1:
1$logFile = "E:\wsl-models\wsl-portproxy.log"2function Log($msg) {3 "$((Get-Date).ToString('yyyy-MM-dd HH:mm:ss')) | $msg" | Out-File $logFile -Append4}5Log "Script started"6
7
8$wslIp = $null9for ($i = 1; $i -le 5; $i++) {10 $wslIp = (wsl hostname -I 2>$null).Trim().Split(" ")[0]11 if ($wslIp -and $wslIp -ne "") { break }12 Log "Attempt $i`: WSL IP not ready, retrying"13 Start-Sleep -Seconds 514}15Log "Detected WSL IP: $wslIp"16
17
18netsh interface portproxy reset19
20netsh interface portproxy add v4tov4 `21 listenaddress=0.0.0.0 listenport=8000 connectaddress=$wslIp connectport=800022Log "Portproxy refresh complete: 0.0.0.0:8000 -> $wslIp`:8000"And I scheduled this script to run at logon:
1$action = New-ScheduledTaskAction -Execute "powershell.exe" `2 -Argument "-ExecutionPolicy Bypass -File E:\wsl-models\wsl-portproxy.ps1"3$trigger = New-ScheduledTaskTrigger -AtLogOn4
5$principal = New-ScheduledTaskPrincipal -UserId "$env:USERNAME" -RunLevel Highest6
7Register-ScheduledTask -TaskName "WSL-Portproxy-8000" `8 -Action $action -Trigger $trigger -Principal $principal -ForceThe first time I tried to register it I got Access is denied. I had opened PowerShell as a normal user. Re-opened as Administrator, ran again, registered.
After all four pieces were in place, I ran the curl from my MacBook:
curl -H "Authorization: Bearer $VLLM_API_KEY" http://192.168.50.94:8000/v1/models | jqI got correct JSON response. Took the rest of the day to get to that one moment.
Step 8: The first Opencode session
On MacBook laptop I created Opencode config at ~/.config/Opencode/Opencode.json:
1{2 "$schema": "https://Opencode.ai/config.json",3 "model": "lan-vllm/qwen3.5-35b",4 "small_model": "lan-vllm/qwen3.5-35b",5 "provider": {6 "lan-vllm": {7 "npm": "@ai-sdk/openai-compatible",8 "name": "vLLM (LAN - RTX 5090)",9 "options": {10 "baseURL": "http://192.168.50.94:8000/v1",11
12 "apiKey": "{env:VLLM_API_KEY}"13 },14 "models": {15 "qwen3.5-35b": {16 "name": "Qwen3.5 35B-A3B (GPTQ)",17
18 "limit": { "context": 131072, "output": 8192 }19 }20 }21 }22 }23}Quick notes:
baseURLpoints at the Windows host, the portproxy forwards into WSL.{env:VLLM_API_KEY}reads from my MacBook shell, so the key isn’t in the file.limit.output: 8192- this is important, see the next problem.
Added the API key to my shell:
echo 'export VLLM_API_KEY=sk-local-...' >> ~/.zshrcsource ~/.zshrcThen cd ~/my-project && opencode. Inside Opencode, /models showed lan-vllm/qwen3.5-35b. I picked it and asked it to summarize the README file, and It did.
I worked with it for a few minutes, then it threw this:
This model's maximum context length is 65536 tokens. However, you requested16384 output tokens and your prompt contains at least 49153 input tokens,for a total of at least 65537 tokens.Opencode was asking for 16k output, but the prompt was already 49k. 49 + 16 > 65. To fix it, I dropped output to 8192 in the Opencode config and bumped --max-model-len to 131072 in the vLLM script (Qwen3.5 supports it natively), made both changes at once.
After that, it worked well.
Step 9: Windows restart breaks everything
I cleaned up and added a couple of nice-to-haves: firewall logging on, model storage backup plan, a systemd service in WSL so I wouldn’t have to launch the script by hand.
Then I restarted Windows to confirm everything came back automatically.
And it didn’t.
Trying to run curl from the MacBook:
curl: (56) Recv failure: Connection reset by peerAfter some digging:
- The
WSL-MountModelsscheduled task had run, butlsblkshowed nothing at/mnt/models. - The systemd
vllmservice was failing withunavailable resources or another system error. tail -f /var/log/vllm.logshowed vLLM was actually starting eventually, just slowly. The portproxy script had run too early - before WSL had a stable IP - and forwarded to nothing.
I patched the auto-mount task, fixed the systemd unit, added retries to the portproxy script.
After another reboot, everything came up clean: MacBook curl worked.
Step 10: I gave up on autostart
My “kids first” rule kicked in at this moment. Even with everything coming up automatically, vLLM was holding 20+ GB of VRAM on every boot.
The kids would turn on the desktop to play a game and find it stuttering.
So I disabled the autostart entirely:
Disable-ScheduledTask -TaskName "WSL-KeepAlive"I left WSL-MountModels and WSL-Portproxy-8000 enabled - they’re cheap and they make my own startup faster, but vLLM only runs when I run it.
My new ritual when I want to use it:
- Open PowerShell, type
wsl. - Inside WSL:
~/llm/serve-qwen.sh. - Wait for
Application startup complete.(about 90 seconds). - Open Opencode on the MacBook.
When I’m done, Ctrl+C kills vLLM and the GPU is free.
Slightly more friction, but worth it for a happy household.
Step 11: One more rabbit hole - Claude Code
I thought: “could I point Claude Code at this too?” so I set:
export ANTHROPIC_BASE_URL=http://192.168.50.94:8000/v1
claudeChat worked, I closed Claude Code, closed the terminal. Opened a new one, no ANTHROPIC_BASE_URL was set. But vLLM logs kept showing requests to /v1/v1/messages?beta=true and /v1/api/event_logging/batch.
Some background process was holding the old environment. I tracked it down through ps, found stale telemetry batchers, killed them.
Note: Do not set
ANTHROPIC_BASE_URLglobally - Claude Code does too much background work for that to be safe
So Claude Code is back to talking to Anthropic API. Opencode is what I use against my local vLLM.
Where I ended up
After all the iterations:
- A coding model that runs on my home GPU, talks over my home network and costs nothing per request.
- A 90-second start ritual when I want it, GPU is free for the kids the rest of the time.
- My code doesn’t leave the house.
- First-hand knowledge of every layer between my MacBook and the model, when something breaks now, I know which corner to look in (or at least I assured myself that I know).
The whole journey took longer than I expected, the model and the LLM server were the easy parts. Storage and networking ate most of the time. If you have similar hardware sitting at home, plan for it.
If you’re going to follow along, the order that worked for me was:
- Get WSL2 to a 2.7+ version before doing anything else.
- Move models to a separate disk early - don’t fill C:\.
- Get vLLM running first, talk to it locally with
curl. - Only then worry about reaching it from another device.
- Always run
vllm serve --helpagainst your installed version, and learn about current flags.
Now, time to use this thing for something fun.
Go make something ◝(ᵔᵕᵔ)◜