Local AI Voice Assistant with Home Assistant, Ollama, and Nvidia GPU on Proxmox

Local AI Voice Assistant with Home Assistant, Ollama, and Nvidia GPU on Proxmox

Fully local AI-powered voice assistant using Home Assistant Voice Preview Edition, Ollama running on a GTX 1080 Ti in a Proxmox LXC, with weather forecast capabilities. No cloud, no ongoing costs.


Architecture Overview

+---------------------+     speech (Wyoming)      +---------------------+     inference      +---------------------+
|                     |                             |                     |                    |                     |
|   HA Voice PE       | -------------------------> |   Home Assistant    | -----------------> |   Ollama LXC        |
|   (ESP32 device)    |   Whisper STT / Piper TTS  |   Assist Pipeline   |   HTTP :11434      |   GTX 1080 Ti       |
|                     |                             |                     |                    |   qwen3:4b-instruct |
+---------------------+                            +---------------------+                    +---------------------+
                                                            |
                                                            | tool calls
                                                    +-------+-------+
                                                    |               |
                                             +------+------+ +------+------+
                                             |             | |             |
                                             | llm_intents | | Open-Meteo  |
                                             | Brave Search| | Weather     |
                                             +-------------+ +-------------+

Background

Home Assistant’s built-in Assist handles simple home control commands well, but falls apart on anything conversational — general knowledge questions, weather forecasts, or anything requiring real-world context. The solution is to route unrecognized commands to a locally-hosted LLM via Ollama, keeping everything on-premises with no data leaving the network.

The challenge with Proxmox is getting Nvidia GPU passthrough working in an LXC container. Unlike a full VM passthrough, LXC device passthrough lets the GPU be shared with other containers (Jellyfin, Frigate etc.) while still giving Ollama near-native CUDA performance. The driver version on the host and inside the LXC must match exactly — this is the most common failure point and the one that requires the most troubleshooting.


Stack

Component Role
Home Assistant Voice Preview Edition Microphone / speaker / wake word
Faster Whisper (HA add-on) Local speech-to-text
Piper (HA add-on) Local text-to-speech
Ollama (Proxmox LXC) Local LLM inference server
qwen3:4b-instruct Conversation model — fast, tool-capable
GTX 1080 Ti (11GB VRAM) GPU acceleration
llm_intents + Brave Search API Web search tool for Ollama
Open-Meteo Free weather integration, no API key

Implementation

01 — Install Ollama on Proxmox via Community Script

Run from the Proxmox host shell. Review the script source before executing — it runs as root.

1bash -c "$(curl -fsSL https://raw.githubusercontent.com/community-scripts/ProxmoxVE/main/ct/ollama.sh)"

When prompted, select a privileged container. Nvidia GPU passthrough requires privileged mode due to device node permissions — unprivileged containers remap UIDs/GIDs in a way that breaks CUDA device access. Allocate at minimum 4 cores, 8GB RAM, and 40GB disk (models are large).

02 — Install Nvidia Drivers on the Proxmox Host

Add non-free repositories and install the driver:

1# Add non-free to all three repo lines
2nano /etc/apt/sources.list
3# Append: non-free non-free-firmware to each deb line
4
5apt update
6apt install -y pve-headers-$(uname -r) nvidia-driver nvidia-modprobe

Blacklist nouveau to prevent it from claiming the GPU before the Nvidia driver loads:

1echo "blacklist nouveau" > /etc/modprobe.d/blacklist-nouveau.conf
2echo "options nouveau modeset=0" >> /etc/modprobe.d/blacklist-nouveau.conf
3update-initramfs -u
4reboot

If the GPU was previously configured for VM passthrough, vfio-pci will have claimed it. Check and remove any vfio bindings:

1# Check which driver is bound
2lspci -k | grep -A3 "09:00.0"
3
4# If Kernel driver in use: vfio-pci, remove the binding
5# Comment out vfio ids line in /etc/modprobe.d/vfio.conf
6# Remove vfio entries from /etc/modules
7# Remove blacklist nvidia from /etc/modprobe.d/blacklist.conf and pve-blacklist.conf
8update-initramfs -u
9reboot

03 — Driver Version Matching

The host and LXC must run the exact same Nvidia driver version. The community script installs the LXC with Ubuntu 24.04 (noble), which may pull a newer driver version than what Debian’s repos offer for the host.

If there is a mismatch (Failed to initialize NVML: Driver/library version mismatch), the cleanest fix is to install the matching runfile driver directly from Nvidia on the host:

 1# Remove apt-managed driver first
 2apt remove --purge nvidia* -y
 3apt autoremove -y
 4
 5# Install kernel headers
 6apt install -y pve-headers-$(uname -r)
 7
 8# Download and install matching runfile (replace version as needed)
 9wget https://us.download.nvidia.com/XFree86/Linux-x86_64/535.288.01/NVIDIA-Linux-x86_64-535.288.01.run
10sh NVIDIA-Linux-x86_64-535.288.01.run --no-questions --ui=none
11reboot

Verify both sides match after reboot:

1# Host
2nvidia-smi | grep "Driver Version"
3
4# Inside LXC
5pct enter <CTID>
6nvidia-smi | grep "Driver Version"

04 — Create Nvidia Device Nodes and Persist Them

The /dev/nvidia* device nodes are not created automatically at boot without a display. Create a systemd service to initialize them before the LXC starts:

1nano /etc/systemd/system/nvidia-modprobe.service
 1[Unit]
 2Description=NVIDIA modprobe
 3After=multi-user.target
 4Before=pve-guests.service
 5
 6[Service]
 7Type=oneshot
 8RemainAfterExit=yes
 9ExecStart=/usr/bin/nvidia-modprobe -u -c=0
10ExecStart=/bin/mkdir -p /dev/nvidia-caps
11ExecStart=/bin/bash -c 'mknod /dev/nvidia-caps/nvidia-cap1 c 236 1 2>/dev/null || true'
12ExecStart=/bin/bash -c 'mknod /dev/nvidia-caps/nvidia-cap2 c 236 2 2>/dev/null || true'
13ExecStart=/bin/chmod 666 /dev/nvidia-caps/nvidia-cap1
14ExecStart=/bin/chmod 666 /dev/nvidia-caps/nvidia-cap2
15
16[Install]
17WantedBy=multi-user.target
1systemctl daemon-reload
2systemctl enable nvidia-modprobe.service
3systemctl start nvidia-modprobe.service

05 — LXC Configuration

The community script automatically adds the correct device passthrough entries to the LXC config using Proxmox’s dev syntax. Verify /etc/pve/lxc/<CTID>.conf contains:

unprivileged: 0
dev0: /dev/nvidia0,gid=44
dev1: /dev/nvidiactl,gid=44
dev2: /dev/nvidia-uvm,gid=44
dev3: /dev/nvidia-uvm-tools,gid=44
dev4: /dev/nvidia-caps/nvidia-cap1,gid=44
dev5: /dev/nvidia-caps/nvidia-cap2,gid=44

Note: nvidia-modeset is not required for Ollama inference and can be omitted if the device node is not present (common with runfile-installed drivers on headless systems).

06 — Pull a Model and Verify GPU Acceleration

Inside the LXC:

1ollama pull qwen3:4b-instruct
2ollama run qwen3:4b-instruct "how many quarts in a gallon"

While it runs, monitor GPU usage on the host:

1watch -n 1 nvidia-smi

Memory-Usage should jump to ~3GB and GPU-Util should spike during generation. Response time for short answers should be 1–3 seconds.

07 — Connect Ollama to Home Assistant

In Home Assistant, go to Settings → Devices & Services → Add Integration → Ollama and enter the LXC’s IP:

http://<LXC_IP>:11434

Configure the integration:

Setting Value
Model qwen3:4b-instruct
Keep alive 300 (seconds)
Context window 8192
Control Home Assistant Off (initially)
Think before responding Off

Set the system prompt to something voice-optimized:

You are a voice assistant for Home Assistant.
Answer questions about the world truthfully.
Answer in plain text only. No markdown, no bullet points, no asterisks.
Keep answers short and conversational, as your response will be spoken aloud.
The current time is {{ now() }}.
The location is [Your City], [Your State].
When asked about sensors or devices, always use available tools to look up
current state before answering. Never guess the state of a device or sensor.

08 — Configure Voice Pipeline

In Settings → Voice Assistants, set the conversation agent to Ollama, speech-to-text to Faster Whisper, and text-to-speech to Piper. Assign the assistant to the Voice Preview Edition device.

09 — Web Search via llm_intents

Install via HACS by adding the custom repository https://github.com/skye-harris/llm_intents. After installing and restarting HA, add the integration and configure it with a Brave Search API key (free tier: 1,000 searches/month). Enable the search tool in the Ollama integration’s tool settings. The model must support tool use — qwen3:4b-instruct does.

10 — Weather Forecasts

Add the Open-Meteo integration (no API key required). Expose the weather.home entity via Settings → Voice Assistants → Expose. For richer forecast responses, the weather entity state and forecast data can be passed to Ollama via automations using the weather.get_forecasts action.


Limitations

Nvidia GPU passthrough in an LXC is more fragile than a full VM passthrough. The host and container driver versions must match exactly, and this breaks silently on driver updates — apt upgrade inside the LXC can pull a newer nvidia-utils version that mismatches the host, causing Ollama to silently fall back to CPU. Pinning the nvidia packages in the LXC with apt-mark hold prevents this.

The runfile-installed driver on the host is not managed by apt, meaning Proxmox kernel updates may require reinstalling the driver manually if the kernel module no longer matches the running kernel.

The 4b model is fast but makes more mistakes than larger models on complex reasoning or ambiguous entity lookups. Keeping the exposed entity list lean and using natural aliases improves reliability significantly.


Outcomes

Inference GPU-accelerated, 1–3 second response time for short answers
Privacy Fully local — no data leaves the network
Cost $5/month Brave Search API free tier, everything else self-hosted
Model qwen3:4b-instruct — fast, tool-capable, fits in 11GB VRAM
GPU GTX 1080 Ti, ~3GB VRAM used during inference
Voice hardware Home Assistant Voice Preview Edition

References