Question 1

How do I run a local LLM for Home Assistant voice control with Ollama?

Accepted Answer

My setup is Ollama hosting the models on a Mac Mini M4 Pro, then Home Assistant connects to it over the LAN. The key is exposing Ollama to the network in the Ollama settings, and using a static IP so Home Assistant can always reach it. After that, you add the Ollama integration in Home Assistant and point it at your Mac’s IP plus the port.

Question 2

What hardware am I using to run local LLMs for Home Assistant?

Accepted Answer

I’m running everything on a Mac Mini M4 Pro dedicated to LLM use. Apple Silicon is great here because unified memory gives the GPU more headroom than you’d typically get with VRAM-limited setups. It’s also insanely efficient—my box idles under 5 watts and peaks around 65 watts under full load.

Question 3

What is my cached responses strategy in Home Assistant?

Accepted Answer

Instead of asking the model to reason through a huge pile of entities every time, I push logic into Home Assistant scripts and predefined pathways. Then I cache the results of common queries so the voice model can respond instantly. For example, I have a bigger model generate a clean spoken weather summary every hour, cache it, and let a smaller model serve it up fast.

Question 4

Which Ollama models work best for Home Assistant conversation agents?

Accepted Answer

At the time I recorded this, my favorite for Home Assistant is Qwen 3—either 4B or 8B depending on your hardware. I’m currently leaning on Qwen 3 4B for real-time voice because it’s faster, even though it can hallucinate a little sometimes. For heavier tasks where latency matters less, I’ve been using Mistral (I’m running a larger Mistral 24B) and letting it do the slower, smarter work.

Question 5

How fast does an LLM need to be for Home Assistant voice?

Accepted Answer

In my experience, anything above 10 tokens per second is technically usable, but for Home Assistant voice it needs to be much faster to feel good. For me, around 50 tokens per second is the minimum for real-time voice interactions. Bigger models like 32B can be usable for background tasks, but they’re usually too slow for snappy voice.

Question 6

How do I test Ollama performance on my machine?

Accepted Answer

I test directly in Terminal with a command like `ollama run <model> --verbose`. The verbose output shows evaluation rate (tokens per second), which is the stat I care about most when judging if a model will feel responsive. I also use a broad test prompt to see how well the model reasons before I trust it in automations.

Question 7

How do I visualize LLM activity with WLED?

Accepted Answer

I used a pre-built WLED animation called SOAP and automated its speed based on the Mac Mini’s power draw. As the LLM ramps up, wattage increases and the animation smoothly speeds up, then eases back down when the box is idle. Since the Mac Mini is dedicated to LLMs, it’s a pretty spot-on “LLM workload meter.”

Question 8

What is Open WebUI and why am I using it with Ollama?

Accepted Answer

Open WebUI is a local ChatGPT-style interface that talks to your Ollama models. I use it to switch between models quickly, run the same prompt across different sizes, and compare performance side-by-side. If you’ve already got Ollama running and you’re comfortable following GitHub instructions, it’s a really nice next step.

Run a Local LLM: Ollama + Home Assistant

🛍️ Products Mentioned (2)

Full write-up and additional resources

To learn for free on Brilliant, go to

About This Video

Frequently Asked Questions

🎬 More from StratoBuilds