Qwen 3 32B
Default base model — strong on multilingual reasoning, instruction-following.
Apache 2.0
DeepSeek-R1-Distill-Qwen-32B
Reasoning lane — chain-of-thought for deviation root-cause work.
MIT
Ollama / vLLM
Inference runtime — Ollama on Apple Silicon, vLLM on GPU servers.
MIT / Apache 2.0
LiteLLM
OpenAI-compatible gateway. /v1/chat/completions, /v1/embeddings.
MIT
Qdrant
Vector DB and RAG. On-disk, snapshot-friendly, your documents.
Apache 2.0
Langfuse
Audit and tracing — every prompt, every response, hash-anchored.
MIT
Open WebUI
Chat UI plus per-vertical workflow surfaces.
MIT
Keycloak
SSO via SAML/OIDC. RBAC. MFA. Your IdP, your roles.
Apache 2.0
Prometheus + Grafana
Latency, error rate, token throughput, GPU temp on dashboards you own.
Apache 2.0