Open-Source LLM Deployment: Running Llama 3, Mistral, and Gemma on Your Infrastructure

Self-hosting open-source LLMs gives you complete control over your data, costs, and latency. No rate limits. No API price changes. No data leaving your infrastructure. But deploying these models efficiently requires understanding quantization, serving infrastructure, and GPU optimization. Here's how to do it right.
Hardware Requirements
The first decision is what hardware you need. The good news: you don't need a cluster of H100s to run useful models.
| Model | Size | RAM/VRAM (FP16) | Quantized (Q4_K_M) | Recommended GPU | |-------|------|-----------------|-------------------|-----------------| | Llama 3.2 3B | 3B params | 6 GB | 2.5 GB | RTX 3060 (12GB) | | Gemma 2 9B | 9B params | 18 GB | 6 GB | RTX 3090 (24GB) | | Mistral Small 7B | 7B params | 14 GB | 4.5 GB | RTX 3090 (24GB) | | Llama 3.1 8B | 8B params | 16 GB | 5.5 GB | RTX 3090 (24GB) | | Mistral Large 22B | 22B params | 44 GB | 14 GB | 2x RTX 3090 | | Llama 3.1 70B | 70B params | 140 GB | 42 GB | 2x A100 (80GB) | | Mixtral 8x22B | 141B MoE | 84 GB (active) | 25 GB | 2x A100 |
Rule of thumb: A single RTX 3090 (24GB, ~$700 used) handles most 7B-9B models at Q4 quantization with excellent throughput (30-50 tokens/second). For 70B-class models, you need at least 2x A100 or 4x RTX 4090.
Quantization: Making Models Fit
Quantization reduces model precision from 16-bit floats to 4-bit or 8-bit integers, dramatically reducing memory requirements with minimal quality loss:
# Quantize Llama 3.1 8B to Q4_K_M using llama.cpp
./llama.cpp/build/bin/quantize \
Meta-Llama-3.1-8B-Instruct.gguf \
Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf \
q4_K_M
Quantization quality ladder (best to worst):
- Q8_0: Virtually no quality loss, 2x memory reduction
- Q6_K: Excellent quality, 2.5x reduction (recommended for production)
- Q5_K_M: Very good quality, 3x reduction (best balance)
- Q4_K_M: Good quality, 4x reduction (most popular)
- Q3_K_M: Noticeable but acceptable for simple tasks
- Q2_K: Significant degradation, only for testing
For production systems, use Q5_K_M or Q4_K_M. The quality gap between FP16 and Q4_K_M is typically <2% on standard benchmarks.
Serving with vLLM
vLLM is the gold standard for production LLM serving. It includes PagedAttention for efficient KV-cache management, continuous batching, and OpenAI-compatible API:
# Start vLLM server with Llama 3.1 8B
vllm serve /models/Llama-3.1-8B-Instruct-Q4_K_M.gguf \
--tensor-parallel-size 1 \
--max-model-len 8192 \
--gpu-memory-utilization 0.90 \
--max-num-seqs 32 \
--enforce-eager \
--api-key sk-your-key \
--port 8000
Performance tuning tips:
--max-num-seqs 32: Balances throughput and latency for chat use cases--gpu-memory-utilization 0.90: Leaves headroom for KV-cache growth--enforce-eager: Reduces first-token latency (disables CUDA graphs)--tensor-parallel-size N: Set to number of GPUs for larger models
Alternative Serving Options
Ollama is the easiest way to get started—a single command serves any GGUF model:
ollama pull llama3.1:8b-q4_K_M
ollama serve # Runs on localhost:11434
Ollama is great for development, single-user, or low-throughput scenarios. For production, vLLM delivers 3-5x higher throughput.
llama.cpp offers the best flexibility for CPU + GPU hybrid setups. It can split layers between GPU and CPU, useful when VRAM is insufficient.
./llama-server \
-m Llama-3.1-8B-Instruct-Q4_K_M.gguf \
-ngl 32 # Offload 32 layers to GPU, rest stay on CPU
--host 0.0.0.0 \
--port 8080
Cost Analysis: Self-Hosted vs API
For a 7B model serving 1M tokens/day:
| Cost Factor | Self-Hosted (RTX 3090) | OpenAI GPT-4o-mini | |-------------|----------------------|-------------------| | Hardware (one-time) | $700 | $0 | | Power (annual) | ~$200 | $0 | | GPU amortization (3yr) | $0.64/day | $0 | | API cost (1M tok/day) | $0 | $20/day | | Maintenance | ~$50/month | $0 | | Monthly Total | ~$100-150 | ~$600 |
At 1M tokens/day, self-hosting pays for the GPU in the first 6-8 weeks. Beyond that, it's 4-6x cheaper than API-based alternatives for equivalent model quality.
Monitoring and Observability
Your self-hosted model needs the same monitoring as any production service:
# Prometheus metrics for vLLM
from prometheus_client import Histogram, Counter
llm_request_duration = Histogram(
'llm_request_duration_seconds',
'LLM request latency',
['model', 'endpoint'],
buckets=[0.1, 0.5, 1, 2, 5, 10, 30]
)
llm_tokens_per_second = Histogram(
'llm_tokens_per_second',
'Generation throughput',
['model']
)
llm_errors = Counter(
'llm_errors_total',
'Total LLM errors',
['model', 'error_type']
)
Set alerts for GPU memory utilization > 90%, request latency > 5 seconds P99, and error rate > 1%.
At SoniNow, we help clients deploy and optimize open-source LLMs on their infrastructure. Our AI automation services cover hardware planning, model quantization, serving configuration, and production monitoring.
Self-hosting is the right choice when data privacy, cost predictability, or latency matter most. Contact us to design a self-hosted LLM infrastructure tailored to your workload.
Related Insights

Building AI Chatbots for Customer Support: A Complete Technical Guide
A technical guide to building AI-powered customer support chatbots including LLM integration, RAG architecture, conversation design, escalation workflows, and performance monitoring.

AI-Generated Code: Using LLMs for Development Workflows in 2026
Learn how to effectively use AI-generated code in development workflows including prompt patterns for code, review strategies, security considerations, and integration with CI/CD.

Building AI Agents That Actually Work: Architecture and Orchestration Patterns
Learn production architecture patterns for building reliable AI agents including task planning, tool use, memory systems, reflection loops, and human-in-the-loop workflows.