Open-Source LLM Deployment: Running Llama 3, Mistral, and Gemma on Your Infrastructure | SoniNow Blog

Limited TimeLearn More

llmopen sourcellamamistralself-hosting

Open-Source LLM Deployment: Running Llama 3, Mistral, and Gemma on Your Infrastructure

Published

2026-06-23

Read Time

4 mins

Open-Source LLM Deployment: Running Llama 3, Mistral, and Gemma on Your Infrastructure

Self-hosting open-source LLMs gives you complete control over your data, costs, and latency. No rate limits. No API price changes. No data leaving your infrastructure. But deploying these models efficiently requires understanding quantization, serving infrastructure, and GPU optimization. Here's how to do it right.

Hardware Requirements

The first decision is what hardware you need. The good news: you don't need a cluster of H100s to run useful models.

| Model | Size | RAM/VRAM (FP16) | Quantized (Q4_K_M) | Recommended GPU | |-------|------|-----------------|-------------------|-----------------| | Llama 3.2 3B | 3B params | 6 GB | 2.5 GB | RTX 3060 (12GB) | | Gemma 2 9B | 9B params | 18 GB | 6 GB | RTX 3090 (24GB) | | Mistral Small 7B | 7B params | 14 GB | 4.5 GB | RTX 3090 (24GB) | | Llama 3.1 8B | 8B params | 16 GB | 5.5 GB | RTX 3090 (24GB) | | Mistral Large 22B | 22B params | 44 GB | 14 GB | 2x RTX 3090 | | Llama 3.1 70B | 70B params | 140 GB | 42 GB | 2x A100 (80GB) | | Mixtral 8x22B | 141B MoE | 84 GB (active) | 25 GB | 2x A100 |

Rule of thumb: A single RTX 3090 (24GB, ~$700 used) handles most 7B-9B models at Q4 quantization with excellent throughput (30-50 tokens/second). For 70B-class models, you need at least 2x A100 or 4x RTX 4090.

Quantization: Making Models Fit

Quantization reduces model precision from 16-bit floats to 4-bit or 8-bit integers, dramatically reducing memory requirements with minimal quality loss:

# Quantize Llama 3.1 8B to Q4_K_M using llama.cpp
./llama.cpp/build/bin/quantize \
  Meta-Llama-3.1-8B-Instruct.gguf \
  Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf \
  q4_K_M

Quantization quality ladder (best to worst):

  • Q8_0: Virtually no quality loss, 2x memory reduction
  • Q6_K: Excellent quality, 2.5x reduction (recommended for production)
  • Q5_K_M: Very good quality, 3x reduction (best balance)
  • Q4_K_M: Good quality, 4x reduction (most popular)
  • Q3_K_M: Noticeable but acceptable for simple tasks
  • Q2_K: Significant degradation, only for testing

For production systems, use Q5_K_M or Q4_K_M. The quality gap between FP16 and Q4_K_M is typically <2% on standard benchmarks.

Serving with vLLM

vLLM is the gold standard for production LLM serving. It includes PagedAttention for efficient KV-cache management, continuous batching, and OpenAI-compatible API:

# Start vLLM server with Llama 3.1 8B
vllm serve /models/Llama-3.1-8B-Instruct-Q4_K_M.gguf \
  --tensor-parallel-size 1 \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.90 \
  --max-num-seqs 32 \
  --enforce-eager \
  --api-key sk-your-key \
  --port 8000

Performance tuning tips:

  • --max-num-seqs 32: Balances throughput and latency for chat use cases
  • --gpu-memory-utilization 0.90: Leaves headroom for KV-cache growth
  • --enforce-eager: Reduces first-token latency (disables CUDA graphs)
  • --tensor-parallel-size N: Set to number of GPUs for larger models

Alternative Serving Options

Ollama is the easiest way to get started—a single command serves any GGUF model:

ollama pull llama3.1:8b-q4_K_M
ollama serve  # Runs on localhost:11434

Ollama is great for development, single-user, or low-throughput scenarios. For production, vLLM delivers 3-5x higher throughput.

llama.cpp offers the best flexibility for CPU + GPU hybrid setups. It can split layers between GPU and CPU, useful when VRAM is insufficient.

./llama-server \
  -m Llama-3.1-8B-Instruct-Q4_K_M.gguf \
  -ngl 32  # Offload 32 layers to GPU, rest stay on CPU
  --host 0.0.0.0 \
  --port 8080

Cost Analysis: Self-Hosted vs API

For a 7B model serving 1M tokens/day:

| Cost Factor | Self-Hosted (RTX 3090) | OpenAI GPT-4o-mini | |-------------|----------------------|-------------------| | Hardware (one-time) | $700 | $0 | | Power (annual) | ~$200 | $0 | | GPU amortization (3yr) | $0.64/day | $0 | | API cost (1M tok/day) | $0 | $20/day | | Maintenance | ~$50/month | $0 | | Monthly Total | ~$100-150 | ~$600 |

At 1M tokens/day, self-hosting pays for the GPU in the first 6-8 weeks. Beyond that, it's 4-6x cheaper than API-based alternatives for equivalent model quality.

Monitoring and Observability

Your self-hosted model needs the same monitoring as any production service:

# Prometheus metrics for vLLM
from prometheus_client import Histogram, Counter

llm_request_duration = Histogram(
    'llm_request_duration_seconds',
    'LLM request latency',
    ['model', 'endpoint'],
    buckets=[0.1, 0.5, 1, 2, 5, 10, 30]
)

llm_tokens_per_second = Histogram(
    'llm_tokens_per_second',
    'Generation throughput',
    ['model']
)

llm_errors = Counter(
    'llm_errors_total',
    'Total LLM errors',
    ['model', 'error_type']
)

Set alerts for GPU memory utilization > 90%, request latency > 5 seconds P99, and error rate > 1%.

At SoniNow, we help clients deploy and optimize open-source LLMs on their infrastructure. Our AI automation services cover hardware planning, model quantization, serving configuration, and production monitoring.

Self-hosting is the right choice when data privacy, cost predictability, or latency matter most. Contact us to design a self-hosted LLM infrastructure tailored to your workload.