Skip to content

Grafana + Prometheus + Loki — A Homelab Monitoring Stack

Three separate tools, one pane of glass:

ComponentJobWhat it stores
PrometheusScrapes numerical metrics from exporters every 15sTime-series numbers (CPU %, req/s, bytes…)
LokiIngests log lines shipped to it by PromtailIndexed log streams (labelled, not full-text)
GrafanaQueries both and draws graphs / tables / logsNothing — it’s a view layer
AlertmanagerTakes alerts fired by Prometheus rules, routes to Discord / email / SlackIn-memory + on-disk silences

Supporting cast — the exporters:

ExporterWhere it runsWhat it exposes
node-exporterOn the host, sees /proc, /sys, /CPU, memory, disk, network, load averages, filesystems
cAdvisorReads /var/lib/docker + cgroupsPer-container CPU, memory, network, disk IO
pihole-exporterTalks to the Pi-hole admin APIQueries/s, block rate, top clients, top domains

Together it answers questions like “why is my NAS IO weird at 3am”, “which container is eating all the RAM”, “is Pi-hole actually blocking anything” — with graphs that go back weeks.

The whole stack lives in a single compose.yaml:

/opt/stacks/monitoring/compose.yaml
services:
node-exporter:
image: prom/node-exporter:latest
container_name: node-exporter
restart: unless-stopped
pid: host
volumes:
- /proc:/host/proc:ro
- /sys:/host/sys:ro
- /:/rootfs:ro
command:
- '--path.procfs=/host/proc'
- '--path.rootfs=/rootfs'
- '--path.sysfs=/host/sys'
- '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'
networks:
- monitoring
cadvisor:
image: gcr.io/cadvisor/cadvisor:latest
container_name: cadvisor
restart: unless-stopped
privileged: true
mem_limit: 256m
volumes:
- /:/rootfs:ro
- /var/run:/var/run:ro
- /sys:/sys:ro
- /var/lib/docker/:/var/lib/docker:ro
- /dev/disk/:/dev/disk:ro
devices:
- /dev/kmsg
networks:
- monitoring
pihole-exporter:
image: ekofr/pihole-exporter:latest
container_name: pihole-exporter
restart: unless-stopped
environment:
- PIHOLE_HOSTNAME=192.168.1.80
- PIHOLE_PORT=80
- PORT=9617
networks:
- monitoring
prometheus:
image: prom/prometheus:latest
container_name: prometheus
user: "${PUID}:${PGID}"
volumes:
- /mnt/nfs/docker/docker/prometheus/config:/etc/prometheus
- /mnt/nfs/docker/docker/prometheus/data:/prometheus
networks:
- monitoring
restart: unless-stopped
depends_on:
- node-exporter
- cadvisor
- pihole-exporter
loki:
image: grafana/loki:latest
container_name: loki
restart: unless-stopped
user: "0" # needs root to write to NFS
command: -config.file=/etc/loki/loki-config.yaml
volumes:
- /mnt/nfs/docker/docker/loki/config:/etc/loki
- /mnt/nfs/docker/docker/loki/data:/loki
networks:
- monitoring
promtail:
image: grafana/promtail:latest
container_name: promtail
restart: unless-stopped
command: -config.file=/etc/promtail/promtail-config.yaml
volumes:
- /mnt/nfs/docker/docker/promtail/config:/etc/promtail
- /var/run/docker.sock:/var/run/docker.sock:ro
- /var/lib/docker/containers:/var/lib/docker/containers:ro
networks:
- monitoring
depends_on:
- loki
alertmanager:
image: prom/alertmanager:latest
container_name: alertmanager
restart: unless-stopped
user: "0"
command:
- '--config.file=/etc/alertmanager/alertmanager.yml'
- '--storage.path=/alertmanager'
volumes:
- /mnt/nfs/docker/docker/alertmanager/config:/etc/alertmanager
- /mnt/nfs/docker/docker/alertmanager/data:/alertmanager
networks:
- monitoring
grafana:
image: grafana/grafana:latest
container_name: grafana
environment:
- GF_SECURITY_ADMIN_USER=${GF_ADMIN_USER}
- GF_SECURITY_ADMIN_PASSWORD=${GF_ADMIN_PASSWORD}
- GF_SERVER_ROOT_URL=https://grafana.falseviking.uk
- GF_DASHBOARDS_DEFAULT_HOME_DASHBOARD_PATH=/var/lib/grafana/dashboards/home.json
ports:
- "3001:3000"
volumes:
- /mnt/nfs/docker/docker/grafana:/var/lib/grafana
- /mnt/nfs/docker/docker/grafana/provisioning:/etc/grafana/provisioning
networks:
- monitoring
- proxy
restart: unless-stopped
depends_on:
- prometheus
- loki
networks:
monitoring:
driver: bridge
proxy:
external: true

Prometheus’s whole job is: pull metrics from a list of URLs every N seconds and store them. Configure it via prometheus.yml:

/mnt/nfs/docker/docker/prometheus/config/prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
rule_files:
- /etc/prometheus/rules/*.yml
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093
scrape_configs:
- job_name: prometheus
static_configs:
- targets: ['localhost:9090']
- job_name: node-exporter
static_configs:
- targets: ['node-exporter:9100']
- job_name: cadvisor
static_configs:
- targets: ['cadvisor:8080']
- job_name: pihole
static_configs:
- targets: ['pihole-exporter:9617']
- job_name: crowdsec
static_configs:
- targets: ['192.168.1.210:6060']
- job_name: traefik
static_configs:
- targets: ['192.168.1.210:8080']

The pattern is always the same:

  1. Something exposes /metrics on an HTTP port (exporters, Traefik with --metrics.prometheus=true, CrowdSec with prometheus.level: full, etc.).
  2. You add a job_name + targets entry.
  3. Prometheus scrapes it every 15s.

Targets can be inside Docker (service-name:port) or outside (192.168.x.y:port) — Prometheus doesn’t care.

Loki stores logs. Promtail ships them to Loki. On this host, Promtail discovers containers via the Docker socket and auto-ships every container’s stdout/stderr:

/mnt/nfs/docker/docker/promtail/config/promtail-config.yaml
server:
http_listen_port: 9080
grpc_listen_port: 0
positions:
filename: /tmp/positions.yaml
clients:
- url: http://loki:3100/loki/api/v1/push
scrape_configs:
- job_name: docker
docker_sd_configs:
- host: unix:///var/run/docker.sock
refresh_interval: 5s
relabel_configs:
- source_labels: ['__meta_docker_container_name']
regex: '/(.*)'
target_label: 'container'
- source_labels: ['__meta_docker_container_log_stream']
target_label: 'logstream'
- source_labels: ['__meta_docker_container_label_com_docker_compose_project']
target_label: 'compose_project'
- source_labels: ['__meta_docker_container_label_com_docker_compose_service']
target_label: 'compose_service'

What you get in Loki:

  • Labels: container, logstream (stdout/stderr), compose_project, compose_service
  • Full log bodies, searchable with LogQL
  • Automatic pickup — spin up a new container and its logs just appear

A typical LogQL query:

{container="traefik"} |= "error"

— which is, on purpose, very close to how Prometheus queries feel. Labels on the left to select streams, filters on the right.

Grafana has a database-backed config (users, dashboards, datasources) but you should provision it via files. That way the whole stack can be destroyed and rebuilt without clicks.

Datasources:

/mnt/nfs/docker/docker/grafana/provisioning/datasources/datasources.yaml
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
uid: PBFA97CFB590B2093
access: proxy
url: http://prometheus:9090
isDefault: true
editable: true
- name: Loki
type: loki
uid: P8E80F9AEF21F6940
access: proxy
url: http://loki:3100
editable: true

The uid is important. Dashboards reference datasources by UID — so if you share a dashboard JSON between stacks, keep the UIDs stable or you’ll get “datasource not found” on every panel.

Dashboards — point a provider at a folder:

/mnt/nfs/docker/docker/grafana/provisioning/dashboards/dashboards.yaml
apiVersion: 1
providers:
- name: 'default'
orgId: 1
folder: ''
type: file
disableDeletion: false
editable: true
options:
path: /var/lib/grafana/dashboards
foldersFromFilesStructure: false

Anything dropped in /var/lib/grafana/dashboards/*.json is picked up automatically. This is where Grafana.com dashboard imports end up — download the JSON from grafana.com/dashboards, drop it in, done.

Grafana.com has thousands; most are overbuilt. A minimal homelab kit:

DashboardIDWhat it shows
Node Exporter Full1860Host-level CPU / memory / disk / network — everything
Docker & System Monitoring893Per-container resource graphs from cAdvisor
Pi-hole10176Block rate, query clients, top domains
CrowdSec Engine21419Active decisions, bucket activity, LAPI calls
Traefik 217346Request rate, latency, 4xx/5xx by router

After that, build your own. Every panel is a PromQL or LogQL query — the dashboards are just saved queries with graph configs.

Prometheus evaluates alert rules every evaluation_interval and hands firing alerts to Alertmanager. The trick is to write alerts that only fire on things you’d actually wake up for.

/mnt/nfs/docker/docker/prometheus/config/rules/alerts.yml
groups:
- name: container_alerts
rules:
- alert: ContainerDown
expr: absent(container_last_seen{name=~".+"}) or (time() - container_last_seen{name=~".+"}) > 60
for: 2m
labels:
severity: critical
annotations:
summary: "Container {{ $labels.name }} is down"
- alert: ContainerHighCPU
expr: (sum(rate(container_cpu_usage_seconds_total{name=~".+"}[5m])) by (name) * 100) > 80
for: 5m
labels:
severity: warning
annotations:
summary: "Container {{ $labels.name }} high CPU usage"
- name: host_alerts
rules:
- alert: DiskSpaceHigh
expr: (1 - (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"})) * 100 > 85
for: 5m
labels:
severity: warning
- alert: DiskSpaceHighNFS
expr: (1 - (node_filesystem_avail_bytes{mountpoint=~"/mnt/nfs.*"} / node_filesystem_size_bytes{mountpoint=~"/mnt/nfs.*"})) * 100 > 85
for: 5m
labels:
severity: warning
- alert: HighMemoryUsage
expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 85
for: 5m
labels:
severity: warning
- name: security_alerts
rules:
- alert: CrowdSecNewBan
expr: increase(cs_active_decisions[5m]) > 0
for: 0m
labels:
severity: info

The for: clause is the anti-flap filter — a condition must hold for N minutes before firing. ContainerHighCPU with for: 5m means a single spike won’t page you; a sustained load will.

{{ $labels.name }} and {{ $value }} are Go templates expanded when the alert fires — they end up in the Alertmanager notification.

Prometheus fires; Alertmanager routes. Minimal config:

alertmanager.yml
route:
receiver: discord
group_by: ['alertname']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
receivers:
- name: discord
discord_configs:
- webhook_url: https://discord.com/api/webhooks/XXX/YYY
title: '{{ .GroupLabels.alertname }}'
message: '{{ range .Alerts }}{{ .Annotations.summary }}\n{{ end }}'

Grafana binds :3001 on the host (mapped to its internal :3000). Route grafana.example.comgrafana:3000 in your reverse proxy over HTTP. Set GF_SERVER_ROOT_URL to the full public URL so login redirects work.

If you front it with SSO (Authentik, Keycloak), you also want:

environment:
- GF_AUTH_PROXY_ENABLED=true
- GF_AUTH_PROXY_HEADER_NAME=X-authentik-username
- GF_AUTH_PROXY_HEADER_PROPERTY=username
- GF_AUTH_PROXY_AUTO_SIGN_UP=true
- GF_AUTH_PROXY_WHITELIST=<your-proxy-ip>

…and then configure the reverse proxy to inject the username header only on authenticated requests.

10. Day-to-Day: What Grafana Is Actually For

Section titled “10. Day-to-Day: What Grafana Is Actually For”

Once it’s up, three workflows dominate:

1. “Is everything OK right now?” Open the home dashboard. Glance at CPU, memory, disk, container status. Takes 5 seconds.

2. “Why did X break at 03:15?” Open Explore → pick Loki → filter {container="X"} with a time range covering 03:15. Logs for exactly that window. Follow the timestamps into Prometheus (Explore → Prometheus, same time range) to see if it was CPU-starved, memory-exhausted, or something external.

3. “Has Z been getting slower?” Build a panel with the right PromQL. Widen the time range to 30 days. See the trend. If it’s real, dig in.

The two most valuable keybindings in Grafana:

  • d+h — jump to the home dashboard from anywhere
  • t+z — zoom out the time range (t+w shifts earlier; good for “what happened before now”)

All three components store data:

ComponentDefault retentionWhere it goes
Prometheus15 days/prometheus volume — TSDB files
LokiUnlimited unless configured/loki — chunks (compressed log data) + index
GrafanaForever/var/lib/grafana — SQLite with dashboards + settings

Prometheus grows roughly linearly with the number of series you scrape. Loki grows with log volume — a Talkative Plex plus a busy Traefik can easily generate hundreds of MB/day. Set retention up front:

Prometheus: add --storage.tsdb.retention.time=30d to the command args.

Loki: set limits_config.retention_period: 720h + a compactor block with delete_request_store: filesystem.

Prometheus target is red in Status → Targets. Either network path (try docker exec prometheus wget -qO- http://target:port/metrics) or the target doesn’t expose /metrics.

“No data” in a Grafana panel but Prometheus has it. Wrong datasource UID. Edit the panel, re-pick the datasource.

Loki says “ingester not ready” / “too many outstanding requests”. Usually NFS being slow. Check docker logs loki for IO timeouts.

Dashboards vanish on restart. The Grafana DB is on a bind mount that lost its permissions. chmod 777 the Grafana data dir (or pin the container to UID 472).

Alerts stuck in “pending” and never fire. The alert’s for: hasn’t elapsed. Or the rule is evaluating but never matching — check Alerts → expression in the Prometheus UI, run the raw PromQL in Explore.