Grafana + Prometheus + Loki — A Homelab Monitoring Stack
1. What You’re Building
Section titled “1. What You’re Building”Three separate tools, one pane of glass:
| Component | Job | What it stores |
|---|---|---|
| Prometheus | Scrapes numerical metrics from exporters every 15s | Time-series numbers (CPU %, req/s, bytes…) |
| Loki | Ingests log lines shipped to it by Promtail | Indexed log streams (labelled, not full-text) |
| Grafana | Queries both and draws graphs / tables / logs | Nothing — it’s a view layer |
| Alertmanager | Takes alerts fired by Prometheus rules, routes to Discord / email / Slack | In-memory + on-disk silences |
Supporting cast — the exporters:
| Exporter | Where it runs | What it exposes |
|---|---|---|
| node-exporter | On the host, sees /proc, /sys, / | CPU, memory, disk, network, load averages, filesystems |
| cAdvisor | Reads /var/lib/docker + cgroups | Per-container CPU, memory, network, disk IO |
| pihole-exporter | Talks to the Pi-hole admin API | Queries/s, block rate, top clients, top domains |
Together it answers questions like “why is my NAS IO weird at 3am”, “which container is eating all the RAM”, “is Pi-hole actually blocking anything” — with graphs that go back weeks.
2. The Compose File
Section titled “2. The Compose File”The whole stack lives in a single compose.yaml:
services: node-exporter: image: prom/node-exporter:latest container_name: node-exporter restart: unless-stopped pid: host volumes: - /proc:/host/proc:ro - /sys:/host/sys:ro - /:/rootfs:ro command: - '--path.procfs=/host/proc' - '--path.rootfs=/rootfs' - '--path.sysfs=/host/sys' - '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)' networks: - monitoring
cadvisor: image: gcr.io/cadvisor/cadvisor:latest container_name: cadvisor restart: unless-stopped privileged: true mem_limit: 256m volumes: - /:/rootfs:ro - /var/run:/var/run:ro - /sys:/sys:ro - /var/lib/docker/:/var/lib/docker:ro - /dev/disk/:/dev/disk:ro devices: - /dev/kmsg networks: - monitoring
pihole-exporter: image: ekofr/pihole-exporter:latest container_name: pihole-exporter restart: unless-stopped environment: - PIHOLE_HOSTNAME=192.168.1.80 - PIHOLE_PORT=80 - PORT=9617 networks: - monitoring
prometheus: image: prom/prometheus:latest container_name: prometheus user: "${PUID}:${PGID}" volumes: - /mnt/nfs/docker/docker/prometheus/config:/etc/prometheus - /mnt/nfs/docker/docker/prometheus/data:/prometheus networks: - monitoring restart: unless-stopped depends_on: - node-exporter - cadvisor - pihole-exporter
loki: image: grafana/loki:latest container_name: loki restart: unless-stopped user: "0" # needs root to write to NFS command: -config.file=/etc/loki/loki-config.yaml volumes: - /mnt/nfs/docker/docker/loki/config:/etc/loki - /mnt/nfs/docker/docker/loki/data:/loki networks: - monitoring
promtail: image: grafana/promtail:latest container_name: promtail restart: unless-stopped command: -config.file=/etc/promtail/promtail-config.yaml volumes: - /mnt/nfs/docker/docker/promtail/config:/etc/promtail - /var/run/docker.sock:/var/run/docker.sock:ro - /var/lib/docker/containers:/var/lib/docker/containers:ro networks: - monitoring depends_on: - loki
alertmanager: image: prom/alertmanager:latest container_name: alertmanager restart: unless-stopped user: "0" command: - '--config.file=/etc/alertmanager/alertmanager.yml' - '--storage.path=/alertmanager' volumes: - /mnt/nfs/docker/docker/alertmanager/config:/etc/alertmanager - /mnt/nfs/docker/docker/alertmanager/data:/alertmanager networks: - monitoring
grafana: image: grafana/grafana:latest container_name: grafana environment: - GF_SECURITY_ADMIN_USER=${GF_ADMIN_USER} - GF_SECURITY_ADMIN_PASSWORD=${GF_ADMIN_PASSWORD} - GF_SERVER_ROOT_URL=https://grafana.falseviking.uk - GF_DASHBOARDS_DEFAULT_HOME_DASHBOARD_PATH=/var/lib/grafana/dashboards/home.json ports: - "3001:3000" volumes: - /mnt/nfs/docker/docker/grafana:/var/lib/grafana - /mnt/nfs/docker/docker/grafana/provisioning:/etc/grafana/provisioning networks: - monitoring - proxy restart: unless-stopped depends_on: - prometheus - loki
networks: monitoring: driver: bridge proxy: external: true3. Prometheus Configuration
Section titled “3. Prometheus Configuration”Prometheus’s whole job is: pull metrics from a list of URLs every N seconds and store them. Configure it via prometheus.yml:
global: scrape_interval: 15s evaluation_interval: 15s
rule_files: - /etc/prometheus/rules/*.yml
alerting: alertmanagers: - static_configs: - targets: - alertmanager:9093
scrape_configs: - job_name: prometheus static_configs: - targets: ['localhost:9090']
- job_name: node-exporter static_configs: - targets: ['node-exporter:9100']
- job_name: cadvisor static_configs: - targets: ['cadvisor:8080']
- job_name: pihole static_configs: - targets: ['pihole-exporter:9617']
- job_name: crowdsec static_configs: - targets: ['192.168.1.210:6060']
- job_name: traefik static_configs: - targets: ['192.168.1.210:8080']The pattern is always the same:
- Something exposes
/metricson an HTTP port (exporters, Traefik with--metrics.prometheus=true, CrowdSec withprometheus.level: full, etc.). - You add a
job_name+targetsentry. - Prometheus scrapes it every 15s.
Targets can be inside Docker (service-name:port) or outside (192.168.x.y:port) — Prometheus doesn’t care.
4. Loki + Promtail for Logs
Section titled “4. Loki + Promtail for Logs”Loki stores logs. Promtail ships them to Loki. On this host, Promtail discovers containers via the Docker socket and auto-ships every container’s stdout/stderr:
server: http_listen_port: 9080 grpc_listen_port: 0
positions: filename: /tmp/positions.yaml
clients: - url: http://loki:3100/loki/api/v1/push
scrape_configs: - job_name: docker docker_sd_configs: - host: unix:///var/run/docker.sock refresh_interval: 5s relabel_configs: - source_labels: ['__meta_docker_container_name'] regex: '/(.*)' target_label: 'container' - source_labels: ['__meta_docker_container_log_stream'] target_label: 'logstream' - source_labels: ['__meta_docker_container_label_com_docker_compose_project'] target_label: 'compose_project' - source_labels: ['__meta_docker_container_label_com_docker_compose_service'] target_label: 'compose_service'What you get in Loki:
- Labels:
container,logstream(stdout/stderr),compose_project,compose_service - Full log bodies, searchable with LogQL
- Automatic pickup — spin up a new container and its logs just appear
A typical LogQL query:
{container="traefik"} |= "error"— which is, on purpose, very close to how Prometheus queries feel. Labels on the left to select streams, filters on the right.
5. Grafana Provisioning
Section titled “5. Grafana Provisioning”Grafana has a database-backed config (users, dashboards, datasources) but you should provision it via files. That way the whole stack can be destroyed and rebuilt without clicks.
Datasources:
apiVersion: 1
datasources: - name: Prometheus type: prometheus uid: PBFA97CFB590B2093 access: proxy url: http://prometheus:9090 isDefault: true editable: true
- name: Loki type: loki uid: P8E80F9AEF21F6940 access: proxy url: http://loki:3100 editable: trueThe uid is important. Dashboards reference datasources by UID — so if you share a dashboard JSON between stacks, keep the UIDs stable or you’ll get “datasource not found” on every panel.
Dashboards — point a provider at a folder:
apiVersion: 1
providers: - name: 'default' orgId: 1 folder: '' type: file disableDeletion: false editable: true options: path: /var/lib/grafana/dashboards foldersFromFilesStructure: falseAnything dropped in /var/lib/grafana/dashboards/*.json is picked up automatically. This is where Grafana.com dashboard imports end up — download the JSON from grafana.com/dashboards, drop it in, done.
6. Useful Dashboards to Start With
Section titled “6. Useful Dashboards to Start With”Grafana.com has thousands; most are overbuilt. A minimal homelab kit:
| Dashboard | ID | What it shows |
|---|---|---|
| Node Exporter Full | 1860 | Host-level CPU / memory / disk / network — everything |
| Docker & System Monitoring | 893 | Per-container resource graphs from cAdvisor |
| Pi-hole | 10176 | Block rate, query clients, top domains |
| CrowdSec Engine | 21419 | Active decisions, bucket activity, LAPI calls |
| Traefik 2 | 17346 | Request rate, latency, 4xx/5xx by router |
After that, build your own. Every panel is a PromQL or LogQL query — the dashboards are just saved queries with graph configs.
7. Alert Rules That Matter
Section titled “7. Alert Rules That Matter”Prometheus evaluates alert rules every evaluation_interval and hands firing alerts to Alertmanager. The trick is to write alerts that only fire on things you’d actually wake up for.
groups: - name: container_alerts rules: - alert: ContainerDown expr: absent(container_last_seen{name=~".+"}) or (time() - container_last_seen{name=~".+"}) > 60 for: 2m labels: severity: critical annotations: summary: "Container {{ $labels.name }} is down"
- alert: ContainerHighCPU expr: (sum(rate(container_cpu_usage_seconds_total{name=~".+"}[5m])) by (name) * 100) > 80 for: 5m labels: severity: warning annotations: summary: "Container {{ $labels.name }} high CPU usage"
- name: host_alerts rules: - alert: DiskSpaceHigh expr: (1 - (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"})) * 100 > 85 for: 5m labels: severity: warning
- alert: DiskSpaceHighNFS expr: (1 - (node_filesystem_avail_bytes{mountpoint=~"/mnt/nfs.*"} / node_filesystem_size_bytes{mountpoint=~"/mnt/nfs.*"})) * 100 > 85 for: 5m labels: severity: warning
- alert: HighMemoryUsage expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 85 for: 5m labels: severity: warning
- name: security_alerts rules: - alert: CrowdSecNewBan expr: increase(cs_active_decisions[5m]) > 0 for: 0m labels: severity: infoThe for: clause is the anti-flap filter — a condition must hold for N minutes before firing. ContainerHighCPU with for: 5m means a single spike won’t page you; a sustained load will.
{{ $labels.name }} and {{ $value }} are Go templates expanded when the alert fires — they end up in the Alertmanager notification.
8. Alertmanager Routing
Section titled “8. Alertmanager Routing”Prometheus fires; Alertmanager routes. Minimal config:
route: receiver: discord group_by: ['alertname'] group_wait: 30s group_interval: 5m repeat_interval: 4h
receivers: - name: discord discord_configs: - webhook_url: https://discord.com/api/webhooks/XXX/YYY title: '{{ .GroupLabels.alertname }}' message: '{{ range .Alerts }}{{ .Annotations.summary }}\n{{ end }}'9. Reverse Proxy
Section titled “9. Reverse Proxy”Grafana binds :3001 on the host (mapped to its internal :3000). Route grafana.example.com → grafana:3000 in your reverse proxy over HTTP. Set GF_SERVER_ROOT_URL to the full public URL so login redirects work.
If you front it with SSO (Authentik, Keycloak), you also want:
environment: - GF_AUTH_PROXY_ENABLED=true - GF_AUTH_PROXY_HEADER_NAME=X-authentik-username - GF_AUTH_PROXY_HEADER_PROPERTY=username - GF_AUTH_PROXY_AUTO_SIGN_UP=true - GF_AUTH_PROXY_WHITELIST=<your-proxy-ip>…and then configure the reverse proxy to inject the username header only on authenticated requests.
10. Day-to-Day: What Grafana Is Actually For
Section titled “10. Day-to-Day: What Grafana Is Actually For”Once it’s up, three workflows dominate:
1. “Is everything OK right now?” Open the home dashboard. Glance at CPU, memory, disk, container status. Takes 5 seconds.
2. “Why did X break at 03:15?” Open Explore → pick Loki → filter {container="X"} with a time range covering 03:15. Logs for exactly that window. Follow the timestamps into Prometheus (Explore → Prometheus, same time range) to see if it was CPU-starved, memory-exhausted, or something external.
3. “Has Z been getting slower?” Build a panel with the right PromQL. Widen the time range to 30 days. See the trend. If it’s real, dig in.
The two most valuable keybindings in Grafana:
d+h— jump to the home dashboard from anywheret+z— zoom out the time range (t+wshifts earlier; good for “what happened before now”)
11. Storage & Retention
Section titled “11. Storage & Retention”All three components store data:
| Component | Default retention | Where it goes |
|---|---|---|
| Prometheus | 15 days | /prometheus volume — TSDB files |
| Loki | Unlimited unless configured | /loki — chunks (compressed log data) + index |
| Grafana | Forever | /var/lib/grafana — SQLite with dashboards + settings |
Prometheus grows roughly linearly with the number of series you scrape. Loki grows with log volume — a Talkative Plex plus a busy Traefik can easily generate hundreds of MB/day. Set retention up front:
Prometheus: add --storage.tsdb.retention.time=30d to the command args.
Loki: set limits_config.retention_period: 720h + a compactor block with delete_request_store: filesystem.
12. Troubleshooting
Section titled “12. Troubleshooting”Prometheus target is red in Status → Targets. Either network path (try docker exec prometheus wget -qO- http://target:port/metrics) or the target doesn’t expose /metrics.
“No data” in a Grafana panel but Prometheus has it. Wrong datasource UID. Edit the panel, re-pick the datasource.
Loki says “ingester not ready” / “too many outstanding requests”. Usually NFS being slow. Check docker logs loki for IO timeouts.
Dashboards vanish on restart. The Grafana DB is on a bind mount that lost its permissions. chmod 777 the Grafana data dir (or pin the container to UID 472).
Alerts stuck in “pending” and never fire. The alert’s for: hasn’t elapsed. Or the rule is evaluating but never matching — check Alerts → expression in the Prometheus UI, run the raw PromQL in Explore.