Grafana + Prometheus + Loki — A Homelab Monitoring Stack

1. What You’re Building

Three separate tools, one pane of glass:

Component	Job	What it stores
Prometheus	Scrapes numerical metrics from exporters every 15s	Time-series numbers (CPU %, req/s, bytes…)
Loki	Ingests log lines shipped to it by Promtail	Indexed log streams (labelled, not full-text)
Grafana	Queries both and draws graphs / tables / logs	Nothing — it’s a view layer
Alertmanager	Takes alerts fired by Prometheus rules, routes to Discord / email / Slack	In-memory + on-disk silences

Supporting cast — the exporters:

Exporter	Where it runs	What it exposes
node-exporter	On the host, sees `/proc`, `/sys`, `/`	CPU, memory, disk, network, load averages, filesystems
cAdvisor	Reads `/var/lib/docker` + cgroups	Per-container CPU, memory, network, disk IO
pihole-exporter	Talks to the Pi-hole admin API	Queries/s, block rate, top clients, top domains

Together it answers questions like “why is my NAS IO weird at 3am”, “which container is eating all the RAM”, “is Pi-hole actually blocking anything” — with graphs that go back weeks.

2. The Compose File

The whole stack lives in a single compose.yaml:

services:
  node-exporter:
    image: prom/node-exporter:latest
    container_name: node-exporter
    restart: unless-stopped
    pid: host
    volumes:
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
      - /:/rootfs:ro
    command:
      - '--path.procfs=/host/proc'
      - '--path.rootfs=/rootfs'
      - '--path.sysfs=/host/sys'
      - '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'
    networks:
      - monitoring

  cadvisor:
    image: gcr.io/cadvisor/cadvisor:latest
    container_name: cadvisor
    restart: unless-stopped
    privileged: true
    mem_limit: 256m
    volumes:
      - /:/rootfs:ro
      - /var/run:/var/run:ro
      - /sys:/sys:ro
      - /var/lib/docker/:/var/lib/docker:ro
      - /dev/disk/:/dev/disk:ro
    devices:
      - /dev/kmsg
    networks:
      - monitoring

  pihole-exporter:
    image: ekofr/pihole-exporter:latest
    container_name: pihole-exporter
    restart: unless-stopped
    environment:
      - PIHOLE_HOSTNAME=192.168.1.80
      - PIHOLE_PORT=80
      - PORT=9617
    networks:
      - monitoring

  prometheus:
    image: prom/prometheus:latest
    container_name: prometheus
    user: "${PUID}:${PGID}"
    volumes:
      - /mnt/nfs/docker/docker/prometheus/config:/etc/prometheus
      - /mnt/nfs/docker/docker/prometheus/data:/prometheus
    networks:
      - monitoring
    restart: unless-stopped
    depends_on:
      - node-exporter
      - cadvisor
      - pihole-exporter

  loki:
    image: grafana/loki:latest
    container_name: loki
    restart: unless-stopped
    user: "0"   # needs root to write to NFS
    command: -config.file=/etc/loki/loki-config.yaml
    volumes:
      - /mnt/nfs/docker/docker/loki/config:/etc/loki
      - /mnt/nfs/docker/docker/loki/data:/loki
    networks:
      - monitoring

  promtail:
    image: grafana/promtail:latest
    container_name: promtail
    restart: unless-stopped
    command: -config.file=/etc/promtail/promtail-config.yaml
    volumes:
      - /mnt/nfs/docker/docker/promtail/config:/etc/promtail
      - /var/run/docker.sock:/var/run/docker.sock:ro
      - /var/lib/docker/containers:/var/lib/docker/containers:ro
    networks:
      - monitoring
    depends_on:
      - loki

  alertmanager:
    image: prom/alertmanager:latest
    container_name: alertmanager
    restart: unless-stopped
    user: "0"
    command:
      - '--config.file=/etc/alertmanager/alertmanager.yml'
      - '--storage.path=/alertmanager'
    volumes:
      - /mnt/nfs/docker/docker/alertmanager/config:/etc/alertmanager
      - /mnt/nfs/docker/docker/alertmanager/data:/alertmanager
    networks:
      - monitoring

  grafana:
    image: grafana/grafana:latest
    container_name: grafana
    environment:
      - GF_SECURITY_ADMIN_USER=${GF_ADMIN_USER}
      - GF_SECURITY_ADMIN_PASSWORD=${GF_ADMIN_PASSWORD}
      - GF_SERVER_ROOT_URL=https://grafana.falseviking.uk
      - GF_DASHBOARDS_DEFAULT_HOME_DASHBOARD_PATH=/var/lib/grafana/dashboards/home.json
    ports:
      - "3001:3000"
    volumes:
      - /mnt/nfs/docker/docker/grafana:/var/lib/grafana
      - /mnt/nfs/docker/docker/grafana/provisioning:/etc/grafana/provisioning
    networks:
      - monitoring
      - proxy
    restart: unless-stopped
    depends_on:
      - prometheus
      - loki

networks:
  monitoring:
    driver: bridge
  proxy:
    external: true

3. Prometheus Configuration

Prometheus’s whole job is: pull metrics from a list of URLs every N seconds and store them. Configure it via prometheus.yml:

global:
  scrape_interval: 15s
  evaluation_interval: 15s

rule_files:
  - /etc/prometheus/rules/*.yml

alerting:
  alertmanagers:
    - static_configs:
        - targets:
            - alertmanager:9093

scrape_configs:
  - job_name: prometheus
    static_configs:
      - targets: ['localhost:9090']

  - job_name: node-exporter
    static_configs:
      - targets: ['node-exporter:9100']

  - job_name: cadvisor
    static_configs:
      - targets: ['cadvisor:8080']

  - job_name: pihole
    static_configs:
      - targets: ['pihole-exporter:9617']

  - job_name: crowdsec
    static_configs:
      - targets: ['192.168.1.210:6060']

  - job_name: traefik
    static_configs:
      - targets: ['192.168.1.210:8080']

The pattern is always the same:

Something exposes /metrics on an HTTP port (exporters, Traefik with --metrics.prometheus=true, CrowdSec with prometheus.level: full, etc.).
You add a job_name + targets entry.
Prometheus scrapes it every 15s.

Targets can be inside Docker (service-name:port) or outside (192.168.x.y:port) — Prometheus doesn’t care.

4. Loki + Promtail for Logs

Loki stores logs. Promtail ships them to Loki. On this host, Promtail discovers containers via the Docker socket and auto-ships every container’s stdout/stderr:

server:
  http_listen_port: 9080
  grpc_listen_port: 0

positions:
  filename: /tmp/positions.yaml

clients:
  - url: http://loki:3100/loki/api/v1/push

scrape_configs:
  - job_name: docker
    docker_sd_configs:
      - host: unix:///var/run/docker.sock
        refresh_interval: 5s
    relabel_configs:
      - source_labels: ['__meta_docker_container_name']
        regex: '/(.*)'
        target_label: 'container'
      - source_labels: ['__meta_docker_container_log_stream']
        target_label: 'logstream'
      - source_labels: ['__meta_docker_container_label_com_docker_compose_project']
        target_label: 'compose_project'
      - source_labels: ['__meta_docker_container_label_com_docker_compose_service']
        target_label: 'compose_service'

What you get in Loki:

Labels: container, logstream (stdout/stderr), compose_project, compose_service
Full log bodies, searchable with LogQL
Automatic pickup — spin up a new container and its logs just appear

A typical LogQL query:

{container="traefik"} |= "error"

— which is, on purpose, very close to how Prometheus queries feel. Labels on the left to select streams, filters on the right.

5. Grafana Provisioning

Grafana has a database-backed config (users, dashboards, datasources) but you should provision it via files. That way the whole stack can be destroyed and rebuilt without clicks.

Datasources:

apiVersion: 1

datasources:
  - name: Prometheus
    type: prometheus
    uid: PBFA97CFB590B2093
    access: proxy
    url: http://prometheus:9090
    isDefault: true
    editable: true

  - name: Loki
    type: loki
    uid: P8E80F9AEF21F6940
    access: proxy
    url: http://loki:3100
    editable: true

The uid is important. Dashboards reference datasources by UID — so if you share a dashboard JSON between stacks, keep the UIDs stable or you’ll get “datasource not found” on every panel.

Dashboards — point a provider at a folder:

apiVersion: 1

providers:
  - name: 'default'
    orgId: 1
    folder: ''
    type: file
    disableDeletion: false
    editable: true
    options:
      path: /var/lib/grafana/dashboards
      foldersFromFilesStructure: false

Anything dropped in /var/lib/grafana/dashboards/*.json is picked up automatically. This is where Grafana.com dashboard imports end up — download the JSON from grafana.com/dashboards, drop it in, done.

6. Useful Dashboards to Start With

Grafana.com has thousands; most are overbuilt. A minimal homelab kit:

Dashboard	ID	What it shows
Node Exporter Full	`1860`	Host-level CPU / memory / disk / network — everything
Docker & System Monitoring	`893`	Per-container resource graphs from cAdvisor
Pi-hole	`10176`	Block rate, query clients, top domains
CrowdSec Engine	`21419`	Active decisions, bucket activity, LAPI calls
Traefik 2	`17346`	Request rate, latency, 4xx/5xx by router

After that, build your own. Every panel is a PromQL or LogQL query — the dashboards are just saved queries with graph configs.

7. Alert Rules That Matter

Prometheus evaluates alert rules every evaluation_interval and hands firing alerts to Alertmanager. The trick is to write alerts that only fire on things you’d actually wake up for.

groups:
  - name: container_alerts
    rules:
      - alert: ContainerDown
        expr: absent(container_last_seen{name=~".+"}) or (time() - container_last_seen{name=~".+"}) > 60
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Container {{ $labels.name }} is down"

      - alert: ContainerHighCPU
        expr: (sum(rate(container_cpu_usage_seconds_total{name=~".+"}[5m])) by (name) * 100) > 80
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Container {{ $labels.name }} high CPU usage"

  - name: host_alerts
    rules:
      - alert: DiskSpaceHigh
        expr: (1 - (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"})) * 100 > 85
        for: 5m
        labels:
          severity: warning

      - alert: DiskSpaceHighNFS
        expr: (1 - (node_filesystem_avail_bytes{mountpoint=~"/mnt/nfs.*"} / node_filesystem_size_bytes{mountpoint=~"/mnt/nfs.*"})) * 100 > 85
        for: 5m
        labels:
          severity: warning

      - alert: HighMemoryUsage
        expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 85
        for: 5m
        labels:
          severity: warning

  - name: security_alerts
    rules:
      - alert: CrowdSecNewBan
        expr: increase(cs_active_decisions[5m]) > 0
        for: 0m
        labels:
          severity: info

The for: clause is the anti-flap filter — a condition must hold for N minutes before firing. ContainerHighCPU with for: 5m means a single spike won’t page you; a sustained load will.

{{ $labels.name }} and {{ $value }} are Go templates expanded when the alert fires — they end up in the Alertmanager notification.

8. Alertmanager Routing

Prometheus fires; Alertmanager routes. Minimal config:

route:
  receiver: discord
  group_by: ['alertname']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h

receivers:
  - name: discord
    discord_configs:
      - webhook_url: https://discord.com/api/webhooks/XXX/YYY
        title: '{{ .GroupLabels.alertname }}'
        message: '{{ range .Alerts }}{{ .Annotations.summary }}\n{{ end }}'

9. Reverse Proxy

Grafana binds :3001 on the host (mapped to its internal :3000). Route grafana.example.com → grafana:3000 in your reverse proxy over HTTP. Set GF_SERVER_ROOT_URL to the full public URL so login redirects work.

If you front it with SSO (Authentik, Keycloak), you also want:

environment:
  - GF_AUTH_PROXY_ENABLED=true
  - GF_AUTH_PROXY_HEADER_NAME=X-authentik-username
  - GF_AUTH_PROXY_HEADER_PROPERTY=username
  - GF_AUTH_PROXY_AUTO_SIGN_UP=true
  - GF_AUTH_PROXY_WHITELIST=<your-proxy-ip>

…and then configure the reverse proxy to inject the username header only on authenticated requests.

10. Day-to-Day: What Grafana Is Actually For

Once it’s up, three workflows dominate:

1. “Is everything OK right now?” Open the home dashboard. Glance at CPU, memory, disk, container status. Takes 5 seconds.

2. “Why did X break at 03:15?” Open Explore → pick Loki → filter {container="X"} with a time range covering 03:15. Logs for exactly that window. Follow the timestamps into Prometheus (Explore → Prometheus, same time range) to see if it was CPU-starved, memory-exhausted, or something external.

3. “Has Z been getting slower?” Build a panel with the right PromQL. Widen the time range to 30 days. See the trend. If it’s real, dig in.

The two most valuable keybindings in Grafana:

d+h — jump to the home dashboard from anywhere
t+z — zoom out the time range (t+w shifts earlier; good for “what happened before now”)

11. Storage & Retention

All three components store data:

Component	Default retention	Where it goes
Prometheus	15 days	`/prometheus` volume — TSDB files
Loki	Unlimited unless configured	`/loki` — chunks (compressed log data) + index
Grafana	Forever	`/var/lib/grafana` — SQLite with dashboards + settings

Prometheus grows roughly linearly with the number of series you scrape. Loki grows with log volume — a Talkative Plex plus a busy Traefik can easily generate hundreds of MB/day. Set retention up front:

Prometheus: add --storage.tsdb.retention.time=30d to the command args.

Loki: set limits_config.retention_period: 720h + a compactor block with delete_request_store: filesystem.

12. Troubleshooting

Prometheus target is red in Status → Targets. Either network path (try docker exec prometheus wget -qO- http://target:port/metrics) or the target doesn’t expose /metrics.

“No data” in a Grafana panel but Prometheus has it. Wrong datasource UID. Edit the panel, re-pick the datasource.

Loki says “ingester not ready” / “too many outstanding requests”. Usually NFS being slow. Check docker logs loki for IO timeouts.

Dashboards vanish on restart. The Grafana DB is on a bind mount that lost its permissions. chmod 777 the Grafana data dir (or pin the container to UID 472).

Alerts stuck in “pending” and never fire. The alert’s for: hasn’t elapsed. Or the rule is evaluating but never matching — check Alerts → expression in the Prometheus UI, run the raw PromQL in Explore.