Writeup still a WIP, please pardon the dust.
Below is mostly braindumps & rough commands for creating/tweaking these services. Formal writeup coming soon!
# First node
curl -sfL https://get.k3s.io | INSTALL_K3S_VERSION=v1.29.6+k3s2 INSTALL_K3S_EXEC="server --cluster-init" sh -
export NODE_TOKEN=$(cat /var/lib/rancher/k3s/server/node-token)
# Remaining nodes
curl -sfL https://get.k3s.io | K3S_TOKEN=$NODE_TOKEN INSTALL_K3S_VERSION=v1.29.6+k3s2 INSTALL_K3S_EXEC="server --server https://<server node ip>:6443 --kubelet-arg=allowed-unsafe-sysctls=net.ipv4.*,net.ipv6.conf.all.forwarding" sh -
https://docs.k3s.io/upgrades/automated
$ sudo crictl rmi --prune
(Shouldn't be a problem on newer Debian, where rsyslog is not in use.)
In /etc/systemd/journald.conf, set "SystemMaxUse=100M"
In /etc/logrotate.conf, set "size 100M"
https://github.com/containerd/containerd/blob/main/docs/content-flow.md
containerd really doesn't want you batch-deleting snapshots.
https://github.com/k3s-io/k3s/issues/1905#issuecomment-820554037
for sha in $(sudo k3s ctr snapshot usage | awk '{print $1}'); do sudo k3s ctr snapshot rm $sha && echo $sha; done
Run this a few times until it stops returning results.
Uses traefik, the k3s default.
externalTrafficPolicy: Local is used to preserve forwarded IPs.
A cluster-ingress=true
label is given to the node my router is pointing to. Some services use a nodeAffinity to request it.
For traefik, this is a harmless optimization to reduce traffic hairpinning. For pods with hostNetwork: true
, this ensures they run on the node with the right IP.
See rook/rook-ceph-operator-values.yaml
and rook/rook-ceph-cluster-values.yaml
.
https://rook.io/docs/rook/latest-release/Upgrade/rook-upgrade/?h=upgrade
https://rook.io/docs/rook/latest-release/Upgrade/ceph-upgrade/?h=upgrade
ceph osd metadata <id> | grep -e '"hostname"' -e '"bluestore_bdev_dev_node"'
$ ceph osd metadata osd.1 | grep -e '"hostname"' -e '"bluestore_bdev_dev_node"'
"bluestore_bdev_dev_node": "/dev/sdd",
"hostname": "node1",
My setup divides k8s nodes into ceph & non-ceph nodes (using the label storage-node=true
).
Ensure labels & a toleration are set properly, so non-rook nodes can still run PV plugin Daemonsets. I accomplished this with a storage-node=false
label on non-rook nodes, with a toleration checking for storage-node
.
Otherwise, any pod scheduled on a non-ceph node won't be able to mount ceph-backed PVCs.
See rook-ceph-cluster-values.yaml->cephClusterSpec->placement
for an example.
EC-backed filesystems require a regular replicated pool as a default.
https://lists.ceph.io/hyperkitty/list/[email protected]/thread/QI42CLL3GJ6G7PZEMAD3CXBHA5BNWSYS/ https://tracker.ceph.com/issues/42450
Then setfattr a directory on the filesystem with an EC-backed pool. Any new data written to the folder will go to the EC-backed pool.
setfattr -n ceph.dir.layout.pool -v cephfs-erasurecoded /mnt/cephfs/my-erasure-coded-dir
https://docs.ceph.com/en/quincy/cephfs/file-layouts/
Create CephFilesystem
Create SC backed by Filesystem & Pool
Ensure the CSI subvolumegroup was created. If not, ceph fs subvolumegroup create <fsname> csi
Create PVC without a specified PV: PV will be auto-created
Super important: Set created PV's persistentVolumeReclaimPolicy
to Retain
Save the PV yaml, remove any extra information (see rook/data/data-static-pv.yaml for an example of what's required). Give it a more descriptive name.
Delete the PVC, and PV.
Apply your new PV YAML. Create a new PVC, pointing at this new PV.
Grow resources->storage on PV Grow resources->storage on PVC
Verify the new limit: getfattr -n ceph.quota.max_bytes /mnt/volumes/csi/csi-vol-<uuid>/<uuid>
for i in ceph osd pool ls
; do echo $i: ceph osd pool get $i crush_rule
; done
On ES backed pools, device class information is in the erasure code profile, not the crush rule. https://docs.ceph.com/en/latest/dev/erasure-coded-pool/
for i in ceph osd erasure-code-profile ls
; do echo $i: ceph osd erasure-code-profile get $i
; done
If hostNetwork is enabled on the cluster, ensure rook-ceph-operator is not running with hostNetwork enable. It doesn't need host network access to orchestrate the cluster, & impedes orchestration of objectstores & associated resources.
This is great for setting up easy public downloads.
rook/buckets/user-josh.yaml
)kubectl -n rook-ceph get secret rook-ceph-object-user-ceph-objectstore-josh -o go-template='{{range $k,$v := .data}}{{printf "%s: " $k}}{{if not $v}}{{$v}}{{else}}{{$v | base64decode}}{{end}}{{"\n"}}{{end}}
rook/buckets/bucket.py::create_bucket
)rook/buckets/bucket.py::set_public_read_policy
)Upload file
from bucket import *
conn = connect()
conn.upload_file('path/to/s3-bucket-listing/index.html', 'public', 'index.html', ExtraArgs={'ContentType': 'text/html'})
curl -s -L https://nvidia.github.io/nvidia-container-runtime/gpgkey | sudo apt-key add -
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-container-runtime/$distribution/nvidia-container-runtime.list | sudo tee /etc/apt/sources.list.d/nvidia-container-runtime.list
wget https://developer.download.nvidia.com/compute/cuda/11.6.2/local_installers/cuda-repo-debian11-11-6-local_11.6.2-510.47.03-1_amd64.deb
sudo dpkg -i cuda-repo-debian11-11-6-local_11.6.2-510.47.03-1_amd64.deb
sudo apt-key add /var/cuda-repo-debian11-11-6-local/7fa2af80.pub
sudo apt-get update
sudo apt install cuda nvidia-container-runtime nvidia-kernel-dkms
sudo apt install --reinstall nvidia-kernel-dkms
sudo vi /etc/modprobe.d/blacklist-nvidia-nouveau.conf
blacklist nouveau
options nouveau modeset=0
sudo update-initramfs -u
Copy https://github.com/k3s-io/k3s/blob/v1.24.2%2Bk3s2/pkg/agent/templates/templates_linux.go into /var/lib/rancher/k3s/agent/etc/containerd/config.toml.tmpl
(substitute your k3s version)
Edit the file to add a [plugins.cri.containerd.runtimes.runc.options]
section:
<... snip>
conf_dir = "{{ .NodeConfig.AgentConfig.CNIConfDir }}"
{{end}}
[plugins.cri.containerd.runtimes.runc]
runtime_type = "io.containerd.runc.v2"
[plugins.cri.containerd.runtimes.runc.options]
BinaryName = "/usr/bin/nvidia-container-runtime"
{{ if .PrivateRegistryConfig }}
<... snip>
& then systemctl restart k3s
Label your GPU-capable nodes: kubectl label nodes <node name> gpu-node=true
& then install the nvidia device plugin:
helm repo add nvdp https://nvidia.github.io/k8s-device-plugin
helm repo update
KUBECONFIG=/etc/rancher/k3s/k3s.yaml helm upgrade -i nvdp nvdp/nvidia-device-plugin --version=0.12.2 --namespace nvidia-device-plugin --create-namespace --set-string nodeSelector.gpu-node=true
Ensure the pods on the namespace are Running.
Test GPU passthrough by applying examples/cuda-pod.yaml
, then exec-ing into it & running nvidia-smi
.
https://github.com/NVIDIA/k8s-device-plugin#shared-access-to-gpus-with-cuda-time-slicing
version: v1
sharing:
timeSlicing:
renameByDefault: false
failRequestsGreaterThanOne: false
resources:
- name: nvidia.com/gpu
replicas: 5
$ helm upgrade -i nvdp nvdp/nvidia-device-plugin ... --set-file config.map.config=nvidia-device-plugin-config.yaml
https://docs.ceph.com/en/latest/man/8/mount.ceph/
sudo mount -t ceph user@<cluster FSID>.<filesystem name>=/ /mnt/ceph -o secret=<secret key>,x-systemd.requires=ceph.target,x-systemd.mount-timeout=5min,_netdev,mon_addr=192.168.1.1
sudo vi /etc/fstab
192.168.1.1,192.168.1.2:/ /ceph ceph name=admin,secret=<secret key>,x-systemd.mount-timeout=5min,_netdev,mds_namespace=data
$ cat /etc/ceph/ceph.conf
[global]
fsid = <my cluster uuid>
mon_host = [v2:192.168.1.1:3300/0,v1:192.168.1.1:6789/0] [v2:192.168.1.2:3300/0,v1:192.168.1.2:6789/0]
$ cat /etc/ceph/ceph.client.admin.keyring
[client.admin]
key = <my key>
caps mds = "allow *"
caps mgr = "allow *"
caps mon = "allow *"
caps osd = "allow *"
sudo vi /etc/fstab
none /ceph fuse.ceph ceph.id=admin,ceph.client_fs=data,x-systemd.requires=ceph.target,x-systemd.mount-timeout=5min,_netdev 0 0
https://unix.stackexchange.com/questions/554908/disable-spectre-and-meltdown-mitigations
https://rpi4cluster.com/monitoring/monitor-intro/, + what's in the monitoring
folder.
Tried https://github.com/prometheus-operator/kube-prometheus. The only way to persist dashboards is to add them to Jsonnet & apply the generated configmap. I'm not ready for that kind of IaC commitment in a homelab.
kubectl expose svc/some-service --name=some-service-external --port 1234 --target-port 1234 --type LoadBalancer
Service will then be available on port 1234 of any k8s node.
An A record for lan.jibby.org
& *.lan.jibby.org
points to an internal IP.
To be safe, a middleware is included to filter out source IPs outside of the LAN network & k3s CIDR. See traefik/middleware-lanonly.yaml
.
Then, internal services can be exposed with an IngressRoute, as a subdomain of lan.jibby.org
. See sonarr.yaml
's IngressRoute.
My backups target is a machine running
KUBECONFIG=/etc/rancher/k3s/k3s.yaml velero install \
--provider aws \
--plugins velero/velero-plugin-for-aws:v1.0.0 \
--bucket velero \
--secret-file ./credentials-velero \
--use-volume-snapshots=true \
--default-volumes-to-fs-backup \
--use-node-agent \
--backup-location-config region=default,s3ForcePathStyle="true",s3Url=http://172.16.69.234:9000 \
--snapshot-location-config region="default"
Had to remove resources:
from the daemonset.
kubectl -n velero edit backupstoragelocation default
https://velero.io/docs/v1.3.0/restore-reference/
Velero does not support hostPath PVCs, but works just fine with the openebs-hostpath
storageClass.
KUBECONFIG=/etc/rancher/k3s/k3s.yaml helm install openebs --namespace openebs openebs/openebs --create-namespace --set localprovisioner.basePath=/k3s-storage/openebs
This is a nice PVC option for simpler backup target setups.