Complete server deployment config

0 Rilis

Josh Bicking 34f822e12a nextcloud real IPs (but not locally)		2 tahun lalu
elasticsearch	17b77eed0c add mastodon and sonarr	3 tahun lalu
examples	17b77eed0c add mastodon and sonarr	3 tahun lalu
monitoring	c671a0f368 k3s only	3 tahun lalu
nextcloud	34f822e12a nextcloud real IPs (but not locally)	2 tahun lalu
postgres	c671a0f368 k3s only	3 tahun lalu
redis	ede6421bba add redis	3 tahun lalu
rook	1440bee64a gpu sharing & object storage working	3 tahun lalu
.env.example	20dd2bdb09 start on hyperconverged setup	5 tahun lalu
.gitignore	d2a1ddbdab update readme	3 tahun lalu
README.md	d5cc36516d add *arr sync script	3 tahun lalu
blog.yaml	17b77eed0c add mastodon and sonarr	3 tahun lalu
cloudflared.yaml	17b77eed0c add mastodon and sonarr	3 tahun lalu
gogs-pvc.yaml	c671a0f368 k3s only	3 tahun lalu
gogs.yaml	c671a0f368 k3s only	3 tahun lalu
inotify-consumers.sh	4bcbe4d51d add inotify watchers script	3 tahun lalu
jellyfin-pvc.yaml	1440bee64a gpu sharing & object storage working	3 tahun lalu
jellyfin.yaml	1440bee64a gpu sharing & object storage working	3 tahun lalu
lidarr-pvc.yaml	34f822e12a nextcloud real IPs (but not locally)	2 tahun lalu
lidarr.yaml	34f822e12a nextcloud real IPs (but not locally)	2 tahun lalu
mastodon.yaml	17b77eed0c add mastodon and sonarr	3 tahun lalu
matrix-pvc.yaml	c671a0f368 k3s only	3 tahun lalu
matrix.yaml	2f60ae93f9 add vaultwarden	3 tahun lalu
miniflux.yaml	992fc30889 add miniflux	3 tahun lalu
nvidia-device-plugin-config.yaml	1440bee64a gpu sharing & object storage working	3 tahun lalu
plex-pvc.yaml	17b77eed0c add mastodon and sonarr	3 tahun lalu
plex.yaml	34f822e12a nextcloud real IPs (but not locally)	2 tahun lalu
prowlarr-pvc.yaml	e9cf603736 add radarr & prowlarr, stop using sshfs	3 tahun lalu
prowlarr.yaml	e9cf603736 add radarr & prowlarr, stop using sshfs	3 tahun lalu
radarr-pvc.yaml	e9cf603736 add radarr & prowlarr, stop using sshfs	3 tahun lalu
radarr.yaml	466b2a9a2e use python seedbox syncing script	3 tahun lalu
seedbox_sync.py	7d0e39661a rely on bash as little as possible	2 tahun lalu
selfoss-pvc.yaml	c671a0f368 k3s only	3 tahun lalu
selfoss.yaml	c671a0f368 k3s only	3 tahun lalu
sonarr-pvc.yaml	17b77eed0c add mastodon and sonarr	3 tahun lalu
sonarr.yaml	466b2a9a2e use python seedbox syncing script	3 tahun lalu
temp-pvc-pod.yaml	c671a0f368 k3s only	3 tahun lalu
traefik-dashboard.yaml	c671a0f368 k3s only	3 tahun lalu
traefik-helmchartconfig.yaml	34f822e12a nextcloud real IPs (but not locally)	2 tahun lalu
vaultwarden-pvc.yaml	2f60ae93f9 add vaultwarden	3 tahun lalu
vaultwarden.yaml	250d5e93e4 expose vaultwarden websocket	3 tahun lalu

k3s

curl -sfL https://get.k3s.io | INSTALL_K3S_EXEC="server --cluster-init" sh -
export NODE_TOKEN=$(cat /var/lib/rancher/k3s/server/node-token)
curl -sfL https://get.k3s.io | K3S_TOKEN=$NODE_TOKEN INSTALL_K3S_EXEC="server --server https://192.168.122.87:6443" INSTALL_K3S_VERSION=v1.23.6+k3s1 sh -

rook

KUBECONFIG=/etc/rancher/k3s/k3s.yaml helm upgrade --install --create-namespace --namespace rook-ceph rook-ceph rook-release/rook-ceph:1.9.2 -f rook-ceph-values.yaml

KUBECONFIG=/etc/rancher/k3s/k3s.yaml helm install --create-namespace --namespace rook-ceph rook-ceph-cluster --set operatorNamespace=rook-ceph rook-release/rook-ceph-cluster:1.9.2 -f rook-ceph-cluster-values.yaml

Sharing 1 CephFS instance between multiple PVCs

https://github.com/rook/rook/blob/677d3fa47f21b07245e2e4ab6cc964eb44223c48/Documentation/Storage-Configuration/Shared-Filesystem-CephFS/filesystem-storage.md

Create CephFilesystem Create SC backed by Filesystem & Pool Ensure the CSI subvolumegroup was created. If not, ceph fs subvolumegroup create <fsname> csi Create PVC without a specified PV: PV will be auto-created Set created PV to ReclaimPolicy: Retain Create a new, better-named PVC

If important data is on CephBlockPool-backed PVCs, don't forget to set the PV's persistentVolumeReclaimPolicy to Retain.

tolerations

If your setup divides k8s nodes into ceph & non-ceph nodes (using a label, like storage-node=true), ensure labels & a toleration are set properly (storage-node=false, with a toleration checking for storage-node) so non-ceph nodes still run PV plugin Daemonsets.

CephFS w/ EC backing pool

EC-backed filesystems require a regular replicated pool as a default

https://lists.ceph.io/hyperkitty/list/[email protected]/thread/Y6T7OVTC4XAAWMFTK3MYGC7TB6G47OCH/ https://tracker.ceph.com/issues/42450

ObjectStore

If hostNetwork is enabled on the cluster, ensure rook-ceph-operator is not running with hostNetwork enable. It doesn't need host network access to orchestrate the cluster, & impedes orchestration of objectstores & associated resources.

public s3-interface bucket listing w/ HTML

This is great for setting up easy public downloads.

Create a user (rook/buckets/user-josh.yaml)
kubectl -n rook-ceph get secret rook-ceph-object-user-ceph-objectstore-josh -o go-template='{{range $k,$v := .data}}{{printf "%s: " $k}}{{if not $v}}{{$v}}{{else}}{{$v | base64decode}}{{end}}{{"\n"}}{{end}}
Create bucket (rook/buckets/bucket.py::create_bucket)
Set policy (rook/buckets/bucket.py::set_public_read_policy)

Upload file

from bucket import *
conn = connect()
conn.upload_file('path/to/s3-bucket-listing/index.html', 'public', 'index.html', ExtraArgs={'ContentType': 'text/html'})

nvidia driver (on debian)

curl -s -L https://nvidia.github.io/nvidia-container-runtime/gpgkey |   sudo apt-key add -
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-container-runtime/$distribution/nvidia-container-runtime.list |   sudo tee /etc/apt/sources.list.d/nvidia-container-runtime.list

wget https://developer.download.nvidia.com/compute/cuda/11.6.2/local_installers/cuda-repo-debian11-11-6-local_11.6.2-510.47.03-1_amd64.deb
sudo dpkg -i cuda-repo-debian11-11-6-local_11.6.2-510.47.03-1_amd64.deb
sudo apt-key add /var/cuda-repo-debian11-11-6-local/7fa2af80.pub
sudo apt-get update

install kernel headers

sudo apt install cuda nvidia-container-runtime nvidia-kernel-dkms

sudo apt install --reinstall nvidia-kernel-dkms

verify dkms is actually running

sudo vi /etc/modprobe.d/blacklist-nvidia-nouveau.conf

blacklist nouveau
options nouveau modeset=0

sudo update-initramfs -u

configure containerd to use nvidia by default

Copy https://github.com/k3s-io/k3s/blob/v1.24.2%2Bk3s2/pkg/agent/templates/templates_linux.go into /var/lib/rancher/k3s/agent/etc/containerd/config.toml.tmpl (substitute your k3s version)

Edit the file:

<... snip>
  conf_dir = "{{ .NodeConfig.AgentConfig.CNIConfDir }}"
{{end}}
[plugins.cri.containerd.runtimes.runc]
  runtime_type = "io.containerd.runc.v2"

[plugins.cri.containerd.runtimes.runc.options]
  BinaryName = "/usr/bin/nvidia-container-runtime"

{{ if .PrivateRegistryConfig }}
<... snip>

& then systemctl restart k3s

Label your GPU-capable nodes: kubectl label nodes <node name> gpu-node=true

& then install the nvidia device plugin:

helm repo add nvdp https://nvidia.github.io/k8s-device-plugin
helm repo update
KUBECONFIG=/etc/rancher/k3s/k3s.yaml helm upgrade -i nvdp nvdp/nvidia-device-plugin --version=0.12.2 --namespace nvidia-device-plugin --create-namespace --set-string nodeSelector.gpu-node=true

Ensure the pods on the namespace are Running.

Test GPU passthrough by applying examples/cuda-pod.yaml, then exec-ing into it & running nvidia-smi.

Sharing GPU

https://github.com/NVIDIA/k8s-device-plugin#shared-access-to-gpus-with-cuda-time-slicing

version: v1
sharing:
  timeSlicing:
    renameByDefault: false
    failRequestsGreaterThanOne: false
    resources:
    - name: nvidia.com/gpu
      replicas: 5

$ helm upgrade -i nvdp nvdp/nvidia-device-plugin ... --set-file config.map.config=nvidia-device-plugin-config.yaml

ceph client for cephfs volumes

sudo apt install ceph-fuse

sudo vi /etc/fstab

192.168.1.1.,192.168.1.2:/    /ceph   ceph    name=admin,secret=<secret key>,x-systemd.mount-timeout=5min,_netdev,mds_namespace=data

disable mitigations

https://unix.stackexchange.com/questions/554908/disable-spectre-and-meltdown-mitigations

Monitoring

https://rpi4cluster.com/monitoring/k3s-grafana/

Tried https://github.com/prometheus-operator/kube-prometheus. The only way to persist dashboards is to add them to Jsonnet & apply the generated configmap.

Exposing internal services

kubectl expose svc/some-service --name=some-service-external --port 1234 --target-port 1234 --type LoadBalancer

Service will then be available on port 1234 of any k8s node.

libvirtd

...

Still to do

deluge
gogs ssh ingress (can't go through cloudflare without cloudflared on the client)
Something better than expose for accessing internal services

README.md

k3s