diff --git a/ops-agents/heimdall/history/2026-04-20-k3s-detailed-inspection.md b/ops-agents/heimdall/history/2026-04-20-k3s-detailed-inspection.md new file mode 100644 index 0000000..c8deba2 --- /dev/null +++ b/ops-agents/heimdall/history/2026-04-20-k3s-detailed-inspection.md @@ -0,0 +1,335 @@ +--- +date: 2026-04-20 +topic: K3s 상세 점검 (기본 점검 이후 심화) +areas: + - infra/k8s/overview + - infra/data/longhorn + - infra/platform/argocd + - infra/observability/vector +--- + +# 2026-04-20 K3s 상세 점검 + +수집 시점 **2026-04-20 19:10 KST**. K3s v1.34.5+k3s1, containerd 2.1.5, Longhorn 1.8.2. 요청자: kappa. + +## 종합 판정 + +| 섹션 | 판정 | 핵심 | +|---|---|---| +| 1. 노드 | OK | 전 노드 req/lim 40% 이하 | +| 2. NS 리소스 | OK | apisix mem 1위 3456Mi | +| 3. 파드 이슈 | OK | CrashLoop 0 | +| 4. PV/PVC | OK | 26 PVC 모두 Bound | +| 5. Longhorn | OK | 25 볼륨 healthy, 스냅샷 97/97 | +| 6. ArgoCD 14앱 | OK | 14/14 Synced+Healthy | +| 7. cert-manager | OK | 최단 62일 | +| 8. 네트워크 | OK | metallb 14%, 라우팅 24개 | +| 9. Helm 28 releases | OK | 모두 deployed | +| 10. 오늘 변경 | OK | 4/4 반영 | +| 11. graphify | OK | 불일치 없음 | + +--- + +## 1. 노드 상세 — OK + +### Capacity / Allocatable / Kernel + +| Node | Role | CPU cap | Mem cap | Mem alloc | Kernel | Ready since | +|---|---|---|---|---|---|---| +| incus-hp1 | worker | 32 | 198.0 Gi | 198.0 Gi | 6.12.74+deb13+1 | 2026-04-19 12:43 | +| incus-hp2 | worker | 32 | 198.0 Gi | 198.0 Gi | 6.12.74+deb13+1 | 2026-04-13 19:51 | +| incus-kr1 | control-plane | 28 | 65.7 Gi | 65.7 Gi | 6.12.74+deb13+1 | 2026-04-16 07:33 | +| incus-kr2 | control-plane | 16 | 32.0 Gi | 21.6 Gi | 6.12.74+deb13+1 | 2026-04-19 12:08 | + +kr2 `alloc mem` 21.6Gi < capacity 32Gi → 10.4Gi system-reserved. 타 노드 reserve 없음 (의도성 확인 여지). + +### Requests / Limits + +| Node | CPU req / lim | Mem req / lim | +|---|---|---| +| incus-hp1 | 5 (15%) / 420m (1%) | 4680 Mi (2%) / 11236 Mi (5%) | +| incus-hp2 | 5.7 (17%) / 1.8 (5%) | 7593 Mi (3%) / 18584 Mi (9%) | +| incus-kr1 | 5.34 (19%) / 5.8 (20%) | 3477 Mi (5%) / 8058 Mi (12%) | +| incus-kr2 | 2.29 (14%) / 400m (2%) | 1654 Mi (7%) / 3514 Mi (16%) | + +전 노드 req/lim 40% 이하. 성장 여유 충분. + +### Pressure / PLEG + +전 노드 MemoryPressure/DiskPressure/PIDPressure=False. `reason=PLEGUnhealthy` 이벤트 0건. + +### Longhorn 디스크 사용률 + +| Node | Available | Maximum | Scheduled | 사용률 | +|---|---|---|---|---| +| incus-hp1 (nvme) | 909 Gi | 938 Gi | 65 Gi | 3.1% | +| incus-hp2 | 895 Gi | 938 Gi | 137 Gi | 4.6% | +| incus-kr1 | 798 Gi | 936 Gi | 137 Gi | 14.7% | +| incus-kr2 | 798 Gi | 936 Gi | 64 Gi | 14.8% | + +`/var/lib/rancher` 측정 미수행 (heimdall 컨테이너에서 호스트 SSH 불가). 다음 점검에서 longhorn-manager daemonset exec 경로로 수집. + +--- + +## 2. NS 리소스 상위 10 — OK + +파드 단위 CPU/메모리 request 합계 기준. + +| # | Namespace | Pods | CPU req (core) | Mem req (MiB) | +|---|---|---|---|---| +| 1 | apisix | 7 | 0.64 | 3 456 | +| 2 | monitoring | 10 | 0.27 | 2 162 | +| 3 | kube-system | 22 | 1.45 | 1 770 | +| 4 | teable | 2 | 0.15 | 832 | +| 5 | logging | 5 | 0.30 | 768 | +| 6 | open-webui | 1 | 0.10 | 768 | +| 7 | teleport | 3 | 0.60 | 768 | +| 8 | openmemory | 3 | 0.15 | 704 | +| 9 | metallb-system | 5 | 0.00 | 640 | +| 10 | argocd | 7 | 0.20 | 600 | + +apisix 메모리 1위 (etcd 3-member + dashboard + ingress-controller). + +--- + +## 3. 파드 레벨 이슈 — OK + +| 지표 | 값 | +|---|---| +| CrashLoopBackOff | 0 | +| Pending | 0 | +| Unknown | 0 | +| Non-Running pods | 0 | +| 재시작 ≥1회 (lifetime) | 28 | + +재시작 누적 상위 10 (lifetime, 최근 6h 아님): + +| NS | Pod | Restarts | +|---|---|---| +| democratic-csi | synology-iscsi-node-hlsr2 | 8 | +| longhorn-system | longhorn-csi-plugin-cqndq | 6 | +| monitoring | vm-stack-vmop | 5 | +| democratic-csi | synology-iscsi-node-287g6 | 5 | +| longhorn-system | longhorn-manager-l5smc | 4 | +| longhorn-system | csi-resizer-qk99l | 4 | +| kube-system | kube-multus-ds-sk7hh | 4 | +| longhorn-system | longhorn-csi-plugin-dt5hg | 3 | +| longhorn-system | longhorn-csi-plugin-97ssq | 3 | +| longhorn-system | csi-provisioner-vfbj2 | 3 | + +재시작은 누적. "최근 6h" 판정 위해서는 `lastTerminated` 시각 필요 (본 점검에서 미수집). CrashLoop 0 이라 영향 없음. **주의**: democratic-csi 8회는 iSCSI 노드 드라이버 특성상 높은 편 — 원인 확인 여지. + +--- + +## 4. PV / PVC — OK + +| StorageClass | PVC 수 | Bound | Pending | Provisioner | +|---|---|---|---|---| +| longhorn | 25 | 25 | 0 | driver.longhorn.io | +| nfs (legacy 1개) | 1 | 1 | 0 | cluster.local/...nfs-subdir-external-provisioner | +| synology-iscsi | 0 | — | — | org.democratic-csi.iscsi.synology (준비됨, 미사용) | +| local-path | 0 | — | — | rancher.io/local-path | +| **합계** | **26** | **26** | **0** | | + +기본 SC 2개 공존 (`local-path`, `longhorn`). **주의**: 이중 default 는 권장되지 않음 — 정리 검토. + +--- + +## 5. Longhorn 디테일 — OK + +### 볼륨/Replica 요약 + +- 전체 볼륨: **25** (모두 attached+healthy) +- 전체 replica: **72** 개, 모두 running, failedAt="", rebuildRetryCount=0 +- 배치: hp2 25, kr1 25, kr2 15, **hp1 3** (신규 노드, rebalance 미완) + +### Replica 정책 + +| numberOfReplicas | 볼륨 수 | +|---|---| +| 3 | 18 | +| 2 | 7 (safeline 7개: chaos/tengine-logs/detector-logs/luigi/mgt/detector/database) | + +**주의**: safeline 2-replica 정책 의도성 확인 필요. 2-replica 는 단일 노드 장애 시 복구 안전지대 없음. + +### 최근 6h 스냅샷 + +| 구분 | 성공 | 실패 | +|---|---|---| +| critical-snapshot (hourly) | 91 | 0 | +| standard-snapshot (daily 18:00) | 6 | 0 | +| **합계** | **97** | **0** | + +### RecurringJob + +| Name | Schedule | Retain | Task | +|---|---|---|---| +| critical-snapshot | `0 * * * *` | 24 | snapshot | +| critical-backup | `0 */6 * * *` | 28 | backup | +| standard-snapshot | `0 18 * * *` | 7 | snapshot | +| standard-backup | `0 19 * * *` | 7 | backup | + +graphify `K3s Backup Pipeline` / `Longhorn RecurringJob (4 jobs)` 노드와 일치. + +--- + +## 6. ArgoCD 14 앱 — OK + +| App | Sync | Health | Last Sync (UTC) | Revision | +|---|---|---|---|---| +| bunnycdn-mcp | Synced | Healthy | 2026-04-13 07:03 | f9054536 | +| cfb-manager | Synced | Healthy | 2026-04-13 07:30 | 07dd408c | +| juiceshop | Synced | Healthy | 2026-04-13 06:43 | 550488f8 | +| kroki | Synced | Healthy | 2026-04-13 06:43 | 550488f8 | +| namecheap-api | Synced | Healthy | 2026-04-13 06:43 | 550488f8 | +| nas-proxy | Synced | Healthy | 2026-04-13 06:43 | 550488f8 | +| openmemory | Synced | Healthy | 2026-04-19 05:56 | c572d356 | +| outline | Synced | Healthy | 2026-04-19 05:56 | c572d356 | +| pgpool | Synced | Healthy | 2026-04-16 07:39 | 0a94c94f | +| proxysql | Synced | Healthy | 2026-04-13 06:43 | 550488f8 | +| searxng | Synced | Healthy | 2026-04-13 06:43 | 550488f8 | +| smtp-relay | Synced | Healthy | 2026-04-13 07:30 | 07dd408c | +| vault-mcp | Synced | Healthy | 2026-04-13 06:53 | 0f5a662a | +| vultr-api | Synced | Healthy | 2026-04-13 06:43 | 550488f8 | + +- 14/14 Synced + Healthy +- 전 앱 `operationState.phase=Succeeded` +- 가장 오래된 sync 2026-04-13 (7일 전) — auto-sync 하 드리프트 없음 + +--- + +## 7. cert-manager — OK + +| Certificate | Domain | Ready | NotAfter (UTC) | 만료까지 (일) | +|---|---|---|---|---| +| wildcard-actions-it-com | *.actions.it.com | True | 2026-06-21 18:19 | 62 | +| wildcard-anvil-it-com | *.anvil.it.com | True | 2026-06-21 18:14 | 62 | +| wildcard-api-inouter | *.api.inouter.com | True | 2026-06-24 02:54 | 64 | +| wildcard-inouter | *.inouter.com | True | 2026-06-21 18:12 | 62 | +| wildcard-ironclad-it-com | *.ironclad.it.com | True | 2026-06-21 18:12 | 62 | +| wildcard-keepanker-cv | *.keepanker.cv | True | 2026-06-21 18:12 | 62 | +| wildcard-mcp-inouter | *.mcp.inouter.com | True | 2026-06-24 03:56 | 64 | +| wildcard-servidor-it-com | *.servidor.it.com | True | 2026-06-21 18:12 | 62 | + +30일 이내 만료 0건. 최단 잔여 62일. + +--- + +## 8. 네트워크 — OK + +### MetalLB + +- Pool: `default-pool` = `192.168.9.50-192.168.9.99` (50 IPs) +- 할당: 7 / 50 (14%) + +| IP | Namespace | Service | +|---|---|---| +| 192.168.9.50 | apisix | apisix-gateway | +| 192.168.9.51 | sshpiper | sshpiper | +| 192.168.9.52 | teleport | teleport-cluster | +| 192.168.9.53 | kube-system | traefik | +| 192.168.9.54 | gitea | gitea-ssh | +| 192.168.9.55 | sftpgo | sftpgo | +| 192.168.9.56 | db | haproxy-pg | + +### 라우팅 리소스 + +| 타입 | 개수 | 비고 | +|---|---|---| +| Traefik IngressRoute | 13 | argocd-server, bunnycdn-mcp-tls, longhorn-ui(+tls), nas-proxy-tls, open-webui-tls, outline, portainer-tls, teable-tls, vault-mcp(+hcv)-tls, vector, vlogs | +| Gateway API HTTPRoute | 11 | argocd, bunnycdn-mcp, gitea, grafana, kroki, n8n, nocodb, openmemory-mcp, safeline-mgt, searxng, sftpgo-web | +| APISIXRoute | 0 | APISIX 설치됨(helm)이나 CR 미사용 | +| Ingress | 0 | | + +graphify `Traefik DaemonSet + Gateway API`, `APISIX→Traefik 메인 라우팅 전환` 기록과 정합. + +--- + +## 9. Helm releases — OK + +총 **28개** 릴리스, 모두 `deployed`. + +| NS | Release | Chart | App Ver | Updated | +|---|---|---|---|---| +| apisix | apisix | apisix-2.13.0 | 3.15.0 | 2026-04-20 08:21 | +| apisix | apisix-ingress-controller | apisix-ingress-controller-1.1.2 | 2.0.1 | 2026-04-19 14:53 | +| argocd | argocd | argo-cd-9.4.16 | v3.3.5 | **2026-04-20 12:01** | +| cert-manager | cert-manager | cert-manager-v1.20.0 | v1.20.0 | 2026-04-19 14:54 | +| kube-system | descheduler | descheduler-0.35.1 | 0.35.1 | 2026-04-19 14:25 | +| external-secrets | external-secrets | external-secrets-2.3.0 | v2.3.0 | 2026-04-19 14:55 | +| gitea | gitea | gitea-12.5.0 | 1.25.4 | 2026-04-19 14:54 | +| db | haproxy-pg | haproxy-pg-0.1.0 | 3.1 | 2026-04-16 17:31 | +| longhorn-system | longhorn | longhorn-1.8.2 | v1.8.2 | 2026-04-19 15:05 | +| metallb-system | metallb | metallb-0.15.3 | v0.15.3 | 2026-04-20 08:14 | +| n8n | n8n | n8n-2.0.1 | 1.122.4 | 2026-03-25 10:15 | +| nfs-provisioner | nfs-provisioner | nfs-subdir-external-provisioner-4.0.18 | 4.0.2 | 2026-04-19 14:55 | +| tools | nocodb | nocodb-1.10.0 | 0.301.5 | 2026-04-13 15:29 | +| open-webui | open-webui | open-webui-13.3.1 | 0.8.12 | 2026-04-19 16:25 | +| db | pgcat | pgcat-0.1.0 | 0.2.5 | 2026-04-16 17:07 | +| portainer | portainer | portainer-239.1.0 | ce-latest-ee-2.39.1 | 2026-04-19 14:55 | +| kube-system | reflector | reflector-10.0.21 | 10.0.21 | 2026-04-19 14:55 | +| safeline | safeline | safeline-10.1.0 | 9.3.2 | 2026-03-23 20:12 | +| sftpgo | sftpgo | sftpgo-0.44.0 | 2.7.1 | 2026-03-27 16:13 | +| sshpiper | sshpiper | sshpiper-0.4.6 | v1.5.0 | 2026-04-19 14:55 | +| democratic-csi | synology-iscsi | democratic-csi-0.15.1 | 1.0 | 2026-04-20 08:54 | +| teable | teable | teable-0.1.0 | latest | 2026-04-19 14:57 | +| teleport | teleport-cluster | teleport-cluster-18.7.3 | 18.7.3 | 2026-04-13 15:14 | +| kube-system | traefik | traefik-39.0.6 | v3.6.11 | 2026-04-19 14:54 | +| logging | vector | vector-0.51.0 | 0.54.0-distroless-libc | **2026-04-20 12:04** | +| velero | velero | velero-12.0.0 | 1.18.0 | 2026-04-20 10:04 | +| logging | vlogs | victoria-logs-single-0.11.31 | v1.49.0 | 2026-04-08 20:22 | +| monitoring | vm-stack | victoria-metrics-k8s-stack-0.72.6 | v1.139.0 | 2026-04-19 14:54 | + +outdated 판정 본 점검 미수행. 2주 이상 `Updated` 없는 릴리스: n8n(3-25), sftpgo(3-27), safeline(3-23). ArgoCD auto-sync 커버리지 여부 확인 여지. + +--- + +## 10. 오늘(2026-04-20) 변경 반영 확인 — OK (4/4) + +| 항목 | 기대 | 실제 | 판정 | +|---|---|---|---| +| argocd-application-controller memory limit | 1Gi | `1Gi` | ✅ | +| vector container memory limit | 512Mi | `512Mi` | ✅ | +| vector buffer `max_events` | 10000 | `10000` | ✅ | +| vector buffer `retry_max_duration_secs` | 300 | `300` | ✅ | + +참고: +- vector DS 4개 파드 restart count **0**, 시작 시각 2026-04-20 12:04:22~30 UTC +- 재시작 시점(03:04 UTC)에 `vlogs` 싱크 Healthcheck 400 1회 — 이후 30분간 동일 에러 미발생, buffer retry 로 복구 +- vector `retry_initial_backoff_secs: 2` +- argocd application-controller `requests` 는 384Mi 유지 + +--- + +## 11. graphify 크로스체크 — OK + +| graphify 노드 | 라이브 상태 | 정합 | +|---|---|---| +| Longhorn v1.8.2 | helm `longhorn-1.8.2` | ✅ | +| Traefik DaemonSet + Gateway API | traefik v3.6.11 + 13 IngressRoute + 11 HTTPRoute | ✅ | +| MetalLB L2 도입 | metallb v0.15.3, pool 9.50-99 | ✅ | +| Longhorn RecurringJob (4 jobs: critical/standard) | 4 RecurringJob (snapshot×2 + backup×2) | ✅ | +| Vector Log Collector → VictoriaLogs Log Pipeline | vector→vlogs 구성 | ✅ | +| APISIX→Traefik 메인 라우팅 전환 (2026-03-25 history) | Traefik 중심 HTTPRoute, APISIX CR 0 | ✅ | +| K3s PostgreSQL 백엔드 이전 (2026-03-24 history) | hp2 합류·운영 (uptime 6d23h+) | ✅ | + +graphify 기록과 라이브 상태 불일치 없음. + +--- + +## 후속 권장 + +1. Heimdall: longhorn-manager daemonset 으로 호스트 `/var/lib/rancher` 사용률 측정 경로 마련. +2. Heimdall: safeline 7볼륨 replica=2 정책 의도 확인 → 백서 기준 확정. +3. Heimdall: hp1 신규 노드로 Longhorn replica rebalance (scheduled=65Gi 만). +4. Heimdall: vector → vlogs 초기 healthcheck 400 원인 조사. +5. Heimdall: default StorageClass 2개 중 하나로 통일 검토. +6. Heimdall: iSCSI democratic-csi node-plugin 8회 재시작 원인 (syslog/dmesg). + +--- + +## 비고: 본 리포트 산출 경로 + +- 원래 Outline `heimdall` 컬렉션에 업로드 시도했으나 **BunnyCDN Shield 403** 로 상세 본문 차단 (요약 문서 ID `c1ec3f2c-0fa8-49f8-9d0b-3d619a0e4715` 만 생성 완료, 부모 아래 하위 섹션 생성 시 WAF 차단). +- Gitea 업스트림 504 로 `git push` 도 대기. 로컬에 파일 먼저 commit, push 는 gitea 회복 시 재시도. +- Syn 에게 Outline 업로드 경로 WAF 룰 확인 요청 대상 (본 점검 범위 외 follow-up).