pgpool-II PoC (n8n 전용 전환) + postgresql-ha.md 섹션 추가

2026-04-16 08:25:02 +09:00
parent 125413d083
commit 0d59adb95f
2 changed files with 180 additions and 0 deletions
--- a/history/2026-04-16-pgpool-n8n-poc.md
+++ b/history/2026-04-16-pgpool-n8n-poc.md
@@ -0,0 +1,156 @@
 ---
 date: 2026-04-16
 topic: pgpool-II PoC (n8n 전용 전환)
 areas:
  - infra/postgresql-ha.md
 tags: [history, pgpool, pgcat, patroni, postgresql, poc]
 ---
 n8n 이 Patroni failover 와 etcd 순단에 풀 좀비로 취약 — pgcat 의 클라이언트-측 연결 관리로는 해소 불가. pgpool-II streaming_replication 모드 PoC 로 n8n 만 전환, 검증 결과 양쪽 시나리오에서 `무에러 또는 <2초 자가복구` 달성.
 ## 전환 경로
 - 전환 전: n8n → pgcat.db.svc:6432 → OpenWrt HAProxy 192.168.9.1:5432 → Patroni leader
 - 전환 후: n8n → pgpool.db.svc:9999 → Patroni 3노드 직결
 NocoDB·Outline 은 pgcat 유지. 7일 관측 후 확대 여부 판단.
 ## 이미지 선택
 - kappa 원 스펙: `pgpool/pgpool:4.6.3 (공식)` — 존재하지 않음. `pgpool/pgpool` 리포지토리 최대 태그 **4.4.3** (2024 이후 방치), 이후 버전은 `bitnamilegacy/pgpool` 에만 존재
 - 1차 시도: Bitnami 4.6.3 — env 기반 설정 자동화가 scram-sha-256 패스워드를 pgpool 내부 포맷으로 재해시해서 백엔드 인증 실패. 디버깅 폐기 (kappa 지시)
 - 2차 채택: **`pgpool/pgpool:4.4.3`** (공식) + 최소 ConfigMap 마운트
 ## 최종 설정 (`helm-charts/pgpool/`)
 ArgoCD 관리 대상. `kaffa/helm-charts` 리포 `pgpool/` 디렉토리 raw manifest. Application 매니페스트 `kubectl apply` 로 1회 부트스트랩.
 ### 핵심 구성
 `pgpool.conf` (~30 라인):
 ```
 listen_addresses = '*'
 port = 9999
 backend_clustering_mode = 'streaming_replication'
 backend_hostname0 = '10.100.2.5'   # postgres-1
 backend_port0 = 5432
 backend_weight0 = 1
 backend_flag0 = 'ALLOW_TO_FAILOVER'
 # ... backend1=postgres-2, backend2=postgres-3 동일 패턴
 sr_check_user = 'sr_check'
 sr_check_password = '<plaintext>'
 sr_check_database = 'postgres'
 sr_check_period = 10
 health_check_user = 'sr_check'
 health_check_password = '<plaintext>'
 health_check_period = 10
 health_check_timeout = 5
 failover_on_backend_error = off    # Patroni 가 promotion 수행
 failover_command = ''
 follow_primary_command = ''
 use_watchdog = off
 enable_pool_hba = on
 allow_clear_text_frontend_auth = on
 load_balance_mode = off            # 모든 쿼리 primary
 num_init_children = 32
 max_pool = 4
 connection_cache = on
 ```
 `pool_hba.conf`:
 ```
 local all all                  trust
 host  all all 0.0.0.0/0        password
 host  all all ::/0             password
 ```
 ### 인증 설계 핵심
 Postgres 가 cluster-wide `password_encryption = scram-sha-256` 이라 모든 역할 비밀번호는 SCRAM 해시로 저장. pgpool 이 백엔드에 scram-sha-256 auth 하려면 plaintext 필요.
 - `pool_passwd` **미사용** — pgpool 이 plaintext 엔트리를 자동 md5 로 변환하면 backend SCRAM 거절되어 `failed to authenticate with backend using SCRAM` 발생
 - `allow_clear_text_frontend_auth=on` + pool_hba `password` 메소드 → 클라이언트가 plaintext 로 전송 → pgpool 이 그 값을 backend SCRAM challenge-response 에 그대로 사용
 - K8s Service 내부 트래픽이므로 clear-text 는 클러스터 내 허용
 - 장기: pgpool PGPOOLKEYFILE + AES 암호화 password 도입 검토 (Bitnami 가 하는 방식)
 ### Secret 처리
 `pgpool-secrets` K8s Secret (수동 `kubectl create`, git 미추적):
 - `sr_check_password` — 16-byte hex 랜덤
 - `n8n_password` — `n8n` (pgcat 와 동일)
 entrypoint.sh 가 env 에서 sed 로 `pgpool.conf.tmpl` 의 `__SR_CHECK_PASSWORD__` 를 치환하여 emptyDir 에 `pgpool.conf` 렌더. 정규 운영 승격 시 ExternalSecret + Vault 로 이관.
 ### Patroni 에 sr_check role 생성
 ```sql
 CREATE ROLE sr_check WITH LOGIN REPLICATION PASSWORD '<hex>';
 GRANT pg_monitor TO sr_check;
 ```
 postgres-2(당시 leader) 에서 실행, async streaming 으로 postgres-1/postgres-3 에 자동 전파. pg_hba 는 `host all all 0.0.0.0/0 md5` 이미 설정되어 있어 추가 변경 불필요.
 ## 검증 시나리오
 ### 1. Patroni switchover (postgres-3 → postgres-2)
 ```
 T0           T1(+4.0s)             T1+2s
 switchover → done                  new primary 라우팅 확립
 ```
 - n8n 로그: `failed to create a backend 2 connection` × 1 → `Database connection recovered`
 - pgpool `SHOW POOL_NODES`: `last_status_change` 가 switchover 시점에 갱신, `role=primary` 가 새 노드로 이동
 - n8n.inouter.com 200 유지
 - **자동 복구 ~2초**
 비교: 같은 시나리오 pgcat 에서는 client 측 pg 풀이 idle 소켓 재사용으로 좀비 → pod restart 필요 (2026-04-15 19:53 UTC 사고 1038 에러).
 ### 2. mbp etcd 60초 stop
 ```
 T0           T+30s 중간 쓰기         T+60s            T+70s
 mbp stop  →  SELECT OK, write OK →  mbp start     →  확인: 계속 쓰기 OK
 ```
 - n8n 에러 **0건**, HTTP 200 유지
 - pgpool 이 backend Patroni 자체를 직접 봄 — etcd 쿼럼은 2노드(NAS + jp1) 로 유지되어 Patroni leader 변경 없음, pgpool 경로 영향 없음
 비교: pgcat 경로에서는 mbp etcd 56초 다운 시 n8n `Database connection timed out` 캐스케이드 → 503 → pod restart 필수 (오늘 오후 인시던트).
 ## 배포 아티팩트
 - helm-charts 디렉토리: `pgpool/` — `configmap.yaml`, `deployment.yaml`, `service.yaml`, `pdb.yaml`
 - ArgoCD Application: `pgpool` (namespace `argocd`, source path `pgpool`, selfHeal + prune)
 - K8s 리소스: namespace `db` (pgcat 와 공존)
 ## 롤백
 n8n ConfigMap `n8n-app-config` 에서 `DB_POSTGRESDB_HOST` 를 `pgpool.db.svc.cluster.local:9999` 에서 `pgcat.db.svc.cluster.local:6432` 로 되돌리고 rollout restart. 소요 시간 ~30초.
 ## 7일 관측 계획 (만료 2026-04-23)
 - 자연 Patroni failover 발생 시 n8n 에러 창 <5초 유지 여부
 - pgpool `SHOW STATS` / `SHOW POOL_NODES` 주간 샘플링
 - n8n 일별 DB 에러 카운트 (목표 <10/day 비-failover 시)
 - pgpool 리소스 사용량 (num_init_children=32 × max_pool=4 = 최대 128 backend connections per pod × 2 pod = 256 total. 현재 n8n 평균 idle 수준 확인 필요)
 긍정 관측이면 NocoDB → pgpool, Outline → pgpool 단계 전환. 부정이면 pgpool 해체 + pgcat 복귀.
 ## 참조
 - helm-charts commit 시리즈: a6f5991(초기) → e1dcd6d(bitnami 전환) → 213babb(scram) → 13dfae1(port 수정) → d3dde47(detach_false_primary) → bc6faae(공식 이미지 피봇) → 74ca477(pool_passwd 제거) → **9bc3a24(clear-text auth, 최종)**
 - 선행 조사: n8n 풀 좀비 `230ec530-b8a6-406c-9165-35c9eb2d8282`
 - pgcat 후속: TCP keepalive `129fbf50-e69b-47fd-ad55-3f5ff9066caf`
 - retry_timeout 상향 → mbp etcd hiccup 시 n8n 503 사고 (오늘 오후) → pgpool PoC 의 직접 동기
 ## 미해결 / Syn 공유
 - `pgpool/pgpool` 4.4.3 이후 방치 — pgpool 공식 이미지의 미래 불확실. 대안: Bitnami legacy 계속 사용하거나 우리가 커스텀 빌드
 - plaintext pool_passwd 우회 — scram-sha-256 백엔드 + pgpool 백엔드 auth 에 권고 방식은 AES 암호화. 1주 관측 후 하드닝 필요
--- a/infra/postgresql-ha.md
+++ b/infra/postgresql-ha.md
@@ -173,6 +173,30 @@ tcp_keepalives_count = 3
 Patroni failover 인시던트 이력: [[../history/2026-04-08-patroni-failover-incident|2026-04-08 pgcat/nocodb/outline read-only 사고]] · [[../history/2026-04-15-pgcat-ha-promotion|2026-04-15 pgcat HA 승격 Step 0]]
 ## pgpool-II PoC (n8n 전용)
 2026-04-16 기준 **n8n 만** pgpool 경유 (NocoDB·Outline 은 pgcat 유지). 1주 관측 후 확대 여부 판단.
 ### 구조
 ```
 n8n → pgpool.db.svc.cluster.local:9999 → Patroni 3노드 직결 (HAProxy 미경유)
 ```
 - 이미지: `pgpool/pgpool:4.4.3` (공식). `pgpool/pgpool` 태그 최신. 4.5/4.6 은 Bitnami 쪽 `bitnamilegacy/pgpool` 에만 있으나 env 래퍼 복잡도로 포기
 - replicas=2 on kr1 + kr2 (`podAntiAffinity` topologyKey hostname), PDB `minAvailable: 1`
 - 인증: `allow_clear_text_frontend_auth=on` + `pool_hba.conf` 메소드 `password`. 클라이언트가 plaintext 전송 → pgpool 이 그대로 backend scram-sha-256 challenge-response 에 사용. `pool_passwd` 미사용 (pgpool 이 plaintext 엔트리를 자동으로 md5 로 해시하는데 postgres 가 md5 거절 → 인증 실패)
 - sr_check/health_check: `sr_check_user=sr_check` (REPLICATION + pg_monitor), plaintext password 를 entrypoint sed 로 주입
 - `backend_clustering_mode = streaming_replication` (Patroni 연동). `failover_on_backend_error = off` — Patroni 가 promotion 수행, pgpool 은 role 변경 탐지 후 라우팅만 갱신
 - `load_balance_mode = off` — 모든 쿼리 primary 로 (n8n read-your-write 일관성)
 ### 검증 결과 (2026-04-16)
 - **Patroni switchover** (postgres-3→postgres-2, TL9→10): n8n 에러 창 ~2초 / 1건 `failed to create a backend 2 connection` → 즉시 `Database connection recovered`. pgpool `SHOW POOL_NODES` 에서 `last_status_change` 가 switchover 시점에 갱신되어 primary 자동 재탐지 확인
 - **mbp etcd 60s down**: n8n 에러 **0건**, n8n.inouter.com 200 유지. pgcat 에서는 동일 시나리오에서 n8n 풀 좀비 → 503 / pod restart 필요했음
 자세한 구성·검증 로그: [[../history/2026-04-16-pgpool-n8n-poc|history]]
 ## APISIX etcd 사용 현황
 | 사이트 | etcd | prefix | 비고 |