Skuber 알림 설정 가이드 (PromQL)

개요

Skuber는 PromQL 기반 알람을 지원합니다. 기존 Prometheus Alertmanager에서 사용하던 알람 규칙을 Skuber Alert로 마이그레이션할 수 있습니다.

⚠️ 중요: Skuber PromQL 구문

Prometheus vs Skuber 차이점

Skuber는 OpenTelemetry 메트릭을 사용하며, dot notation(.)으로 저장됩니다.
PromQL에서 dot notation 메트릭을 조회하려면 {__name__="metric.name"} 구문을 사용해야 합니다.

환경	메트릭 이름	PromQL 쿼리
Prometheus	`kube_pod_container_status_restarts_total`	`kube_pod_container_status_restarts_total > 3`
Skuber	`k8s.container.restarts`	`{__name__="k8s.container.restarts"}`

Threshold 설정 방법

중요: Skuber에서는 Threshold 조건을 쿼리에 포함하지 않습니다!

❌ 잘못된 방법: {__name__="k8s.container.restarts"} > 3
✅ 올바른 방법:
   - Query: {__name__="k8s.container.restarts"}
   - Alert Threshold (UI에서 설정): 3

Threshold는 Skuber Alert UI의 "Alert Threshold" 필드에서 별도로 설정합니다.

1. Pod 재시작 알람

PromQL 쿼리

{__name__="k8s.container.restarts"}

심각도별 Threshold 설정

Severity	Query	Threshold (UI 설정)	설명
Warning	`{__name__="k8s.container.restarts"}`	1	재시작 1회 이상
Critical	`{__name__="k8s.container.restarts"}`	5	재시작 5회 이상

네임스페이스 필터링 (선택)

{__name__="k8s.container.restarts", k8s_namespace_name="production"}

Prometheus → Skuber 변환 예시

Prometheus Alertmanager	Skuber
`kube_pod_container_status_restarts_total > 3`	Query: `{__name__="k8s.container.restarts"}` + Threshold: 3

Alert 메시지 예시

Title: Pod 재시작 감지
Description: Container가 재시작되었습니다. 네임스페이스와 Pod 상태를 확인하세요.

2. Pod Pending 알람

PromQL 쿼리

{__name__="k8s.pod.phase"}

참고: k8s.pod.phase 값 매핑

1: Pending

2: Running

3: Succeeded

4: Failed

5: Unknown

Pending 상태 감지 설정

Severity	Query	Threshold (UI 설정)	Condition
Warning	`{__name__="k8s.pod.phase"}`	1	Equals (==)

Prometheus → Skuber 변환 예시

Prometheus Alertmanager	Skuber
`kube_pod_status_phase{phase="Pending"} == 1`	Query: `{__name__="k8s.pod.phase"}` + Threshold: 1 + Condition: Equals

Alert 메시지 예시

Title: Pod Pending 상태 지속
Description: Pod가 Pending 상태입니다. 리소스 부족 또는 스케줄링 문제를 확인하세요.

3. 볼륨 (PVC) 알람

가용 용량 쿼리

{__name__="k8s.volume.available"}

전체 용량 쿼리

{__name__="k8s.volume.capacity"}

심각도별 설정

참고: 비율 계산이 필요한 경우, 두 개의 별도 알람을 설정하거나
Skuber에서 지원하는 경우 Formula 기능을 사용합니다.

Severity	모니터링 방법	설명
Warning	`{__name__="k8s.volume.available"}` + Threshold (낮은 값)	가용 용량이 특정 바이트 이하
Critical	`{__name__="k8s.volume.available"}` + Threshold (더 낮은 값)	가용 용량 매우 부족

Alert 메시지 예시

Title: PVC 용량 부족 경고
Description: Volume의 가용 용량이 부족합니다. 용량 확장을 검토하세요.

4. Conntrack 알람

⚠️ 현재 수집되지 않음

Conntrack 메트릭을 수집하려면 OTel Collector의 hostmetrics receiver에 다음 설정이 필요합니다:
receivers:
  hostmetrics:
    scrapers:
      network:
        include:
          interfaces: [".*"]
        metrics:
          system.network.conntrack.count:
            enabled: true
          system.network.conntrack.max:
            enabled: true

5. Node NotReady 알람

PromQL 쿼리

{__name__="k8s.node.condition_ready"}

설정

Severity	Query	Threshold (UI 설정)	Condition
Critical	`{__name__="k8s.node.condition_ready"}`	0	Equals (==)

Alert 메시지 예시

Title: Node NotReady 상태
Description: 노드가 NotReady 상태입니다. 즉시 확인이 필요합니다.

6. Node Memory 알람

PromQL 쿼리

{__name__="system.memory.utilization"}

심각도별 설정

Severity	Query	Threshold (UI 설정)	설명
Warning	`{__name__="system.memory.utilization"}`	0.85	메모리 85% 초과
Critical	`{__name__="system.memory.utilization"}`	0.95	메모리 95% 초과

Alert 메시지 예시

Title: Node 메모리 부족
Description: 노드의 메모리 사용률이 높습니다.

7. Node CPU 알람

PromQL 쿼리

{__name__="system.cpu.utilization"}

심각도별 설정

Severity	Query	Threshold (UI 설정)	설명
Warning	`{__name__="system.cpu.utilization"}`	0.8	CPU 80% 초과
Critical	`{__name__="system.cpu.utilization"}`	0.95	CPU 95% 초과

Alert 메시지 예시

Title: Node CPU 과부하
Description: 노드의 CPU 사용률이 높습니다.

8. Node Disk 알람

PromQL 쿼리

{__name__="system.filesystem.utilization"}

심각도별 설정

Severity	Query	Threshold (UI 설정)	설명
Warning	`{__name__="system.filesystem.utilization"}`	0.85	디스크 85% 초과
Critical	`{__name__="system.filesystem.utilization"}`	0.95	디스크 95% 초과

Alert 메시지 예시

Title: Node 디스크 부족
Description: 노드의 디스크 사용률이 높습니다. 정리가 필요합니다.

9. Deployment 레플리카 알람

가용 레플리카 쿼리

{__name__="kube_deployment_status_replicas_available"}

목표 레플리카 쿼리

{__name__="kube_deployment_spec_replicas"}

Alert 메시지 예시

Title: Deployment 레플리카 불일치
Description: Deployment의 가용 레플리카가 목표 레플리카와 불일치합니다.

Skuber Alert 생성 방법

Step 1: Alert 메뉴 접근

Skuber UI → Alerts → + New Alert

Step 2: Alert Type 선택

Metric Based Alert 선택

Step 3: Query 설정

Query Builder 탭 클릭
PromQL 모드 선택
{__name__="metric.name"} 형식으로 쿼리 입력

Step 4: Condition 설정

설정	값	설명
Condition	Above threshold / Below threshold / Equals	조건 유형
Alert Threshold	(숫자 값)	여기에 Threshold 값 입력
For	5m	5분 지속 시 발생

핵심: Threshold 조건은 쿼리가 아닌 Alert Threshold 필드에서 설정합니다!

Step 5: Alert 정보 입력

항목	예시
Alert Name	Pod Restart Alert
Severity	Warning / Critical
Labels	team=platform, env=production

Step 6: Notification Channel 설정

지원 채널:

Slack
PagerDuty
Email
Webhook
MS Teams
Opsgenie

Step 7: 저장

Save Alert 클릭

전체 알람 요약

알람명	PromQL Query	Threshold (UI)	Condition	Severity	For
Pod 재시작 (경고)	`{__name__="k8s.container.restarts"}`	1	Above	Warning	0m
Pod 재시작 (심각)	`{__name__="k8s.container.restarts"}`	5	Above	Critical	0m
Pod Pending	`{__name__="k8s.pod.phase"}`	1	Equals	Warning	5m
PVC 가용량 부족	`{__name__="k8s.volume.available"}`	(bytes)	Below	Warning	5m
Node NotReady	`{__name__="k8s.node.condition_ready"}`	0	Equals	Critical	1m
Node Memory (경고)	`{__name__="system.memory.utilization"}`	0.85	Above	Warning	5m
Node Memory (심각)	`{__name__="system.memory.utilization"}`	0.95	Above	Critical	5m
Node CPU (경고)	`{__name__="system.cpu.utilization"}`	0.8	Above	Warning	5m
Node CPU (심각)	`{__name__="system.cpu.utilization"}`	0.95	Above	Critical	5m
Node Disk (경고)	`{__name__="system.filesystem.utilization"}`	0.85	Above	Warning	5m
Node Disk (심각)	`{__name__="system.filesystem.utilization"}`	0.95	Above	Critical	5m

Prometheus → Skuber 마이그레이션 체크리스트

변환 규칙

항목	Prometheus	Skuber
메트릭 이름	underscore (`_`)	dot notation (`.`)
PromQL 구문	`metric_name > 3`	`{__name__="metric.name"}`
Threshold	쿼리에 포함	UI Alert Threshold 필드
조건 연산자	쿼리에 포함 (`>`, `==`, `<`)	UI Condition 드롭다운

주요 메트릭 매핑

Prometheus 메트릭	Skuber 메트릭
`kube_pod_container_status_restarts_total`	`k8s.container.restarts`
`kube_pod_status_phase`	`k8s.pod.phase`
`kubelet_volume_stats_available_bytes`	`k8s.volume.available`
`kubelet_volume_stats_capacity_bytes`	`k8s.volume.capacity`
`kube_node_status_condition`	`k8s.node.condition_ready`
`node_memory_MemAvailable_bytes`	`system.memory.utilization`
`node_cpu_seconds_total`	`system.cpu.utilization`
`node_filesystem_avail_bytes`	`system.filesystem.utilization`

수집되지 않는 메트릭

메트릭	상태	수집 방법
Conntrack	❌ 미수집	OTel Collector hostmetrics receiver에 conntrack scraper 추가
PVC Pending 상태	❌ 미수집	kube-state-metrics 전체 메트릭 수집 설정 필요

트러블슈팅

"No Data" 또는 "invalid promql query" 에러

구문 확인: {__name__="metric.name"} 형식 사용
Threshold 분리: 쿼리에 > 3 같은 조건 포함하지 않음
메트릭 이름 확인: dot notation(.) 사용

❌ 잘못됨: k8s_container_restarts > 3
❌ 잘못됨: k8s.container.restarts > 3
❌ 잘못됨: {__name__="k8s.container.restarts"} > 3
✅ 올바름: {__name__="k8s.container.restarts"}  (Threshold는 UI에서 설정)

메트릭이 조회되지 않는 경우

kube-state-metrics 설치 확인

kubectl get pods -n skuber-observability | grep kube-state-metrics

OTel Collector에서 메트릭 수집 확인

kubectl logs -n skuber-observability -l app.kubernetes.io/name=k8s-infra | grep k8s.container

ClickHouse에서 메트릭 존재 확인

SELECT DISTINCT metric_name
FROM signoz_metrics.time_series_v4
WHERE metric_name LIKE '%k8s%'
ORDER BY metric_name;

알람이 발생하지 않는 경우

Alert Threshold 값 확인
Condition (Above/Below/Equals) 확인
For 지속 시간 확인
Notification Channel 연결 상태 확인

환경 설정 요구사항

DOT_METRICS_ENABLED 환경변수

Skuber에서 dot notation 메트릭을 사용하려면 DOT_METRICS_ENABLED=true 환경변수가 설정되어 있어야 합니다.

- name: DOT_METRICS_ENABLED
  value: "true"

이 설정은 Skuber 배포 시 기본적으로 포함되어 있습니다.

Skuber 알림 설정 가이드 (PromQL)

개요

⚠️ 중요: Skuber PromQL 구문

Prometheus vs Skuber 차이점

Threshold 설정 방법

1. Pod 재시작 알람

PromQL 쿼리

심각도별 Threshold 설정

네임스페이스 필터링 (선택)

Prometheus → Skuber 변환 예시

Alert 메시지 예시

2. Pod Pending 알람

PromQL 쿼리

Pending 상태 감지 설정

Prometheus → Skuber 변환 예시

Alert 메시지 예시

3. 볼륨 (PVC) 알람

가용 용량 쿼리

전체 용량 쿼리

심각도별 설정

Alert 메시지 예시

4. Conntrack 알람

5. Node NotReady 알람

PromQL 쿼리

설정

Alert 메시지 예시

6. Node Memory 알람

PromQL 쿼리

심각도별 설정

Alert 메시지 예시

7. Node CPU 알람

PromQL 쿼리

심각도별 설정

Alert 메시지 예시

8. Node Disk 알람

PromQL 쿼리

심각도별 설정

Alert 메시지 예시

9. Deployment 레플리카 알람

가용 레플리카 쿼리

목표 레플리카 쿼리

Alert 메시지 예시

Skuber Alert 생성 방법

Step 1: Alert 메뉴 접근

Step 2: Alert Type 선택

Step 3: Query 설정

Step 4: Condition 설정

Step 5: Alert 정보 입력

Step 6: Notification Channel 설정

Step 7: 저장

전체 알람 요약

Prometheus → Skuber 마이그레이션 체크리스트

변환 규칙

주요 메트릭 매핑

수집되지 않는 메트릭

트러블슈팅

"No Data" 또는 "invalid promql query" 에러

메트릭이 조회되지 않는 경우

알람이 발생하지 않는 경우

환경 설정 요구사항

DOT_METRICS_ENABLED 환경변수

참고 자료