Consul server unable to join Consul Cluster

Antes de empezar a explicar cuál es mi problema, me gustaría señalar que no tengo mucha experiencia con Cónsul, así que por favor sea paciente conmigo :D Necesitaría tu ayuda para averiguar qué es lo que está mal con el Cónsul que he desplegado en mi AKS Azure. La infraestructura que tengo parece esta:

  • 3 servidores en AKS (versión cónsul 1.8.4)
  • 6 clientes ejecutando en VMs (versión cónsul 1.8.0)
  • El grupo AKS es privado

Todo estaba funcionando bien, pero de repente, las cápsulas comenzaron a morir una tras otra. Redistribuí Cónsul corriendo en AKS y ahora tengo el problema de que sólo tengo dos de tres servidores Cónsul corriendo. El tercer servidor estará en un estado de ejecución por aproximadamente 30 s y luego se convertirá en OOM asesinado y luego entrará en CrashLoopBackOff estado. Cuando ejecuto los miembros del cónsul de comando, obtengo todos los clientes del servidor y la cápsula problemática se mostrará como "izquierda", mientras que los otros se muestran como "vivo". También he intentado ejecutar el comando cónsul ensambla {ip address} pero esto me da el siguiente mensaje de error:

# Consul join 10.0.0.135 Dirección de unión de errores '10.0.0.135': Código de respuesta inesperado: 500 (1 error ocurrido: * Failed to join 10.0.0.135: dial tcp 10.0.0.135:8301: connect refused: connection

) No se unió a los nodos.

Adjunté el archivo yaml de mi Consul StatefulSet y el registro de errores de la cápsula problemática.

Debo señalar que estoy teniendo esta infraestructura durante 2 meses y todo estaba bien, y todas las cápsulas estaban sanas y funcionando. En los últimos 3 días estoy tratando con este tema, investigando en Internet tratando de averiguar cómo puedo solucionar este problema, pero sin resultado.

¿Podrías ayudarme a averiguar por qué de repente esto empezó a suceder y eventualmente ayudarme a resolver este problema?

Gracias de antemano por su tiempo,

Mike


YAMLfile

 kind: StatefulSet
apiVersion: apps/v1
metadata:
  name: consul-consul-server
  namespace: consul
  selfLink: /apis/apps/v1/namespaces/consul/statefulsets/consul-consul-server
  uid: ddfb4383-8545-457d-8c3a-5dc7ec04f9f2
  resourceVersion: '19294440'
  generation: 11
  creationTimestamp: '2020-10-26T10:19:25Z'
  labels:
    app: consul
    app.kubernetes.io/managed-by: Helm
    chart: consul-helm
    component: server
    heritage: Helm
    release: consul
  annotations:
    meta.helm.sh/release-name: consul
    meta.helm.sh/release-namespace: consul
spec:
  replicas: 3
  selector:
    matchLabels:
      app: consul
      chart: consul-helm
      component: server
      hasDNS: 'true'
      release: consul
  template:
    metadata:
      creationTimestamp: null
      labels:
        app: consul
        chart: consul-helm
        component: server
        hasDNS: 'true'
        release: consul
      annotations:
        consul.hashicorp.com/config-checksum: ca3d163bab055381827226140568f3bef7eaac187cebd76878e0b63e9e442356
        consul.hashicorp.com/connect-inject: 'false'
    spec:
      volumes:
        - name: config
          configMap:
            name: consul-consul-server-config
            defaultMode: 420
      containers:
        - name: consul
          image: 'consul:1.8.4'
          command:
            - /bin/sh
            - '-ec'
            - |
              CONSUL_FULLNAME="consul-consul"

              exec /bin/consul agent \
                -advertise="${HOST_IP}" \
                -bind=0.0.0.0 \
                -bootstrap-expect=3 \
                -client=0.0.0.0 \
                -config-dir=/consul/config \
                -datacenter=dc1 \
                -data-dir=/consul/data \
                -domain=consul \
                -hcl="connect { enabled = true }" \
                -ui \
                -retry-join=${CONSUL_FULLNAME}-server-0.${CONSUL_FULLNAME}-server.${NAMESPACE}.svc \
                -retry-join=${CONSUL_FULLNAME}-server-1.${CONSUL_FULLNAME}-server.${NAMESPACE}.svc \
                -retry-join=${CONSUL_FULLNAME}-server-2.${CONSUL_FULLNAME}-server.${NAMESPACE}.svc \
                -server
          ports:
            - name: http
              hostPort: 8500
              containerPort: 8500
              protocol: TCP
            - name: serflan
              hostPort: 8301
              containerPort: 8301
              protocol: TCP
            - name: serfwan
              hostPort: 8302
              containerPort: 8302
              protocol: TCP
            - name: server
              hostPort: 8300
              containerPort: 8300
              protocol: TCP
            - name: dns-tcp
              hostPort: 8600
              containerPort: 8600
              protocol: TCP
            - name: dns-udp
              hostPort: 8600
              containerPort: 8600
              protocol: UDP
          env:
            - name: POD_IP
              valueFrom:
                fieldRef:
                  apiVersion: v1
                  fieldPath: status.podIP
            - name: HOST_IP
              valueFrom:
                fieldRef:
                  apiVersion: v1
                  fieldPath: status.hostIP
            - name: NAMESPACE
              valueFrom:
                fieldRef:
                  apiVersion: v1
                  fieldPath: metadata.namespace
          resources:
            limits:
              cpu: 800m
              memory: 800Mi
            requests:
              cpu: 800m
              memory: 800Mi
          volumeMounts:
            - name: data-consul
              mountPath: /consul/data
            - name: config
              mountPath: /consul/config
          readinessProbe:
            exec:
              command:
                - /bin/sh
                - '-ec'
                - |
                  curl http://127.0.0.1:8500/v1/status/leader \
                  2>/dev/null | grep -E '".+"'
            initialDelaySeconds: 5
            timeoutSeconds: 5
            periodSeconds: 3
            successThreshold: 1
            failureThreshold: 2
          lifecycle:
            preStop:
              exec:
                command:
                  - /bin/sh
                  - '-c'
                  - consul leave
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File
          imagePullPolicy: IfNotPresent
      restartPolicy: Always
      terminationGracePeriodSeconds: 30
      dnsPolicy: ClusterFirst
      serviceAccountName: consul-consul-server
      serviceAccount: consul-consul-server
      hostNetwork: true
      securityContext:
        fsGroup: 1000
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            - labelSelector:
                matchLabels:
                  app: consul
                  component: server
                  release: consul
              topologyKey: kubernetes.io/hostname
      schedulerName: default-scheduler
  volumeClaimTemplates:
    - kind: PersistentVolumeClaim
      apiVersion: v1
      metadata:
        name: data-consul
        creationTimestamp: null
      spec:
        accessModes:
          - ReadWriteOnce
        resources:
          requests:
            storage: 10Gi
        volumeMode: Filesystem
      status:
        phase: Pending
  serviceName: consul-consul-server
  podManagementPolicy: Parallel
  updateStrategy:
    type: RollingUpdate
    rollingUpdate:
      partition: 0
  revisionHistoryLimit: 10
status:
  observedGeneration: 11
  replicas: 3
  readyReplicas: 2
  currentReplicas: 3
  updatedReplicas: 3
  currentRevision: consul-consul-server-5f84b7b657
  updateRevision: consul-consul-server-5f84b7b657
  collisionCount: 0

Error de registro

==> Starting Consul agent...
           Version: '1.8.4'
           Node ID: '017e63cc-bedf-c4c3-2a3c-0dfc0c05594a'
         Node name: 'aks-nodepool1-12257257-vmss000002'
        Datacenter: 'dc1' (Segment: '')
            Server: true (Bootstrap: false)
       Client Addr: [0.0.0.0] (HTTP: 8500, HTTPS: -1, gRPC: -1, DNS: 8600)
      Cluster Addr: 10.0.0.153 (LAN: 8301, WAN: 8302)
           Encrypt: Gossip: false, TLS-Outgoing: false, TLS-Incoming: false, Auto-Encrypt-TLS: false

==> Log data will now stream in as it occurs:

    2020-12-19T09:27:59.021Z [WARN]  agent: bootstrap_expect > 0: expecting 3 servers
    2020-12-19T09:27:59.044Z [WARN]  agent.auto_config: bootstrap_expect > 0: expecting 3 servers
    2020-12-19T09:27:59.075Z [WARN]  agent.server.snapshot: found temporary snapshot: name=1164-491656-1605996371085.tmp
    2020-12-19T09:27:59.075Z [WARN]  agent.server.snapshot: found temporary snapshot: name=1796-1455900-1608237651961.tmp
    2020-12-19T09:27:59.084Z [WARN]  agent.server.snapshot: found temporary snapshot: name=7621-1555326-1608364051682.tmp
    2020-12-19T09:28:05.099Z [INFO]  agent.server.raft: restored from snapshot: id=7621-1538935-1608350048113
    2020-12-19T09:28:33.735Z [INFO]  agent.server.raft: initial configuration: index=1560707 servers="[{Suffrage:Voter ID:8bdce7bb-464f-19e6-7a36-c165917790a4 Address:10.0.0.173:8300} {Suffrage:Voter ID:804735ae-e812-a843-96a1-7140a17909b6 Address:10.0.0.143:8300}]"
    2020-12-19T09:28:33.735Z [INFO]  agent.server.raft: entering follower state: follower="Node at 10.0.0.153:8300 [Follower]" leader=
    2020-12-19T09:28:33.749Z [INFO]  agent.server.serf.wan: serf: EventMemberJoin: aks-nodepool1-12257257-vmss000002.dc1 10.0.0.153
    2020-12-19T09:28:33.749Z [INFO]  agent.server.serf.wan: serf: Attempting re-join to previously known node: aks-nodepool1-12257257-vmss000000.dc1: 10.0.0.173:8302
    2020-12-19T09:28:33.752Z [INFO]  agent.server.serf.wan: serf: EventMemberJoin: aks-nodepool1-12257257-vmss000000.dc1 10.0.0.173
    2020-12-19T09:28:33.752Z [INFO]  agent.server.serf.wan: serf: EventMemberJoin: aks-nodepool1-12257257-vmss000001.dc1 10.0.0.143
    2020-12-19T09:28:33.752Z [INFO]  agent.server.serf.wan: serf: Re-joined to previously known node: aks-nodepool1-12257257-vmss000000.dc1: 10.0.0.173:8302
    2020-12-19T09:28:33.764Z [INFO]  agent.server.serf.lan: serf: EventMemberJoin: aks-nodepool1-12257257-vmss000002 10.0.0.153
    2020-12-19T09:28:33.764Z [INFO]  agent.router: Initializing LAN area manager
    2020-12-19T09:28:33.764Z [INFO]  agent.server.serf.lan: serf: Attempting re-join to previously known node: mapserver-failover: 10.0.0.54:8301
    2020-12-19T09:28:33.764Z [INFO]  agent.server: Handled event for server in area: event=member-join server=aks-nodepool1-12257257-vmss000002.dc1 area=wan
    2020-12-19T09:28:33.764Z [INFO]  agent.server: Handled event for server in area: event=member-join server=aks-nodepool1-12257257-vmss000000.dc1 area=wan
    2020-12-19T09:28:33.764Z [INFO]  agent.server: Handled event for server in area: event=member-join server=aks-nodepool1-12257257-vmss000001.dc1 area=wan
    2020-12-19T09:28:33.764Z [INFO]  agent.server: Adding LAN server: server="aks-nodepool1-12257257-vmss000002 (Addr: tcp/10.0.0.153:8300) (DC: dc1)"
    2020-12-19T09:28:33.764Z [INFO]  agent.server: Raft data found, disabling bootstrap mode
    2020-12-19T09:28:33.769Z [INFO]  agent.server.serf.lan: serf: EventMemberJoin: image-failover 10.0.0.57
    2020-12-19T09:28:33.769Z [INFO]  agent.server.serf.lan: serf: EventMemberJoin: aks-nodepool1-12257257-vmss000000 10.0.0.173
    2020-12-19T09:28:33.769Z [INFO]  agent.server.serf.lan: serf: EventMemberJoin: mapserver 10.0.0.53
    2020-12-19T09:28:33.769Z [INFO]  agent.server: Adding LAN server: server="aks-nodepool1-12257257-vmss000000 (Addr: tcp/10.0.0.173:8300) (DC: dc1)"
    2020-12-19T09:28:33.769Z [INFO]  agent.server.serf.lan: serf: EventMemberJoin: image 10.0.0.56
    2020-12-19T09:28:33.770Z [INFO]  agent.server.serf.lan: serf: EventMemberJoin: mapserver-failover 10.0.0.54
    2020-12-19T09:28:33.770Z [INFO]  agent.server.serf.lan: serf: EventMemberJoin: aks-nodepool1-12257257-vmss000001 10.0.0.143
    2020-12-19T09:28:33.770Z [INFO]  agent.server.serf.lan: serf: EventMemberJoin: web-server-01 10.0.0.36
    2020-12-19T09:28:33.770Z [INFO]  agent.server.serf.lan: serf: EventMemberJoin: web-server-02 10.0.0.37
    2020-12-19T09:28:33.770Z [INFO]  agent.server: Adding LAN server: server="aks-nodepool1-12257257-vmss000001 (Addr: tcp/10.0.0.143:8300) (DC: dc1)"
    2020-12-19T09:28:33.770Z [INFO]  agent.server.serf.lan: serf: Re-joined to previously known node: mapserver-failover: 10.0.0.54:8301
    2020-12-19T09:28:33.778Z [INFO]  agent: Started DNS server: address=0.0.0.0:8600 network=tcp
    2020-12-19T09:28:33.778Z [INFO]  agent: Started DNS server: address=0.0.0.0:8600 network=udp
    2020-12-19T09:28:33.778Z [INFO]  agent: Started HTTP server: address=[::]:8500 network=tcp
    2020-12-19T09:28:33.778Z [INFO]  agent: started state syncer
==> Consul agent running!
    2020-12-19T09:28:33.779Z [INFO]  agent: Retry join is supported for the following discovery methods: cluster=LAN discovery_methods="aliyun aws azure digitalocean gce k8s linode mdns os packet scaleway softlayer tencentcloud triton vsphere"
    2020-12-19T09:28:33.779Z [INFO]  agent: Joining cluster...: cluster=LAN
    2020-12-19T09:28:33.779Z [INFO]  agent: (LAN) joining: lan_addresses=[consul-consul-server-0.consul-consul-server.consul.svc, consul-consul-server-1.consul-consul-server.consul.svc, consul-consul-server-2.consul-consul-server.consul.svc]
    2020-12-19T09:28:33.932Z [WARN]  agent.server.memberlist.lan: memberlist: Failed to resolve consul-consul-server-0.consul-consul-server.consul.svc: lookup consul-consul-server-0.consul-consul-server.consul.svc on 168.63.129.16:53: no such host
    2020-12-19T09:28:33.972Z [WARN]  agent.server.raft: failed to get previous log: previous-index=1561367 last-index=1561020 error="log not found"
    2020-12-19T09:28:34.101Z [INFO]  agent: Synced node info
    2020-12-19T09:28:34.114Z [INFO]  agent: Synced service: service=app_webserver_1
    2020-12-19T09:28:34.126Z [INFO]  agent: Synced service: service=administration_webserver_1
    2020-12-19T09:28:34.184Z [WARN]  agent.server.memberlist.lan: memberlist: Failed to resolve consul-consul-server-1.consul-consul-server.consul.svc: lookup consul-consul-server-1.consul-consul-server.consul.svc on 168.63.129.16:53: no such host
    2020-12-19T09:28:34.184Z [INFO]  agent: Synced service: service=app_webserver_2
    2020-12-19T09:28:34.378Z [WARN]  agent.server.memberlist.lan: memberlist: Failed to resolve consul-consul-server-2.consul-consul-server.consul.svc: lookup consul-consul-server-2.consul-consul-server.consul.svc on 168.63.129.16:53: no such host
    2020-12-19T09:28:34.378Z [WARN]  agent: (LAN) couldn't join: number_of_nodes=0 error="3 errors occurred:
    * Failed to resolve consul-consul-server-0.consul-consul-server.consul.svc: lookup consul-consul-server-0.consul-consul-server.consul.svc on 168.63.129.16:53: no such host
    * Failed to resolve consul-consul-server-1.consul-consul-server.consul.svc: lookup consul-consul-server-1.consul-consul-server.consul.svc on 168.63.129.16:53: no such host
    * Failed to resolve consul-consul-server-2.consul-consul-server.consul.svc: lookup consul-consul-server-2.consul-consul-server.consul.svc on 168.63.129.16:53: no such host

"
     
    2020-12-19T09:30:04.891Z [WARN]  agent.server.memberlist.lan: memberlist: Refuting a suspect message (from: aks-nodepool1-12257257-vmss000002)
    2020-12-19T09:30:05.287Z [WARN]  agent: Join cluster failed, will retry: cluster=LAN retry_interval=30s error=
    2020-12-19T09:30:05.927Z [INFO]  agent.server.memberlist.lan: memberlist: Suspect web-server-02 has failed, no acks received
    2020-12-19T09:30:06.786Z [INFO]  agent.server.serf.lan: serf: EventMemberJoin: web-server-01 10.0.0.36
    2020-12-19T09:30:07.390Z [WARN]  agent: Check is now critical: check=service:administration_webserver_1
    2020-12-19T09:30:08.039Z [WARN]  agent: Check is now critical: check=service:app_webserver_2
    2020-12-19T09:30:08.427Z [WARN]  agent: Check is now critical: check=service:app_webserver_1
    2020-12-19T09:30:14.487Z [WARN]  agent: Check is now critical: check=service:administration_webserver_1
    2020-12-19T09:30:15.975Z [WARN]  agent: Check is now critical: check=service:app_webserver_1
    2020-12-19T09:30:15.976Z [WARN]  agent: Check is now critical: check=service:app_webserver_2
    2020-12-19T09:30:16.134Z [WARN]  agent.server.memberlist.lan: memberlist: Was able to connect to aks-nodepool1-12257257-vmss000000 but other probes failed, network may be misconfigured
    2020-12-19T09:30:21.533Z [WARN]  agent: Check is now critical: check=service:administration_webserver_1
    2020-12-19T09:30:23.290Z [WARN]  agent: Check is now critical: check=service:app_webserver_1
    2020-12-19T09:30:23.426Z [INFO]  agent.server.memberlist.lan: memberlist: Suspect image-failover has failed, no acks received
    2020-12-19T09:30:23.581Z [WARN]  agent: Check is now critical: check=service:app_webserver_2
    2020-12-19T09:30:27.987Z [WARN]  agent: Check is now critical: check=service:administration_webserver_1
    2020-12-19T09:30:29.988Z [WARN]  agent: Check is now critical: check=service:app_webserver_1
    2020-12-19T09:30:30.094Z [WARN]  agent: Check is now critical: check=service:app_webserver_2
    2020-12-19T09:30:32.472Z [WARN]  agent.server.memberlist.lan: memberlist: Was able to connect to image but other probes failed, network may be misconfigured
    2020-12-19T09:30:32.534Z [INFO]  agent.server.memberlist.lan: memberlist: Marking web-server-02 as failed, suspect timeout reached (0 peer confirmations)
    2020-12-19T09:30:32.675Z [INFO]  agent.server.serf.lan: serf: EventMemberFailed: web-server-02 10.0.0.37
    2020-12-19T09:30:34.832Z [WARN]  agent: Check is now critical: check=service:administration_webserver_1
    2020-12-19T09:30:35.542Z [INFO]  agent: (LAN) joining: lan_addresses=[consul-consul-server-0.consul-consul-server.consul.svc, consul-consul-server-1.consul-consul-server.consul.svc, consul-consul-server-2.consul-consul-server.consul.svc]
    2020-12-19T09:30:36.929Z [WARN]  agent: Check is now critical: check=service:app_webserver_2
    2020-12-19T09:30:37.836Z [WARN]  agent: Check is now critical: check=service:app_webserver_1
    2020-12-19T09:30:40.096Z [WARN]  agent.server.memberlist.lan: memberlist: Was able to connect to mapserver-failover but other probes failed, network may be misconfigured
    2020-12-19T09:30:43.635Z [WARN]  agent: Check is now critical: check=service:app_webserver_2
    2020-12-19T09:30:44.534Z [INFO]  agent.server.serf.lan: serf: EventMemberJoin: web-server-02 10.0.0.37
    2020-12-19T09:30:45.086Z [INFO]  agent.server.memberlist.wan: memberlist: Suspect aks-nodepool1-12257257-vmss000000.dc1 has failed, no acks received
    2020-12-19T09:30:45.836Z [WARN]  agent: Check is now critical: check=service:app_webserver_1
    2020-12-19T09:30:47.386Z [WARN]  agent.server.memberlist.lan: memberlist: Refuting a suspect message (from: aks-nodepool1-12257257-vmss000002)
    2020-12-19T09:30:49.724Z [WARN]  agent.server.memberlist.lan: memberlist: Failed to resolve consul-consul-server-0.consul-consul-server.consul.svc: lookup consul-consul-server-0.consul-consul-server.consul.svc on 168.63.129.16:53: no such host
    2020-12-19T09:30:50.539Z [WARN]  agent: Check is now critical: check=service:administration_webserver_1
    2020-12-19T09:30:51.040Z [WARN]  agent: Check is now critical: check=service:app_webserver_2
    2020-12-19T09:30:51.334Z [INFO]  agent.server.memberlist.lan: memberlist: Suspect image-failover has failed, no acks received
    2020-12-19T09:30:53.929Z [WARN]  agent: Check is now critical: check=service:app_webserver_1
    2020-12-19T09:30:54.933Z [WARN]  agent.server.memberlist.wan: memberlist: Refuting a suspect message (from: aks-nodepool1-12257257-vmss000000.dc1)
    2020-12-19T09:30:55.723Z [WARN]  agent.server.memberlist.lan: memberlist: Failed to resolve consul-consul-server-1.consul-consul-server.consul.svc: lookup consul-consul-server-1.consul-consul-server.consul.svc on 168.63.129.16:53: no such host
    2020-12-19T09:30:58.039Z [WARN]  agent: Check is now critical: check=service:administration_webserver_1
    2020-12-19T09:30:58.631Z [WARN]  agent: Check is now critical: check=service:app_webserver_2
    2020-12-19T09:31:01.087Z [WARN]  agent: Check is now critical: check=service:app_webserver_1
    2020-12-19T09:31:02.088Z [WARN]  agent.server.memberlist.lan: memberlist: Failed to resolve consul-consul-server-2.consul-consul-server.consul.svc: lookup consul-consul-server-2.consul-consul-server.consul.svc on 168.63.129.16:53: no such host
    2020-12-19T09:31:02.534Z [WARN]  agent: (LAN) couldn't join: number_of_nodes=0 error="3 errors occurred:
    * Failed to resolve consul-consul-server-0.consul-consul-server.consul.svc: lookup consul-consul-server-0.consul-consul-server.consul.svc on 168.63.129.16:53: no such host
    * Failed to resolve consul-consul-server-1.consul-consul-server.consul.svc: lookup consul-consul-server-1.consul-consul-server.consul.svc on 168.63.129.16:53: no such host
    * Failed to resolve consul-consul-server-2.consul-consul-server.consul.svc: lookup consul-consul-server-2.consul-consul-server.consul.svc on 168.63.129.16:53: no such host
    2020-12-19T09:31:05.834Z [INFO]  agent.server.memberlist.lan: memberlist: Suspect mapserver-failover has failed, no acks received
    2020-12-19T09:31:05.924Z [WARN]  agent: Check is now critical: check=service:app_webserver_2
    2020-12-19T09:31:07.886Z [WARN]  agent: Check is now critical: check=service:app_webserver_1
    2020-12-19T09:31:11.930Z [WARN]  agent: Check is now critical: check=service:administration_webserver_1
    2020-12-19T09:31:12.535Z [WARN]  agent.server.memberlist.wan: memberlist: Was able to connect to aks-nodepool1-12257257-vmss000000.dc1 but other probes failed, network may be misconfigured
    2020-12-19T09:31:12.544Z [WARN]  agent: Check is now critical: check=service:app_webserver_2
    2020-12-19T09:31:14.679Z [WARN]  agent: Check is now critical: check=service:app_webserver_1
    2020-12-19T09:31:15.384Z [WARN]  agent.server.memberlist.lan: memberlist: Was able to connect to web-server-01 but other probes failed, network may be misconfigured
    2020-12-19T09:31:18.586Z [WARN]  agent: Check is now critical: check=service:administration_webserver_1
    2020-12-19T09:31:19.085Z [WARN]  agent: Check is now critical: check=service:app_webserver_2
    2020-12-19T09:31:21.229Z [WARN]  agent: Check is now critical: check=service:app_webserver_1
    2020-12-19T09:31:22.583Z [WARN]  agent.server.memberlist.lan: memberlist: Was able to connect to aks-nodepool1-12257257-vmss000000 but other probes failed, network may be misconfigured
    2020-12-19T09:31:22.723Z [INFO]  agent: Synced check: check=service:administration_webserver_1
    2020-12-19T09:31:23.137Z [INFO]  agent: Synced check: check=service:app_webserver_1
    2020-12-19T09:31:23.285Z [WARN]  agent.server.memberlist.lan: memberlist: Refuting a suspect message (from: mapserver-failover)
    2020-12-19T09:31:23.729Z [INFO]  agent: Synced check: check=service:app_webserver_2
    2020-12-19T09:31:25.229Z [WARN]  agent: Check is now critical: check=service:administration_webserver_1
    2020-12-19T09:31:25.579Z [WARN]  agent: Check is now critical: check=service:app_webserver_2
    2020-12-19T09:31:27.485Z [WARN]  agent: Check is now critical: check=service:app_webserver_1
    2020-12-19T09:31:28.675Z [WARN]  agent.server.memberlist.lan: memberlist: Refuting a suspect message (from: web-server-01)
    2020-12-19T09:31:31.137Z [INFO]  agent: Synced check: check=service:administration_webserver_1
    2020-12-19T09:31:31.532Z [INFO]  agent: Synced check: check=service:app_webserver_2
    2020-12-19T09:31:31.698Z [INFO]  agent.server.fsm: snapshot created: duration=5.743467ms
    2020-12-19T09:31:32.922Z [ERROR] agent.server.raft: failed to flush response: error="write tcp 10.0.0.153:8300->10.0.0.143:35822: write: broken pipe"
    2020-12-19T09:31:32.927Z [WARN]  agent.server.raft: skipping application of old log: index=1561084
    2020-12-19T09:31:33.038Z [ERROR] agent.server.raft: failed to flush response: error="write tcp 10.0.0.153:8300->10.0.0.143:35824: write: broken pipe"
    2020-12-19T09:31:33.091Z [WARN]  agent.server.raft: skipping application of old log: index=1561084
    2020-12-19T09:31:33.171Z [ERROR] agent.server.raft: failed to flush response: error="write tcp 10.0.0.153:8300->10.0.0.143:35948: write: broken pipe"
    2020-12-19T09:31:33.232Z [ERROR] agent.server.raft: failed to take snapshot: error="cannot take snapshot now, wait until the configuration entry at 1560707 has been applied (have applied 1547895)"
    2020-12-19T09:31:33.292Z [WARN]  agent.server.raft: skipping application of old log: index=1561084
    2020-12-19T09:31:33.377Z [WARN]  agent.server.raft: failed to get previous log: previous-index=1561406 last-index=1561084 error="log not found"
    2020-12-19T09:31:33.378Z [ERROR] agent.server.raft: failed to flush response: error="write tcp 10.0.0.153:8300->10.0.0.143:36164: write: broken pipe"
    2020-12-19T09:31:33.626Z [ERROR] agent.server.raft: failed to flush response: error="write tcp 10.0.0.153:8300->10.0.0.143:36064: write: broken pipe"
    2020-12-19T09:31:33.627Z [INFO]  agent: Synced check: check=service:app_webserver_1
    2020-12-19T09:31:33.677Z [WARN]  agent.server.raft: skipping application of old log: index=156108
    2020-12-19T09:31:33.725Z [INFO]  agent: (LAN) joining: lan_addresses=[consul-consul-server-0.consul-consul-server.consul.svc, consul-consul-server-1.consul-consul-server.consul.svc, consul-consul-server-2.consul-consul-server.consul.svc]
    2020-12-19T09:31:33.831Z [ERROR] agent.server.raft: failed to flush response: error="write tcp 10.0.0.153:8300->10.0.0.143:36234: write: broken pipe"
    2020-12-19T09:31:33.833Z [ERROR] agent.server.raft: failed to flush response: error="write tcp 10.0.0.153:8300->10.0.0.143:36340: write: broken pipe"
    2020-12-19T09:31:33.833Z [ERROR] agent.server.raft: failed to flush response: error="write tcp 10.0.0.153:8300->10.0.0.143:36444: write: broken pipe"
    2020-12-19T09:31:33.833Z [ERROR] agent.server.raft: failed to flush response: error="write tcp 10.0.0.153:8300->10.0.0.143:36656: write: broken pipe"
    2020-12-19T09:31:33.833Z [ERROR] agent.server.raft: failed to flush response: error="write tcp 10.0.0.153:8300->10.0.0.143:36440: write: broken pipe"

Pregunta hecha hace 3 años, 4 meses, 29 días - Por debugduke


2 Respuestas:

  • Basándome en la información proporcionada, parece que el servidor Consul que se está ejecutando en un AKS Azure está teniendo problemas para unirse al clúster Consul correctamente. La razón principal parece ser que el servidor Consul no puede resolver los nombres de los otros miembros del clúster debido a problemas de resolución de nombres DNS.

    Aquí hay algunos puntos clave que se pueden abordar para intentar solucionar el problema:

    • Problema de resolución de nombres de host: Los errores indican que el servidor Consul no puede resolver los nombres de host de los otros miembros del clúster. Esto podría estar relacionado con la configuración de DNS en el entorno AKS.
    • Configuración de DNS en AKS: Asegúrate de que la configuración de DNS en tu entorno AKS esté configurada correctamente. Puede que sea necesario revisar la configuración de DNS y asegurarte de que los nombres de host de los diferentes nodos del clúster Consul se puedan resolver correctamente.
    • Verificación de la configuración del clúster Consul: Revisa la configuración del clúster Consul para asegurarte de que los nombres de host y las direcciones IP se configuran correctamente en la configuración de los nodos. Asegúrate de que estas configuraciones sean coherentes.
    • Actualización de la configuración de Retry Join: Revisar la configuración de Retry Join en el archivo de configuración de Consul para asegurarse de que los nombres de host se definen de manera correcta y se pueden resolver. Puede que necesites actualizar estas configuraciones para asegurarte de que los nodos se unan correctamente al clúster.
    • Revisión de la configuración de seguridad: Asegúrate de que no haya configuraciones de seguridad que estén bloqueando la comunicación entre los nodos del clúster Consul.

    En general, te recomendaría revisar la configuración de DNS en tu entorno AKS, así como la configuración de Consul para garantizar que los nodos se puedan comunicar y unir al clúster de manera adecuada. Si el problema persiste, también puedes considerar solicitar asistencia adicional al equipo de soporte de Azure o a la comunidad de Cónsul para obtener más ayuda específica sobre este problema.

    Respondida el Dec 20, 2020 a las 21:08 - por Gpt

    Votos positivos: 0 | Votos negativos: 0

  • El problema es que no se puede resolver el nombre de host de los otros servidores Consul.

    Solución:

    • Verifique que los servidores Consul puedan resolver los nombres de host de cada uno.
    • Asegúrese de que los servidores Consul puedan comunicarse entre sí en los puertos 8301 y 8302.
    • Intente reiniciar los servidores Consul.

    Aquí hay algunos pasos más específicos que puede seguir:

    • Verifique la resolución de DNS: Ejecute el siguiente comando en cada servidor Consul:
    
    nslookup consul-consul-server-0.consul-consul-server.consul.svc
    

    Debería devolver la dirección IP correcta del servidor Consul.

    • Verifique la comunicación de la red: Ejecute el siguiente comando en cada servidor Consul:
    
    nc -vz consul-consul-server-0.consul-consul-server.consul.svc 8301
    

    Debería conectarse con éxito al otro servidor Consul.

    • Reinicie los servidores Consul: Ejecute el siguiente comando en cada servidor Consul:
    
    systemctl restart consul
    

    Después de seguir estos pasos, intente volver a unir los servidores Consul. Si el problema persiste, puede consultar la documentación oficial de Consul para obtener más información sobre la resolución de problemas:

    https://www.consul.io/docs/agent/join-cluster.html#resolving-issues

    Respondida el Dec 20, 2020 a las 21:17 - por Gemini

    Votos positivos: 0 | Votos negativos: 0