绿色记忆 » Cloud

DevPod on Kubernetes: turning devcontainer.json into a persistent remote workspace

Alex — Fri, 10 Apr 2026 07:22:22 +0000

DevPod is an open source workspace manager for reproducible development environments across Docker, Kubernetes, SSH hosts, and several cloud backends. This note documents a full Kubernetes-based remote development setup with DevPod, including persistent volume strategy, custom images, file sync, IDE integration, and the GPU issues that tend to burn the most time.

What DevPod is

DevPod, from Loft Labs, separates environment definition from the infrastructure that runs it. The developer describes the environment in

devcontainer.json

, including the base image, toolchain, ports, and lifecycle hooks. DevPod then creates and manages the matching workspace on the selected Provider.

Three terms matter more than anything else:

Provider: the infrastructure backend. DevPod supports Docker, Kubernetes, SSH, and several cloud platforms.
Workspace: an isolated development environment instance, usually backed by a container or VM on the provider.
```
devcontainer.json
```
: a Dev Container specification file that defines the image, lifecycle hooks, port forwarding, and editor behavior.

Compared with GitHub Codespaces or Gitpod, DevPod is a client-side tool rather than a hosted platform. On a self-managed Kubernetes cluster, that means you keep control over networking, storage, security policy, and node placement.

Kubernetes provider architecture

When Kubernetes is the provider, DevPod creates a Pod to host the workspace. Most setups end up with three files:

```
devcontainer.json
```
, which defines the image, workspace directory, forwarded ports, and lifecycle commands.
```
pod-manifest.yaml
```
, which carries the Kubernetes-native parts such as security context, resource limits, and volume mounts.
An orchestration script such as
```
devpod.sh
```
, which wraps
```
devpod up
```
, file sync, and environment bootstrap. That script is usually the glue that makes the workflow tolerable.

Workspace lifecycle

A typical flow looks like this:

# Create and start the workspace, which creates a Pod on Kubernetes
devpod up . --ide none --provider K8s

# Sync local source code to the remote workspace
rsync -az --exclude='node_modules' ./project/ remote:/workspace/project/

# Enter the development environment
devpod ssh my-workspace

# Stop the workspace, which removes the Pod but keeps the PVC
devpod stop my-workspace

# Delete everything, including the Pod and the PVC
devpod delete my-workspace

What matters is how

devpod stop

behaves. It removes the Pod but keeps the PVC. The next

devpod up

recreates the Pod and reattaches the same volume, so the data survives Pod recreation.

Managing multiple environments

The simplest way to split environments is to keep a separate Pod manifest for each one and switch them in a wrapper script:

# Example orchestration logic: select a manifest and disk size by environment
case "$ENV" in
  prod) MANIFEST="pod-manifest.yaml";      DISK="300Gi" ;;
  dev)  MANIFEST="pod-manifest-dev.yaml";  DISK="50Gi"  ;;
  test) MANIFEST="pod-manifest-test.yaml"; DISK="500Gi" ;;
esac

devpod up . --ide none \
  --provider K8s \
  --provider-option DISK_SIZE="$DISK" \
  --provider-option POD_MANIFEST="$MANIFEST"

This lets each environment define its own node selectors, quotas, and security policy while still sharing one

devcontainer.json

and one base image.

Persistent volume strategy

Where you mount the PVC decides what survives a Pod rebuild.

Recommended: mount the PVC at $HOME

Mount the PVC at the container's

$HOME

, for example

/root

. In most setups, that is the least painful option. There are a few reasons to prefer it:

The IDE server side, such as VS Code Server or Cursor Server, installs itself under
```
~/.vscode-server
```
or
```
~/.cursor-server
```
. Those directories land on persistent storage automatically.
Toolchain state such as
```
~/.nvm
```
and
```
~/.local/bin
```
does not need extra symlink work.
Shell files such as
```
~/.bashrc
```
also persist, so environment setup happens once instead of on every Pod restart.

If the PVC is mounted somewhere else, such as

/workspace

, you usually end up adding symlinks or reinstalling tooling whenever the Pod comes back.

Example directory layout

/root/                          # PVC mount point = $HOME
├── .cursor-server/             # IDE server and extensions, persistent
│   ├── cli/                    # Server binaries, disposable
│   └── extensions/             # Installed extensions, keep these
├── .nvm/                       # Node.js version manager, persistent
├── .local/bin/                 # kubectl and other tools, persistent
├── .bashrc                     # Shell configuration, persistent
├── Projects/
│   ├── my-project/             # Project source code
│   └── shared-libs/            # Shared libraries
└── .config/                    # Tool configuration

Common commands

DevPod manages the whole workspace lifecycle through the

devpod

CLI. These are the commands that tend to matter in daily use.

Provider management

Add and configure the provider first:

# Add the Kubernetes provider
devpod provider add kubernetes

# List configured providers
devpod provider list

# Set provider options such as the namespace and Pod manifest path
devpod provider set-options kubernetes \
  --option KUBERNETES_NAMESPACE=devpod \
  --option POD_MANIFEST=pod-manifest.yaml

Workspace lifecycle

# Create and start a workspace
# --ide none skips automatic IDE attach and works well in scripts
devpod up . --provider kubernetes --ide none

# List workspace state
devpod list

# SSH into the workspace
devpod ssh my-workspace

# Stop the workspace, which removes the Pod and keeps the PVC
devpod stop my-workspace

# Delete the workspace and the PVC
devpod delete my-workspace

stop

only removes the Pod. Everything on the PVC, including extensions, toolchain state, and source code, stays in place. The next

up

recreates the Pod and reattaches the volume, so the environment comes back quickly.

Useful provider options

The Kubernetes provider accepts extra parameters through

--provider-option

devpod up . --provider kubernetes --ide none \
  --provider-option DISK_SIZE=100Gi \
  --provider-option POD_MANIFEST=pod-manifest-test.yaml \
  --provider-option KUBERNETES_NAMESPACE=devpod

Option	Description
DISK_SIZE	PVC size, for example 50Gi or 300Gi.
POD_MANIFEST	Path to the custom Pod manifest.
KUBERNETES_NAMESPACE	Target namespace for workspace Pods.

Status checks and debugging

# Show detailed workspace status
devpod status my-workspace

# Inspect the underlying Pod directly
kubectl get pod -n devpod -l app=devpod

# Show Pod events when startup fails
kubectl describe pod my-workspace -n devpod

Configuration details

devcontainer.json

is the core Dev Container file. It defines the image, lifecycle hooks, forwarded ports, editor customization, and the rest of the workspace contract. DevPod fully supports that specification. The file usually lives at

.devcontainer/devcontainer.json

A full example for remote development on Kubernetes:

{
  "name": "my-workspace",

  // Use a custom image with all tools preinstalled
  "image": "registry.example.com/dev/ubuntu:22.04-tools",

  // Skip first-run installation work
  "onCreateCommand": "true",

  // Mount the PVC at $HOME so IDE state and extensions persist
  // workspaceMount is left empty on purpose. DevPod v0.6.x has a known
  // .devpodignore issue, so large monorepos can get uploaded in full.
  "workspaceFolder": "/root",

  "customizations": {
    "vscode": {
      "extensions": [
        "ms-python.python",
        "ms-python.vscode-pylance",
        "ms-python.debugpy",
        "redhat.vscode-yaml",
        "ms-kubernetes-tools.vscode-kubernetes-tools"
      ],
      "settings": {
        "python.defaultInterpreterPath": "/usr/local/bin/python",
        "editor.formatOnSave": true,
        "terminal.integrated.defaultProfile.linux": "bash"
      }
    }
  },

  "forwardPorts": [8000, 8080, 5432, 6379],
  "portsAttributes": {
    "8000": { "label": "API Server" },
    "8080": { "label": "Web UI" },
    "5432": { "label": "PostgreSQL", "onAutoForward": "silent" },
    "6379": { "label": "Redis", "onAutoForward": "silent" }
  },
  "otherPortsAttributes": {
    "onAutoForward": "silent"
  }
}

Images and builds

You can define the base container either by pointing directly at an image or by building one from a Dockerfile.

The

image

field accepts any OCI image, including Docker Hub, GHCR, or a private registry. For remote development on Kubernetes, a prebuilt image usually saves trouble. Baking the whole toolchain into the image cuts startup time from minutes to seconds.

If you need to customize the image, use

build

{
  "build": {
    "dockerfile": "Dockerfile",
    "context": "..",
    "args": {
      "PYTHON_VERSION": "3.11"
    }
  }
}

context

defaults to

"."

, which means the directory that contains

devcontainer.json

. Setting it to

".."

lets the Dockerfile reference files from the project root.

workspaceFolder and workspaceMount

workspaceFolder

is the directory the IDE opens by default after it connects. On Kubernetes, it usually makes sense to point it at the PVC mount, for example

/root

, so the workspace path and the persistent path are the same thing.

workspaceMount

controls how local source code gets mounted into the container. It is useful in local Docker workflows. In remote Kubernetes workflows, it is often better to leave it empty. DevPod v0.6.x has a known issue in #1885 where

.devpodignore

can be ignored during streaming upload, which means a large workspace, including

venv

and

node_modules

, can get pushed in full. A custom

rsync

step gives you much better control.

Lifecycle hooks

The Dev Container spec defines six lifecycle hooks, in this order:

initializeCommand     # runs on the host, every startup
  ↓
onCreateCommand       # runs once after first container creation
  ↓
updateContentCommand  # runs after content updates, at least once
  ↓
postCreateCommand     # runs after user assignment, user secrets available
  ↓
postStartCommand      # runs after each container start
  ↓
postAttachCommand     # runs after each IDE attach

Each hook accepts three forms:

String: executed through
```
/bin/sh
```
.
Array: executed directly without a shell, which is safer.
Object: multiple named commands executed in parallel, useful when several services need to start together.

{
  "postAttachCommand": {
    "api-server": "cd /root/api && python -m uvicorn main:app --port 8000",
    "worker": "cd /root/worker && python -m celery -A tasks worker"
  }
}

A few practical rules help here:

If all tools are already in the image, set
```
onCreateCommand
```
to
```
"true"
```
and skip it.
```
postStartCommand
```
is a good place for startup checks or light warmup.
The
```
waitFor
```
field decides which phase must finish before the IDE attaches. The default is
```
"updateContentCommand"
```
.

IDE customization

You can declare extensions and settings under

customizations.vscode

, and they are applied automatically when the IDE connects:

"customizations": {
  "vscode": {
    "extensions": [
      "ms-python.python",
      "ms-python.vscode-pylance",
      "ms-python.debugpy",
      "redhat.vscode-yaml",
      "ms-kubernetes-tools.vscode-kubernetes-tools"
    ],
    "settings": {
      "python.defaultInterpreterPath": "/usr/local/bin/python",
      "editor.formatOnSave": true,
      "terminal.integrated.defaultProfile.linux": "bash"
    }
  }
}

Extensions listed under

extensions

install automatically on first attach. With the PVC mounted at

$HOME

, you only pay that cost once. Settings defined here override local editor settings, which helps keep behavior consistent across a team.

Port forwarding

Ports listed in

forwardPorts

are forwarded automatically after the IDE connects. When a service starts inside the container, you can usually hit it on local

localhost

without extra setup.

portsAttributes

lets you define a label and behavior per port:

"forwardPorts": [8000, 8080, 5432, 6379],
"portsAttributes": {
  "8000": { "label": "API Server" },
  "8080": { "label": "Web UI", "onAutoForward": "openBrowser" },
  "5432": { "label": "PostgreSQL", "onAutoForward": "silent" },
  "6379": { "label": "Redis", "onAutoForward": "silent" }
},
"otherPortsAttributes": {
  "onAutoForward": "silent"
}

onAutoForward

controls the first reaction when DevPod sees the port:

"notify"

shows a notification,

"openBrowser"

opens a browser,

"silent"

forwards quietly, and

"ignore"

does nothing.

otherPortsAttributes

sets the default for ports you did not list explicitly.

Environment variables

The Dev Container spec splits environment variables into two layers:

```
containerEnv
```
: set on the container itself, visible to all processes, and fixed for the life of that container.
```
remoteEnv
```
: only visible to IDE-launched processes such as terminals, tasks, and debuggers. This layer can reference
```
${containerEnv:VAR}
```
and does not require a container rebuild when changed.

{
  "containerEnv": {
    "PYTHONPATH": "/root/libs/common:/root/libs/shared"
  },
  "remoteEnv": {
    "PATH": "${containerEnv:PATH}:/root/.local/bin"
  }
}

Both fields also support

${localEnv:VAR}

, which reads an environment variable from the host, for example

${localEnv:HOME}

Features

Dev Container Features are reusable Dockerfile fragments distributed as OCI artifacts. The

features

field lets you add tools without editing the base image directly:

{
  "features": {
    "ghcr.io/devcontainers/features/docker-in-docker:2": {},
    "ghcr.io/devcontainers/features/kubectl-helm-minikube:1": {
      "version": "latest"
    },
    "ghcr.io/devcontainers/features/node:1": {
      "version": "22"
    }
  }
}

You can browse the available features at containers.dev/features. For Kubernetes-based remote development, though, baking tools into the base image is usually better than paying installation time on every new workspace. Features fit local Docker prototypes better than long-lived remote workspaces.

Container behavior controls

A few fields change how the container behaves at runtime:

Field	Default	Description
overrideCommand	true	Overrides the image command with an infinite loop so the container stays alive. This default usually makes sense for custom development images.
shutdownAction	stopContainer	What happens when the IDE closes. Options include stopContainer and none. For Kubernetes, none is often the better choice.
init	false	Uses tini as PID 1 to reap zombie processes.
privileged	false	Enables privileged mode. In Docker workflows this can be set here. In Kubernetes, it belongs in the Pod manifest.
containerUser	root or the Dockerfile USER	The user for all container operations.
remoteUser	same as containerUser	The user for IDE terminals and tasks. It can differ from containerUser.

Predefined variables

String values in

devcontainer.json

can use these predefined variables:

Variable	Meaning
${localEnv:VAR_NAME}	Host environment variable, with optional default value syntax: ${localEnv:VAR:default}
${containerEnv:VAR_NAME}	Container environment variable, available only inside remoteEnv
${localWorkspaceFolder}	Workspace path on the host
${containerWorkspaceFolder}	Workspace path inside the container
${devcontainerId}	Stable unique identifier for the container

Customizing the base image

You can point the Dev Container

image

field at any public image, but for remote development on Kubernetes it is usually worth building a dedicated base image with the toolchain, language runtimes, and system libraries locked into image layers.

That pays off in a few ways:

The Pod is usable as soon as it starts. You do not wait for
```
onCreateCommand
```
to install half the environment.
Environment consistency improves because everyone shares the same image instead of replaying installation steps in slightly different conditions.
When the Pod is rebuilt, the toolchain comes back with it. You are not depending on package manager availability at workspace creation time.

Dockerfile layering rules

Good layering makes build caching much more effective. Put low-churn tools in lower layers and faster-moving pieces in upper layers. End each

RUN

block with

apt-get clean && rm -rf /var/lib/apt/lists/*

to keep layers smaller, and use

--no-install-recommends

to avoid pulling in packages you do not need.

The following example builds a development image with Python 3.11, common system tools, and the NVIDIA CUDA runtime:

FROM ubuntu:22.04
ENV DEBIAN_FRONTEND=noninteractive

# Layer 1: system tools, Python 3.11, and all PPAs
RUN apt-get update && \
    apt-get install -y --no-install-recommends \
      software-properties-common gnupg2 wget curl ca-certificates && \
    add-apt-repository -y ppa:deadsnakes/ppa && \
    add-apt-repository -y ppa:graphics-drivers/ppa && \
    wget -qO /tmp/cuda-keyring.deb \
      https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb && \
    dpkg -i /tmp/cuda-keyring.deb && rm /tmp/cuda-keyring.deb && \
    apt-get update && \
    apt-get install -y --no-install-recommends \
      python3.11 python3.11-venv python3.11-dev python3-pip \
      git make vim jq postgresql-client \
      openssh-server procps iproute2 iputils-ping \
      rsync htop telnet && \
    update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.11 1 && \
    update-alternatives --install /usr/bin/python  python  /usr/bin/python3.11 1 && \
    apt-get clean && rm -rf /var/lib/apt/lists/*

# Layer 2: NVIDIA driver tools such as nvidia-smi
RUN apt-get update && \
    apt-get install -y --no-install-recommends nvidia-utils-580-server && \
    apt-get clean && rm -rf /var/lib/apt/lists/*

# Layer 3: CUDA runtime libraries
RUN apt-get update && \
    apt-get install -y --no-install-recommends cuda-libraries-12-8 && \
    apt-get clean && rm -rf /var/lib/apt/lists/*

Several design choices here matter:

Add PPAs and GPG keys in layer 1, before
```
update-alternatives
```
changes the default Python. If you switch Python first,
```
add-apt-repository
```
can fail with
```
No module named 'apt_pkg'
```
because the
```
apt_pkg
```
binding expects the system Python.
Keep NVIDIA tools and CUDA libraries in separate layers. That way a driver update only rebuilds one layer.
Install
```
nvidia-utils-xxx-server
```
, not
```
nvidia-utils-xxx
```
. On Ubuntu, the latter can be a transitional dummy package without the actual
```
nvidia-smi
```
binary.
Pick
```
cuda-libraries-12-8
```
, roughly 1.2 GB, instead of the full
```
cuda-toolkit-12-8
```
, which is closer to 10 GB. Most development environments need the runtime more often than the full compiler and debugger stack.

How the image and devcontainer.json work together

Once the image already contains the full toolchain,

devcontainer.json

becomes much simpler:

{
  "image": "registry.example.com/dev/ubuntu:22.04-cuda12.8",
  "onCreateCommand": "true",
  "workspaceFolder": "/root"
}

Setting

onCreateCommand

"true"

means there is nothing left to install at first startup. The Pod is ready immediately after creation.

Customizing the Pod spec

The Pod manifest is the core Kubernetes-side configuration. It controls the things DevPod cannot express through

devcontainer.json

Template variables

DevPod renders the Pod manifest as a template before it creates the Pod. These placeholders are commonly used:

Variable	Meaning
{{.WorkspaceId}}	Workspace name, often reused as the Pod name and label value.
{{.Image}}	Image declared in devcontainer.json .

Security context

Remote development containers often need looser permissions than production containers. These are the settings that come up most often:

Setting	Use	Risk
privileged: true	Docker-in-Docker, device access, debugging tools	Full access to host kernel capabilities
SYS_ADMIN	mount and cgroup operations	Medium
SYS_PTRACE	strace , gdb , and similar debugging	Low
NET_ADMIN	Network debugging and iptables work	Medium
hostNetwork: true	Direct use of the host network stack, which avoids CNI overhead	Port conflicts and loss of network isolation
hostPID: true	Inspect host processes for system-level debugging	Loss of process isolation

Loosen permissions only where the workspace actually needs them, and keep these Pods isolated to dedicated namespaces or nodes so they do not interfere with production workloads.

Resource requests and limits

resources:
  requests:
    cpu: "500m"
    memory: "1Gi"
  limits:
    cpu: "16"
    memory: "64Gi"

Set

requests

low enough to keep scheduling realistic, and

limits

high enough to leave room for bursts. Development environments rarely sit at peak usage all day, but builds and test runs can spike hard for a short time.

File sync strategy

Default DevPod sync vs custom rsync

DevPod includes a built-in sync path through

devpod up

, and it works fine for small projects. On large multi-repo workspaces, with dozens of subprojects and millions of files, two problems show up fast:

The first sync can take a very long time, and exclusion control is limited.
DevPod may try to upload the entire
```
workspaceFolder
```
, including directories you do not want remotely, such as
```
node_modules
```
and
```
.git
```
.

The usual way around this is to launch DevPod with

--ide none

, skip automatic sync, and then run your own

rsync

command with explicit include and exclude rules.

The stub directory trick

Even with

--ide none

, DevPod still tries to sync the local directory that matches

workspaceFolder

during

devpod up

. If that directory is large, the initial startup can still crawl. One workaround is to create a temporary empty directory and use that for the initial workspace creation:

STUB_DIR=$(mktemp -d)
devpod up "$STUB_DIR" --ide none --provider K8s ...
rm -rf "$STUB_DIR"
# Then sync the real source tree with rsync

rsync in practice

SSH_CMD="ssh my-workspace.devpod"

rsync -az \
  --exclude='node_modules' \
  --exclude='.git' \
  --exclude='__pycache__' \
  --exclude='venv' \
  --exclude='.venv' \
  --exclude='dist' \
  --exclude='.next' \
  --exclude='.temp' \
  --exclude='.logs' \
  --exclude='.vscode/sessions.json' \
  --copy-unsafe-links \
  ./my-project/ my-workspace.devpod:/root/Projects/my-project/

The most useful flags here are:

```
-az
```
: archive mode plus compression. Do not add
```
--progress
```
when you have a large number of small files. The extra output can slow the SSH stream badly enough to trigger a broken pipe.
```
--copy-unsafe-links
```
: dereferences symlinks that point outside the synced tree. In multi-repo setups this is useful because shared directories linked from elsewhere often do not resolve correctly on the remote side.
```
--exclude
```
: keep anything noisy or disposable out of the remote workspace.
```
.vscode/sessions.json
```
changes constantly and tends to fight with remote state, so it should stay out.

Remote IDE access

VS Code and Cursor run remote development by installing a server-side component inside the container. The local editor talks to that server through SSH.

How the server gets installed

The server build has to match the local IDE version, usually by commit hash. The installation flow is usually:

Read the current commit hash from the local IDE.
Download the matching server bundle.

Transfer and unpack it to

~/.cursor-server/cli/servers/Stable-{commit}/

on the remote side.

The wrapper script should make installation idempotent:

COMMIT=$(get_ide_commit_hash)
SERVER_BIN="$HOME/.cursor-server/cli/servers/Stable-$COMMIT/server/bin/code-server"

if $SSH_CMD "test -x $SERVER_BIN"; then
  echo "Server already installed"
else
  # Download and install the server
  install_ide_server "$COMMIT"
fi

Extension persistence

Extensions live under

~/.cursor-server/extensions/

~/.vscode-server/extensions/

. If the PVC is mounted at

$HOME

, those extensions persist automatically.

A common mistake is wiping the whole

~/.cursor-server

directory during a server reinstall. That blows away every installed extension. The safer cleanup target is the server binary directory only:

# Wrong: removes extensions too
rm -rf ~/.cursor-server

# Right: remove only the server binaries
rm -rf ~/.cursor-server/cli

Bulk extension sync

When you first prepare a remote environment, it can be faster to sync already installed local extensions than to redownload everything from the marketplace:

rsync -az \
  ~/.cursor-server/extensions/ \
  my-workspace.devpod:~/.cursor-server/extensions/

After the sync, check for broken symlinks. Some extensions include links to a local Node.js path that does not exist remotely. Those need to be replaced with real files.

# Find broken symlinks on the remote side
find ~/.cursor-server/extensions/ -type l ! -exec test -e {} \; -print

# Replace each broken link with a real copy of the target file
# fetched from the local machine

Keeping SSH sessions alive

DevPod SSH sessions can drop when left idle because the server's SSH daemon, a firewall, or a load balancer between the client and the pod times out the connection. The standard fix is to enable SSH keepalive on the client side:

# ~/.ssh/config
Host *.devpod
    ServerAliveInterval 60
    ServerAliveCountMax 10

ServerAliveInterval 60

sends a keepalive packet every 60 seconds.

ServerAliveCountMax 10

allows up to 10 consecutive missed responses before the client closes the connection. That combination keeps the tunnel alive through typical idle timeouts and handles pauses of up to roughly 10 minutes.

For sessions opened through scripts, add the options as flags:

ssh -o ServerAliveInterval=60 -o ServerAliveCountMax=10 my-workspace.devpod

For Cursor remote connections, the keepalive must be in

~/.ssh/config

rather than a command flag, because Cursor manages the underlying SSH process itself and does not expose extra flags to the user.

Why the first connection is slow

The first IDE attach to a fresh workspace often takes anywhere from 30 seconds to several minutes because the IDE still has to:

Establish the SSH tunnel, which adds some overhead through DevPod's SSH layer.
Download and install the server if it is not already present.
Initialize the installed extensions.

Later connections are much faster because the server and extensions are already sitting on the PVC.

GPU access inside Kubernetes workspaces

GPU access on Kubernetes depends on several moving parts, including the host driver, the device plugin, and the container runtime hook. If any one of them is wrong, the container will come up without usable GPU devices.

How the NVIDIA device plugin works

The NVIDIA Device Plugin runs as a DaemonSet on GPU nodes and registers the extended resource

nvidia.com/gpu

with Kubernetes. A Pod requests GPUs by declaring the count in

resources.limits

resources:
  limits:
    nvidia.com/gpu: "4"
  requests:
    nvidia.com/gpu: "4"

The scheduler places the Pod on a node with enough GPU capacity, and the device plugin injects the actual device nodes such as

/dev/nvidia0

runtimeClassName: nvidia

Requesting GPU resources is not enough. Kubernetes also has to know which container runtime class should handle GPU device setup. That happens through the Pod's

runtimeClassName

field:

spec:
  runtimeClassName: nvidia
  containers:
    - name: devpod
      # ...

If you omit

runtimeClassName

, the Pod may still get GPU quota, but the runtime will not call NVIDIA's prestart hook. The result is simple: no

/dev/nvidia*

devices inside the container. This is one of the most common failure modes.

AppArmor blocking

privileged: true

does not mean AppArmor is

unconfined

. On nodes with AppArmor enabled, a privileged container can still be blocked by the default profile, such as

cri-containerd.apparmor.d

, when it tries to access GPU device nodes.

The fix is to declare an unconfined AppArmor profile in the Pod annotations:

metadata:
  annotations:
    container.apparmor.security.beta.K8s.io/devpod: unconfined

Here

devpod

is the container name. The annotation must match it exactly.

The NVIDIA_VISIBLE_DEVICES trap

It is tempting to set

NVIDIA_VISIBLE_DEVICES=all

in the Pod manifest to expose every GPU. In a setup that already uses

runtimeClassName: nvidia

, that usually backfires. A manually set value can interfere with the device plugin's own injection logic.

The NVIDIA container runtime behaves like this:

If
```
NVIDIA_VISIBLE_DEVICES
```
comes from the device plugin, the runtime mounts exactly the devices that value names.
If the manifest hardcodes
```
NVIDIA_VISIBLE_DEVICES=all
```
, that value overrides the plugin-managed one and can break the mapping step.

The safer approach is to leave

NVIDIA_VISIBLE_DEVICES

alone and let the device plugin manage it. Keeping

NVIDIA_DRIVER_CAPABILITIES=all

is fine if the container needs full driver capability access.

Getting nvidia-smi inside the container

nvidia-smi

is the fastest way to confirm GPU visibility. There is one common trap when you install it inside the container: on some Linux distributions, packages named

nvidia-utils-xxx

are only transitional dummy packages. They install successfully but do not include the real

nvidia-smi

binary.

On Ubuntu 22.04, the reliable path is:

Add
```
ppa:graphics-drivers/ppa
```
.
Install
```
nvidia-utils-xxx-server
```
, with the
```
-server
```
suffix.

If changing the image is inconvenient, one temporary workaround is to mount host driver tools and libraries into the container with

hostPath

volumeMounts:
  - name: host-root
    mountPath: /host
    readOnly: true
volumes:
  - name: host-root
    hostPath:
      path: /

After startup, add

/host/usr/lib/x86_64-linux-gnu

LD_LIBRARY_PATH

and call

/host/usr/bin/nvidia-smi

directly. It works, but it is still a workaround. The long-term fix is to bake the required driver tools into the image.

How to debug NVML Unknown Error

nvidia-smi

returns

Failed to initialize NVML: Unknown Error

, check things in this order:

AppArmor. Confirm the Pod annotation is
```
unconfined
```
, and inspect the actual container profile with
```
cat /proc/1/attr/current
```
.
Device nodes. Check whether
```
ls /dev/nvidia*
```
returns anything. If the files exist but opening them returns
```
EPERM
```
, the cgroup device filter is the problem, not the driver.
Runtime class. Confirm the Pod spec sets
```
runtimeClassName: nvidia
```
and that the cluster actually has that RuntimeClass.
Environment variables. Verify that
```
NVIDIA_VISIBLE_DEVICES
```
was not overridden manually.
Driver versions. Make sure the user-space NVIDIA libraries in the container are compatible with the host kernel driver.
containerd
```
privileged_without_host_devices
```
. If the cluster uses
```
nvidia-container-runtime
```
as the default runtime and this flag is
```
false
```
, privileged pods that do not request
```
nvidia.com/gpu
```
will see device files in
```
/dev
```
but be blocked by the eBPF cgroup program. See the next section.

GPU access in privileged pods that bypass the device plugin

A development pod sometimes skips the

nvidia.com/gpu

resource request entirely, for example when the node already runs an inference service and the workspace wants to share the hardware without holding a scheduler slot. That approach works until the cluster enables GPU time-slicing.

A common part of the time-slicing setup is replacing the default containerd

runc

binary with

nvidia-container-runtime

[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
  BinaryName = "nvidia-container-runtime"

When this is in place and the containerd setting

privileged_without_host_devices

false

, privileged containers no longer inherit host

/dev

automatically. The nvidia-container-runtime attaches an eBPF cgroup program that controls device access. For pods without a device plugin allocation, that program blocks

/dev/nvidiactl

and friends at the open syscall level.

The device files appear in

ls /dev

because the runtime still creates their directory entries. But opening them returns

EPERM

, and NVML fails immediately:

>>> import os; os.open('/dev/nvidiactl', os.O_RDWR)
PermissionError: [Errno 1] Operation not permitted: '/dev/nvidiactl'

The error is at the kernel cgroup level, not in the userspace library. The same block applies even if you run the host's own

nvidia-smi

binary via

chroot /host nvidia-smi

because the eBPF program acts on any process in that container's cgroup.

The fix is one line in

/etc/containerd/config.toml

[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
  privileged_without_host_devices = true

After updating the file, containerd must be restarted. If you cannot SSH directly to the node, a temporary privileged Job that mounts the host root is the standard approach. Use an image that is already cached on the node and avoid pulling from a registry:

apiVersion: batch/v1
kind: Job
metadata:
  name: containerd-restart
  namespace: kube-system
spec:
  ttlSecondsAfterFinished: 60
  template:
    spec:
      restartPolicy: Never
      nodeSelector:
        kubernetes.io/hostname: "your-node"
      hostPID: true
      hostNetwork: true
      containers:
        - name: restart
          image: your-cached-image
          imagePullPolicy: IfNotPresent
          securityContext:
            privileged: true
          command:
            - /bin/bash
            - -c
            - |
              chroot /host systemctl restart containerd
              sleep 5
              chroot /host systemctl is-active containerd
          volumeMounts:
            - name: host-root
              mountPath: /host
      volumes:
        - name: host-root
          hostPath:
            path: /

The pod must be recreated after containerd restarts. Stop the workspace with

devpod stop

before applying the config change. The PVC is not touched by any of these steps.

Common failure modes

Symptom	Cause	Fix
Pod enters Dead or Failed state	OOM, node issues, or a bad manifest	Run devpod stop , fix the manifest, then run devpod up again. The PVC stays intact.
SSH exits with code 255	The Pod is not ready yet, or the SSH tunnel dropped	Check Pod state and retry after it reaches Running. If server installation was interrupted, rerun the installation step manually.
rsync reports Broken pipe	Progress output flooded the SSH channel	Use rsync -az without --progress or --info=progress2 .
add-apt-repository fails with No module named 'apt_pkg'	The default Python was switched before repository setup	Add all PPAs before calling update-alternatives .
IDE extensions disappear after a Pod rebuild	The reinstall script removed the extensions directory	Delete only the cli/ subtree and keep extensions/ .
nvidia-smi: command not found	A transitional dummy package was installed	Install nvidia-utils-xxx-server from ppa:graphics-drivers/ppa .
NVML Unknown Error	AppArmor, runtime class, device injection, environment override, or cgroup eBPF block from privileged_without_host_devices = false	Try opening /dev/nvidiactl in Python. EPERM means cgroup block. Set privileged_without_host_devices = true in containerd config and restart. Otherwise debug: AppArmor, device nodes, runtimeClassName , then environment variables.
/dev/nvidia* does not exist	Missing runtimeClassName: nvidia or a broken device plugin	Confirm the RuntimeClass exists and the device plugin DaemonSet is healthy.

The post DevPod on Kubernetes: turning devcontainer.json into a persistent remote workspace appeared first on 绿色记忆.

Replacing Docker Desktop with Colima on macOS

Alex — Sun, 15 Mar 2026 03:29:02 +0000

Colima is one of the cleanest ways to run containers locally on macOS. It starts a Linux virtual machine through Lima, runs Docker, containerd, and optional k3s Kubernetes inside that VM, then exposes the result to host-side tools such as

docker

and

kubectl

. This note covers how Colima works on macOS, how to install it, which settings matter in practice, how to verify the setup, and which operational details usually trip people up.

Containers on macOS

Containers are not lightweight macOS processes. They depend on Linux kernel features such as namespaces, cgroups, and OverlayFS. macOS does not provide those interfaces, so Linux containers on macOS always run on top of a Linux virtual machine.

That is the right starting point for understanding Colima. It does not bypass virtualization. It makes that layer lighter and easier to work with. Lima manages the Linux VM. Colima configures the container runtime inside it and ties that runtime into the host command-line workflow.

What Colima is

Colima is best understood as a developer-friendly layer on top of Lima. Lima handles VM lifecycle, file sharing, and port forwarding. Colima takes care of the container runtime and exposes it to the tools you already use on the host.

Three properties matter most in day-to-day use:

It gives macOS a local environment for Docker, containerd, and optional Kubernetes.
It works with the host CLI instead of forcing everything through a desktop application workflow.
It supports multiple profiles, with each profile backed by its own VM. That makes it easy to split a lightweight Docker setup from a heavier Kubernetes setup.

Why Colima

For local container development on macOS, the real question is usually not whether containers can run. They can. The question is whether the environment is easy to reason about. Colima is appealing for three simple reasons.

The structure is clear. Host CLI, Linux VM, and container runtime are separate layers, which makes troubleshooting easier.
The controls are explicit. CPU, memory, disk, architecture, Kubernetes, networking, and mount behavior can all be configured through flags or YAML.
It fits an engineering workflow better than a GUI-first workflow. Scripts, profiles, and repeatable setup steps all work naturally.

If Docker Desktop is already installed, you do not necessarily have to remove it first. What matters more is knowing which Docker context is active, otherwise commands may end up talking to the wrong daemon.

Installation

On macOS, the simplest installation path is Homebrew. If you use the Docker runtime, you need the Docker CLI on the host. If you want local Kubernetes, you also need

kubectl

brew install colima docker kubectl

The first startup can stay close to the defaults. The goal is just to confirm that the host CLI can talk to the VM-backed runtime.

colima start
docker run --rm hello-world
docker ps

If you only need a Docker daemon, that is enough. If you also want local Kubernetes, enable it at startup:

colima start --kubernetes
kubectl get nodes

If more than one Docker daemon exists on the machine, check the active context before assuming anything is broken:

docker context ls
docker context use colima

Common configuration

Colima accepts both command-line flags and persistent YAML configuration. In practice,

colima start --edit

is usually the safest entry point because it opens the current profile configuration, lets you change it, and then starts the instance.

The example below is a sensible local-development baseline. It removes private registry assumptions and keeps only the settings that are useful in a public, general-purpose setup.

# Resource sizing. The defaults are fine for a single container,
# but small once you run an app stack and k3s together.
cpu: 4
memory: 8
disk: 100

# Immutable creation-time settings. Use the host architecture
# and keep Docker as the container runtime.
arch: host
runtime: docker

# Single-node k3s. Disable the default Traefik install so it
# does not collide with whatever ingress stack you already use.
kubernetes:
  enabled: true
  version: v1.35.0+k3s1
  k3sArgs:
    - --disable=traefik

# Give the VM a host-reachable address for debugging and direct checks.
network:
  address: true
  mode: shared

# On newer macOS versions, prefer Apple's virtualization framework.
vmType: vz

# On Apple Silicon, enable Rosetta for linux/amd64 userland binaries.
rosetta: true

# VZ plus virtiofs is a common high-performance combination on macOS.
mountType: virtiofs

# Make Docker and Kubernetes contexts active on startup.
autoActivate: true

# Install a small set of debugging tools inside the VM.
# Provision scripts should stay idempotent.
provision:
  - mode: system
    script: |
      apt-get update
      apt-get install -y vim curl htop git make dnsutils net-tools iputils-ping telnet

Configuration overview

The official documentation groups Colima settings into resources, VM settings, runtime settings, networking, mounts, SSH, provisioning, and environment variables. The table below summarizes the current upstream template, plus

rootDisk

, which is documented separately in the configuration guide.

Key	Default	Meaning	Notes
cpu	2	Number of vCPUs assigned to the VM.	Resource setting
memory	2	Memory assigned to the VM, in GiB.	Resource setting
disk	100	Container data disk size, in GiB.	Can only be increased after creation
rootDisk	20	Root filesystem disk size for the VM, in GiB.	Documented in the config guide
arch	host	VM architecture, either the host architecture or an explicit override.	Immutable after creation
runtime	docker	Container runtime.	Immutable after creation
modelRunner	docker	Backend used for AI model execution.	AI workload setting
hostname	null	Custom VM hostname.	Defaults to colima or colima-
kubernetes.enabled	false	Turns the built-in k3s cluster on or off.	Kubernetes group
kubernetes.version	latest stable	k3s version, which must match an actual k3s release string.	Kubernetes group
kubernetes.k3sArgs	--disable=traefik	Extra arguments passed to the k3s server.	Kubernetes group
kubernetes.port	0	Kubernetes API listen port. A value of 0 means "pick a free port".	Kubernetes group
autoActivate	true	Makes Docker and Kubernetes contexts active on startup.	Client-side behavior
network.address	false	Assigns a host-reachable IP address to the VM.	macOS only
network.mode	shared	Network mode. The docs list shared and bridged.	macOS only
network.interface	en0	Host network interface used in bridged mode.	Only used with bridged mode
network.preferredRoute	false	Uses the assigned VM IP as the preferred route.	Requires address=true
network.dns	[]	Custom DNS resolvers for the VM.	Network group
network.dnsHosts	host.docker.internal: host.lima.internal	Built-in DNS host mapping.	Network group
network.hostAddresses	false	Replicates host IP addresses into the VM for more specific port forwarding behavior.	Network group
network.gatewayAddress	192.168.5.2	Gateway address for the VM network.	Last octet must be 2
forwardAgent	false	Forwards the host SSH agent into the VM.	SSH group
docker	{}	Configuration block mapped directly into Docker daemon.json.	Advanced setting
vmType	qemu	Virtualization backend.	Immutable after creation
portForwarder	ssh	Port forwarding mechanism. Valid values are ssh, grpc, and none.	Network group
rosetta	false	Enables amd64 userland emulation on Apple Silicon.	Requires VZ
binfmt	true	Enables foreign-architecture binary emulation.	Cross-architecture compatibility
nestedVirtualization	false	Turns nested virtualization on.	Requires newer Apple Silicon and VZ
mountType	sshfs on qemu, virtiofs on vz	Host-to-VM mount driver.	Immutable after creation
mountInotify	false	Propagates inotify file events into the VM.	Experimental
cpuType	host	CPU type used by QEMU.	QEMU only
provision	[]	Provision scripts executed during startup.	Should be idempotent
sshConfig	true	Controls whether the host ~/.ssh/config is updated automatically.	SSH group
sshPort	0	SSH server port inside the VM. A value of 0 means a random port.	SSH group
mounts	[]	Extra host directory mounts. Setting it to null disables mounts completely.	Mount group
diskImage	""	Path to a custom VM disk image.	Advanced setting
env	{}	Environment variables injected into the VM.	Environment variable group

Template and instance configuration

The official docs effectively give Colima three configuration entry points. The first is

colima start --edit

, which edits the current instance configuration. The second is

colima template

, which edits the default template used by future instances. The third is environment variables such as

COLIMA_HOME

COLIMA_PROFILE

, and

DOCKER_CONFIG

, which change the config root, the active profile, and the Docker client config directory.

# Edit the current profile
colima start --edit

# Edit the default template
colima template

# Pick a specific editor
colima start --edit --editor code
colima template --editor code

It also helps to remember the config file locations:

Default profile:
```
~/.colima/default/colima.yaml
```
Named profile:
```
~/.colima//colima.yaml
```
Default template:
```
~/.colima/_templates/default.yaml
```

The docs also call out four settings as immutable after instance creation: arch, runtime, vmType, and mountType. If you need to change any of them, restart is not enough. Delete the instance and recreate it with the new values.

Verification

Once the configuration is in place, start with the VM status:

colima status

network.address

is enabled and

jq

is installed on the host, you can pull out the VM IP directly:

export COLIMA_VM_IP=$(colima status -j | jq -r .ip_address)
echo "$COLIMA_VM_IP"
ping "$COLIMA_VM_IP"

Then verify both the Docker and Kubernetes control paths:

docker ps
kubectl config get-contexts
kubectl get nodes

If you need to inspect the underlying VM directly, SSH into it:

colima ssh

Operational commands

The official command reference has a clear shape.

start

handles creation and startup. Lifecycle commands handle stop, restart, and delete. Status and connection commands let you inspect and enter the VM. On top of that, Colima also exposes helper commands for Kubernetes, containerd, templates, upgrades, shell completion, and AI model runners.

# Start the default profile
colima start

# Start with Kubernetes enabled
colima start --kubernetes

# List all profiles
colima list

# Stop the current instance
colima stop

# Delete the current instance and its container data
colima delete --data --force

Command list

Command	Typical form	Purpose
start	colima start [profile]	Creates or starts a profile. Most runtime and VM settings are applied here.
stop	colima stop [profile]	Stops an instance.
restart	colima restart [profile]	Restarts an instance.
delete	colima delete [profile]	Deletes an instance, with optional data removal.
status	colima status [profile]	Shows instance state, runtime, architecture, mount type, socket path, and related details.
list	colima list	Lists all profiles.
ssh	colima ssh [profile] -- command	Opens an SSH session or runs a single command inside the VM.
ssh-config	colima ssh-config [profile]	Prints the SSH configuration for the VM.
kubernetes start	colima kubernetes start [profile]	Enables Kubernetes on a running instance.
kubernetes stop	colima kubernetes stop [profile]	Stops Kubernetes.
kubernetes reset	colima kubernetes reset [profile]	Resets the built-in Kubernetes cluster.
model run	colima model run	Runs an AI model.
model serve	colima model serve	Serves an AI model through a web UI.
nerdctl	colima nerdctl --	Forwards nerdctl commands when the runtime is containerd.
nerdctl install	colima nerdctl install	Installs a standalone nerdctl binary for direct use.
template	colima template	Generates or edits the default configuration template.
update	colima update	Updates Colima itself.
prune	colima prune [profile]	Removes unused data to free disk space.
version	colima version	Prints version information.
completion	colima completion [shell]	Generates shell completion scripts.

start flag groups

colima start

is where most of the surface area lives. The official docs group its flags into nine categories: resources, runtime, VM, networking, mounts, Kubernetes, SSH, DNS, and configuration.

Group	Flags	Description
Resources	--cpus, --memory, --disk, --root-disk	Sets CPU, memory, container data disk, and root disk size.
Runtime	--runtime, --activate	Selects the runtime and controls whether contexts are activated automatically.
VM	--arch, --vm-type, --cpu-type, --hostname, --disk-image, --vz-rosetta, --nested-virtualization, --binfmt, --foreground	Controls architecture, virtualization backend, CPU model, disk image, and foreground mode.
Networking	--network-address, --network-host-addresses, --network-mode, --network-interface, --network-preferred-route, --gateway-address, --port-forwarder	Controls reachable IPs, bridged mode, routing, gateway behavior, and port forwarding.
Mounts	--mount, --mount-type, --mount-inotify	Controls host directory mounts and file event propagation.
Kubernetes	--kubernetes, --kubernetes-version, --k3s-arg, --k3s-listen-port	Enables k3s, selects a version, and passes extra server arguments.
SSH	--ssh-agent, --ssh-config, --ssh-port	Controls SSH agent forwarding, host SSH config generation, and the SSH port.
DNS	--dns, --dns-host	Sets DNS resolvers and custom host mappings.
Configuration	--edit, --editor, --template, --save-config, --env	Controls config editing, editor choice, template use, persistence of flags, and VM environment variables.

Other command flags

Command	Flags	Meaning
delete	--data, --force	--data removes images, volumes, and related data. --force skips confirmation.
list	--json	Outputs the profile list as JSON.
ssh	-- command	Runs a single command in the VM instead of opening an interactive shell.
model run / serve	--profile, --runner, --port	Selects the profile, the model runner backend, and the web UI port for serve .
completion	bash, zsh, fish, powershell	Generates completion scripts for the selected shell.

If a creation-time setting such as architecture, runtime, VM type, or mount driver does not change after a restart, that usually means nothing is wrong with the syntax. Those settings belong to instance creation, so the fix is to delete the instance and recreate it.

Common issues

Docker context

A large share of "Cannot connect to the Docker daemon" errors have nothing to do with Colima failing to start. The local

docker

CLI is often still attached to a different context. Check

docker context ls

first, then switch to

colima

if needed.

Image visibility

With the Docker runtime, images built or pulled inside one Colima instance are directly visible to Kubernetes in that same instance. That is one of the nicer parts of the setup because local builds do not need to be pushed to a remote registry just to test them. If you switch to the containerd runtime, the image workflow changes with it, and debugging should follow containerd namespaces rather than Docker assumptions.

VM IP and port publishing

network.address: true

makes the VM reachable from the host, which is useful for debugging, but it should not become a substitute for normal service exposure. Application traffic should still use container port publishing with

-p HOST:CONTAINER

, or the usual Kubernetes Service and Ingress paths.

The post Replacing Docker Desktop with Colima on macOS appeared first on 绿色记忆.

Kubernetes GPU Sharing

Alex — Fri, 06 Mar 2026 06:52:07 +0000

GPU sharing in Kubernetes depends on what the NVIDIA device plugin advertises to the scheduler, what isolation the underlying mechanism really provides, and what the installed hardware can support. This note uses a real production scheduling stall to walk through the GPU inventory, the practical differences between Time-Slicing and MIG, the constraints imposed by the current cluster hardware, and the rollout that expanded schedulable GPU slots from 2 to 11.

Background

GPUs are expensive and scarce. Kubernetes treats them as exclusive resources by default, so one Pod usually occupies one whole physical GPU even when the workload uses only a small fraction of the card. As concurrent GPU-backed services grew, especially for LLM inference, NLP, and PII detection, that default model turned into both wasted capacity and a concrete scheduling problem.

The production cluster had 8 nodes, 2 of them with GPUs. An NLP inference service was scaled to multiple replicas, each requesting

nvidia.com/gpu: 1

. Both physical GPUs were already occupied, so new replicas sat in Pending for more than 28 days with repeated events such as

0/8 nodes are available: 8 Insufficient nvidia.com/gpu

. That failure forced a deeper evaluation of GPU sharing options.

NVIDIA exposes two mainstream sharing paths for Kubernetes: Time-Slicing and MIG. They solve different problems. They also depend on very different hardware assumptions, which means the hardware survey has to come first.

Cluster Survey

Node And GPU Inventory

All 8 nodes in the cluster were Ready and running on Tencent Cloud TKE. GPU nodes were identified through the node label

nvidia-device-enable=enable

, then a privileged Pod entered the host namespace and ran

nvidia-smi -q

to confirm the exact models:

Instance Type	GPU Model	Architecture	Memory	Compute Capability	Driver	MIG Support
PNV5b.8XLARGE96	NVIDIA L20	Ada Lovelace	46068 MiB	8.9	570.158.01	No
GN7.2XLARGE32	Tesla T4	Turing	15360 MiB	7.5	570.158.01	No

Both nodes were on the same driver release and neither card supported MIG. That distinction matters. The L20 has a higher Compute Capability than Ampere, but MIG support does not track Compute Capability monotonically. The Ada Lovelace L-series does not support MIG, while Ampere parts such as A100 and A30, and Hopper parts such as H100 and H20, do.

Existing Device Plugin State

The cluster was running the TKE-provided

nvidia-device-plugin:v0.14.5

as a DaemonSet with the startup arguments

--mig-strategy=single --fail-on-init-error=false --pass-device-specs=true

. There was no

--config-file

flag and no Time-Slicing ConfigMap. Each GPU node registered exactly one

nvidia.com/gpu

resource with Kubernetes, so the cluster had 2 schedulable GPU slots in total, both already consumed. The DaemonSet metadata included

meta.helm.sh/release-name: nvidia-gpu

, which showed that it had originally been installed through Helm even though Helm CLI was not present in the current environment.

Mechanisms

Time-Slicing

Time-Slicing is an oversubscription feature implemented by the NVIDIA k8s-device-plugin. An administrator defines a replica count for each GPU resource in a ConfigMap, and the device plugin advertises that GPU to Kubernetes as multiple schedulable resources. Under the hood, workloads share the same physical GPU and CUDA time-slices execution across processes.

Kubernetes itself does not understand the semantics of GPU sharing. It only sees whatever extended resources the plugin exposes, such as

nvidia.com/gpu

nvidia.com/gpu.shared

. From the Pod's point of view, the declaration pattern does not change: GPU resources belong in

resources.limits

, and if

requests

is also present, it must match the limit value.

The tradeoff is blunt. Time-Slicing does not isolate memory, so all replicas on the same GPU share one physical memory pool. It also does not guarantee a proportional share of compute. Asking for multiple shared GPUs does not mean the workload gets a linear share of throughput. NVIDIA recommends

failRequestsGreaterThanOne: true

so that a request larger than 1 fails with

UnexpectedAdmissionError

instead of creating the false impression of an exclusive quota.

The upside is operational simplicity. On existing non-MIG hardware, it usually takes only a ConfigMap plus a device-plugin restart. The downside is weak isolation, a shared fault domain, and limited observability. In Time-Slicing mode, DCGM-Exporter cannot reliably attribute metrics to individual containers and mostly reports at the physical GPU level.

MIG

MIG, or Multi-Instance GPU, is NVIDIA's hardware partitioning model introduced on Ampere-class GPUs and later architectures that support it. A physical GPU is divided into GPU Instances, each with its own SM slices, memory partition, cache and bandwidth share, DMA engines, and hardware fault boundary. That is the isolation Time-Slicing cannot provide.

How MIG appears in Kubernetes depends on the configured strategy. Under

single

, the resource name stays as

nvidia.com/gpu

, but each advertised unit maps to one same-profile MIG instance. Under

mixed

, each MIG profile is exposed as its own resource type such as

nvidia.com/mig-1g.12gb

. That model is far cleaner for isolation, but it depends on MIG-capable hardware. Changing MIG profiles also requires node-level maintenance. On Hopper, GPU reset support makes this less disruptive than it was on Ampere, but it is still not a zero-touch change.

Comparison

Dimension	Time-Slicing	MIG
Memory isolation	None, all replicas share physical memory	Hardware-level, per-instance
Fault domain	Shared within one physical GPU	Isolated at the instance level
Kubernetes resource shape	nvidia.com/gpu or nvidia.com/gpu.shared	single uses nvidia.com/gpu ; mixed uses nvidia.com/mig-*
Workload changes	Often none when renameByDefault=false	mixed requires Pods to request explicit MIG resources
Hardware support	Broad, works on existing full-GPU resources	Only on MIG-capable GPU models
Metric attribution	Mostly physical-GPU level	Can be modeled around MIG resources
Operational complexity	Low, usually just ConfigMap plus plugin restart	Moderate, requires lifecycle management for GPU Instances
Composition	Can be applied to full GPUs and to mixed MIG resources	Can serve as the lower-level partitioning layer before Time-Slicing

H20 And MIG

The H20 is a Hopper GH100 part with Compute Capability 9.0 and 96 GB of HBM3e. NVIDIA's MIG documentation lists it as supporting up to 7 MIG instances. That makes it a relevant medium-term target even though it was not present in the current cluster.

Typical H20 partition shapes look like this:

Profile	SM Share	Memory	Instances Per Card	Typical Use
1g.12gb	1/7	12GB	7	Inference for models up to roughly 7B
2g.24gb	2/7	24GB	3	Mid-size models around 13B
3g.47gb	3/7	47GB	2	Models around 30B
4g.47gb	4/7	47GB	1	Single larger model instance
7g.94gb	7/7	94GB	1	Full-card style allocation for 70B+

Time-Slicing and MIG can be combined. A common pattern is to partition the H20 with MIG first, then oversubscribe a specific MIG resource with Time-Slicing. In that case the ConfigMap needs

migStrategy: mixed

and the resource name must target the MIG profile directly:

sharing:
  timeSlicing:
    resources:
      - name: nvidia.com/mig-1g.12gb
        replicas: 2

Configuration Reference

Time-Slicing ConfigMap By GPU Model

The ConfigMap can hold multiple keys, with each key representing one node configuration.

any

acts as the fallback. Other keys are selected through node labels:

apiVersion: v1
kind: ConfigMap
metadata:
  name: time-slicing-config
  namespace: kube-system
data:
  any: |-
    version: v1
    flags:
      migStrategy: none
    sharing:
      timeSlicing:
        renameByDefault: false
        failRequestsGreaterThanOne: true
        resources:
          - name: nvidia.com/gpu
            replicas: 2
  l20: |-
    version: v1
    flags:
      migStrategy: none
    sharing:
      timeSlicing:
        renameByDefault: false
        failRequestsGreaterThanOne: true
        resources:
          - name: nvidia.com/gpu
            replicas: 8
  t4: |-
    version: v1
    flags:
      migStrategy: none
    sharing:
      timeSlicing:
        renameByDefault: false
        failRequestsGreaterThanOne: true
        resources:
          - name: nvidia.com/gpu
            replicas: 3

With

renameByDefault: false

, the resource name stays as

nvidia.com/gpu

. The node labels then pick up a

-SHARED

suffix, for example

nvidia.com/gpu.product=Tesla-T4-SHARED

, which makes it possible to distinguish shared and non-shared nodes with selectors. Replica counts were chosen from measured per-process memory usage, described later in the rollout record.

Device Plugin DaemonSets By GPU Model

Version

v0.14.5

does not support per-node dynamic config selection in the way newer operator-managed deployments do. The practical solution was to run two DaemonSets, each with its own

--config-file

and its own node selector:

# Label nodes by GPU model
kubectl label node  nvidia.com/device-plugin.config=l20
kubectl label node   nvidia.com/device-plugin.config=t4

# Patch the existing DaemonSet so it runs only on T4 nodes and uses the t4 config
kubectl patch daemonset nvidia-device-plugin-daemonset -n kube-system --type=json -p='[
  {"op":"replace","path":"/spec/template/spec/containers/0/args/3",
   "value":"--config-file=/etc/nvidia/time-slicing-config/t4"},
  {"op":"add","path":"/spec/template/spec/nodeSelector/nvidia.com~1device-plugin.config",
   "value":"t4"}
]'
kubectl rollout restart daemonset/nvidia-device-plugin-daemonset -n kube-system

A dedicated L20 DaemonSet then reused the same ConfigMap but pointed at the

l20

key:

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: nvidia-device-plugin-daemonset-l20
  namespace: kube-system
spec:
  selector:
    matchLabels:
      name: nvidia-device-plugin-ds-l20
  updateStrategy:
    type: RollingUpdate
  template:
    metadata:
      labels:
        name: nvidia-device-plugin-ds-l20
    spec:
      nodeSelector:
        nvidia-device-enable: enable
        nvidia.com/device-plugin.config: l20
      tolerations:
      - operator: Exists
      priorityClassName: system-node-critical
      containers:
      - name: nvidia-device-plugin-ctr
        image: sgccr.ccs.tencentyun.com/tkeimages/nvidia-device-plugin:v0.14.5
        command: [nvidia-device-plugin]
        args:
        - --fail-on-init-error=false
        - --mig-strategy=single
        - --pass-device-specs=true
        - --config-file=/etc/nvidia/time-slicing-config/l20
        env:
        - name: NVIDIA_DRIVER_CAPABILITIES
          value: utility,compute
        resources:
          limits:
            cpu: 100m
            memory: 100Mi
          requests:
            cpu: 100m
            memory: 100Mi
        securityContext:
          capabilities:
            drop: [ALL]
        volumeMounts:
        - name: device-plugin
          mountPath: /var/lib/kubelet/device-plugins
        - name: time-slicing-config
          mountPath: /etc/nvidia/time-slicing-config
      volumes:
      - name: device-plugin
        hostPath:
          path: /var/lib/kubelet/device-plugins
      - name: time-slicing-config
        configMap:
          name: time-slicing-config

MIG On Future H20 Nodes

On Hopper hardware, MIG reconfiguration is less painful than it was on earlier generations because GPU reset support is better. The operational sequence is still node maintenance first, GPU reconfiguration second, scheduler re-entry last:

kubectl drain  --ignore-daemonsets --delete-emptydir-data

# SSH into the H20 node
sudo nvidia-smi -mig 1
sudo nvidia-smi mig -cgi 19,19,19,19,19,19,19 -C
nvidia-smi -L

kubectl uncordon

From the Kubernetes side, the three MIG strategies remain

single

mixed

, and

none

single

keeps the traditional

nvidia.com/gpu

resource shape when all instances on a node share one profile.

mixed

exposes explicit

nvidia.com/mig-*

resources and requires workloads to request them directly. For a new MIG deployment, Helm is the cleaner path:

helm upgrade -i nvdp nvdp/nvidia-device-plugin \
  --version=0.17.1 \
  --namespace nvidia-device-plugin \
  --create-namespace \
  --set migStrategy=single \
  --set gfd.enabled=true

Rollout Record

Step 1: Inspect The Baseline

# Check the current DaemonSet arguments
kubectl get daemonset nvidia-device-plugin-daemonset -n kube-system -o yaml | grep -A 15 "containers:"

# Check whether any Time-Slicing ConfigMap already exists
kubectl get configmap -n kube-system | grep nvidia

# Check current GPU slot counts
kubectl get nodes -o custom-columns="NAME:.metadata.name,GPU:.status.capacity.nvidia\.com/gpu"

At baseline there was no ConfigMap, no

--config-file

, and only one schedulable GPU slot per GPU node.

Step 2: Create The ConfigMap And Patch The DaemonSet

The first pass used the

any

key with

replicas=2

. The goal at that stage was not model-specific tuning. It was to verify that the plugin picked up Time-Slicing at all.

kubectl apply -f time-slicing-config.yaml

kubectl patch daemonset nvidia-device-plugin-daemonset -n kube-system --type=json -p='[
  {"op":"add","path":"/spec/template/spec/volumes/-",
   "value":{"name":"time-slicing-config","configMap":{"name":"time-slicing-config"}}},
  {"op":"add","path":"/spec/template/spec/containers/0/volumeMounts/-",
   "value":{"name":"time-slicing-config","mountPath":"/etc/nvidia/time-slicing-config"}},
  {"op":"add","path":"/spec/template/spec/containers/0/args/-",
   "value":"--config-file=/etc/nvidia/time-slicing-config/any"}
]'

kubectl rollout restart daemonset/nvidia-device-plugin-daemonset -n kube-system
kubectl rollout status daemonset/nvidia-device-plugin-daemonset -n kube-system --timeout=120s

One operational detail matters here: the device plugin does not watch ConfigMap updates automatically. Editing the ConfigMap alone is not enough. A DaemonSet restart is required before a new Time-Slicing configuration takes effect.

Step 3: Verify The First Expansion

The restart finished in about 6 seconds. After that, node capacity showed the first slot expansion:

kubectl get nodes -o custom-columns="NAME:.metadata.name,GPU-CAP:.status.capacity.nvidia\.com/gpu,GPU-ALLOC:.status.allocatable.nvidia\.com/gpu"
# L20 node   2   2
# T4  node   2   2

The plugin logs also confirmed that the Time-Slicing configuration had loaded:

kubectl logs -n kube-system  --tail=5
# "timeSlicing": {"failRequestsGreaterThanOne": true, "resources": [{"replicas": 2}]}
# Registered device plugin for 'nvidia.com/gpu' with Kubelet

The workload side told the more useful story. The blocked object was not just one Pod. It was a Rolling Update that had been stalled for 28 days. The Deployment had already created a new ReplicaSet, but the first replacement Pod could not schedule because no GPU was free. Since the default Deployment strategy waits for the new ReplicaSet to become ready before shrinking the old one, the entire release froze in the middle. Once Time-Slicing expanded capacity, scheduling completed within 43 seconds and the Deployment resumed immediately:

kubectl describe pod  -n  | grep -A 3 "Events:"
# Warning  FailedScheduling  (x1303 over 4d12h)  0/8 nodes are available: 8 Insufficient nvidia.com/gpu.
# Normal   Scheduled         43s                  Successfully assigned  to

Step 4: Tune Replica Counts By GPU Model

After the mechanism worked, the next step was to size replicas from actual memory usage rather than intuition. A privileged Pod was used to measure the live memory footprint on both GPU nodes:

GPU	Total Memory	Measured Per-Process Usage	Final Replicas	Theoretical Headroom Per Slot
NVIDIA L20	46068 MiB	4621 MiB (about 10%)	8	about 1137 MiB
Tesla T4	15360 MiB	4401 MiB (about 29%)	3	about 759 MiB

The L20 made the underutilization obvious. With

replicas=2

, each slot effectively had about 22 GB available while the measured process used only about 4.6 GB. That was far too conservative. Raising the L20 to 8 slots pushed theoretical memory utilization much closer to a useful level.

This stage also exposed a version-specific limitation. The

v0.14.5

plugin cannot point

--config-file

at a directory for dynamic per-node selection. On this version, doing so crashes the Pod:

# Pod CrashLoopBackOff, log output:
# E unable to load config: unable to finalize config: unable to parse config file:
#   read error: read /etc/nvidia/time-slicing-config: is a directory

That selection mechanism depends on the config-manager sidecar used by GPU Operator deployments. The bare device plugin does not have it. In practice, that forced the two-DaemonSet layout: one bound to T4 nodes and one bound to L20 nodes, each with an explicit config file target.

The final GPU slot layout looked like this:

kubectl get nodes -o custom-columns="NAME:.metadata.name,GPU-CAP:.status.capacity.nvidia\.com/gpu,GPU-ALLOC:.status.allocatable.nvidia\.com/gpu"
# L20 node   8   8
# T4  node   3   3

Outstanding Issues

One risk remains around ownership. The original DaemonSet is TKE-managed. During cluster upgrades or node-group operations, the control plane may reconcile that DaemonSet back to its original form and wipe out manual patches. The current mitigation is documentation and repeatability. The cleaner long-term answer is to move to TKE's native GPU sharing feature or deploy GPU Operator and stop patching the managed object directly.

Observability is also still weak. The cluster already runs

nvidia-gpu-exporter

, but in Time-Slicing mode the metrics still aggregate at the physical GPU level. Per-Pod memory and compute attribution remains limited. That is one reason MIG is still the better long-term target when the hardware supports it.

There is also a side effect on privileged development pods that bypass the device plugin entirely by not requesting

nvidia.com/gpu

. During this rollout, the containerd

runc

runtime was configured to use

nvidia-container-runtime

as its binary, which is a common step when setting up GPU sharing. The containerd flag

privileged_without_host_devices

defaults to

false

in that configuration. With the nvidia runtime active and that flag false, privileged containers that have no device plugin allocation are blocked from

/dev/nvidiactl

by an eBPF cgroup program, even though the device files are visible in

ls /dev

. The result is

Failed to initialize NVML: Unknown Error

. Setting

privileged_without_host_devices = true

in containerd config and restarting containerd resolves it. Any cluster that runs time-slicing and also operates privileged dev pods outside the device plugin should check this flag.

Final State

Item	Before	After
Total GPU slots	2 full physical GPUs	11 slots (L20 x 8 + T4 x 3)
L20 utilization by memory	about 10% (1 process on 46 GB)	about 80% at theoretical full slot usage
T4 utilization by memory	about 29% (1 process on 15 GB)	about 86% at theoretical full slot usage
Pending Pods	1 Pod stuck for 28 days	0
Blocked Rolling Update	Frozen for 28 days	Completed, new version fully ready
DaemonSet count	1 generic DaemonSet	2 model-specific DaemonSets
Memory and fault isolation	None	Still none under Time-Slicing
Container-level GPU metrics	None	Still limited, pending MIG-capable hardware

References

The post Kubernetes GPU Sharing appeared first on 绿色记忆.

Investigating and Solving the Issue of Failed Certificate Request with ZeroSSL and Cert-Manager

Alex — Mon, 14 Oct 2024 06:45:45 +0000

In this blog post, I will walk through my journey investigating and resolving an issue where my certificate request from ZeroSSL, using Cert-Manager, remained in a "not ready" state for over two days. I'll cover the tools involved, provide background information, and show how I eventually identified and fixed the problem.

Background on ACME

ACME

ACME (Automatic Certificate Management Environment) is a protocol developed by the Internet Security Research Group (ISRG) to automate the process of obtaining and managing SSL/TLS certificates from Certificate Authorities (CAs). It simplifies the traditionally manual steps involved in certificate issuance by using an automated process. ACME is widely used by services like Let’s Encrypt and ZeroSSL to secure websites with HTTPS.

The ACME protocol automates interactions between a client (such as Cert-Manager or Certbot) and a CA, allowing the client to request, renew, and manage certificates without human intervention. ACME operates through a series of challenges that prove ownership of the domain for which a certificate is requested. Once the ownership is verified, the CA can issue the certificate.

HTTP-01 Challenge

In this challenge, the client proves ownership of a domain by hosting a specific file at a designated path (e.g., http://gmem.cc/.well-known/acme-challenge/). The CA attempts to retrieve this file, and if successful, the challenge is validated. This challenge is commonly used for publicly accessible web servers.

DNS-01 Challenge

In this challenge, the client proves domain ownership by creating a special DNS TXT record for the domain. The CA checks the DNS record to confirm ownership. DNS-01 is typically used for wildcard certificates or when the server is not publicly accessible because it doesn’t rely on serving files over HTTP.

TLS-ALPN-01 Challenge

This challenge requires the client to prove control of a domain by configuring a TLS server with a special certificate during the ACME validation process. The CA then connects to the server via TLS and checks the certificate. This challenge is less common and usually used in specialized environments.

Background on Cert-Manager

Cert-Manager is an open-source Kubernetes add-on that automates the management, issuance, and renewal of certificates within Kubernetes clusters. It integrates with various Certificate Authorities (CAs) and protocols, including ACME (used by providers like Let’s Encrypt and ZeroSSL). Cert-Manager is widely used to ensure that certificates remain valid and secure without manual intervention.

Cert-Manager Components

When deploying Cert-Manager in Kubernetes, several key components work together to handle certificate management.

cert-manager

The core component of the Cert-Manager system, responsible for managing the lifecycle of certificates and interacting with Issuers (such as ACME servers like Let’s Encrypt or ZeroSSL). It runs as a Kubernetes controller and is responsible for:

Watching Certificate, CertificateRequest, Issuer, ClusterIssuer, Order, and Challenge resources.
Requesting certificates from CAs.
Automatically renewing certificates before expiration.
Handling the interactions with external CAs (via ACME, Vault, Venafi, etc.).

This component performs the actual management of certificates, from creation to renewal, ensuring that the requested certificates are stored securely as Kubernetes Secrets.

cert-manager-cainjector

The CA Injector is an additional component that works alongside Cert-Manager to inject CA data into other Kubernetes resources. It primarily operates on Kubernetes ValidatingWebhookConfiguration and MutatingWebhookConfiguration resources, injecting certificates into them automatically. This is necessary for:

Mutating admission controllers that require TLS certificates for secure communication.

Ensuring that Kubernetes components relying on CA certificates have up-to-date CA data.

The cainjector is critical in environments where certain Kubernetes components (e.g., webhooks) require their certificates to be signed by a trusted CA.

cert-manager-webhook

The Webhook component provides an admission controller that validates Cert-Manager resources like Certificate, Issuer, ClusterIssuer, and CertificateRequest upon creation or update. It ensures that the Cert-Manager resources are correctly configured by:

Validating resources before they are accepted into the Kubernetes API (syntax and structure).

Mutating resources to provide defaults (for example, setting default values in a Certificate resource).

Providing a layer of security and correctness by ensuring invalid configurations are caught early.

The webhook helps catch configuration issues early, improving the reliability of certificate management workflows.

cert-manager-webhook-dnspod

This webhook, which is maintained by the community, specifically handles DNS-01 challenges for domains hosted in Tencent Cloud’s DNSPod. When Cert-Manager requests a certificate using the DNS-01 challenge, it needs to create a DNS TXT record in the domain's DNS zone. cert-manager-webhook-dnspod facilitates this by interacting with the DNSPod API to manage DNS records.

Cert-Manager invokes this webhook when it needs to solve a DNS-01 challenge using DNSPod as the DNS provider. The webhook receives instructions from Cert-Manager, communicates with the DNSPod API to create or delete DNS TXT records, and reports back to Cert-Manager when the challenge is complete.

In this post the webhook was created with the following manifest:

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: cert-manager-webhook-dnspod
  namespace: argocd
spec:
  destination:
    namespace: cert-manager
    server: https://kubernetes.default.svc
  project: default
  source:
    repoURL: https://github.com/imroc/cert-manager-webhook-dnspod
    targetRevision: master
    path: charts/cert-manager-webhook-dnspod
    helm:
      releaseName: cert-manager-webhook-dnspod
      values: |
        groupName: acme.dnspod.tencent.com
        clusterIssuer:
          name: letsencrypt-issuer
          secretId: DNSPOD_SECRETID
          secretKey: DNSPOD_SECRETKEY
          email: gmem@me.com
  syncPolicy:
    syncOptions:
      - CreateNamespace=true
    automated:
      prune: true
      selfHeal: true

Cert-Manager CRDs

Cert-Manager relies on several types of custom resources that work together to manage the certificate lifecycle:

Issuer/ClusterIssuer:
1. An Issuer or ClusterIssuer is a Cert-Manager resource that defines how certificates should be requested from a Certificate Authority (CA). The difference between them is that an Issuer is scoped to a single namespace, whereas a ClusterIssuer can be used cluster-wide.
2. Issuers can be configured to use different CA backends, such as ACME, Vault, or self-signed certificates.
Certificate: The Certificate resource defines which certificates should be issued and managed by Cert-Manager. It specifies details such as the domain names, issuer to use, renewal period, and where to store the certificate (usually in a Kubernetes secret).
CertificateRequest: When a Certificate resource is created, Cert-Manager generates a CertificateRequest. This resource represents the actual request to the Issuer for a certificate. Cert-Manager manages these requests and handles the approval and signing process.
Order: When Cert-Manager requests a certificate from an ACME-based Issuer, it creates an Order resource to track the status of the certificate issuance. The Order keeps track of challenges and interactions with the ACME server.
Challenge: A Challenge resource represents the ACME challenge issued by the CA (such as DNS-01 or HTTP-01). The challenge proves domain ownership by requiring the client (Cert-Manager) to respond to a domain validation request from the CA.

Cert-Manager Challenge Workflows

HTTP-01 Challenge

The HTTP-01 challenge is used when the domain is publicly accessible over HTTP. The CA validates ownership of the domain by requesting a specific file over HTTP. Cert-Manager sets up the challenge response using a Kubernetes Ingress resource.

Create an Issuer or ClusterIssuer: The Issuer defines the connection to the CA (such as Let’s Encrypt or ZeroSSL) and specifies that ACME should be used with the HTTP-01 challenge.
Create a Certificate Resource: A Certificate resource is created that specifies the domain names for which a certificate is needed and the Issuer to use.
Cert-Manager Creates a CertificateRequest: Cert-Manager generates a CertificateRequest based on the Certificate resource.
Cert-Manager Creates an Order: Cert-Manager creates an Order resource to track the status of the certificate request with the ACME server.
Cert-Manager Creates an HTTP-01 Challenge: An HTTP-01 challenge is created, and Cert-Manager configures an Ingress resource to serve the challenge response at the path /.well-known/acme-challenge/.
ACME Server Attempts to Validate: The CA attempts to access the challenge file via the HTTP URL (e.g., http://example.com/.well-known/acme-challenge/). If successful, the challenge is validated.
Certificate Issued: Once the challenge is validated, the CA issues the certificate, and Cert-Manager stores it in the specified Kubernetes Secret.

DNS-01 Challenge

The DNS-01 challenge is used when domain ownership must be validated via DNS records. This method is often preferred for wildcard certificates or domains that are not publicly accessible over HTTP.

Create an Issuer or ClusterIssuer: The Issuer specifies that ACME should be used with the DNS-01 challenge and includes configuration for interacting with the DNS provider (e.g., AWS Route53, Cloudflare, or a custom webhook like dnspod).
Create a Certificate Resource: A Certificate resource is created that defines the domains for which the certificate is needed and the Issuer to use.
Cert-Manager Creates a CertificateRequest: Cert-Manager generates a CertificateRequest based on the Certificate resource.
Cert-Manager Creates an Order: Cert-Manager creates an Order resource to track the status of the certificate request with the ACME server.
Cert-Manager Creates a DNS-01 Challenge: A DNS-01 challenge is created, and Cert-Manager interacts with the configured DNS provider to automatically create a special TXT record for the domain (e.g., _acme-challenge.example.com).
ACME Server Attempts to Validate: The CA checks for the presence of the _acme-challenge.example.com TXT record in the domain’s DNS records. If the correct value is found, the challenge is validated.
Certificate Issued: Once the DNS challenge is validated, the CA issues the certificate, and Cert-Manager stores it in the specified Kubernetes Secret.

The Problem, Investigation and Fix

Problem Statement

I created a ClusterIssuer and a Certificate for the wildcard domain *.gmem.cc. However, after waiting for more than two days, the certificate status still showed as not ready, with the following condition in the certificate's status:

status:
  conditions:
    - lastTransitionTime: "2024-10-14T06:02:10Z"
      message: Issuing certificate as Secret does not exist
      observedGeneration: 1
      reason: DoesNotExist
      status: "False"
      type: Ready

I checked on DNSPod and found a

TXT

record

_acme-challenge.gmem.cc

with the TTL set to 600 seconds.

Initial Investigation

To diagnose the issue, I checked the status of the related Cert-Manager resources:

CertificateRequest

Order

, and

Challenge

. These are the key resources that Cert-Manager uses to interact with ACME and handle certificate issuance. Here’s an overview of my findings:

CertificateRequest:

status:
  conditions:
    - lastTransitionTime: "2024-10-14T06:02:14Z"
      message: Certificate request has been approved by cert-manager.io
      reason: cert-manager.io
      status: "True"
      type: Approved
    - lastTransitionTime: "2024-10-14T06:02:14Z"
      message: 'Waiting on certificate issuance from order istio-system/wildcard-ssl-5pgsr-3353861729: "pending"'
      reason: Pending
      status: "False"
      type: Ready

The certificate request was approved, but the system was waiting for the certificate issuance to complete.

Order:

status:
  state: pending
  authorizations:
    - challenges:
        - token: iQMwrfsFRmJ_MytUY3N4NW6QehtTn0-IEvJWAmYEw_k
          type: dns-01
          url: https://acme.zerossl.com/v2/DV90/chall/DDRBMBd9jnJo_W4EcQfSWQ
      wildcard: true

The order was still in a "pending" state, and the DNS-01 challenge had not been completed yet.

Challenge:

status:
  presented: true
  processing: true
  reason: 'Waiting for DNS-01 challenge propagation: DNS record for "gmem.cc" not yet propagated'
  state: pending

The challenge was waiting for DNS propagation, but apparently the

TXT

record I mentioned above had been created two days ago.

Global DNS Propagration Check

Mutifarious reasons can cause delay on DNS propagation, we can check whether TXT record _acme-challenge.gmem.cc is synchronized all over the world using whatsmydns.net:

https://www.whatsmydns.net/#TXT/_acme-challenge.gmem.cc

In this case, the check result was that the record had been fully propagated.

Cert-Manager Logs

I then checked the logs for the Cert-Manager pod to see if any errors were being reported. The relevant and repeating error message from the logs was:

E1014 05:52:58.647839 1 sync.go:190] "cert-manager/challenges: propagation check failed" err="DNS record for \"gmem.cc\" not yet propagated"

At this point, it seemed that Cert-Manager was unable to see the propagated DNS records, even though I had confirmed their existence.

After enabling verbose logging (--v5 log level) in Cert-Manager, I finally discovered the root cause. The detailed logs revealed that Cert-Manager was checking the DNS record at an intermediate CNAME:

I1014 06:20:33.919849 1 dns.go:116] "cert-manager/challenges/Check: checking DNS propagation"
I1014 06:20:33.921190 1 wait.go:90] Updating FQDN: _acme-challenge.gmem.cc. with its CNAME: lb-db1nok14-foh4te5vrj0dya3c.clb.sg-tencentclb.com.
I1014 06:25:35.886683 1 wait.go:298] Searching fqdn "lb-db1nok14-foh4te5vrj0dya3c.clb.sg-tencentclb.com." using seed nameservers [10.231.18.121:53]
I1014 06:25:35.886696 1 wait.go:329] Returning cached zone record "sg-tencentclb.com." for fqdn "lb-db1nok14-foh4te5vrj0dya3c.clb.sg-tencentclb.com."
I1014 06:20:33.987786 1 wait.go:141] Looking up TXT records for "lb-db1nok14-foh4te5vrj0dya3c.clb.sg-tencentclb.com."

The challenge was failing because the DNS record for

*.gmem.cc

was a CNAME pointing to another domain, lb-db1nok14-foh4te5vrj0dya3c.clb.sg-tencentclb.com, which caused Cert-Manager to search for a wrong TXT record.

The Fix

The solution was to remove the wildcard record *.gmem.cc which was CNAMEed to Tencent Cloud Loadbalancer address. After a few minutes, the DNS cache was invalidated and Cert-Manager finally logged something different:

I1014 06:25:45.974354 1 wait.go:298] Searching fqdn "_acme-challenge.gmem.cc." using seed nameservers [10.231.18.121:53]
I1014 06:25:46.453434 1 wait.go:383] Returning discovered zone record "gmem.cc." for fqdn "_acme-challenge.gmem.cc."
I1014 06:25:46.454660 1 wait.go:316] Returning authoritative nameservers [c.dnspod.com., a.dnspod.com., b.dnspod.com.]
I1014 06:25:46.462705 1 wait.go:141] Looking up TXT records for "_acme-challenge.gmem.cc."

indicating that the correct TXT record was found. And from the subsequent logs some working detailed of Cert-Manager was revealed:

I1014 06:25:47.050500 1 dns.go:128] "cert-manager/challenges/Check: waiting DNS record TTL to allow the DNS01 record to propagate for domain" resource_name="wildcard-ssl-5pgsr-3353861729-3026093647" resource_namespace="istio-system" resource_kind="Challenge" resource_version="v1" dnsName="gmem.cc" type="DNS-01" resource_name="wildcard-ssl-5pgsr-3353861729-3026093647" resource_namespace="istio-system" resource_kind="Challenge" resource_version="v1" domain="gmem.cc" ttl=60 fqdn="_acme-challenge.gmem.cc."

This line indicated that after Cert-Manager validated the TXT record locally, it would wait for TTL ( 60 here ) seconds, just in case that Zero SSL server hadn't been able to see te record.

014 06:26:47.051650 1 sync.go:359] "cert-manager/challenges/acceptChallenge: accepting challenge with ACME server" resource_name="wildcard-ssl-5pgsr-3353861729-3026093647" resource_namespace="istio-system" resource_kind="Challenge" resource_version="v1" dnsName="gmem.cc" type="DNS-01"
I1014 06:26:47.051665 1 logger.go:81] "cert-manager/acme-middleware: Calling Accept"
I1014 06:26:49.708221 1 sync.go:376] "cert-manager/challenges/acceptChallenge: waiting for authorization for domain" resource_name="wildcard-ssl-5pgsr-3353861729-3026093647" resource_namespace="istio-system" resource_kind="Challenge" resource_version="v1" dnsName="gmem.cc" type="DNS-01"
I1014 06:26:49.708250 1 logger.go:99] "cert-manager/acme-middleware: Calling WaitAuthorization"
I1014 06:26:50.194610 1 logs.go:199] "cert-manager/controller: Event(v1.ObjectReference{Kind:\"Challenge\", Namespace:\"istio-system\", Name:\"wildcard-ssl-5pgsr-3353861729-3026093647\", UID:\"a4b2f2d7-6185-4072-92f0-b7a89418cdb1\", APIVersion:\"acme.cert-manager.io/v1\", ResourceVersion:\"463570645\", FieldPath:\"\"}): type: 'Normal' reason: 'DomainVerified' Domain \"gmem.cc\" verified with \"DNS-01\" validation"

After the wait, Cert-Manager called Zero SSL server for Accept and WaitAuthorization operation and the server verified that we were the owner of the domain name.

Conclusion

In this case, the root cause of the certificate issuance failure was a CNAME record interfering with Cert-Manager's DNS-01 challenge. It had nothing to do with the ACME server but was linked to Cert-Manager's internal implementation.

To aviod similar issues in the future, we need to stop using wildcard domain records.

The post Investigating and Solving the Issue of Failed Certificate Request with ZeroSSL and Cert-Manager appeared first on 绿色记忆.

Kubernetes Migration

Alex — Tue, 27 Dec 2022 11:37:50 +0000

Migrating a Kubernetes cluster from one cloud provider to another usually breaks into three separate problems: moving Kubernetes resources, moving the data attached to workloads, and moving the container images those workloads depend on.

Kubernetes resource migration
Persistent volume migration
Container image migration

Kubernetes resources and persistent volumes can be handled with Velero. Image registry migration is simpler in most cases. Common open source options include Alibaba Cloud's image-syncer and Tencent Cloud's image-transfer.

Velero

Overview

Velero is an open source backup and restore system built for Kubernetes. A common cross-cloud migration pattern is to back up the source cluster and restore that backup into the target cluster.

Velero consists of two parts:

Server-side components running inside the Kubernetes clusters being backed up or restored
A CLI client

The server side is a collection of controllers that watch Velero custom resources for backup and restore operations. The CLI mostly saves you from writing those custom resources by hand.

Notable newer capabilities

Compared with the version we reviewed in the earlier note on Kubernetes failure detection and self-healing, Velero has added several capabilities that matter in real migrations:

ReadWriteMany volumes are no longer backed up repeatedly.
Cloud provider plugins have been split out from the core Velero repository.
Restic-based persistent volume backups are always incremental, even when Pods move.
Namespace cloning can automatically clone the related persistent volumes.
CSI-backed persistent volumes are supported, including the mainstream AWS, Azure, and GCP cases.
Backup and restore progress reporting is supported.
Velero can back up all API versions of a resource.
Volume backup through Restic can be enabled by default with
```
--default-volumes-to-restic
```
.
```
restoreStatus
```
can be used to control which resource status fields are restored.
```
--existing-resource-policy
```
can change restore behavior when a resource already exists. The default is to skip existing resources, except for ServiceAccounts. Setting it to
```
update
```
makes Velero update existing resources instead.
Since 1.10, Velero supports Kopia as an alternative to Restic. Kopia often performs better on large backup sets or very large file counts.

Backup flow

Velero supports both on-demand and scheduled backups. In both cases it collects Kubernetes resources, applies filters if requested, packages the result, and uploads it to an object storage backend.

A typical backup flow looks like this:

The user runs
```
velero backup create
```
, which creates a
```
Backup
```
resource.
```
BackupController
```
sees the new Backup resource and validates it.
If validation succeeds, the controller runs the backup. By default, Velero creates snapshots for all persistent volumes. Use
```
--snapshot-volumes=false
```
to change that behavior.
The controller uploads the backup data to object storage.

When Velero backs up resources, it stores them using the preferred API version. If the source API server exposes two versions of a group, for example

teleport/v1alpha1

and

teleport/v1

, and

v1

is the preferred version, the backup stores the resource in

v1

form. The target cluster does not have to prefer that version, but it must support it. That is one reason restore can fail across clusters with different Kubernetes or CRD versions.

Backups can have a retention period through

--ttl

. When that retention window expires, Velero deletes the Kubernetes backup records, the backup files, the snapshots, and the related Restore objects. If garbage collection fails, Velero adds a

velero.io/gc-failure=REASON

label to the Backup object.

There is one important caveat for cross-cloud migration: snapshot-based volume backup is not enough. A snapshot created on cloud A is not something you can usually restore directly on cloud B.

Restore flow

Restore takes a previous backup, including Kubernetes resources and volume data, and replays it into the target cluster. The target cluster can be the source cluster itself, and the restore can be filtered so only part of the backup is restored.

Restored Kubernetes resources receive the label

velero.io/restore-name=RESTORE_NAME

. By default, the restore name is

BACKUP_NAME-TIMESTAMP

, where the timestamp format is

YYYYMMDDhhmmss

A typical restore flow looks like this:

The user runs
```
velero restore create
```
, which creates a
```
Restore
```
resource.
```
RestoreController
```
sees the Restore object and validates it.
If validation succeeds, the controller reads the backup metadata from object storage and performs prechecks, including API version checks, to see whether the resources can run on the new cluster.
The controller restores resources one by one.

By default, Velero does not delete or overwrite existing objects in the target cluster. If a resource already exists, Velero skips it. Setting

--existing-resource-policy=update

tells Velero to try to update matching existing resources instead.

Object storage as source of truth

The object storage backend is Velero's single source of truth. That has two practical consequences:

If object storage contains backup data but the Kubernetes API does not contain the matching Backup resource, Velero recreates the Backup object.
If Kubernetes contains a Backup resource but object storage does not contain the matching backup data, Velero deletes the Backup object.

This is also why cross-cloud migration works at all. The source and target clusters do not need to talk directly to each other. Object storage becomes the only shared medium.

The CRD that defines where backup metadata is stored is

BackupStorageLocation

. It points to a bucket or a prefix inside a bucket. Velero stores backup metadata there, and file-system-based volume backups through Restic or Kopia also live there. Snapshot-based volume backups do not live in that bucket, because the snapshot implementation is controlled by the cloud provider.

Each Backup can use one

BackupStorageLocation

Snapshot locations

Snapshot-related information is stored in

VolumeSnapshotLocation

. The actual fields depend on the cloud plugin, because snapshot implementation is provider-specific.

Each Backup can use one

VolumeSnapshotLocation

per volume snapshot provider.

Providers and plugins

Velero uses a plugin model that keeps storage and cloud provider integrations outside the core project.

Hooks

Velero also exposes hooks around the standard backup and restore flow.

Backup hooks run during backup. One standard use is telling a database to flush in-memory buffers before a snapshot or file backup starts.

Restore hooks run during restore. They are often used for initialization steps that need to happen before the application starts normally.

Installation

Installing the CLI

Install the Velero CLI binary, extract it, and place

velero

$PATH

. To enable shell completion:

echo 'source <(velero completion bash)' >> ~/.bashrc

Client-side configuration can be adjusted like this:

# Enable client features
velero client config set features=EnableCSI

# Disable color output
velero client config set colorized=false

Installing the server components

The CLI can also install the server components:

velero install \
    --namespace=teleport-system \
    --use-node-agent \
    --default-volumes-to-fs-backup \
    --features=EnableCSI,EnableAPIGroupVersions \
    --velero-pod-cpu-request \
    --velero-pod-mem-request \
    --velero-pod-cpu-limit \
    --velero-pod-mem-limit \
    --node-agent-pod-cpu-request \
    --node-agent-pod-mem-request \
    --node-agent-pod-cpu-limit \
    --node-agent-pod-mem-limit \
    --provider aws \
    --bucket backups \
    --secret-file ./aws-iam-creds \
    --backup-location-config region=us-east-2 \
    --snapshot-location-config region=us-east-2 \
    --no-default-backup-location \
    --dry-run -o yaml

Several flags in that example matter in migration scenarios:

```
--use-node-agent
```
enables file-system-based backup support.
```
--default-volumes-to-fs-backup
```
makes file-system backup the default for Pod volumes. Without it, volumes normally have to be selected through annotations.
```
--features=EnableCSI,EnableAPIGroupVersions
```
turns on feature gates that matter in newer storage and API-version scenarios.
Resource request and limit flags often need adjustment when file-system backup is used heavily.

After installation, you can configure default backup and snapshot locations:

velero backup-location create backups-primary \
    --provider aws \
    --bucket velero-backups \
    --config region=us-east-1 \
    --default

velero server --default-volume-snapshot-locations="PROVIDER-NAME:LOCATION-NAME,PROVIDER2-NAME:LOCATION2-NAME"

You can also add extra snapshot providers after the initial install:

velero plugin add registry/image:version

velero snapshot-location create NAME \
    --provider PROVIDER-NAME \
    [--config PROVIDER-CONFIG]

A cross-cloud migration test

One practical test setup is to create one Kubernetes cluster on Alibaba Cloud as the source cluster and another on Tencent Cloud as the target cluster, then use Velero to move workloads across them.

Creating the clusters

Create the clusters through the two cloud consoles. The exact steps depend on the providers and are not the point here.

Migrating stateless workloads

The most common Kubernetes use case is still stateless workloads. Stateful infrastructure such as databases is often delegated to cloud PaaS products instead of being hosted inside the cluster. That reality makes Kubernetes migration much easier, because volume migration often drops out of scope.

A simple test case is an Nginx Deployment plus a Service. Start by creating the resources on the source cluster:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx
spec:
  selector:
    matchLabels:
      app: nginx
  replicas: 2
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      - image: nginx:1.7.9
        name: nginx
        ports:
        - containerPort: 80

---

apiVersion: v1
kind: Service
metadata:
  labels:
    app: nginx
  name: nginx
spec:
  ports:
  - port: 80
    targetPort: 80
  selector:
    app: nginx
  type: LoadBalancer

For this kind of workload, the migration path is fairly direct: back up the namespace or selected resources from the source cluster, restore them into the target cluster, and then verify that the restored Deployment, Service, and related objects match expectations. The harder cases show up later, when CRDs, API version skew, storage classes, and cloud-specific integrations enter the picture.

The post Kubernetes Migration appeared first on 绿色记忆.

Terraform: a practical guide to infrastructure as code

Alex — Wed, 20 Oct 2021 02:15:51 +0000

Terraform is an infrastructure-as-code tool. You describe the target infrastructure in configuration files, and Terraform compares that description with real infrastructure, builds a plan, and then creates, updates, or deletes objects until the two match. The real job is not "writing cloud scripts." It is keeping an explicit model of infrastructure state.

What Terraform manages

Terraform can manage far more than basic IaaS objects. A Terraform configuration may include virtual machines, networks, DNS records, IAM bindings, managed databases, and even SaaS resources. The boundary is the provider model: if a provider can create, read, update, and delete a resource type, Terraform can manage it.

The CLI workflow has three moving parts:

The Terraform CLI itself.
Configuration files written in the Terraform language, which is based on HCL.
Providers, which are plugins that talk to cloud or service APIs.

Terraform reads the configuration, builds an execution plan, and decides which objects must be created, changed, replaced, or removed. It also tracks dependencies between resources and applies changes in parallel where that is safe.

CLI basics

Installing the CLI

Install Terraform from the official downloads page and place the binary on

$PATH

Useful global behavior

Terraform supports

-chdir=DIR

to run commands against a different working directory. That is handy in scripts and monorepos.

Shell completion can be installed with

terraform -install-autocomplete

and removed with

terraform -uninstall-autocomplete

Resource addresses

Many subcommands accept resource addresses. A few common forms are:

# resource_type.resource_name
aws_instance.foo

# indexed resource instance
aws_instance.bar[1]

# resource inside nested child modules
module.foo.module.bar.aws_instance.baz

CLI configuration file

The CLI configuration file path can be set with

TF_CLI_CONFIG_FILE

. On non-Windows systems, the default path is

$HOME/.terraformrc

. This file can configure plugin caching, credentials, and provider installation behavior.

plugin_cache_dir   = "$HOME/.terraform.d/plugin-cache"
disable_checkpoint = true

credentials "app.terraform.io" {
  token = "xxxxxx.atlasv1.zzzzzzzzzzzzz"
}

provider_installation {
  filesystem_mirror {
    path    = "/usr/share/terraform/providers"
    include = ["example.com/*/*"]
  }

  direct {
    exclude = ["example.com/*/*"]
  }

  dev_overrides {
    "hashicorp.com/edu/hashicups-pf" = "$(go env GOBIN)"
  }
}

dev_overrides

is mainly for provider development. It lets you test a local provider binary without going through the full registry and checksum flow.

Core commands

init

terraform init

prepares the working directory. Terraform commands are expected to run from a directory that contains Terraform configuration files. Initialization downloads providers and modules, sets up the backend, and creates local working data.

After initialization, the directory usually contains:

```
.terraform/
```
, which stores provider and module downloads.
```
terraform.tfstate
```
when the local backend is used.
```
terraform.tfstate.d/
```
when multiple workspaces are used with the local backend.

Some changes require re-running initialization, especially provider version changes, module source changes, and backend configuration changes.

terraform get

can download modules without doing the full set of

init

tasks.

terraform init -upgrade

upgrades providers and modules to newer versions that still satisfy the version constraints.

validate

terraform validate

checks whether the configuration is syntactically and structurally valid.

plan

terraform plan

shows the changes Terraform would like to make. It compares the desired state from configuration with the current state of the infrastructure, using both the state file and provider API reads.

Terraform's core execution loop is built around three commands:

plan

apply

, and

destroy

Saving a plan

terraform plan -out=FILE

A saved plan can later be passed to

terraform apply

Planning modes

Destroy mode, enabled by
```
-destroy
```
, builds a plan that removes everything tracked by the current configuration.
Refresh-only mode, enabled by
```
-refresh-only
```
, updates state and root outputs to match infrastructure changes made outside Terraform.

Input variables and concurrency

Use

-var 'NAME=VALUE'

to set input variables directly, and

-var-file=FILENAME

to load them from a file.

Use

-parallelism=n

to cap concurrency. The default is 10.

Other options

Option	Meaning
-refresh=false	Skip the pre-plan refresh step. This can reduce remote API calls, but Terraform may miss drift introduced outside Terraform.
-replace=ADDRESS	Force Terraform to plan a replacement for a single resource instance, such as aws_instance.example[0] .
-target=ADDRESS	Limit planning to a specific resource and its dependencies. Useful for debugging, but easy to abuse.
-input=false	Disable interactive prompts for root input variables. This is standard in CI and batch execution.

apply

terraform apply

executes the proposed changes. By default it runs an implicit plan first, though it can also execute a previously saved plan file.

The basic form is

terraform apply [options] [plan file]

Automatic approval

Use

-auto-approve

to skip manual approval.

Lock timeout

Use

-lock-timeout=DURATION

to wait for a state lock before failing.

destroy

terraform destroy

removes all infrastructure objects managed by the current configuration and workspace.

Other commands

Command	Meaning
console	Evaluate Terraform expressions interactively.
fmt	Format configuration files.
force-unlock	Remove a stale state lock. Use carefully, because unlocking while another process is still running can corrupt state.
graph	Generate a dependency graph of the configuration.
import	Attach an existing infrastructure object to a resource address in configuration.
login / logout	Manage credentials for remote services such as Terraform Cloud or a private module registry.
output	Show root module outputs.
providers	Show provider dependencies for the current module.
refresh	Refresh state to match remote infrastructure.
show	Display a saved plan or current state in human-readable form.
workspace	Manage and switch workspaces.

taint and untaint

taint

marks a resource instance as not fully functional. That flag does not immediately change infrastructure, but the next plan will propose destroying and recreating the object.

untaint

clears that status.

Terraform language basics

Blocks

A Terraform configuration is built from blocks. The syntax looks like this:

 "" "" {
   = 
}

A block is a container, and its meaning depends on the block type. In a

resource

block, the two labels identify the resource type and local name.

Depending on block type, the number of labels may be zero, fixed, or variable. A block body may contain arguments or nested blocks. Top-level blocks are limited to a fixed set of Terraform language constructs.

resource "aws_vpc" "main" {
  cidr_block = var.base_cidr_block
}

Arguments and identifiers

An argument assigns a value to a name. The available arguments and their types depend on context, usually the resource type or block type.

Identifiers are used for argument names, block type names, and many Terraform object names. They may contain letters, digits,

, and

, but cannot start with a digit.

Comments

Single-line comments can start with

//

. Multi-line comments use

/* ... */

Data types

Type	Meaning
string	Unicode text, for example "hello" .
number	Numeric value, for example 6.02 .
bool	true or false .
list / tuple	Ordered collections, for example ["us-west-1a", "us-west-1c"] .
map / object	Key-value structures, for example { name = "Mabel", age = 52 } .

null

represents the null value.

Strings and templates

Escape sequences

Terraform strings support standard escapes such as

\n

\r

\t

\"

\\

\uNNNN

, and

\UNNNNNNNN

Heredoc

block {
  value = <
Indented heredoc is also supported:
block {
  value = <<-EOT
  hello
    world
  EOT
}
JSON and YAML output
Terraform can render JSON or YAML from native values with helper functions such as 
jsonencode
:
example = jsonencode({
  a = 1
  b = "hello"
})
String templates
Terraform supports interpolation with 
${ ... }
 and template directives with %{ ... }
.
# expression interpolation
"Hello, ${var.name}!"

# conditional template
"Hello, %{ if var.name != "" }${var.name}%{ else }unnamed%{ endif }!"

# loop template
<
Whitespace trimming uses 
~
 inside template directives.
References
Terraform expressions can reference values from several sources:

.
 for managed resources.
var.
 for input variables.
local.
 for locals.
module.
 for child module outputs.
data..
 for data resources.
path.module
, path.root
, and path.cwd
 for filesystem paths.
terraform.workspace
 for the current workspace name.

Special values also appear in certain contexts, including 
count.index
, each.key
, each.value
, and self
.
Operators and function calls
Terraform supports logical operators such as 
!
, &&
, and ||
; arithmetic operators such as *
, /
, %
, +
, and -
; and the usual comparison operators.
(, )

# argument expansion
min([55, 2453, 2]...)
Conditional expressions
condition ? true_val : false_val

var.a != "" ? var.a : "default-a"
for expressions
A 
for
 expression transforms one complex value into another. Each input element may contribute zero or one output element.
[for s in var.list : upper(s)]

[for k, v in var.map : length(k) + length(v)]

{ for s in var.list : s => upper(s) }
You can also filter values with an 
if
 clause:
[for s in var.list : upper(s) if s != ""]
Grouping mode is enabled by adding 
...
 at the end of the value expression:
locals {
  users_by_role = {
    for name, user in var.users : user.role => name...
  }
}
dynamic blocks
Expressions can assign argument values, but they cannot directly repeat or conditionally emit nested blocks. That is where 
dynamic
 blocks come in.
resource "aws_elastic_beanstalk_environment" "example" {
  dynamic "setting" {
    for_each = var.settings
    content {
      namespace = setting.value["namespace"]
      name      = setting.value["name"]
      value     = setting.value["value"]
    }
  }
}
dynamic
 can generate nested blocks inside resources, data sources, providers, and provisioners. It cannot generate meta-argument blocks such as lifecycle
.
splat expressions
Splat expressions are a concise alternative to some 
for
 expressions:
[for o in var.list : o.id]
var.list[*].id

[for o in var.list : o.interfaces[0].name]
var.list[*].interfaces[0].name
Splat syntax works with list-like collections, not maps or objects. It can also turn a single optional value into a list-like expression in some contexts:
for_each = var.website[*]
Type constraints
Module and provider authors can use type constraints to validate user input. Terraform's type system is stronger than it first appears. You can constrain not only the outer type, but also the shape and element types inside it.
Collection and structural types
list(string)
list(number)
list(any)

object({ name = string, age = number })

tuple([string, number, bool])
Terraform also performs automatic conversions between similar complex types, such as object and map, or tuple and list, when the values fit the required shape. That flexibility is convenient, but it also means module authors should think carefully about how strict they want input constraints to be.
The special any placeholder
any
 is not really a type. It is a placeholder that Terraform resolves to a concrete type during type-checking. For example, a value such as ["a", "b", "c"]
 can satisfy list(any)
, and Terraform will infer a more specific list element type behind the scenes.
Optional object attributes
variable "with_optional_attribute" {
  type = object({
    a = string
    b = optional(string)
  })
}
Version constraints
Version constraints appear when selecting modules, providers, or the Terraform CLI version itself:
version = ">= 1.2.0, < 2.0.0"

=
!=
>  >=  <  <=
~>
~>
 allows changes to the rightmost specified version component.
Resources and providers
Managed resources
A 
resource
 block declares the desired shape of a real infrastructure object:
resource "resource_type" "local_name" {
  # arguments...
}
The resource type decides which arguments exist. The local name only matters inside the current module. Together, the type and local name form the module-local identity of the resource.
Lifecycle of a managed resource
When Terraform creates a new resource, it stores the remote object's identifier in state. On later runs, Terraform compares the real object with the configuration and decides whether to update it in place, replace it, or leave it alone.
When a configuration is applied, Terraform generally does four things:

Create resources that exist in configuration but not in state.
Destroy resources that exist in state but no longer exist in configuration.
Update resources whose arguments changed and support in-place changes.
Replace resources whose arguments changed but cannot be updated in place.

That last case depends heavily on provider behavior and the underlying API. Terraform decides the graph; the provider decides what each API operation can actually do.
Reading resource attributes
Within the same module, resource attributes are accessed as 
..
.
Besides user-supplied arguments, resources also expose read-only attributes that come back from the provider API, such as generated IDs.
Dependencies
Terraform infers most dependencies from expressions. If one resource argument references another resource, Terraform treats that as a dependency edge in the graph.
For dependencies that cannot be inferred from expressions, use the 
depends_on
 meta-argument.
Local-only resources
Some resource types do not represent remote infrastructure at all. They only store data in Terraform state. These local-only resources are often used for intermediate values such as generated random IDs or local key material.
Providers
Every resource type belongs to a provider. A provider is a Terraform plugin that implements one or more resource types and data source types.
A module needs providers for every resource it uses, and provider configuration is usually supplied by the root module. Providers can also expose multiple configurations, often to target different regions or accounts.
provider "google" {
  region = "us-central1"
}

provider "google" {
  alias  = "europe"
  region = "europe-west1"
}

resource "google_compute_instance" "example" {
  provider = google.europe
}
Resources implicitly depend on their selected provider configuration, so Terraform will not try to create the resource before the provider is ready.
Resource meta-arguments
depends_on
depends_on
 handles dependencies that expression analysis cannot see. It should be used sparingly.
resource "aws_iam_role" "example" {
  name = "example"
}

resource "aws_iam_role_policy" "example" {
  role = aws_iam_role.example.name
}

resource "aws_instance" "example" {
  iam_instance_profile = aws_iam_role.example.name

  depends_on = [
    aws_iam_role_policy.example,
  ]
}
count
count
 creates several similar resource instances from one block:
resource "aws_instance" "server" {
  count = 4

  ami           = "ami-a1b2c3d4"
  instance_type = "t2.micro"

  tags = {
    Name = "Server ${count.index}"
  }
}
Instances are referenced with index syntax such as 
aws_instance.server[0]
.
for_each
for_each
 is more flexible than count
 when instances differ in meaningful ways. It accepts a map or a set(string)
.
resource "azurerm_resource_group" "rg" {
  for_each = {
    a_group       = "eastus"
    another_group = "westus2"
  }

  name     = each.key
  location = each.value
}
Resources created by 
for_each
 are referenced with key syntax such as azurerm_resource_group.rg["a_group"]
.
The keys must be known before apply, cannot come from impure functions such as 
uuid
 or timestamp
, and cannot be sensitive values.
You can also chain 
for_each
 from one resource to another:
resource "aws_vpc" "example" {
  for_each   = var.vpcs
  cidr_block = each.value.cidr_block
}

resource "aws_internet_gateway" "example" {
  for_each = aws_vpc.example
  vpc_id   = each.value.id
}
lifecycle
The 
lifecycle
 block customizes replacement and update behavior:
resource "azurerm_resource_group" "example" {
  lifecycle {
    create_before_destroy = true
  }
}



Argument
Meaning




create_before_destroy
Create the replacement first, then delete the old object.


prevent_destroy
Fail if the plan would delete the resource.


ignore_changes
Ignore selected attribute differences when deciding whether an update is needed. The special value all
 suppresses all updates.



timeouts
Some resource types provide a nested 
timeouts
 block:
resource /* ... */ {
  timeouts {
    create = "60m"
    update = "30m"
    delete = "2h"
  }
}
Provisioners
Provisioners are the escape hatch for actions that do not fit Terraform's declarative model. Use them reluctantly. They add uncertainty and sit outside the normal planning model.
Terraform cannot reason very well about provisioner side effects. Provisioners also tend to need direct network access, credentials, and timing assumptions that make runs less predictable.
self, when, and on_failure
Provisioners use 
self
 to refer to the parent resource. They also support when
 and on_failure
:
resource "aws_instance" "web" {
  provisioner "local-exec" {
    when    = destroy
    command = "echo 'Destroy-time provisioner'"
  }
}
If a create-time provisioner fails, Terraform marks the resource tainted so the next 
apply
 can replace it.
connection settings
Many provisioners need SSH or WinRM. Connection details can be declared at the resource level or on a specific provisioner:
provisioner "file" {
  connection {
    type     = "ssh"
    user     = "root"
    password = var.root_password
    host     = var.host
  }
}

provisioner "file" {
  connection {
    type     = "winrm"
    user     = "Administrator"
    password = var.admin_password
    host     = var.host
  }
}
null_resource and common provisioners
null_resource
 exists for provisioner-driven workflows that are not tied to a real managed resource.
resource "null_resource" "cluster" {
  triggers = {
    cluster_instance_ids = join(",", aws_instance.cluster.*.id)
  }

  provisioner "remote-exec" {
    inline = [
      "bootstrap-cluster.sh ${join(" ", aws_instance.cluster.*.private_ip)}",
    ]
  }
}
The common built-in provisioners are:



Provisioner
Meaning




file
Copy files or directories from the machine running Terraform to the target resource.


local-exec
Run a local command after a resource action.


remote-exec
Connect to the remote resource and run commands there.



Data sources
A data source, declared with a 
data
 block, reads information from an external system and exposes the result to the configuration. It is still provider-backed, but it only reads.
data "aws_ami" "example" {
  most_recent = true

  owners = ["self"]
  tags = {
    Name   = "app-server"
    Tested = "true"
  }
}
If the query arguments are known during planning, Terraform reads the data source during refresh. If those arguments depend on values that will only exist after apply, Terraform delays the read until apply time.
Data sources support the same dependency patterns and most of the same meta-arguments as managed resources.
Variables, locals, and outputs
Modules in Terraform behave a bit like functions. Input variables are the parameters, outputs are the return values, and locals are internal named expressions.
Input variables
Input variables parameterize a module so it can be reused in different configurations. Root module variables can be set from the CLI or variable files. Child module variables must be passed through the corresponding 
module
 block.
variable "image_id" {
  type        = string
  description = ""

  validation {
    condition     = bool-expr
    error_message = ""
  }

  sensitive = false
}

variable "availability_zone_names" {
  type    = list(string)
  default = ["us-west-1a"]
}
Variable values can come from 
-var
, -var-file
, environment variables, or automatically loaded files such as terraform.tfvars
.
Locals
Locals are named expressions used to simplify or normalize configuration logic:
locals {
  common_tags = {
    Project = "demo"
    Owner   = "infra"
  }
}
Locals can reference other locals as long as there is no dependency cycle.
Outputs
Outputs expose values from a module to its caller or to the CLI:
output "vpc_id" {
  value = aws_vpc.main.id
}
How to read Terraform
Terraform makes more sense once you treat it as a graph engine wrapped around provider APIs. Configuration declares vertices and edges. State records which remote objects correspond to which addresses. Providers translate graph operations into API calls.
Most Terraform work is not about memorizing syntax. It is about knowing which values are known at plan time, where dependencies come from, what the provider can update in place, and when a resource has to be replaced. Once those four things are clear, the language stops feeling mysterious.

Argument	Meaning
create_before_destroy	Create the replacement first, then delete the old object.
prevent_destroy	Fail if the plan would delete the resource.
ignore_changes	Ignore selected attribute differences when deciding whether an update is needed. The special value all suppresses all updates.

Provisioner	Meaning
file	Copy files or directories from the machine running Terraform to the target resource.
local-exec	Run a local command after a resource action.
remote-exec	Connect to the remote resource and run commands there.

The post Terraform: a practical guide to infrastructure as code appeared first on 绿色记忆.

编写Kubernetes风格的APIServer

Alex — Fri, 20 Aug 2021 07:33:34 +0000

背景

前段时间接到一个需求做一个工具，工具将在K8S中运行。需求很适合用控制器模式实现，很自然的就基于kubebuilder进行开发了。但是和K8S环境提供方沟通时发现，他们不允许工作负载调用控制平面的接口，这该怎么办呢。

最快速的解决方案是，自己运行一套kube-apiserver + etcd。但是这对我们来说太重了，kube-apiserver很多我们不需要的特性占用了过多资源，因此这里想寻找一个更轻量的方案。

apiserver库

kubernetes/apiserver同步自kubernertes主代码树的taging/src/k8s.io/apiserver目录，它提供了创建K8S风格的API Server所需要的库。包括kube-apiserver、kube-aggregator、service-catalog在内的很多项目都依赖此库。

apiserver库的目的主要是用来构建API Aggregation中的Extension API Server。它提供的特性包括：

将authn/authz委托给主kube-apiserver
支持kuebctl兼容的API发现
支持admisson control链
支持版本化的API类型

K8S提供了一个样例kubernetes/sample-apiserver，但是这个例子依赖于主kube-apiserver。即使不使用authn/authz或API聚合，也是如此。你需要通过--kubeconfig来指向一个主kube-apiserver，样例中的SharedInformer依赖于会连接到主kube-apiserver来访问K8S资源。

sample-apiserver分析

显然我们是不能对主kube-apiserver有任何依赖的，这里分析一下sample-apiserver的代码，看看如何进行改动。

入口点

func main() {
	logs.InitLogs()
	defer logs.FlushLogs()

	stopCh := genericapiserver.SetupSignalHandler()
	// 初始化服务器选项
	options := server.NewWardleServerOptions(os.Stdout, os.Stderr)
	// 启动服务器
	cmd := server.NewCommandStartWardleServer(options, stopCh)
	cmd.Flags().AddGoFlagSet(flag.CommandLine)
	if err := cmd.Execute(); err != nil {
		klog.Fatal(err)
	}
}

服务器选项

type WardleServerOptions struct {
	RecommendedOptions *genericoptions.RecommendedOptions

	SharedInformerFactory informers.SharedInformerFactory
	StdOut                io.Writer
	StdErr                io.Writer
}

func NewWardleServerOptions(out, errOut io.Writer) *WardleServerOptions {
	o := &WardleServerOptions{
		RecommendedOptions: genericoptions.NewRecommendedOptions(
			// 数据默认存放在Etcd的/registry/wardle.example.com目录下
			defaultEtcdPathPrefix,
			// 指定wardle.example.com/v1alpha1使用遗留编解码器
			apiserver.Codecs.LegacyCodec(v1alpha1.SchemeGroupVersion),
			// API Server的进程信息
			genericoptions.NewProcessInfo("wardle-apiserver", "wardle"),
		),

		StdOut: out,
		StdErr: errOut,
	}
	// wardle.example.com/v1alpha1中的所有对象存储到Etcd
	o.RecommendedOptions.Etcd.StorageConfig.EncodeVersioner = 
		runtime.NewMultiGroupVersioner(v1alpha1.SchemeGroupVersion, 
			schema.GroupKind{Group: v1alpha1.GroupName})
	return o
}

可以看到，选项的核心是genericoptions.RecommendedOptions，顾名思义，它用于提供运行apiserver所需的“推荐”选项：

type RecommendedOptions struct {
	// Etcd相关的配置
	Etcd           *EtcdOptions
	// HTTPS相关选项，包括监听地址、证书等配置。还负责创建并设置Lookback专用的rest.Config
	SecureServing  *SecureServingOptionsWithLoopback
	// authn选项
	Authentication *DelegatingAuthenticationOptions
	// authz选项
	Authorization  *DelegatingAuthorizationOptions
	// 审计选项
	Audit          *AuditOptions
	// 用于启用剖析、竞态条件剖析
	Features       *FeatureOptions
	// 核心API选项，指定主kube-apiserver配置文件位置
	CoreAPI        *CoreAPIOptions

	// 特性开关
	FeatureGate featuregate.FeatureGate
	// 所有以上选项的ApplyTo被调用后，调用下面的函数。返回的PluginInitializer会传递给Admission.ApplyTo
	ExtraAdmissionInitializers func(c *server.RecommendedConfig) ([]admission.PluginInitializer, error)
	Admission                  *AdmissionOptions
	// 提供服务器信息
	ProcessInfo *ProcessInfo
	// Webhook选项
	Webhook     *WebhookOptions
	// 控制服务器的出站流量
	EgressSelector *EgressSelectorOptions
}

推荐的选项取值，可以由函数genericoptions.NewRecommendedOptions()提供。RecommendedOptions支持通过命令行参数获取选项取值：

func (o *RecommendedOptions) AddFlags(fs *pflag.FlagSet) {}

准备服务器

服务器实现为cobra.Command命令，首先会将RecommendedOptions绑定到命令行参数。

func NewCommandStartWardleServer(defaults *WardleServerOptions, stopCh <-chan struct{}) *cobra.Command {
	o := *defaults
	cmd := &cobra.Command{
		Short: "Launch a wardle API server",
		Long:  "Launch a wardle API server",
		RunE: func(c *cobra.Command, args []string) error {
			// ...
		},
	}

	flags := cmd.Flags()
	// 将选项添加为命令行标记
	o.RecommendedOptions.AddFlags(flags)
	utilfeature.DefaultMutableFeatureGate.AddFlag(flags)

	return cmd
}

然后，调用cmd.Execute()，进而调用上面的RunE方法：

if err := o.Complete(); err != nil {
	return err
}
if err := o.Validate(args); err != nil {
	return err
}
if err := o.RunWardleServer(stopCh); err != nil {
	return err
}
return nil

Complete方法，就是注册了一个Admission控制器，参考下文。

Validate方法调用RecommendedOptions进行选项（合并了用户提供的命令行标记）合法性校验：

func (o WardleServerOptions) Validate(args []string) error {
	errors := []error{}
	// 校验结果是错误的切片
	errors = append(errors, o.RecommendedOptions.Validate()...)
	// 合并为单个错误
	return utilerrors.NewAggregate(errors)
}

启动服务器

RunWardleServer方法启动API Server。它包含了将服务器选项（Option）转换为服务器配置（Config），从服务器配置实例化APIServer，并运行APIServer的整个流程：

func (o WardleServerOptions) RunWardleServer(stopCh <-chan struct{}) error {
	// 选项转换为配置
	config, err := o.Config()
	if err != nil {
		return err
	}
	// 配置转换为CompletedConfig，实例化APIServer
	server, err := config.Complete().New()
	if err != nil {
		return err
	}

	// 注册一个在API Server启动之后运行的钩子
	server.GenericAPIServer.AddPostStartHookOrDie("start-sample-server-informers", func(context genericapiserver.PostStartHookContext) error {
		config.GenericConfig.SharedInformerFactory.Start(context.StopCh)
		o.SharedInformerFactory.Start(context.StopCh)
		return nil
	})

	//                             准备运行、  运行 APIServer
	return server.GenericAPIServer.PrepareRun().Run(stopCh)
}

服务器配置

API Server不能直接使用选项，必须将选项转换为apiserver.Config：

func (o *WardleServerOptions) Config() (*apiserver.Config, error) {
	// 这里又对选项进行了若干修改

	// 检查证书是否可以读取，如果不可以则尝试生成自签名证书
	if err := o.RecommendedOptions.SecureServing.MaybeDefaultWithSelfSignedCerts("localhost", nil, []net.IP{net.ParseIP("127.0.0.1")}); err != nil {
		return nil, fmt.Errorf("error creating self-signed certificates: %v", err)
	}
	// 根据特性开关，决定是否支持分页
	o.RecommendedOptions.Etcd.StorageConfig.Paging = utilfeature.DefaultFeatureGate.Enabled(features.APIListChunking)
	// 
	o.RecommendedOptions.ExtraAdmissionInitializers = func(c *genericapiserver.RecommendedConfig) ([]admission.PluginInitializer, error) {
		// ...
	}

	// 创建推荐配置
	serverConfig := genericapiserver.NewRecommendedConfig(apiserver.Codecs)
	// 暴露OpenAPI端点
	serverConfig.OpenAPIConfig = genericapiserver.DefaultOpenAPIConfig(
		// 自动生成的
		sampleopenapi.GetOpenAPIDefinitions, openapi.NewDefinitionNamer(apiserver.Scheme))
	serverConfig.OpenAPIConfig.Info.Title = "Wardle"
	serverConfig.OpenAPIConfig.Info.Version = "0.1"

	// 将RecommendedOptions应用到RecommendedConfig
	if err := o.RecommendedOptions.ApplyTo(serverConfig); err != nil {
		return nil, err
	}

	// 选项包含GenericConfig和你自定义的选项两部分
	config := &apiserver.Config{
		GenericConfig: serverConfig,
		ExtraConfig:   apiserver.ExtraConfig{},
	}
	return config, nil
}

从上面的代码可以看到RecommendedConfig是配置的核心：

type RecommendedConfig struct {
	// 用于配置GenericAPIServer的结构
	Config

	// SharedInformerFactory用于提供K8S资源的shared informers
	// 该字段由RecommendedOptions.CoreAPI.ApplyTo调用设置，informer默认使用in-cluster的ClientConfig
	SharedInformerFactory informers.SharedInformerFactory

	// 由RecommendedOptions.CoreAPI.ApplyTo设置，informer使用
	ClientConfig *restclient.Config
}

type Config struct {
	SecureServing *SecureServingInfo
	Authentication AuthenticationInfo
	Authorization AuthorizationInfo
	// 特权的本机使用的ClientConfig，PostStartHooks用到它
	LoopbackClientConfig *restclient.Config
	// ...
}

通过调用ApplyTo方法，将RecommendedOptions中的选项传递给了RecommendedConfig：

func (o *RecommendedOptions) ApplyTo(config *server.RecommendedConfig) error {
	// 调用config.AddHealthChecks添加Etcd的健康检查
	// 设置config.RESTOptionsGetter
	if err := o.Etcd.ApplyTo(&config.Config); err != nil {
		return err
	}
	// 创建config.Listener
	// 初始化config.Cert、config.CipherSuites、config.SNICerts等
	if err := o.SecureServing.ApplyTo(&config.Config.SecureServing, &config.Config.LoopbackClientConfig); err != nil {
		return err
	}
	// 初始化身份验证配置，获取相应的K8S客户端接口
	if err := o.Authentication.ApplyTo(&config.Config.Authentication, config.SecureServing, config.OpenAPIConfig); err != nil {
		return err
	}
	// 初始化访问控制配置，获取相应的K8S客户端接口
	if err := o.Authorization.ApplyTo(&config.Config.Authorization); err != nil {
		return err
	}
	if err := o.Audit.ApplyTo(&config.Config, config.ClientConfig, config.SharedInformerFactory, o.ProcessInfo, o.Webhook); err != nil {
		return err
	}
	if err := o.Features.ApplyTo(&config.Config); err != nil {
		return err
	}
	// 从配置文件加载kubeconfig或者使用incluster配置，提供config.ClientConfig和config.SharedInformerFactory
	if err := o.CoreAPI.ApplyTo(config); err != nil {
		return err
	}
	// 调用Admission初始化器
	if initializers, err := o.ExtraAdmissionInitializers(config); err != nil {
		return err
	// 逐个初始化Admission控制器
	} else if err := o.Admission.ApplyTo(&config.Config, config.SharedInformerFactory, config.ClientConfig, o.FeatureGate, initializers...); err != nil {
		return err
	}
	if err := o.EgressSelector.ApplyTo(&config.Config); err != nil {
		return err
	}
	if feature.DefaultFeatureGate.Enabled(features.APIPriorityAndFairness) {
		config.FlowControl = utilflowcontrol.New(
			config.SharedInformerFactory,
			kubernetes.NewForConfigOrDie(config.ClientConfig).FlowcontrolV1alpha1(),
			config.MaxRequestsInFlight+config.MaxMutatingRequestsInFlight,
			config.RequestTimeout/4,
		)
	}
	return nil
}

可以看到，在生成服务器配置的阶段，对主kube-apiserver有强依赖，这些依赖导致了sample-apiserver无法脱离K8S独立运行。

RecommendedConfig还需要通过Conplete方法，变成CompletedConfig：

// server, err := config.Complete().New()

func (cfg *Config) Complete() CompletedConfig {
	c := completedConfig{
		cfg.GenericConfig.Complete(),
		&cfg.ExtraConfig,
	}

	c.GenericConfig.Version = &version.Info{
		Major: "1",
		Minor: "0",
	}

	return CompletedConfig{&c}
}

// Complete补全缺失的、必须的配置信息，这些信息能够从已有配置导出
func (c *RecommendedConfig) Complete() CompletedConfig {
	return c.Config.Complete(c.SharedInformerFactory)
}

实例化APIServer

WardleServer，是从CompletedConfig实例化的：

func (c completedConfig) New() (*WardleServer, error) {
	// 创建GenericAPIServer
	//                                         名字用于在记录日志时进行区分
	//                                                           DelegationTarget用于进行APIServer的组合（composition）
	genericServer, err := c.GenericConfig.New("sample-apiserver", genericapiserver.NewEmptyDelegate())
	if err != nil {
		return nil, err
	}

	s := &WardleServer{
		GenericAPIServer: genericServer,
	}

上面的

New()

创建了核心的GenericAPIServer：

func (c completedConfig) New(name string, delegationTarget DelegationTarget) (*GenericAPIServer, error) {
	// 断言
	if c.Serializer == nil {
		return nil, fmt.Errorf("Genericapiserver.New() called with config.Serializer == nil")
	}
	if c.LoopbackClientConfig == nil {
		return nil, fmt.Errorf("Genericapiserver.New() called with config.LoopbackClientConfig == nil")
	}
	if c.EquivalentResourceRegistry == nil {
		return nil, fmt.Errorf("Genericapiserver.New() called with config.EquivalentResourceRegistry == nil")
	}

	handlerChainBuilder := func(handler http.Handler) http.Handler {
		return c.BuildHandlerChainFunc(handler, c.Config)
	}
	// 构建请求处理器
	apiServerHandler := NewAPIServerHandler(name, c.Serializer, handlerChainBuilder, delegationTarget.UnprotectedHandler())

	// 创建GenericAPIServer，很多字段直接来自completedConfig
	s := &GenericAPIServer{
		discoveryAddresses:         c.DiscoveryAddresses,
		LoopbackClientConfig:       c.LoopbackClientConfig,
		legacyAPIGroupPrefixes:     c.LegacyAPIGroupPrefixes,
		admissionControl:           c.AdmissionControl,
		Serializer:                 c.Serializer,
		AuditBackend:               c.AuditBackend,
		Authorizer:                 c.Authorization.Authorizer,
		delegationTarget:           delegationTarget,
		EquivalentResourceRegistry: c.EquivalentResourceRegistry,
		HandlerChainWaitGroup:      c.HandlerChainWaitGroup,

		minRequestTimeout:     time.Duration(c.MinRequestTimeout) * time.Second,
		ShutdownTimeout:       c.RequestTimeout,
		ShutdownDelayDuration: c.ShutdownDelayDuration,
		SecureServingInfo:     c.SecureServing,
		ExternalAddress:       c.ExternalAddress,

		Handler: apiServerHandler,

		listedPathProvider: apiServerHandler,

		openAPIConfig:           c.OpenAPIConfig,
		skipOpenAPIInstallation: c.SkipOpenAPIInstallation,

		postStartHooks:         map[string]postStartHookEntry{},
		preShutdownHooks:       map[string]preShutdownHookEntry{},
		disabledPostStartHooks: c.DisabledPostStartHooks,

		healthzChecks:    c.HealthzChecks,
		livezChecks:      c.LivezChecks,
		readyzChecks:     c.ReadyzChecks,
		readinessStopCh:  make(chan struct{}),
		livezGracePeriod: c.LivezGracePeriod,

		DiscoveryGroupManager: discovery.NewRootAPIsHandler(c.DiscoveryAddresses, c.Serializer),

		maxRequestBodyBytes: c.MaxRequestBodyBytes,
		livezClock:          clock.RealClock{},
	}

	// ...

	// 添加delegationTarget的生命周期钩子
	for k, v := range delegationTarget.PostStartHooks() {
		s.postStartHooks[k] = v
	}
	for k, v := range delegationTarget.PreShutdownHooks() {
		s.preShutdownHooks[k] = v
	}

	// 添加预配置的钩子
	for name, preconfiguredPostStartHook := range c.PostStartHooks {
		if err := s.AddPostStartHook(name, preconfiguredPostStartHook.hook); err != nil {
			return nil, err
		}
	}

	// 如果配置包含了SharedInformerFactory，而启动该SharedInformerFactory的钩子没有注册
	// 则注册一个PostStart钩子来启动它
	genericApiServerHookName := "generic-apiserver-start-informers"
	if c.SharedInformerFactory != nil {
		if !s.isPostStartHookRegistered(genericApiServerHookName) {
			err := s.AddPostStartHook(genericApiServerHookName, func(context PostStartHookContext) error {
				c.SharedInformerFactory.Start(context.StopCh)
				return nil
			})
			if err != nil {
				return nil, err
			}
			// TODO: Once we get rid of /healthz consider changing this to post-start-hook.
			err = s.addReadyzChecks(healthz.NewInformerSyncHealthz(c.SharedInformerFactory))
			if err != nil {
				return nil, err
			}
		}
	}

	const priorityAndFairnessConfigConsumerHookName = "priority-and-fairness-config-consumer"
	if s.isPostStartHookRegistered(priorityAndFairnessConfigConsumerHookName) {
	} else if c.FlowControl != nil {
		err := s.AddPostStartHook(priorityAndFairnessConfigConsumerHookName, func(context PostStartHookContext) error {
			go c.FlowControl.Run(context.StopCh)
			return nil
		})
		if err != nil {
			return nil, err
		}
		// TODO(yue9944882): plumb pre-shutdown-hook for request-management system?
	} else {
		klog.V(3).Infof("Not requested to run hook %s", priorityAndFairnessConfigConsumerHookName)
	}

	// 添加delegationTarget的健康检查
	for _, delegateCheck := range delegationTarget.HealthzChecks() {
		skip := false
		for _, existingCheck := range c.HealthzChecks {
			if existingCheck.Name() == delegateCheck.Name() {
				skip = true
				break
			}
		}
		if skip {
			continue
		}
		s.AddHealthChecks(delegateCheck)
	}

	s.listedPathProvider = routes.ListedPathProviders{s.listedPathProvider, delegationTarget}

	// 安装Profiling、Metrics、URL / 下显示的路径列表（listedPathProvider）
	installAPI(s, c.Config)

	// use the UnprotectedHandler from the delegation target to ensure that we don't attempt to double authenticator, authorize,
	// or some other part of the filter chain in delegation cases.
	if delegationTarget.UnprotectedHandler() == nil && c.EnableIndex {
		s.Handler.NonGoRestfulMux.NotFoundHandler(routes.IndexLister{
			StatusCode:   http.StatusNotFound,
			PathProvider: s.listedPathProvider,
		})
	}

	return s, nil
}

GenericAPIServer.Handler就是HTTP请求的处理器，我们在下文的请求处理过程一节分析。

安装APIGroup

实例化GenericAPIServer之后，是安装APIGroup：

// 创建APIGroupInfo，关于一组API的各种信息，包括已经注册的API（Scheme），如何进行编解码（Codec）
	// 如何解析查询参数（ParameterCodec）
	apiGroupInfo := genericapiserver.NewDefaultAPIGroupInfo(wardle.GroupName, Scheme, metav1.ParameterCodec, Codecs)

	// 从资源到rest.Storage的映射
	v1alpha1storage := map[stcongring]rest.Storage{}
	v1alpha1storage["flunders"] = wardleregistry.RESTInPeace(flunderstorage.NewREST(Scheme, c.GenericConfig.RESTOptionsGetter))
	v1alpha1storage["fischers"] = wardleregistry.RESTInPeace(fischerstorage.NewREST(Scheme, c.GenericConfig.RESTOptionsGetter))
	// 两个版本分别对应一个映射
	apiGroupInfo.VersionedResourcesStorageMap["v1alpha1"] = v1alpha1storage

	v1beta1storage := map[string]rest.Storage{}
	v1beta1storage["flunders"] = wardleregistry.RESTInPeace(flunderstorage.NewREST(Scheme, c.GenericConfig.RESTOptionsGetter))
	apiGroupInfo.VersionedResourcesStorageMap["v1beta1"] = v1beta1storage
	if err := s.GenericAPIServer.InstallAPIGroup(&apiGroupInfo); err != nil {
		return nil, err
	}

	return s, nil
}

上面的flunderstorage.NewREST等方法返回的registry.REST，内嵌的genericregistry.Store不仅仅实现了rest.Storage：

type Storage interface {
	// 当请求的数据存放到该方法创建的对象之后，可以调用Create/Update进行持久化
	// 必须返回一个适用于 Codec.DecodeInto([]byte, runtime.Object) 的指针类型
	New() runtime.Object
}

还实现了rest.StandardStorage：

type StandardStorage interface {
	Getter
	Lister
	CreaterUpdater
	GracefulDeleter
	CollectionDeleter
	Watcher
}

实现了这些接口，意味着registry.REST能够支持API对象的增删改查和Watch。更多细节我们在下面的请求处理过程一节中探讨。

启动APIServer

如启动服务器一节中的代码所示，在将选项转换为配置、完成配置，并从配置实例化APIServer之后，会执行由两个步骤组成的启动逻辑。

首先是PrepareRun，这里执行一些需要在API安装（在实例化时）之后进行的操作：

func (s *GenericAPIServer) PrepareRun() preparedGenericAPIServer {
	s.delegationTarget.PrepareRun()
	// 安装OpenAPI的Handler
	if s.openAPIConfig != nil && !s.skipOpenAPIInstallation {
		s.OpenAPIVersionedService, s.StaticOpenAPISpec = routes.OpenAPI{
			Config: s.openAPIConfig,
		}.Install(s.Handler.GoRestfulContainer, s.Handler.NonGoRestfulMux)
	}
	// 安装健康检查的Handler
	s.installHealthz()
	s.installLivez()
	err := s.addReadyzShutdownCheck(s.readinessStopCh)
	if err != nil {
		klog.Errorf("Failed to install readyz shutdown check %s", err)
	}
	s.installReadyz()

	// 为审计后端注册关闭前钩子
	if s.AuditBackend != nil {
		err := s.AddPreShutdownHook("audit-backend", func() error {
			s.AuditBackend.Shutdown()
			return nil
		})
		if err != nil {
			klog.Errorf("Failed to add pre-shutdown hook for audit-backend %s", err)
		}
	}

	return preparedGenericAPIServer{s}
}

然后是Run，启动APIServer：

func (s preparedGenericAPIServer) Run(stopCh <-chan struct{}) error {
	delayedStopCh := make(chan struct{})
	go func() {
		defer close(delayedStopCh)
		// 收到关闭信号
		<-stopCh
		// 一旦关闭流程被触发，/readyz就需要立刻返回错误
		close(s.readinessStopCh)
		// 关闭服务器前休眠ShutdownDelayDuration，这让LB有个时间窗口来检测/readyz状态，不再发送请求给此服务器
		time.Sleep(s.ShutdownDelayDuration)
	}()

	// 运行服务器
	err := s.NonBlockingRun(delayedStopCh)
	if err != nil {
		return err
	}

	// 收到关闭信号
	<-stopCh

	// 运行关闭前钩子
	err = s.RunPreShutdownHooks()
	if err != nil {
		return err
	}

	// 等待延迟关闭信号
	<-delayedStopCh

	// 等待现有请求完毕，然后关闭
	s.HandlerChainWaitGroup.Wait()

	return nil
}

NonBlockingRun是启动APIServer的核心代码。它会启动一个HTTPS服务器：

func (s preparedGenericAPIServer) NonBlockingRun(stopCh <-chan struct{}) error {
	// 这个通道用于保证HTTP Server优雅关闭，不会导致丢失审计事件
	auditStopCh := make(chan struct{})

	// 首先启动审计后端，这时任何请求都进不来
	if s.AuditBackend != nil {
		if err := s.AuditBackend.Run(auditStopCh); err != nil {
			return fmt.Errorf("failed to run the audit backend: %v", err)
		}
	}

	// 下面的通道用于出错时清理listener
	internalStopCh := make(chan struct{})
	var stoppedCh <-chan struct{}
	if s.SecureServingInfo != nil && s.Handler != nil {
		var err error
		// 启动HTTPS服务器，仅当证书错误或内部listen调用出错时会失败
		// server loop在一个Goroutine中运行
		// 可以看到，这里的s.Handler是可以被我们访问到的，因此建立HTTP（非HTTPS）服务器应该很方便
		stoppedCh, err = s.SecureServingInfo.Serve(s.Handler, s.ShutdownTimeout, internalStopCh)
		if err != nil {
			close(internalStopCh)
			close(auditStopCh)
			return err
		}
	}

	// 清理
	go func() {
		<-stopCh
		close(internalStopCh)
		if stoppedCh != nil {
			<-stoppedCh
		}
		s.HandlerChainWaitGroup.Wait()
		close(auditStopCh)
	}()

	// 启动后钩子
	s.RunPostStartHooks(stopCh)

	if _, err := systemd.SdNotify(true, "READY=1\n"); err != nil {
		klog.Errorf("Unable to send systemd daemon successful start message: %v\n", err)
	}

	return nil
}

结构注册到Scheme

apis/wardle包，以及它的子包，定义了wardle.example.com组的API。

wardle包的register.go中定义了组，以及从GV获得GVK、GVR的函数：

const GroupName = "wardle.example.com"
//                                                             没有正式（formal）类型则使用此常量
var SchemeGroupVersion = schema.GroupVersion{Group: GroupName, Version: runtime.APIVersionInternal}

func Kind(kind string) schema.GroupKind {
	return SchemeGroupVersion.WithKind(kind).GroupKind()
}

func Resource(resource string) schema.GroupResource {
	return SchemeGroupVersion.WithResource(resource).GroupResource()
}

API包通常都要提供AddToScheme变量，用于将API注册到指定的scheme：

var (
	SchemeBuilder = runtime.NewSchemeBuilder(addKnownTypes)
	AddToScheme   = SchemeBuilder.AddToScheme
)

func addKnownTypes(scheme *runtime.Scheme) error {
	// 注册API的核心，添加GV中的若干API类型
	scheme.AddKnownTypes(SchemeGroupVersion,
		&Flunder{}, // 必须提供指针类型，Go反射得到类型在编码时作为Kind
		&FlunderList{},
		&Fischer{},
		&FischerList{},
	)
	return nil
}

// 所谓SchemeBuilder其实就是切片
type SchemeBuilder []func(*Scheme) error
// NewSchemeBuilder方法支持多个回调函数，对这些函数调用SchemeBuilder.Register
func NewSchemeBuilder(funcs ...func(*Scheme) error) SchemeBuilder {
	var sb SchemeBuilder
	sb.Register(funcs...)
	return sb
}
// 所谓Register就是添加到切片中
func (sb *SchemeBuilder) Register(funcs ...func(*Scheme) error) {
	for _, f := range funcs {
		*sb = append(*sb, f)
	}
}

// 所谓AddToScheme就是遍历回调，Go方法可以作为变量传递
func (sb *SchemeBuilder) AddToScheme(s *Scheme) error {
	for _, f := range *sb {
		if err := f(s); err != nil {
			return err
		}
	}
	return nil
}

wardle包的types.go则定义了API类型对应的Go结构体。

子包v1alpha1、v1beta1定义了API的两个版本。它们包含和wardle包类似的GroupName、SchemeGroupVersion、SchemeBuilder、AddToScheme…等变量/函数，以及对应的API类型结构体。还包括自动生成的、与APIVersionInternal版本API进行转换的函数。

回到wardle包，Install方法支持将APIVersionInternal、v1alpha1、v1beta1这些版本都注册到scheme：

func Install(scheme *runtime.Scheme) {
	utilruntime.Must(wardle.AddToScheme(scheme))
	utilruntime.Must(v1beta1.AddToScheme(scheme))
	utilruntime.Must(v1alpha1.AddToScheme(scheme))
	utilruntime.Must(scheme.SetVersionPriority(v1beta1.SchemeGroupVersion, v1alpha1.SchemeGroupVersion))
}

而在apiserver包中，在init时会调用上面的Install函数：

var (
	// API注册表
	Scheme = runtime.NewScheme()
	// 编解码器工厂
	Codecs = serializer.NewCodecFactory(Scheme)
)

func init() {
	// 安装sample-apiserver提供的API
	install.Install(Scheme)

	// 将k8s.io/apimachinery/pkg/apis/meta/v1"注册到v1组，为什么？
	// we need to add the options to empty v1
	// TODO fix the server code to avoid this
	metav1.AddToGroupVersion(Scheme, schema.GroupVersion{Version: "v1"})

	// TODO: keep the generic API server from wanting this
	unversioned := schema.GroupVersion{Group: "", Version: "v1"}
	Scheme.AddUnversionedTypes(unversioned,
		&metav1.Status{},
		&metav1.APIVersions{},
		&metav1.APIGroupList{},
		&metav1.APIGroup{},
		&metav1.APIResourceList{},
	)
}

Codecs是API资源编解码器的工厂，RecommendedOptions需要此工厂：

// ... 
		RecommendedOptions: genericoptions.NewRecommendedOptions(
			defaultEtcdPathPrefix,
			// 得到v1alpha1版本的LegacyCodec，遗留编解码器（runtime.Codec）
			// LegacyCodec编码到指定的API版本，解码（从任何支持的源）到内部形式
			// 此编码器总是编码为JSON
			// 
			// LegacyCodec方法已经废弃，客户端/服务器应该根据MIME类型协商serializer
			// 并调用CodecForVersions
			apiserver.Codecs.LegacyCodec(v1alpha1.SchemeGroupVersion),
			genericoptions.NewProcessInfo("wardle-apiserver", "wardle"),
		),

func NewRecommendedOptions(prefix string, codec runtime.Codec, processInfo *ProcessInfo) *RecommendedOptions {
	return &RecommendedOptions{
		//  不过这里的runtime.Codec用于将AP对象转换为JSON并存放到Etcd，而不是发给客户端
		Etcd:           NewEtcdOptions(storagebackend.NewDefaultConfig(prefix, codec)),
	// ...

APIGroupInfo也需要此工厂：

apiGroupInfo := genericapiserver.NewDefaultAPIGroupInfo(wardle.GroupName, Scheme, metav1.ParameterCodec, Codecs)

func NewDefaultAPIGroupInfo(group string, scheme *runtime.Scheme, parameterCodec runtime.ParameterCodec, codecs serializer.CodecFactory) APIGroupInfo {
	return APIGroupInfo{
		// ...
		Scheme:                 scheme,
		ParameterCodec:         parameterCodec,
		NegotiatedSerializer:   codecs,
	}
}

在APIServer实例化期间，Scheme会传递给APIGroupInfo，从而实现资源 - Go类型之间的映射。

Admission控制器

sample-server也示例了如何集成自己的Admission控制器到API Server中。

在选项中，需要注册一个Admission初始化器

o.RecommendedOptions.ExtraAdmissionInitializers = func(c *genericapiserver.RecommendedConfig) ([]admission.PluginInitializer, error) {
		client, err := clientset.NewForConfig(c.LoopbackClientConfig)
		if err != nil {
			return nil, err
		}
		informerFactory := informers.NewSharedInformerFactory(client, c.LoopbackClientConfig.Timeout)
		o.SharedInformerFactory = informerFactory
		//                                   初始化函数
		return []admission.PluginInitializer{wardleinitializer.New(informerFactory)}, nil
	}

初始化器是一个函数，会在选项转为配置的时候执行：

// 调用上面的函数
	if initializers, err := o.ExtraAdmissionInitializers(config); err != nil {
		return err
	// 初始化Admission控制器
	} else if err := o.Admission.ApplyTo(&config.Config, config.SharedInformerFactory, config.ClientConfig, o.FeatureGate, initializers...); err != nil {
		return err
	}

上面的ApplyTo方法，会组建一个Admission控制器的初始化函数链，并逐个调用，以初始化所有Admission控制器。

此外，在PostStart钩子中，会启动Admission控制器所依赖的SharedInformerFactory：

server.GenericAPIServer.AddPostStartHookOrDie("start-sample-server-informers", func(context genericapiserver.PostStartHookContext) error {
		// 主kube-apiserver的InformerFactory，貌似没什么用
		config.GenericConfig.SharedInformerFactory.Start(context.StopCh)
		// 次级kube-apiserver的InformerFactory，Admission控制器需要使用
		o.SharedInformerFactory.Start(context.StopCh)
		return nil
	})

我们看一下sample-apiserver的Admission相关代码。

位于pkg/admission/wardleinitializer包中的是Admission初始化器，它能够为任何WantsInternalWardleInformerFactory类型的Admission控制器注入InformerFactory：

type pluginInitializer struct {
	informers informers.SharedInformerFactory
}

var _ admission.PluginInitializer = pluginInitializer{}

// 该函数在ExtraAdmissionInitializers函数中调用
func New(informers informers.SharedInformerFactory) pluginInitializer {
	return pluginInitializer{
		informers: informers,
	}
}

// 该函数在o.Admission.ApplyTo中调用
func (i pluginInitializer) Initialize(plugin admission.Interface) {
	if wants, ok := plugin.(WantsInternalWardleInformerFactory); ok {
		wants.SetInternalWardleInformerFactory(i.informers)
	}
}

位于pkg/admission/plugin/banflunder包中的是为了的Admission控制器BanFlunder。函数：

func Register(plugins *admission.Plugins) {
	plugins.Register("BanFlunder", func(config io.Reader) (admission.Interface, error) {
		return New()
	})
}

会在程序运行的很早期调用，以注册Admission控制器到API Server：

func (o *WardleServerOptions) Complete() error {
	// 注册插件
	banflunder.Register(o.RecommendedOptions.Admission.Plugins)

	// 配置顺序
	o.RecommendedOptions.Admission.RecommendedPluginOrder = append(o.RecommendedOptions.Admission.RecommendedPluginOrder, "BanFlunder")

	return nil
}

Admission控制器的核心是Admit函数，它可以修改或否决一个API Server请求：

func (d *DisallowFlunder) Admit(ctx context.Context, a admission.Attributes, o admission.ObjectInterfaces) error {
	// 仅仅对特定类型的API感兴趣
	if a.GetKind().GroupKind() != wardle.Kind("Flunder") {
		return nil
	}

	if !d.WaitForReady() {
		return admission.NewForbidden(a, fmt.Errorf("not yet ready to handle request"))
	}

	// 用于获取元数据
	metaAccessor, err := meta.Accessor(a.GetObject())
	if err != nil {
		return err
	}
	flunderName := metaAccessor.GetName()

	fischers, err := d.lister.List(labels.Everything())
	if err != nil {
		return err
	}

	for _, fischer := range fischers {
		for _, disallowedFlunder := range fischer.DisallowedFlunders {
			if flunderName == disallowedFlunder {
				return errors.NewForbidden(
					a.GetResource().GroupResource(),
					a.GetName(),
					// 拒绝请求
					fmt.Errorf("this name may not be used, please change the resource name"),
				)
			}
		}
	}
	return nil
}

请求处理过程

通过上文的分析，我们知道资源的增删改查是由registry.REST（空壳，实际是genericregistry.Store）负责的，那么HTTP请求是如何传递给它的呢？

回顾一下向APIGroupInfo中添加资源的代码：

//              资源类型          
v1alpha1storage["flunders"] = wardleregistry.RESTInPeace(
	// 创建registry.REST
	flunderstorage.NewREST(Scheme, c.GenericConfig.RESTOptionsGetter)
)


func NewREST(scheme *runtime.Scheme, optsGetter generic.RESTOptionsGetter) (*registry.REST, error) {
	// 策略，参考下文
	strategy := NewStrategy(scheme)

	store := &genericregistry.Store{
		// 实例化资源的函数
		NewFunc:                  func() runtime.Object {
			// 每次增删改查，都牵涉到结构的创建，因此在此打断点可以拦截所有请求
			return &wardle.Flunder{}
		},
		// 实例化资源列表的函数
		NewListFunc:              func() runtime.Object { return &wardle.FlunderList{} },
		// 判断对象是否可以被该存储处理
		PredicateFunc: func(label labels.Selector, field fields.Selector) storage.SelectionPredicate {
			return storage.SelectionPredicate{
				Label: label,
				Field: field,
				GetAttrs: func(obj runtime.Object) (labels.Set, fields.Set, error) {
					apiserver, ok := obj.(*wardle.Flunder)
					if !ok {
						return nil, nil, fmt.Errorf("given object is not a Flunder")
					}
					return apiserver.ObjectMeta.Labels, SelectableFields(apiserver), nil
				},
			}
		},
		// 资源的复数名称，当上下文中缺少必要的请求信息时使用
		DefaultQualifiedResource: wardle.Resource("flunders"),
		// 增删改的策略
		CreateStrategy: strategy,
		UpdateStrategy: strategy,
		DeleteStrategy: strategy,
	}
	options := &generic.StoreOptions{RESTOptions: optsGetter, AttrFunc: GetAttrs}
	// 填充默认字段
	if err := store.CompleteWithOptions(options); err != nil {
		return nil, err
	}
	return ®istry.REST{store}, nil
}

为了创建genericregistry.Store，需要两个信息：

Scheme，它提供的信息是Go类型和GVK之间的映射关系。其中Kind是根据Go结构的类型名反射得到的
generic.RESTOptionsGetter

RESTOptionsGetter用于获得RESTOptions：

type RESTOptionsGetter interface {
	GetRESTOptions(resource schema.GroupResource) (RESTOptions, error)
}

RESTOptions包含了关于存储的信息，尽管这个职责和名字好像没什么关系：

type RESTOptions struct {
	// 创建一个存储后端所需的配置信息
	StorageConfig *storagebackend.Config
	// 这是一个函数，能够提供storage.Interface和factory.DestroyFunc
	Decorator     StorageDecorator

	EnableGarbageCollection bool
	DeleteCollectionWorkers int
	ResourcePrefix          string
	CountMetricPollPeriod   time.Duration
}

// 
type Config struct {
	// 存储后端的类型，默认etcd3
	Type string
	// 传递给storage.Interface的所有方法的前缀，对应etcd存储前缀
	Prefix string
	// 连接到Etcd服务器的相关信息
	Transport TransportConfig
	// 提示APIServer是否应该支持分页
	Paging bool
	// 负责（反）串行化
	Codec runtime.Codec
	// 在持久化到Etcd之前，该对象输出目标将被转换为的GVK
	Transformer value.Transformer

	CompactionInterval time.Duration
	CountMetricPollPeriod time.Duration
	LeaseManagerConfig etcd3.LeaseManagerConfig
}


// 利用入参函数，构造storage.Interface
type StorageDecorator func(
	config *storagebackend.Config,
	resourcePrefix string,
	keyFunc func(obj runtime.Object) (string, error),
	newFunc func() runtime.Object,
	newListFunc func() runtime.Object,
	getAttrsFunc storage.AttrFunc,
	trigger storage.IndexerFuncs,
	indexers *cache.Indexers) (storage.Interface, factory.DestroyFunc, error)
// CRUD
type Interface interface {
	Versioner() Versioner
	Create(ctx context.Context, key string, obj, out runtime.Object, ttl uint64) error
	Delete(ctx context.Context, key string, out runtime.Object, preconditions *Preconditions, validateDeletion ValidateObjectFunc) error
	Watch(ctx context.Context, key string, resourceVersion string, p SelectionPredicate) (watch.Interface, error)
	WatchList(ctx context.Context, key string, resourceVersion string, p SelectionPredicate) (watch.Interface, error)
	Get(ctx context.Context, key string, resourceVersion string, objPtr runtime.Object, ignoreNotFound bool) error
	GetToList(ctx context.Context, key string, resourceVersion string, p SelectionPredicate, listObj runtime.Object) error
	List(ctx context.Context, key string, resourceVersion string, p SelectionPredicate, listObj runtime.Object) error
	GuaranteedUpdate(
		ctx context.Context, key string, ptrToType runtime.Object, ignoreNotFound bool,
		precondtions *Preconditions, tryUpdate UpdateFunc, suggestion ...runtime.Object) error
	Count(key string) (int64, error)
}
// 这个函数用于一次性销毁任何Create()创建的、当前Storage使用的对象
type DestroyFunc func()

在这里我们可以注意到storagebackend.Config和Etcd是有耦合的，因此想支持其它存储后端，需要在更早的节点介入。

genericregistry.Store还包含了三个Strategy字段：

type Store struct {
	// ...
	CreateStrategy rest.RESTCreateStrategy
	UpdateStrategy rest.RESTUpdateStrategy
	DeleteStrategy rest.RESTDeleteStrategy
	// ...
}

type RESTCreateStrategy interface {
	runtime.ObjectTyper
	// 用于生成名称
	names.NameGenerator
	// 对象是否必须在命名空间中
	NamespaceScoped() bool
	// 在Validate、Canonicalize之前调用，进行对象的normalize
	// 例如删除不需要持久化的字段、对顺序不敏感的列表字段进行重新排序
	// 不得移除这样的字段：它不能通过校验
	PrepareForCreate(ctx context.Context, obj runtime.Object)
	// 校验，在对象的默认字段被填充后调用
	Validate(ctx context.Context, obj runtime.Object) field.ErrorList
	// 在校验之后，持久化之前调用。正规化对象，通常实现为类型检查，或空函数
	Canonicalize(obj runtime.Object)
}

type RESTUpdateStrategy interface {
	runtime.ObjectTyper
	NamespaceScoped() bool
	// 对象是否可以被PUT请求创建
	AllowCreateOnUpdate() bool
	// 准备更新
	PrepareForUpdate(ctx context.Context, obj, old runtime.Object)
	// 校验
	ValidateUpdate(ctx context.Context, obj, old runtime.Object) field.ErrorList
	// 正规化
	Canonicalize(obj runtime.Object)
	// 当对象上没有指定版本时，是否允许无条件的更新，也就是不管最新的资源版本（禁用乐观并发控制）
	AllowUnconditionalUpdate() bool
}

type RESTDeleteStrategy interface {
	runtime.ObjectTyper
}

type ObjectTyper interface {
	// 得到对象可能的GVK信息
	ObjectKinds(Object) ([]schema.GroupVersionKind, bool, error)
	// Scheme是否支持指定的GVK
	Recognizes(gvk schema.GroupVersionKind) bool
}

可以看到，策略能够影响增删改的行为，它能够生成对象名称，校验对象合法性，甚至修改对象。我们看一下sample-apiserver提供的策略实现：

func NewStrategy(typer runtime.ObjectTyper) flunderStrategy {
	// 简单命名策略：返回请求的basename外加5位字母数字的随即后缀
	return flunderStrategy{typer, names.SimpleNameGenerator}
}

type flunderStrategy struct {
	runtime.ObjectTyper
	names.NameGenerator
}

func (flunderStrategy) NamespaceScoped() bool {
	return true
}

func (flunderStrategy) PrepareForCreate(ctx context.Context, obj runtime.Object) {
}

func (flunderStrategy) PrepareForUpdate(ctx context.Context, obj, old runtime.Object) {
}

func (flunderStrategy) Validate(ctx context.Context, obj runtime.Object) field.ErrorList {
	flunder := obj.(*wardle.Flunder)
	return validation.ValidateFlunder(flunder)
}

func (flunderStrategy) AllowCreateOnUpdate() bool {
	return false
}

func (flunderStrategy) AllowUnconditionalUpdate() bool {
	return false
}

func (flunderStrategy) Canonicalize(obj runtime.Object) {
}

func (flunderStrategy) ValidateUpdate(ctx context.Context, obj, old runtime.Object) field.ErrorList {
	return field.ErrorList{}
}



package validation

// ValidateFlunder validates a Flunder.
func ValidateFlunder(f *wardle.Flunder) field.ErrorList {
	allErrs := field.ErrorList{}

	allErrs = append(allErrs, ValidateFlunderSpec(&f.Spec, field.NewPath("spec"))...)

	return allErrs
}

// ValidateFlunderSpec validates a FlunderSpec.
func ValidateFlunderSpec(s *wardle.FlunderSpec, fldPath *field.Path) field.ErrorList {
	allErrs := field.ErrorList{}

	if len(s.FlunderReference) != 0 && len(s.FischerReference) != 0 {
		allErrs = append(allErrs, field.Invalid(fldPath.Child("fischerReference"), s.FischerReference, "cannot be set with flunderReference at the same time"))
	} else if len(s.FlunderReference) != 0 && s.ReferenceType != wardle.FlunderReferenceType {
		allErrs = append(allErrs, field.Invalid(fldPath.Child("flunderReference"), s.FlunderReference, "cannot be set if referenceType is not Flunder"))
	} else if len(s.FischerReference) != 0 && s.ReferenceType != wardle.FischerReferenceType {
		allErrs = append(allErrs, field.Invalid(fldPath.Child("fischerReference"), s.FischerReference, "cannot be set if referenceType is not Fischer"))
	} else if len(s.FischerReference) == 0 && s.ReferenceType == wardle.FischerReferenceType {
		allErrs = append(allErrs, field.Invalid(fldPath.Child("fischerReference"), s.FischerReference, "cannot be empty if referenceType is Fischer"))
	} else if len(s.FlunderReference) == 0 && s.ReferenceType == wardle.FlunderReferenceType {
		allErrs = append(allErrs, field.Invalid(fldPath.Child("flunderReference"), s.FlunderReference, "cannot be empty if referenceType is Flunder"))
	}

	if len(s.ReferenceType) != 0 && s.ReferenceType != wardle.FischerReferenceType && s.ReferenceType != wardle.FlunderReferenceType {
		allErrs = append(allErrs, field.Invalid(fldPath.Child("referenceType"), s.ReferenceType, "must be Flunder or Fischer"))
	}

	return allErrs
}

分析到这里，我们可以猜测到Create请求被genericregistry.Store处理的过程：

读取请求体，调用NewFunc反串行化为runtime.Obejct
调用PredicateFunc判断是否能够处理该对象
调用CreateStrategy，校验、正规化对象
调用RESTOptions，存储到Etcd

那么，请求是如何传递过来的，上述处理的细节又是怎样的？上文中我们已经定位到关键代码路径，通过断点很容易跟踪到完整处理流程。

在上文分析的GenericAPIServer实例过程中，它的Handler字段是这样创建的：

apiServerHandler := NewAPIServerHandler(name, c.Serializer, handlerChainBuilder, delegationTarget.UnprotectedHandler())

// 处理器链的构建器，注意出入参类型一样。这是因为处理器链是一层层包裹的，而不是链表那样的结构
type HandlerChainBuilderFn func(apiHandler http.Handler) http.Handler

func NewAPIServerHandler(name string, s runtime.NegotiatedSerializer, 
		handlerChainBuilder HandlerChainBuilderFn, notFoundHandler http.Handler) *APIServerHandler {

	// 配置go-restful，go-restful是一个构建REST风格WebService的框架
	nonGoRestfulMux := mux.NewPathRecorderMux(name)
	if notFoundHandler != nil {
		nonGoRestfulMux.NotFoundHandler(notFoundHandler)
	}

	// 容器，是一组WebService的集合
	gorestfulContainer := restful.NewContainer()
	// 容器包含一个用户HTTP请求多路复用的ServeMux
	gorestfulContainer.ServeMux = http.NewServeMux()
	// 路由器
	gorestfulContainer.Router(restful.CurlyRouter{}) // e.g. for proxy/{kind}/{name}/{*}
	// panic处理器
	gorestfulContainer.RecoverHandler(func(panicReason interface{}, httpWriter http.ResponseWriter) {
		logStackOnRecover(s, panicReason, httpWriter)
	})
	// 错误处理器
	gorestfulContainer.ServiceErrorHandler(func(serviceErr restful.ServiceError, request *restful.Request, response *restful.Response) {
		serviceErrorHandler(s, serviceErr, request, response)
	})

	director := director{
		name:               name,
		goRestfulContainer: gorestfulContainer,
		nonGoRestfulMux:    nonGoRestfulMux,
	}

	return &APIServerHandler{
		// 构建处理器链，                 注意传入的director
		FullHandlerChain:   handlerChainBuilder(director),
		GoRestfulContainer: gorestfulContainer,
		NonGoRestfulMux:    nonGoRestfulMux,
		Director:           director,
	}
}


type APIServerHandler struct {
	// 处理器链，接口是http包中标准的：
	// type Handler interface {
	// 	ServeHTTP(ResponseWriter, *Request)
	// }
	// 它组织一系列的过滤器，并在请求通过过滤器链后，调用Director
	FullHandlerChain http.Handler
	// 所有注册的API由此容器处理
	// InstallAPIs使用该字段，其他server不应该直接访问
	GoRestfulContainer *restful.Container
	// 链中最后一个处理器。这个类型包装一个mux对象，并且记录下注册了哪些URL路径
	NonGoRestfulMux *mux.PathRecorderMux

	// Director用于处理fall through和proxy
	Director http.Handler
}

这个Handler实现了http.Handler，简单的把请求委托给FullHandlerChain处理：

func (a *APIServerHandler) ServeHTTP(w http.ResponseWriter, r *http.Request) {
	a.FullHandlerChain.ServeHTTP(w, r)
}

上面这个函数，就是所有HTTP请求处理的入口点。

FullHandlerChain是handlerChainBuilder()调用得到的：

var c completedConfig
handlerChainBuilder := func(handler http.Handler) http.Handler {
	return c.BuildHandlerChainFunc(handler, c.Config)
}
apiServerHandler := NewAPIServerHandler(name, c.Serializer, handlerChainBuilder, delegationTarget.UnprotectedHandler())

completedConfig.BuildHandlerChainFunc则来自于它内嵌的Config：

serverConfig := genericapiserver.NewRecommendedConfig(apiserver.Codecs)
func NewRecommendedConfig(codecs serializer.CodecFactory) *RecommendedConfig {
	return &RecommendedConfig{
		Config: *NewConfig(codecs),
	}
}
func NewConfig(codecs serializer.CodecFactory) *Config {
	defaultHealthChecks := []healthz.HealthChecker{healthz.PingHealthz, healthz.LogHealthz}
	return &Config{
		Serializer:                  codecs,
		BuildHandlerChainFunc:       DefaultBuildHandlerChain,
	// ...
}

func DefaultBuildHandlerChain(apiHandler http.Handler, c *Config) http.Handler {
	// 最内层：传入的apiHandler，也就是上文中的director
	// 访问控制
	handler := genericapifilters.WithAuthorization(apiHandler, c.Authorization.Authorizer, c.Serializer)
	// 访问速率控制
	if c.FlowControl != nil {
		handler = genericfilters.WithPriorityAndFairness(handler, c.LongRunningFunc, c.FlowControl)
	} else {
		handler = genericfilters.WithMaxInFlightLimit(handler, c.MaxRequestsInFlight, c.MaxMutatingRequestsInFlight, c.LongRunningFunc)
	}
	// 身份扮演
	handler = genericapifilters.WithImpersonation(handler, c.Authorization.Authorizer, c.Serializer)
	// 审计
	handler = genericapifilters.WithAudit(handler, c.AuditBackend, c.AuditPolicyChecker, c.LongRunningFunc)
	// 处理身份认证失败
	failedHandler := genericapifilters.Unauthorized(c.Serializer, c.Authentication.SupportsBasicAuth)
	failedHandler = genericapifilters.WithFailedAuthenticationAudit(failedHandler, c.AuditBackend, c.AuditPolicyChecker)
	// 身份验证
	handler = genericapifilters.WithAuthentication(handler, c.Authentication.Authenticator, failedHandler, c.Authentication.APIAudiences)
	// 处理CORS请求
	handler = genericfilters.WithCORS(handler, c.CorsAllowedOriginList, nil, nil, nil, "true")
	// 超时处理
	handler = genericfilters.WithTimeoutForNonLongRunningRequests(handler, c.LongRunningFunc, c.RequestTimeout)
	// 所有长时间运行请求被添加到等待组，用于优雅关机
	handler = genericfilters.WithWaitGroup(handler, c.LongRunningFunc, c.HandlerChainWaitGroup)
	// 将RequestInfo附加到Context对象
	handler = genericapifilters.WithRequestInfo(handler, c.RequestInfoResolver)
	// HTTP Goway处理
	if c.SecureServing != nil && !c.SecureServing.DisableHTTP2 && c.GoawayChance > 0 {
		handler = genericfilters.WithProbabilisticGoaway(handler, c.GoawayChance)
	}
	// 设置Cache-Control头为"no-cache, private"，因为所有server被authn/authz保护
	handler = genericapifilters.WithCacheControl(handler)
	// 崩溃恢复
	handler = genericfilters.WithPanicRecovery(handler)
	return handler
}

我们可以清楚的看到默认的处理器链包含的大量过滤器，以及处理器是一层层包裹而非链表结构。

最后，我们来从头跟踪一下请求处理过程，首先看看处理器链中的过滤器们：

func (a *APIServerHandler) ServeHTTP(w http.ResponseWriter, r *http.Request) {
	a.FullHandlerChain.ServeHTTP(w, r)
}
// FullHandlerChain是这个类型
type HandlerFunc func(ResponseWriter, *Request)
// 巧妙的设计，实现了
// type Handler interface {
//	ServeHTTP(ResponseWriter, *Request)
// }
// 接口，但是这个实现，直接委托给类型对应的函数
func (f HandlerFunc) ServeHTTP(w ResponseWriter, r *Request) {
	f(w, r)
}

// 进入处理器链的最外层
func withPanicRecovery(handler http.Handler, crashHandler func(http.ResponseWriter, *http.Request, interface{})) http.Handler {
	handler = httplog.WithLogging(handler, httplog.DefaultStacktracePred)
	//                      处理函数
	return http.HandlerFunc(func(w http.ResponseWriter, req *http.Request) {
		defer runtime.HandleCrash(func(err interface{}) {
			crashHandler(w, req, err)
		})

		// 转发给内层处理器，内层处理器通过闭包捕获
		handler.ServeHTTP(w, req)
	})
}

// 省略若干中间的处理器...

// 这个处理器值得注意，它将请求信息存放到context中，内层处理器可以直接使用这些信息
// 信息包括：
type RequestInfo struct {
	// 是否针对资源/子资源的请求
	IsResourceRequest bool
	// URL的路径部分
	Path string
	// 小写的HTTP动词
	Verb string
	// API前缀
	APIPrefix  string
	// API组
	APIGroup   string
	// API版本
	APIVersion string
	// 命名空间
	Namespace  string
	// 资源类型名，通常是小写的复数形式，而不是Kind
	Resource string
	// 请求的子资源，子资源是scoped to父资源的另外一个资源，可以具有不同的Kind
	// 例如 /pods对应资源"pods"，对应Kind为"Pod"
	//      /pods/foo/status对应资源"pods"，对应子资源 "status"，对应Kind为"Pod"
	// 然而 /pods/foo/binding对应资源"pods"，对应子资源 "binding", 而对应Kind为"Binding"
	Subresource string
	// 对于某些资源，名字是空的。如果请求指示一个名字（不在请求体中）则填写在该字段
	// Parts are the path parts for the request, always starting with /{resource}/{name}
	Parts []string
}
func WithRequestInfo(handler http.Handler, resolver request.RequestInfoResolver) http.Handler {
	return http.HandlerFunc(func(w http.ResponseWriter, req *http.Request) {
		ctx := req.Context()
		info, err := resolver.NewRequestInfo(req)
		if err != nil {
			responsewriters.InternalError(w, req, fmt.Errorf("failed to create RequestInfo: %v", err))
			return
		}

		req = req.WithContext(request.WithRequestInfo(ctx, info))

		handler.ServeHTTP(w, req)
	})
}

// 省略若干中间的处理器...

遍历所有过滤器后，来到处理器链的最后一环，也就是director。从名字上可以看到，它在整体上负责请求的分发：

func (d director) ServeHTTP(w http.ResponseWriter, req *http.Request) {
	path := req.URL.Path

	// 遍历已经注册的所有WebService，看看有没有负责当前URL路径的
	for _, ws := range d.goRestfulContainer.RegisteredWebServices() {
		switch {
		case ws.RootPath() == "/apis":
			// 如果URL路径是 /apis或/apis/需要特殊处理。通常情况下，应该交由nonGoRestfulMux
			// 但是在启用descovery的情况下，需要直接由goRestfulContainer处理
			if path == "/apis" || path == "/apis/" {
				klog.V(5).Infof("%v: %v %q satisfied by gorestful with webservice %v", d.name, req.Method, path, ws.RootPath())
				d.goRestfulContainer.Dispatch(w, req)
				return
			}
		// 如果前缀匹配
		case strings.HasPrefix(path, ws.RootPath()):
			if len(path) == len(ws.RootPath()) || path[len(ws.RootPath())] == '/' {
				klog.V(5).Infof("%v: %v %q satisfied by gorestful with webservice %v", d.name, req.Method, path, ws.RootPath())
				// don't use servemux here because gorestful servemuxes get messed up when removing webservices
				// TODO fix gorestful, remove TPRs, or stop using gorestful
				d.goRestfulContainer.Dispatch(w, req)
				return
			}
		}
	}

	// 无法找到匹配，跳过 gorestful 容器
	d.nonGoRestfulMux.ServeHTTP(w, req)
}

对于非API资源请求，由PathRecorderMux处理：

func (m *PathRecorderMux) ServeHTTP(w http.ResponseWriter, r *http.Request) {
	m.mux.Load().(*pathHandler).ServeHTTP(w, r)
}
func (h *pathHandler) ServeHTTP(w http.ResponseWriter, r *http.Request) {
	// pathToHandler 记录了所有精确匹配路径的处理器
	//   0 = /healthz/etcd -> k8s.io/apiserver/pkg/server/healthz.adaptCheckToHandler.func1
	//   1 = /livez/etcd -> k8s.io/apiserver/pkg/server/healthz.adaptCheckToHandler.func1
	//   2 = /metrics -> k8s.io/apiserver/pkg/server/routes.MetricsWithReset.Install.func1
	//   3 = /readyz/shutdown -> k8s.io/apiserver/pkg/server/healthz.adaptCheckToHandler.func1
	//   4 = /readyz/ping -> k8s.io/apiserver/pkg/server/healthz.adaptCheckToHandler.func1
	//   5 = /debug/pprof/profile -> net/http/pprof.Profile
	//   6 = /healthz/log -> k8s.io/apiserver/pkg/server/healthz.adaptCheckToHandler.func1
	//   7 = /livez/log -> k8s.io/apiserver/pkg/server/healthz.adaptCheckToHandler.func1
	//   8 = /healthz/ping -> k8s.io/apiserver/pkg/server/healthz.adaptCheckToHandler.func1
	//   9 = / -> 
	//   10 = /healthz -> k8s.io/apiserver/pkg/endpoints/metrics.InstrumentHandlerFunc.func1
	//   11 = /debug/flags -> k8s.io/apiserver/pkg/server/routes.DebugFlags.Index-fm
	//   12 = /livez/ping -> k8s.io/apiserver/pkg/server/healthz.adaptCheckToHandler.func1
	//   13 = /readyz/log -> k8s.io/apiserver/pkg/server/healthz.adaptCheckToHandler.func1
	//   14 = /debug/pprof -> k8s.io/apiserver/pkg/server/routes.redirectTo.func1
	//   15 = /debug/pprof/trace -> net/http/pprof.Trace
	//   16 = /debug/flags/v -> k8s.io/apiserver/pkg/server/routes.StringFlagPutHandler.func1
	//   17 = /readyz/poststarthook/start-sample-server-informers -> k8s.io/apiserver/pkg/server/healthz.adaptCheckToHandler.func1
	//   18 = /livez/poststarthook/start-sample-server-informers -> k8s.io/apiserver/pkg/server/healthz.adaptCheckToHandler.func1
	//   19 = /openapi/v2 -> github.com/NYTimes/gziphandler.NewGzipLevelAndMinSize.func1.1
	//   20 = /healthz/poststarthook/start-sample-server-informers -> k8s.io/apiserver/pkg/server/healthz.adaptCheckToHandler.func1
	//   21 = /readyz/etcd -> k8s.io/apiserver/pkg/server/healthz.adaptCheckToHandler.func1
	//   22 = /index.html -> 
	//   23 = /debug/pprof/symbol -> net/http/pprof.Symbol
	//   24 = /livez -> k8s.io/apiserver/pkg/endpoints/metrics.InstrumentHandlerFunc.func1
	//   25 = /readyz -> k8s.io/apiserver/pkg/endpoints/metrics.InstrumentHandlerFunc.func1
	if exactHandler, ok := h.pathToHandler[r.URL.Path]; ok {
		klog.V(5).Infof("%v: %q satisfied by exact match", h.muxName, r.URL.Path)
		exactHandler.ServeHTTP(w, r)
		return
	}

	// 前缀匹配路径的处理器，默认有/debug/flags/ 和 /debug/pprof/
	for _, prefixHandler := range h.prefixHandlers {
		if strings.HasPrefix(r.URL.Path, prefixHandler.prefix) {
			klog.V(5).Infof("%v: %q satisfied by prefix %v", h.muxName, r.URL.Path, prefixHandler.prefix)
			prefixHandler.handler.ServeHTTP(w, r)
			return
		}
	}

	// 找不到处理器，404
	klog.V(5).Infof("%v: %q satisfied by NotFoundHandler", h.muxName, r.URL.Path)
	h.notFoundHandler.ServeHTTP(w, r)
}

对于API资源请求，例如路径 /apis/wardle.example.com/v1beta1，则由go-restful处理。go-restful框架内部处理细节我们就忽略了，我们主要深入探讨上文中提到的安装APIGroup，每种资源的go-restful Handler都是由它注册的：

func (s *GenericAPIServer) InstallAPIGroup(apiGroupInfo *APIGroupInfo) error {
	return s.InstallAPIGroups(apiGroupInfo)
}
func (s *GenericAPIServer) InstallAPIGroups(apiGroupInfos ...*APIGroupInfo) error {
	// 遍历API组
	for _, apiGroupInfo := range apiGroupInfos {
		//          安装API资源        常量 /apis
		if err := s.installAPIResources(APIGroupPrefix, apiGroupInfo, openAPIModels); err != nil {
			return fmt.Errorf("unable to install api resources: %v", err)
		}

		// setup discovery
		// Install the version handler.
		// Add a handler at /apis/ to enumerate all versions supported by this group.
		apiVersionsForDiscovery := []metav1.GroupVersionForDiscovery{}
		for _, groupVersion := range apiGroupInfo.PrioritizedVersions {
			// Check the config to make sure that we elide versions that don't have any resources
			if len(apiGroupInfo.VersionedResourcesStorageMap[groupVersion.Version]) == 0 {
				continue
			}
			apiVersionsForDiscovery = append(apiVersionsForDiscovery, metav1.GroupVersionForDiscovery{
				GroupVersion: groupVersion.String(),
				Version:      groupVersion.Version,
			})
		}
		preferredVersionForDiscovery := metav1.GroupVersionForDiscovery{
			GroupVersion: apiGroupInfo.PrioritizedVersions[0].String(),
			Version:      apiGroupInfo.PrioritizedVersions[0].Version,
		}
		apiGroup := metav1.APIGroup{
			Name:             apiGroupInfo.PrioritizedVersions[0].Group,
			Versions:         apiVersionsForDiscovery,
			PreferredVersion: preferredVersionForDiscovery,
		}

		s.DiscoveryGroupManager.AddGroup(apiGroup)
		s.Handler.GoRestfulContainer.Add(discovery.NewAPIGroupHandler(s.Serializer, apiGroup).WebService())
	}
	return nil
}

// 安装API资源
func (s *GenericAPIServer) installAPIResources(apiPrefix string, apiGroupInfo *APIGroupInfo, openAPIModels openapiproto.Models) error {
	// 遍历版本
	for _, groupVersion := range apiGroupInfo.PrioritizedVersions {
		// 跳过没有资源的组
		if len(apiGroupInfo.VersionedResourcesStorageMap[groupVersion.Version]) == 0 {
			klog.Warningf("Skipping API %v because it has no resources.", groupVersion)
			continue
		}

		apiGroupVersion := s.getAPIGroupVersion(apiGroupInfo, groupVersion, apiPrefix)
		if apiGroupInfo.OptionsExternalVersion != nil {
			apiGroupVersion.OptionsExternalVersion = apiGroupInfo.OptionsExternalVersion
		}
		apiGroupVersion.OpenAPIModels = openAPIModels
		apiGroupVersion.MaxRequestBodyBytes = s.maxRequestBodyBytes
		// 安装为go-restful的Handler
		if err := apiGroupVersion.InstallREST(s.Handler.GoRestfulContainer); err != nil {
			return fmt.Errorf("unable to setup API %v: %v", apiGroupInfo, err)
		}
	}

	return nil
}

// 注册一系列REST Handler（ storage, watch, proxy, redirect）到restful容器
func (g *APIGroupVersion) InstallREST(container *restful.Container) error {
	// 例如/apis/wardle.example.com/v1beta1
	prefix := path.Join(g.Root, g.GroupVersion.Group, g.GroupVersion.Version)
	installer := &APIInstaller{
		group:             g,
		prefix:            prefix,
		minRequestTimeout: g.MinRequestTimeout,
	}
	// 执行API资源处理器的安装
	apiResources, ws, registrationErrors := installer.Install()
	// 执行资源发现处理器的安装
	versionDiscoveryHandler := discovery.NewAPIVersionHandler(g.Serializer, g.GroupVersion, staticLister{apiResources})
	versionDiscoveryHandler.AddToWebService(ws)
	// 添加WebService到容器
	container.Add(ws)
	return utilerrors.NewAggregate(registrationErrors)
}
func (a *APIInstaller) Install() ([]metav1.APIResource, *restful.WebService, []error) {
	var apiResources []metav1.APIResource
	var errors []error
	// 创建一个针对特定GV的WebService
	ws := a.newWebService()

	// Register the paths in a deterministic (sorted) order to get a deterministic swagger spec.
	paths := make([]string, len(a.group.Storage))
	var i int = 0
	for path := range a.group.Storage {
		paths[i] = path
		i++
	}
	sort.Strings(paths)
	for _, path := range paths {
		// 遍历资源，注册处理器，例如flunders
		apiResource, err := a.registerResourceHandlers(path, a.group.Storage[path], ws)
		if err != nil {
			errors = append(errors, fmt.Errorf("error in registering resource: %s, %v", path, err))
		}
		if apiResource != nil {
			apiResources = append(apiResources, *apiResource)
		}
	}
	return apiResources, ws, errors
}
func (a *APIInstaller) newWebService() *restful.WebService {
	ws := new(restful.WebService)
	// 此WebService的路径模板
	ws.Path(a.prefix)
	// a.prefix contains "prefix/group/version"
	ws.Doc("API at " + a.prefix)
	// 向后兼容的考虑，支持没有MIME类型
	ws.Consumes("*/*")
	// 根据API组使用的编解码器来确定响应支持的MIME类型
	//   0 = {string} "application/json"
	//   1 = {string} "application/yaml"
	//   2 = {string} "application/vnd.kubernetes.protobuf"
	mediaTypes, streamMediaTypes := negotiation.MediaTypesForSerializer(a.group.Serializer)
	ws.Produces(append(mediaTypes, streamMediaTypes...)...)
	// 例如 wardle.example.com/v1beta1
	ws.ApiVersion(a.group.GroupVersion.String())
	return ws
}
// 注册资源处理器，过于冗长，仅仅贴片段
func (a *APIInstaller) registerResourceHandlers(path string, storage rest.Storage, ws *restful.WebService) (*metav1.APIResource, error) {
	// ...
	// creater就是Handler的核心逻辑所在，它来自rest.Storage，对接Etcd
	creater, isCreater := storage.(rest.Creater)
	// ...
	actions = appendIf(actions, action{"POST", resourcePath, resourceParams, namer, false}, isCreater)
	switch action.Verb {
		case "POST": 
			handler = restfulCreateResource(creater, reqScope, admit)

			route := ws.POST(action.Path).To(handler).
				Doc(doc).
				Param(ws.QueryParameter("pretty", "If 'true', then the output is pretty printed.")).
				Operation("create"+namespaced+kind+strings.Title(subresource)+operationSuffix).
				Produces(append(storageMeta.ProducesMIMETypes(action.Verb), mediaTypes...)...).
				Returns(http.StatusOK, "OK", producedObject).
				// TODO: in some cases, the API may return a v1.Status instead of the versioned object
				// but currently go-restful can't handle multiple different objects being returned.
				Returns(http.StatusCreated, "Created", producedObject).
				Returns(http.StatusAccepted, "Accepted", producedObject).
				Reads(defaultVersionedObject).
				Writes(producedObject)
			if err := AddObjectParams(ws, route, versionedCreateOptions); err != nil {
				return nil, err
			}
			addParams(route, action.Params)
			routes = append(routes, route)
	// ...
}
func restfulCreateResource(r rest.Creater, scope handlers.RequestScope, admit admission.Interface) restful.RouteFunction {
	return func(req *restful.Request, res *restful.Response) {
		handlers.CreateResource(r, &scope, admit)(res.ResponseWriter, req.Request)
	}
}
func CreateResource(r rest.Creater, scope *RequestScope, admission admission.Interface) http.HandlerFunc {
	return createHandler(&namedCreaterAdapter{r}, scope, admission, false)
}
// 创建资源处理器核心
func createHandler(r rest.NamedCreater, scope *RequestScope, admit admission.Interface, includeName bool) http.HandlerFunc {
	return func(w http.ResponseWriter, req *http.Request) {
		// For performance tracking purposes.
		trace := utiltrace.New("Create", utiltrace.Field{Key: "url", Value: req.URL.Path}, utiltrace.Field{Key: "user-agent", Value: &lazyTruncatedUserAgent{req}}, utiltrace.Field{Key: "client", Value: &lazyClientIP{req}})
		defer trace.LogIfLong(500 * time.Millisecond)
		// 处理Dryrun
		if isDryRun(req.URL) && !utilfeature.DefaultFeatureGate.Enabled(features.DryRun) {
			scope.err(errors.NewBadRequest("the dryRun feature is disabled"), w, req)
			return
		}

		// TODO: we either want to remove timeout or document it (if we document, move timeout out of this function and declare it in api_installer)
		timeout := parseTimeout(req.URL.Query().Get("timeout"))
		// 命名空间和资源名字处理
		namespace, name, err := scope.Namer.Name(req)
		if err != nil {
			if includeName {
				// name was required, return
				scope.err(err, w, req)
				return
			}

			// otherwise attempt to look up the namespace
			namespace, err = scope.Namer.Namespace(req)
			if err != nil {
				scope.err(err, w, req)
				return
			}
		}

		ctx, cancel := context.WithTimeout(req.Context(), timeout)
		defer cancel()
		ctx = request.WithNamespace(ctx, namespace)
		// 协商输出MIME类型
		outputMediaType, _, err := negotiation.NegotiateOutputMediaType(req, scope.Serializer, scope)
		if err != nil {
			scope.err(err, w, req)
			return
		}

		gv := scope.Kind.GroupVersion()
		// 协商输入如何反串行化
		s, err := negotiation.NegotiateInputSerializer(req, false, scope.Serializer)
		if err != nil {
			scope.err(err, w, req)
			return
		}
		// 从串行化器得到能将请求解码为特定版本的解码器
		decoder := scope.Serializer.DecoderToVersion(s.Serializer, scope.HubGroupVersion)

		// 读取请求体
		body, err := limitedReadBody(req, scope.MaxRequestBodyBytes)
		if err != nil {
			scope.err(err, w, req)
			return
		}

		options := &metav1.CreateOptions{}
		values := req.URL.Query()
		if err := metainternalversionscheme.ParameterCodec.DecodeParameters(values, scope.MetaGroupVersion, options); err != nil {
			err = errors.NewBadRequest(err.Error())
			scope.err(err, w, req)
			return
		}
		if errs := validation.ValidateCreateOptions(options); len(errs) > 0 {
			err := errors.NewInvalid(schema.GroupKind{Group: metav1.GroupName, Kind: "CreateOptions"}, "", errs)
			scope.err(err, w, req)
			return
		}
		options.TypeMeta.SetGroupVersionKind(metav1.SchemeGroupVersion.WithKind("CreateOptions"))

		defaultGVK := scope.Kind
		// 实例化资源的Go结构
		original := r.New()
		trace.Step("About to convert to expected version")
		// 将请求体解码到Go结构中
		obj, gvk, err := decoder.Decode(body, &defaultGVK, original)
		if err != nil {
			err = transformDecodeError(scope.Typer, err, original, gvk, body)
			scope.err(err, w, req)
			return
		}
		if gvk.GroupVersion() != gv {
			err = errors.NewBadRequest(fmt.Sprintf("the API version in the data (%s) does not match the expected API version (%v)", gvk.GroupVersion().String(), gv.String()))
			scope.err(err, w, req)
			return
		}
		trace.Step("Conversion done")

		// 审计和Admission控制
		ae := request.AuditEventFrom(ctx)
		admit = admission.WithAudit(admit, ae)
		audit.LogRequestObject(ae, obj, scope.Resource, scope.Subresource, scope.Serializer)

		userInfo, _ := request.UserFrom(ctx)

		// On create, get name from new object if unset
		if len(name) == 0 {
			_, name, _ = scope.Namer.ObjectName(obj)
		}

		trace.Step("About to store object in database")
		admissionAttributes := admission.NewAttributesRecord(obj, nil, scope.Kind, namespace, name, scope.Resource, scope.Subresource, admission.Create, options, dryrun.IsDryRun(options.DryRun), userInfo)
		// 构建入库函数
		requestFunc := func() (runtime.Object, error) {
			// 调用rest.Storage进行入库
			return r.Create(
				ctx,
				name,
				obj,
				rest.AdmissionToValidateObjectFunc(admit, admissionAttributes, scope),
				options,
			)
		}
		// finishRequest能够异步执行回调，并且处理响应返回的错误
		result, err := finishRequest(timeout, func() (runtime.Object, error) {
			if scope.FieldManager != nil {
				liveObj, err := scope.Creater.New(scope.Kind)
				if err != nil {
					return nil, fmt.Errorf("failed to create new object (Create for %v): %v", scope.Kind, err)
				}
				obj = scope.FieldManager.UpdateNoErrors(liveObj, obj, managerOrUserAgent(options.FieldManager, req.UserAgent()))
			}
			if mutatingAdmission, ok := admit.(admission.MutationInterface); ok && mutatingAdmission.Handles(admission.Create) {
				if err := mutatingAdmission.Admit(ctx, admissionAttributes, scope); err != nil {
					return nil, err
				}
			}
			result, err := requestFunc()
			// If the object wasn't committed to storage because it's serialized size was too large,
			// it is safe to remove managedFields (which can be large) and try again.
			if isTooLargeError(err) {
				if accessor, accessorErr := meta.Accessor(obj); accessorErr == nil {
					accessor.SetManagedFields(nil)
					result, err = requestFunc()
				}
			}
			return result, err
		})
		if err != nil {
			scope.err(err, w, req)
			return
		}
		trace.Step("Object stored in database")

		code := http.StatusCreated
		status, ok := result.(*metav1.Status)
		if ok && err == nil && status.Code == 0 {
			status.Code = int32(code)
		}

		transformResponseObject(ctx, scope, trace, req, w, code, outputMediaType, result)
	}
}

回顾一下请求处理的整体逻辑：

GenericAPIServer.Handler就是http.Handler，可以注册给任何HTTP服务器。因此我们想绕开HTTPS的限制应该很容易
GenericAPIServer.Handler是一个层层包裹的处理器链，外层是一系列过滤器，最里面是director
director负责整体的请求分发：
1. 对于非API资源请求，分发给nonGoRestfulMux。我们可以利用这个扩展点，扩展任意形式的HTTP接口
2. 对于API资源请求，分发给gorestfulContainer
在GenericAPIServer.InstallAPIGroup中，所有支持的API资源的所有版本，都注册为go-restful的一个WebService
这些WebService的逻辑包括（依赖于rest.Storage）：
1. 将请求解码为资源对应的Go结构
2. 将Go结构编码为JSON
3. 将JSON存储到Etcd

sample-apiserver小结

通过对sample-apiserver的代码分析，我们理解了构建自己的API Server的各种关键要素。

APIServer的核心类型是

GenericAPIServer

，它是由

genericapiserver.CompletedConfig

的

New()

方法生成的。后者则是

genericapiserver.RecommendedConfig

的

Complete()

方法生成的。而RecommendedConfig又是从

genericoptions.RecommendedOptions

得到的。sample-apiserver对Config、Option、Server等对象都做了一层包装，我们不关注这些wrapper。

RecommendedOptions对应了用户提供的各类选项（外加所谓推荐选项，降低使用时的复杂度），例如Etcd地址、Etcd存储前缀、APIServer的基本信息等等。调用RecommendedOptions的

ApplyTo

方法，会根据选项，推导出APIServer所需的，完整的配置信息。在这个方法中，甚至会进行自签名证书等重操作，而不是简单的将信息从Option复制给Config。RecommendedOptions会依次调用它的各个字段的ApplyTo方法，从而推导出RecommendedConfig的各个字段。

RecommendedConfig的Complete方法，再一次进行配置信息的推导，主要牵涉到OpenAPI相关的配置。

CompletedConfig的New方法实例化GenericAPIServer，这一步最关键的逻辑是安装API组。API组定义了如何实现GroupVersion中API的增删改查，它将GroupVersion的每种资源映射到registry.REST，后者具有处理REST风格请求的能力，并（默认）存储到Etcd。

GenericAPIServer提供了一些钩子来处理Admission控制器的注册、初始化。以及另外一些钩子来对API Server的生命周期事件做出响应。

sample-apiserver改造

解除对kube-apiserver的依赖

想实现sample-apiserver的独立运行，RecommendedOptions有三个字段必须处理：Authentication、Authorization、CoreAPI，它们都隐含了对主kube-apiserver的依赖。

Authentication依赖主kube-apiserver，是因为它需要访问TokenReviewInterface，访问kube-system中的ConfigMap。Authorization依赖主kube-apiserver，是因为它需要访问SubjectAccessReviewInterface。CoreAPI则是直接为Config提供了两个字段：ClientConfig、SharedInformerFactory。

将这些字段置空，可以解除对主kube-apiserver的依赖。这样启动sample-apiserver时就不需要提供这三个命令行选项：

--kubeconfig=/home/alex/.kube/config
--authentication-kubeconfig=/home/alex/.kube/config
--authorization-kubeconfig=/home/alex/.kube/config

但是，置空CoreAPI会导致报错：admission depends on a Kubernetes core API shared informer, it cannot be nil。这提示我们不能在不依赖主kube-apiserver的情况下使用Admission控制器这一特性，需要将Admission也置空：

o.RecommendedOptions.Authentication = nil
o.RecommendedOptions.Authorization = nil
o.RecommendedOptions.CoreAPI = nil
o.RecommendedOptions.Admission = nil

清空上述四个字段后，sample-server还会在PostStart钩子中崩溃：

// panic，这个SharedInformerFactory是CoreAPI选项提供的
config.GenericConfig.SharedInformerFactory.Start(context.StopCh)
// 仅仅Admission控制器使用该InformerFactory
o.SharedInformerFactory.Start(context.StopCh)

由于注释中给出的原因，这个PostStart钩子已经没有意义，删除即可正常启动服务器。

使用HTTP而非HTTPS

GenericAPIServer的

Run

方法的默认实现，是调用

s.SecureServingInfo.Serve

，因而强制使用HTTPS：

stoppedCh, err = s.SecureServingInfo.Serve(s.Handler, s.ShutdownTimeout, internalStopCh)

不过，很明显的，我们只需要将s.Handler传递给自己的http.Server即可使用HTTP。

添加任意HTTP接口

我们的迁移工具还提供一些非Kubernetes风格的HTTP接口，那么如何集成到APIServer中呢？

在启动服务器之前，可以直接访问

GenericAPIServer.Handler.NonGoRestfulMux

，NonGoRestfulMux实现了：

type mux interface {
	Handle(pattern string, handler http.Handler)
}

调用Handle即可为任何路径注册处理器。

apiserver-builder-alpha

通过对sample-apiserver代码的分析，我们了解到构建自己的API Server有大量繁琐的工作需要做。幸运的是，K8S提供了apiserver-builder-alpha简化这一过程。

apiserver-builder-alpha是一系列工具和库的集合，它能够：

为新的API资源创建Go类型、控制器（基于controller-runtime）、测试用例、文档
构建、（独立、在Minikube或者在K8S中）运行扩展的控制平面组件（APIServer）
让在控制器中watch/update资源更简单
让创建新的资源/子资源更简单
提供大部分合理的默认值

安装

下载压缩包，解压并存放到目录，然后设置环境变量：

export PATH=$HOME/.local/kubernetes/apiserver-builder/bin/:$PATH

起步

你需要在$GOPATH下创建一个项目，创建一个boilerplate.go.txt文件。然后执行：

apiserver-boot init repo --domain cloud.gmem.cc

该命令会生成如下目录结构：

.
├── bin
├── boilerplate.go.txt
├── BUILD.bazel
├── cmd
│   ├── apiserver
│   │   └── main.go
│   └── manager
│       └── main.go
├── go.mod
├── go.sum
├── pkg
│   ├── apis
│   │   └── doc.go
│   ├── controller
│   │   └── doc.go
│   ├── doc.go
│   ├── openapi
│   │   └── doc.go
│   └── webhook
│       └── webhook.go
├── PROJECT
└── WORKSPAC

cmd/apiserver/main.go，是APIServer的入口点：

import "sigs.k8s.io/apiserver-builder-alpha/pkg/cmd/server"

func main() {
	version := "v0"

	err := server.StartApiServerWithOptions(&server.StartOptions{
		EtcdPath:         "/registry/cloud.gmem.cc",
		//                无法运行，这个函数不存在
		Apis:             apis.GetAllApiBuilders(),
		Openapidefs:      openapi.GetOpenAPIDefinitions,
		Title:            "Api",
		Version:          version,

		// TweakConfigFuncs []func(apiServer *apiserver.Config) error
		// FlagConfigFuncs []func(*cobra.Command) error
	})
	if err != nil {
		panic(err)
	}
}

可以看到apiserver-builder-alpha进行了一些封装。

执行下面的命令，添加一个新的API资源：

apiserver-boot create group version resource --group tcm --version v1 --kind Flunder

最后，执行命令可以在本地启动APIServer：

apiserver-boot run local

问题

本文提及的工具项目，最初架构是基于CRD，使用kubebuilder进行代码生成。kubebuilder的目录结构和apiserver-builder并不兼容。

此外apiserver-builder项目仍然处于Alpha阶段，并且经过测试，发现生成代码无法运行。为了避免不必要的麻烦，我们不打算使用它。

编写APIServer

由于apiserver-builder不成熟，而且我们已经基于kubebuilder完成了大部分开发工作。因此打算基于分析sample-apiserver获得的经验，手工编写一个独立运行、使用HTTP协议的APIServer。

kubebuilder并不会生成zz_generated.openapi.go文件，因为该文件对于CRD没有意义。但是这个文件对于独立API Server是必须的。

我们需要为资源类型所在包添加注解：

// +k8s:openapi-gen=true

package v1

并调用openapi-gen生成此文件：

openapi-gen  \
	--input-dirs "k8s.io/apimachinery/pkg/apis/meta/v1,k8s.io/apimachinery/pkg/runtime,k8s.io/apimachinery/pkg/version" \
	--input-dirs cloud.gmem.cc/teleport/api/v1    -p cloud.gmem.cc/teleport/api/v1 -O zz_generated.openapi

下面是完整的quick&dirty的代码：

package main

import (
	v1 "cloud.gmem.cc/teleport/api/v1"
	"context"
	"fmt"
	metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
	"k8s.io/apimachinery/pkg/fields"
	"k8s.io/apimachinery/pkg/labels"
	"k8s.io/apimachinery/pkg/runtime"
	"k8s.io/apimachinery/pkg/runtime/schema"
	"k8s.io/apimachinery/pkg/runtime/serializer"
	utilruntime "k8s.io/apimachinery/pkg/util/runtime"
	"k8s.io/apimachinery/pkg/util/validation/field"
	"k8s.io/apiserver/pkg/endpoints/openapi"
	"k8s.io/apiserver/pkg/features"
	"k8s.io/apiserver/pkg/registry/generic"
	genericregistry "k8s.io/apiserver/pkg/registry/generic/registry"
	"k8s.io/apiserver/pkg/registry/rest"
	genericapiserver "k8s.io/apiserver/pkg/server"
	genericoptions "k8s.io/apiserver/pkg/server/options"
	"k8s.io/apiserver/pkg/storage"
	"k8s.io/apiserver/pkg/storage/names"
	"k8s.io/apiserver/pkg/util/feature"
	utilfeature "k8s.io/apiserver/pkg/util/feature"
	"net"
	"net/http"
	"os"
	"reflect"
	ctrl "sigs.k8s.io/controller-runtime"
)

var (
	setupLog = ctrl.Log.WithName("setup")
)

func main() {

	s := runtime.NewScheme()
	utilruntime.Must(v1.AddToScheme(s))
	gv := v1.GroupVersion
	utilruntime.Must(s.SetVersionPriority(gv))
	metav1.AddToGroupVersion(s, schema.GroupVersion{Version: "v1"})
	unversioned := schema.GroupVersion{Group: "", Version: "v1"}
	s.AddUnversionedTypes(unversioned,
		&metav1.Status{},
		&metav1.APIVersions{},
		&metav1.APIGroupList{},
		&metav1.APIGroup{},
		&metav1.APIResourceList{},
	)

	//  必须注册一个__internal版本，否则报错
	//  failed to prepare current and previous objects: no kind "Flunder" is registered for the internal version of group "tcm.cloud.gmem.cc" in scheme 
	gvi := gv
	gvi.Version = runtime.APIVersionInternal
	s.AddKnownTypes(gvi, &v1.Flunder{}, &v1.FlunderList{})

	codecFactory := serializer.NewCodecFactory(s)
	codec := codecFactory.LegacyCodec(gv)
	options := genericoptions.NewRecommendedOptions(
		"/teleport/cloud.gmem.cc",
		codec,
	)
	options.Etcd.StorageConfig.EncodeVersioner = runtime.NewMultiGroupVersioner(gv, schema.GroupKind{Group: gv.Group})
	ips := []net.IP{net.ParseIP("127.0.0.1")}
	if err := options.SecureServing.MaybeDefaultWithSelfSignedCerts("localhost", nil, ips); err != nil {
		setupLog.Error(err, "error creating self-signed certificates")
		os.Exit(1)
	}
	options.Etcd.StorageConfig.Paging = utilfeature.DefaultFeatureGate.Enabled(features.APIListChunking)
	options.Etcd.StorageConfig.Transport.ServerList = []string{"http://etcd.gmem.cc:2379"}

	options.Authentication = nil
	options.Authorization = nil
	options.CoreAPI = nil
	options.Admission = nil
	options.SecureServing.BindPort = 6443

	config := genericapiserver.NewRecommendedConfig(codecFactory)
	config.OpenAPIConfig = genericapiserver.DefaultOpenAPIConfig(v1.GetOpenAPIDefinitions,
		openapi.NewDefinitionNamer(s))
	config.OpenAPIConfig.Info.Title = "Teleport"
	config.OpenAPIConfig.Info.Version = "1.0"

	feature.DefaultMutableFeatureGate.SetFromMap(map[string]bool{
		string(features.APIPriorityAndFairness): false,
	})

	if err := options.ApplyTo(config); err != nil {
		panic(err)
	}
	completedConfig := config.Complete()
	server, err := completedConfig.New("teleport-apiserver", genericapiserver.NewEmptyDelegate())
	if err != nil {
		panic(err)
	}

	apiGroupInfo := genericapiserver.NewDefaultAPIGroupInfo(gv.Group, s, metav1.ParameterCodec, codecFactory)
	v1storage := map[string]rest.Storage{}
	resource := v1.ResourceFlunders
	v1storage[resource] = createStore(
		s,
		gv.WithResource(resource).GroupResource(),
		func() runtime.Object { return &v1.Flunder{} },
		func() runtime.Object { return &v1.FlunderList{} },
		completedConfig.RESTOptionsGetter,
	)
	apiGroupInfo.VersionedResourcesStorageMap[gv.Version] = v1storage
	if err := server.InstallAPIGroups(&apiGroupInfo); err != nil {
		panic(err)
	}
	server.AddPostStartHookOrDie("teleport-post-start", func(context genericapiserver.PostStartHookContext) error {
		return nil
	})
	preparedServer := server.PrepareRun()
	http.ListenAndServe(":6080", preparedServer.Handler)
}

func createStore(scheme *runtime.Scheme, gr schema.GroupResource, newFunc, newListFunc func() runtime.Object,
	optsGetter generic.RESTOptionsGetter) rest.Storage {
	attrs := func(obj runtime.Object) (labels.Set, fields.Set, error) {
		typ := reflect.TypeOf(newFunc())
		if reflect.TypeOf(obj) != typ {
			return nil, nil, fmt.Errorf("given object is not a %s", typ.Name())
		}
		oma := obj.(metav1.ObjectMetaAccessor)
		meta := oma.GetObjectMeta()
		return meta.GetLabels(), fields.Set{
			"metadata.name":      meta.GetName(),
			"metadata.namespace": meta.GetNamespace(),
		}, nil
	}
	s := strategy{
		scheme,
		names.SimpleNameGenerator,
	}
	store := &genericregistry.Store{
		NewFunc:     newFunc,
		NewListFunc: newListFunc,
		PredicateFunc: func(label labels.Selector, field fields.Selector) storage.SelectionPredicate {
			return storage.SelectionPredicate{
				Label:    label,
				Field:    field,
				GetAttrs: attrs,
			}
		},
		DefaultQualifiedResource: gr,

		CreateStrategy: s,
		UpdateStrategy: s,
		DeleteStrategy: s,

		TableConvertor: rest.NewDefaultTableConvertor(gr),
	}
	options := &generic.StoreOptions{RESTOptions: optsGetter, AttrFunc: attrs}
	if err := store.CompleteWithOptions(options); err != nil {
		panic(err)
	}
	return store
}

type strategy struct {
	runtime.ObjectTyper
	names.NameGenerator
}

func (strategy) NamespaceScoped() bool {
	return true
}

func (strategy) PrepareForCreate(ctx context.Context, obj runtime.Object) {
}

func (strategy) PrepareForUpdate(ctx context.Context, obj, old runtime.Object) {
}

func (strategy) Validate(ctx context.Context, obj runtime.Object) field.ErrorList {
	return field.ErrorList{}
}

func (strategy) AllowCreateOnUpdate() bool {
	return false
}

func (strategy) AllowUnconditionalUpdate() bool {
	return false
}

func (strategy) Canonicalize(obj runtime.Object) {
}

func (strategy) ValidateUpdate(ctx context.Context, obj, old runtime.Object) field.ErrorList {
	return field.ErrorList{}
}

定制存储后端

在安装APIGroup的时候，我们需要为每API组的每个版本的每种资源，指定存储后端：

// 每个组
apiGroupInfo := genericapiserver.NewDefaultAPIGroupInfo(wardle.GroupName, Scheme, metav1.ParameterCodec, Codecs)
// 每个版本
v1alpha1storage := map[stcongring]rest.Storage{}
// 每种资源提供一个rest.Storage
v1alpha1storage["flunders"] = wardleregistry.RESTInPeace(flunderstorage.NewREST(Scheme, c.GenericConfig.RESTOptionsGetter))

// 安装APIGroup
s.GenericAPIServer.InstallAPIGroup(&apiGroupInfo)

默认情况下，使用的是genericregistry.Store，它对接到Etcd。要实现自己的存储后端，实现相关接口即可。

注意：关于存储后端，有很多细节需要处理。

基于文件的存储

下面贴一个在文件系统中，以YAML形式存储API资源的例子：

package file

import (
	"bytes"
	"context"
	"fmt"
	"io/ioutil"
	"k8s.io/apimachinery/pkg/util/uuid"
	"os"
	"path/filepath"
	"reflect"
	"strings"
	"sync"

	apierrors "k8s.io/apimachinery/pkg/api/errors"
	"k8s.io/apimachinery/pkg/api/meta"
	metainternalversion "k8s.io/apimachinery/pkg/apis/meta/internalversion"
	metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
	"k8s.io/apimachinery/pkg/conversion"
	"k8s.io/apimachinery/pkg/runtime"
	"k8s.io/apimachinery/pkg/runtime/schema"
	"k8s.io/apimachinery/pkg/watch"
	genericapirequest "k8s.io/apiserver/pkg/endpoints/request"
	"k8s.io/apiserver/pkg/registry/rest"
)

var _ rest.StandardStorage = &store{}
var _ rest.Scoper = &store{}
var _ rest.Storage = &store{}

// NewStore instantiates a new file storage
func NewStore(groupResource schema.GroupResource, codec runtime.Codec, rootpath string, isNamespaced bool,
	newFunc func() runtime.Object, newListFunc func() runtime.Object, tc rest.TableConvertor) rest.Storage {
	objRoot := filepath.Join(rootpath, groupResource.Group, groupResource.Resource)
	if err := ensureDir(objRoot); err != nil {
		panic(fmt.Sprintf("unable to write data dir: %s", err))
	}
	rest := &store{
		defaultQualifiedResource: groupResource,
		TableConvertor:           tc,
		codec:                    codec,
		objRootPath:              objRoot,
		isNamespaced:             isNamespaced,
		newFunc:                  newFunc,
		newListFunc:              newListFunc,
		watchers:                 make(map[int]*yamlWatch, 10),
	}
	return rest
}

type store struct {
	rest.TableConvertor
	codec        runtime.Codec
	objRootPath  string
	isNamespaced bool

	muWatchers sync.RWMutex
	watchers   map[int]*yamlWatch

	newFunc                  func() runtime.Object
	newListFunc              func() runtime.Object
	defaultQualifiedResource schema.GroupResource
}

func (f *store) notifyWatchers(ev watch.Event) {
	f.muWatchers.RLock()
	for _, w := range f.watchers {
		w.ch <- ev
	}
	f.muWatchers.RUnlock()
}

func (f *store) New() runtime.Object {
	return f.newFunc()
}

func (f *store) NewList() runtime.Object {
	return f.newListFunc()
}

func (f *store) NamespaceScoped() bool {
	return f.isNamespaced
}

func (f *store) Get(ctx context.Context, name string, options *metav1.GetOptions) (runtime.Object, error) {
	return read(f.codec, f.objectFileName(ctx, name), f.newFunc)
}

func (f *store) List(ctx context.Context, options *metainternalversion.ListOptions) (runtime.Object, error) {
	newListObj := f.NewList()
	v, err := getListPrt(newListObj)
	if err != nil {
		return nil, err
	}

	dirname := f.objectDirName(ctx)
	if err := visitDir(dirname, f.newFunc, f.codec, func(path string, obj runtime.Object) {
		appendItem(v, obj)
	}); err != nil {
		return nil, fmt.Errorf("failed walking filepath %v", dirname)
	}
	return newListObj, nil
}

func (f *store) Create(ctx context.Context, obj runtime.Object, createValidation rest.ValidateObjectFunc,
	options *metav1.CreateOptions) (runtime.Object, error) {
	if createValidation != nil {
		if err := createValidation(ctx, obj); err != nil {
			return nil, err
		}
	}
	if f.isNamespaced {
		ns, ok := genericapirequest.NamespaceFrom(ctx)
		if !ok {
			return nil, apierrors.NewBadRequest("namespace required")
		}
		if err := ensureDir(filepath.Join(f.objRootPath, ns)); err != nil {
			return nil, err
		}
	}

	accessor, err := meta.Accessor(obj)
	if err != nil {
		return nil, err
	}
	if accessor.GetUID() == "" {
		accessor.SetUID(uuid.NewUUID())
	}

	name := accessor.GetName()
	filename := f.objectFileName(ctx, name)
	qualifiedResource := f.qualifiedResourceFromContext(ctx)
	if exists(filename) {
		return nil, apierrors.NewAlreadyExists(qualifiedResource, name)
	}

	if err := write(f.codec, filename, obj); err != nil {
		return nil, apierrors.NewInternalError(err)
	}

	f.notifyWatchers(watch.Event{
		Type:   watch.Added,
		Object: obj,
	})

	return obj, nil
}

func (f *store) Update(ctx context.Context, name string, objInfo rest.UpdatedObjectInfo,
	createValidation rest.ValidateObjectFunc, updateValidation rest.ValidateObjectUpdateFunc,
	forceAllowCreate bool, options *metav1.UpdateOptions) (runtime.Object, bool, error) {
	isCreate := false
	oldObj, err := f.Get(ctx, name, nil)
	if err != nil {
		if !forceAllowCreate {
			return nil, false, err
		}
		isCreate = true
	}

	if f.isNamespaced {
		// ensures namespace dir
		ns, ok := genericapirequest.NamespaceFrom(ctx)
		if !ok {
			return nil, false, apierrors.NewBadRequest("namespace required")
		}
		if err := ensureDir(filepath.Join(f.objRootPath, ns)); err != nil {
			return nil, false, err
		}
	}

	updatedObj, err := objInfo.UpdatedObject(ctx, oldObj)
	if err != nil {
		return nil, false, err
	}
	filename := f.objectFileName(ctx, name)

	if isCreate {
		if createValidation != nil {
			if err := createValidation(ctx, updatedObj); err != nil {
				return nil, false, err
			}
		}
		if err := write(f.codec, filename, updatedObj); err != nil {
			return nil, false, err
		}
		f.notifyWatchers(watch.Event{
			Type:   watch.Added,
			Object: updatedObj,
		})
		return updatedObj, true, nil
	}

	if updateValidation != nil {
		if err := updateValidation(ctx, updatedObj, oldObj); err != nil {
			return nil, false, err
		}
	}
	if err := write(f.codec, filename, updatedObj); err != nil {
		return nil, false, err
	}
	f.notifyWatchers(watch.Event{
		Type:   watch.Modified,
		Object: updatedObj,
	})
	return updatedObj, false, nil
}

func (f *store) Delete(ctx context.Context, name string, deleteValidation rest.ValidateObjectFunc,
	options *metav1.DeleteOptions) (runtime.Object, bool, error) {
	filename := f.objectFileName(ctx, name)
	qualifiedResource := f.qualifiedResourceFromContext(ctx)
	if !exists(filename) {
		return nil, false, apierrors.NewNotFound(qualifiedResource, name)
	}

	oldObj, err := f.Get(ctx, name, nil)
	if err != nil {
		return nil, false, err
	}
	if deleteValidation != nil {
		if err := deleteValidation(ctx, oldObj); err != nil {
			return nil, false, err
		}
	}

	if err := os.Remove(filename); err != nil {
		return nil, false, err
	}
	f.notifyWatchers(watch.Event{
		Type:   watch.Deleted,
		Object: oldObj,
	})
	return oldObj, true, nil
}

func (f *store) DeleteCollection(ctx context.Context, deleteValidation rest.ValidateObjectFunc,
	options *metav1.DeleteOptions, listOptions *metainternalversion.ListOptions) (runtime.Object, error) {
	newListObj := f.NewList()
	v, err := getListPrt(newListObj)
	if err != nil {
		return nil, err
	}
	dirname := f.objectDirName(ctx)
	if err := visitDir(dirname, f.newFunc, f.codec, func(path string, obj runtime.Object) {
		_ = os.Remove(path)
		appendItem(v, obj)
	}); err != nil {
		return nil, fmt.Errorf("failed walking filepath %v", dirname)
	}
	return newListObj, nil
}

func (f *store) objectFileName(ctx context.Context, name string) string {
	if f.isNamespaced {
		// FIXME: return error if namespace is not found
		ns, _ := genericapirequest.NamespaceFrom(ctx)
		return filepath.Join(f.objRootPath, ns, name+".yaml")
	}
	return filepath.Join(f.objRootPath, name+".yaml")
}

func (f *store) objectDirName(ctx context.Context) string {
	if f.isNamespaced {
		// FIXME: return error if namespace is not found
		ns, _ := genericapirequest.NamespaceFrom(ctx)
		return filepath.Join(f.objRootPath, ns)
	}
	return filepath.Join(f.objRootPath)
}

func write(encoder runtime.Encoder, filepath string, obj runtime.Object) error {
	buf := new(bytes.Buffer)
	if err := encoder.Encode(obj, buf); err != nil {
		return err
	}
	return ioutil.WriteFile(filepath, buf.Bytes(), 0600)
}

func read(decoder runtime.Decoder, path string, newFunc func() runtime.Object) (runtime.Object, error) {
	content, err := ioutil.ReadFile(filepath.Clean(path))
	if err != nil {
		return nil, err
	}
	newObj := newFunc()
	decodedObj, _, err := decoder.Decode(content, nil, newObj)
	if err != nil {
		return nil, err
	}
	return decodedObj, nil
}

func exists(filepath string) bool {
	_, err := os.Stat(filepath)
	return err == nil
}

func ensureDir(dirname string) error {
	if !exists(dirname) {
		return os.MkdirAll(dirname, 0700)
	}
	return nil
}

func visitDir(dirname string, newFunc func() runtime.Object, codec runtime.Decoder,
	visitFunc func(string, runtime.Object)) error {
	return filepath.Walk(dirname, func(path string, info os.FileInfo, err error) error {
		if err != nil {
			return err
		}
		if info.IsDir() {
			return nil
		}
		if !strings.HasSuffix(info.Name(), ".yaml") {
			return nil
		}
		newObj, err := read(codec, path, newFunc)
		if err != nil {
			return err
		}
		visitFunc(path, newObj)
		return nil
	})
}

func appendItem(v reflect.Value, obj runtime.Object) {
	v.Set(reflect.Append(v, reflect.ValueOf(obj).Elem()))
}

func getListPrt(listObj runtime.Object) (reflect.Value, error) {
	listPtr, err := meta.GetItemsPtr(listObj)
	if err != nil {
		return reflect.Value{}, err
	}
	v, err := conversion.EnforcePtr(listPtr)
	if err != nil || v.Kind() != reflect.Slice {
		return reflect.Value{}, fmt.Errorf("need ptr to slice: %v", err)
	}
	return v, nil
}

func (f *store) Watch(ctx context.Context, options *metainternalversion.ListOptions) (watch.Interface, error) {
	yw := &yamlWatch{
		id: len(f.watchers),
		f:  f,
		ch: make(chan watch.Event, 10),
	}
	// On initial watch, send all the existing objects
	list, err := f.List(ctx, options)
	if err != nil {
		return nil, err
	}

	danger := reflect.ValueOf(list).Elem()
	items := danger.FieldByName("Items")

	for i := 0; i < items.Len(); i++ {
		obj := items.Index(i).Addr().Interface().(runtime.Object)
		yw.ch <- watch.Event{
			Type:   watch.Added,
			Object: obj,
		}
	}

	f.muWatchers.Lock()
	f.watchers[yw.id] = yw
	f.muWatchers.Unlock()

	return yw, nil
}

type yamlWatch struct {
	f  *store
	id int
	ch chan watch.Event
}

func (w *yamlWatch) Stop() {
	w.f.muWatchers.Lock()
	delete(w.f.watchers, w.id)
	w.f.muWatchers.Unlock()
}

func (w *yamlWatch) ResultChan() <-chan watch.Event {
	return w.ch
}

func (f *store) ConvertToTable(ctx context.Context, object runtime.Object,
	tableOptions runtime.Object) (*metav1.Table, error) {
	return f.TableConvertor.ConvertToTable(ctx, object, tableOptions)
}
func (f *store) qualifiedResourceFromContext(ctx context.Context) schema.GroupResource {
	if info, ok := genericapirequest.RequestInfoFrom(ctx); ok {
		return schema.GroupResource{Group: info.APIGroup, Resource: info.Resource}
	}
	// some implementations access storage directly and thus the context has no RequestInfo
	return f.defaultQualifiedResource
}

调用NewStore即可创建一个rest.Storage。前面我们提到过存储后端有很多细节需要处理，对于上面这个样例，它没有：

发现正在删除中的资源，并在CRUD时作出适当响应
进行资源合法性校验。genericregistry.Store的做法是，调用strategy进行校验
自动填充某些元数据字段，包括creationTimestamp、selfLink等

处理子资源

假如K8S中某种资源具有状态子资源。那么当客户端更新状态子资源时，发出的HTTP请求格式为：

PUT /apis/cloud.gmem.cc/v1/namespaces/default/flunders/sample/status

它会匹配路由：

PUT /apis/cloud.gmem.cc/v1/namespaces/{namespace}/flunders/{name}/status

这个路由是专门为status子资源准备的，和主资源路由不同：

PUT /apis/cloud.gmem.cc/v1/namespaces/{namespace}/flunders/{name}

那么，主资源、子资源的处理方式有什么不同？如何影响这种资源处理逻辑呢？

注册子资源

InstallAPIGroups时，你只需要简单的为带有 / 的资源名字符串添加一个rest.Storage，就支持子资源了：

v1beta1storage["flunders/status"] = wardleregistry.RESTInPeace(flunderstorage.NewREST(Scheme, c.GenericConfig.RESTOptionsGetter))

你甚至可以直接使用父资源的rest.Storage。但是这样的结果是，客户端请求可以仅更新status，也可以更新整个flunder，一般情况下这是不符合预期的。

上面的代码，还会导致在APIServer中注册类似本章开始处的go-restful路由。

抽取路径变量namespace、name的代码是：

pathParams := pathProcessor.ExtractParameters(route, webService, httpRequest.URL.Path)

这两个变量识别了当前操控的是什么资源。请求进而会转发给rest.Storage的Update方法：

func (e *Store) Update(ctx context.Context, name string, objInfo rest.UpdatedObjectInfo, createValidation rest.ValidateObjectFunc, updateValidation rest.ValidateObjectUpdateFunc, forceAllowCreate bool, options *metav1.UpdateOptions) (runtime.Object, bool, error) {}

name参数传递的是资源的名字。当前是否应当（仅）更新子资源，rest.Storage无从知晓。

子资源处理器

更新状态子资源的时候，我们通常仅仅允许更新Status字段。要达成这个目的，我们需要为子资源注册独立的rest.Storage。

package store

import (
	"context"
	"k8s.io/apimachinery/pkg/runtime"
	utilruntime "k8s.io/apimachinery/pkg/util/runtime"
	"k8s.io/apiserver/pkg/registry/generic/registry"
	"k8s.io/apiserver/pkg/registry/rest"
	"sigs.k8s.io/apiserver-runtime/pkg/builder/resource/util"
)

// CopyStatusFunc copies status from obj to old
type CopyStatusFunc func(src, dst runtime.Object)

// StatusStore decorates a parent storage and only updates
// status subresource when updating
func StatusStore(parentStore rest.StandardStorage, copyStatusFunc CopyStatusFunc) rest.Storage {
	switch pstor := parentStore.(type) {
	case *registry.Store:
		pstor.UpdateStrategy = &statusStrategy{
			RESTUpdateStrategy: pstor.UpdateStrategy,
			copyStatusFunc:     copyStatusFunc,
		}
	}
	return &statusStore{
		StandardStorage: parentStore,
	}
}

var _ rest.Getter = &statusStore{}
var _ rest.Updater = &statusStore{}

type statusStore struct {
	rest.StandardStorage
}

var _ rest.RESTUpdateStrategy = &statusStrategy{}

// statusStrategy defines a default Strategy for the status subresource.
type statusStrategy struct {
	rest.RESTUpdateStrategy
	copyStatusFunc CopyStatusFunc
}

// PrepareForUpdate calls the PrepareForUpdate function on obj if supported, otherwise does nothing.
func (s *statusStrategy) PrepareForUpdate(ctx context.Context, new, old runtime.Object) {
	s.copyStatusFunc(new, old)
	if err := util.DeepCopy(old, new); err != nil {
		utilruntime.HandleError(err)
	}
}

genericregistry.Store的更新会在一个原子操作的回调函数中进行。在回调中，它会调用Strategy的PrepareForUpdate方法。上面的statusStore的原理就是覆盖此方法，仅仅改变状态子资源。

多版本化

当你的API需要引入破坏性变更时，就要考虑支持多版本化。

API文件布局

下面是一个典型的多版本API文件目录的布局：

api
├── doc.go
├── fullvpcmigration_types.go
├── 
├── v1
│   ├── conversion.go
│   ├── doc.go
│   ├── fullvpcmigration_types.go
│   ├── register.go
│   ├── zz_generated.conversion.go
│   ├── zz_generated.deepcopy.go
│   └── zz_generated.openapi.go
├── v2
│   ├── doc.go
│   ├── fullvpcmigration_types.go
│   ├── register.go
│   ├── zz_generated.conversion.go
│   ├── zz_generated.deepcopy.go
│   └── zz_generated.openapi.go
└── zz_generated.deepcopy.go

API组的根目录（上面的示例项目只有一个组，因此直接将api目录作为组的根目录）下，应该存放__internal版本的资源结构定义，建议将其内容和最新版本保持一致。

doc.go

这个文件应当提供包级别的注释，例如：

// +k8s:openapi-gen=true
// +groupName=gmem.cc
// +kubebuilder:object:generate=true

package api

这个文件用于Scheme的注册。对于__internal版本：

package api

import (
	"k8s.io/apimachinery/pkg/runtime"
	"k8s.io/apimachinery/pkg/runtime/schema"
)

const (
	GroupName = "gmem.cc"
)

var (
	// GroupVersion is group version used to register these objects
	GroupVersion = schema.GroupVersion{Group: GroupName, Version: runtime.APIVersionInternal}

	// SchemeBuilder is used to add go types to the GroupVersionKind scheme
	// no &scheme.Builder{} here, otherwise vk __internal/WatchEvent will double registered to k8s.io/apimachinery/pkg/apis/meta/v1.WatchEvent &
	// k8s.io/apimachinery/pkg/apis/meta/v1.InternalEvent, which is illegal
	SchemeBuilder = runtime.NewSchemeBuilder()

	// AddToScheme adds the types in this group-version to the given scheme.
	AddToScheme = SchemeBuilder.AddToScheme
)

// Kind takes an unqualified kind and returns a Group qualified GroupKind
func Kind(kind string) schema.GroupKind {
	return GroupVersion.WithKind(kind).GroupKind()
}

// Resource takes an unqualified resource and returns a Group qualified GroupResource
func Resource(resource string) schema.GroupResource {
	return GroupVersion.WithResource(resource).GroupResource()
}

对于普通的版本：

package v2

import (
	"cloud.tencent.com/teleport/api"
	metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
	"k8s.io/apimachinery/pkg/runtime"
	"k8s.io/apimachinery/pkg/runtime/schema"
)

var (
	// GroupVersion is group version used to register these objects
	GroupVersion = schema.GroupVersion{Group: api.GroupName, Version: "v2"}

	// SchemeBuilder is used to add go types to the GroupVersionKind scheme
	SchemeBuilder = runtime.NewSchemeBuilder(func(scheme *runtime.Scheme) error {
		metav1.AddToGroupVersion(scheme, GroupVersion)
		return nil
	})
	localSchemeBuilder = &SchemeBuilder

	// AddToScheme adds the types in this group-version to the given scheme.
	AddToScheme = SchemeBuilder.AddToScheme
)

// Kind takes an unqualified kind and returns a Group qualified GroupKind
func Kind(kind string) schema.GroupKind {
	return GroupVersion.WithKind(kind).GroupKind()
}

// Resource takes an unqualified resource and returns a Group qualified GroupResource
func Resource(resource string) schema.GroupResource {
	return GroupVersion.WithResource(resource).GroupResource()
}

可以看到，普通版本需要将metav1包中的某些结构注册到自己的GroupVersion。

zz_generated.openapi.go

这是每个普通版本都需要生成的OpenAPI定义。这些OpenAPI定义必须注册到API Server，否则将会导致kubectl apply等命令报404错误：

$(OPENAPI_GEN)  \
	--input-dirs "k8s.io/apimachinery/pkg/apis/meta/v1,k8s.io/apimachinery/pkg/runtime,k8s.io/apimachinery/pkg/version" \
	--input-dirs cloud.tencent.com/teleport/api/v1 -o ./  -p api/v1 -O zz_generated.openapi

$(OPENAPI_GEN)  \
	--input-dirs cloud.tencent.com/teleport/api/v2 -o ./  -p api/v2 -O zz_generated.openapi

zz_generated.conversion.go

这是每个普通版本都需要生成的From/To __internal版本的类型转换函数。这些转换函数会通过上面的localSchemeBuilder注册到当前GroupVersion：

$(CONVERSION_GEN) -h hack/boilerplate.go.txt --input-dirs cloud.tencent.com/teleport/api/v1 -O zz_generated.conversion
$(CONVERSION_GEN) -h hack/boilerplate.go.txt --input-dirs cloud.tencent.com/teleport/api/v2 -O zz_generated.conversion

升级API版本的原因，自然是因为结构出现了变动。结构的变动，就意味着新旧版本有特殊的类型转换逻辑。这种逻辑显然不可能自动生成，你手工添加的转换代码应该存放在conversion.go中。

zz_generated.deepcopy.go

这个文件是__internal版本、普通版本中的资源对应Go结构都需要生成的深拷贝函数。

关于__internal版本

如前文所述，每个API资源（的版本），都需要一个rest.Storage，这个Storage会直接负责该API资源版本的GET/CREATE/UPDATE/DELETE/WATCH等操作。

作为默认的，针对Etcd存储后端的rest.Storage的实现genericregistry.Store，它在内部有一个Cacher。此Cacher利用缓存来处理WATCH/LIST请求，避免对Etcd过频的访问。在此Cacher内部，会使用资源的内部版本。

所谓内部版本，就是注册到__internal这个特殊版本号的资源。

__internal

这个字面值由常量

runtime.APIVersionInternal

提供。我们通常将组的根目录下的资源结构体，注册为__internal版本。

有了这种内部版本机制，Cacher就不需要在内存中，存储资源的不同版本。

除此之外，rest.Storage或者它的Strategy所需要的一系列资源生命周期回调函数，接受的参数，都是__internal版本。这意味着：

我们不需要为每个版本，编写重复的回调函数
在多版本化的时候，需要将这些回调函数的入参都改为__internal版本

生成和定制转换函数

之所以Cacher、生命周期回调函数，以及下文会提到的，kubectl和存储能够自由的选择自己需要的版本，是因为不同版本的API资源之间可以进行转换。

当你复制一份v1资源的代码为v2时，这时可以使用完全自动生成的转换函数。一旦你添加或修改了一个字段，你就需要定制转换函数了。

假设我们将FullVPCMigrationSpec.TeamName字段改为Team，则需要：

// zz_generated.conversion.go中报错的地方，就是你需要实现的函数

func Convert_v1_FullVPCMigrationSpec_To_api_FullVPCMigrationSpec(in *FullVPCMigrationSpec, out *api.FullVPCMigrationSpec, s conversion.Scope) error {
    // 这里编写因为字段变化还需要手工处理的部分
	out.Team = in.TeamName
    // 然后调用自动生成的函数，这个函数和你实现的函数，名字的差异就是auto前缀
	return autoConvert_v1_FullVPCMigrationSpec_To_api_FullVPCMigrationSpec(in, out, s)
}

func Convert_api_FullVPCMigrationSpec_To_v1_FullVPCMigrationSpec(in *api.FullVPCMigrationSpec, out *FullVPCMigrationSpec, s conversion.Scope) error {
	out.TeamName = in.Team
	return autoConvert_api_FullVPCMigrationSpec_To_v1_FullVPCMigrationSpec(in, out, s)
}

上面两个，是自动生成的转换代码中，缺失的函数，会导致编译错误。你需要自己实现它们。

带有auto前缀的版本，是自动生成的、完成了绝大部分逻辑的转换函数，你需要进行必要的手工处理，然后调用这个auto函数即可。

需要注意，转换函数都是在特定版本和__internal版本之间进行的。也就是如果v1需要转换到v2，则需要先转换为__internal，然后在由__internal转换为v2。这种设计也很好理解，不这样做随着版本的增多，转换函数的数量会爆炸式增长。

类型转换代码必须注册到Scheme，不管是在API Server、kubectl或controller-runtime这样的客户端，都依赖于Scheme。

多版本如何存储

不管是存储（串行化），还是读取（反串行化），都依赖于Codec。所谓Codec就是Serializer：

package runtime

type Serializer interface {
	Encoder
	Decoder
}

// Codec is a Serializer that deals with the details of versioning objects. It offers the same
// interface as Serializer, so this is a marker to consumers that care about the version of the objects
// they receive.
type Codec Serializer

Codec由CodecFactory提供，后者持有Scheme：

serializer.NewCodecFactory(scheme)

我们已经知道，Scheme包含这些信息：

各种Group、Version、Kind，映射到了什么Go结构
Go结构上具有json标签，这些信息决定了结构的串行化格式是怎样的
同一个Group、Kind的不同Version，如何进行相互转换

因此，Codec有能力进行JSON或其它格式的串行化操作，并且在不同版本的Go结构之间进行转换。

对于 genericregistry.Store 来说，存储就是将API资源的Go结构转换为JSON或者ProtoBuf，保存到Etcd，它显然需要Codec的支持。

当启用多版本支持后，你需要将所有版本（prioritizedVersions）作为参数传递给CodecFactory并创建Codec：

prioritizedVersions ：= []schema.GroupVersion{
    {
        Group: "gmem.cc",
        Version: "v2",
    },
    {
        Group: "gmem.cc",
        Version: "v1",
    },
}
codec := codecFactory.LegacyCodec(prioritizedVersions...)

genericOptions.Etcd.StorageConfig.EncodeVersioner = runtime.NewMultiGroupVersioner(schema.GroupVersion{
    Group: "gmem.cc",
    Version: "v2",
} )

并且，prioritizedVersions决定了存储一个资源的时候，优先选择的格式。例如fullvpcmigrations有v1,v2版本，因此在存储的时候会使用v2。而jointeamrequests只有v1版本，因此存储的时候只能使用v1。

注意：如果存在一个既有的v1版本的fullvpcmigration，在上述配置应用后，第一次对它进行修改，会导致存储格式修改为v2。

多版本的OpenAPI

你需要为每个版本生成OpenAPI定义。OpenAPI的定义只是一个map，将所有版本的内容合并即可。

APIServer暴露哪个版本

APIServer会暴露所有注册的资源版本。但是，它有一个版本优先级的概念：

apiGroupInfo.PrioritizedVersions = prioritizedVersions

这个决定了kubectl的时候，优先显示为哪个版本。优选版本也会显示在api-resources子命令的输出：

# kubectl -s http://127.0.0.1:6080 api-resources  
NAME                    SHORTNAMES   APIVERSION                 NAMESPACED   KIND
fullvpcmigrations                    gmem.cc/v2   true         FullVPCMigration
jointeamrequests                     gmem.cc/v1   true         JoinTeamRequest

kuebctl get命令，默认展示优选版本，但是你也可以强制要求显示为指定版本：

kubectl -s http://127.0.0.1:6080 -n default get fullvpcmigration.v1.gmem.cc
# GET http://127.0.0.1:6080/apis/gmem.cc/v1/namespaces/default/fullvpcmigrations?limit=500

kubectl -s http://127.0.0.1:6080 -n default get fullvpcmigration.v2.gmem.cc
# GET http://127.0.0.1:6080/apis/gmem.cc/v2/namespaces/default/fullvpcmigrations?limit=500

不管怎样，存储为任何版本的fullvpcmigrations都会被查询到。你可以认为在客户端视角，选择版本仅仅是选择资源的一种“视图”。

控制器中的版本选择

控制器所监听的资源版本，必须已经在控制器管理器的Scheme中注册。

你在Reconcile代码中，可以用任何已经注册的版本来作为Get操作的容器，类型转换会自动进行。

建议仅在读取、存储资源状态的时候，用普通版本，其余时候，都用__internal版本。这样你的控制器逻辑，在版本升级后，需要的变更会很少。

The post 编写Kubernetes风格的APIServer appeared first on 绿色记忆.

记录一次KeyDB缓慢的定位过程

Alex — Thu, 28 Jan 2021 07:04:58 +0000

环境说明

运行环境

这个问题出现在一套搭建在虚拟机上的Kubernetes 1.18集群上。集群有三个节点：

# kubectl get node -o wide
NAME              STATUS   VERSION   INTERNAL-IP      OS-IMAGE                KERNEL-VERSION              CONTAINER-RUNTIME
192.168.104.51    Ready    v1.18.3   192.168.104.51   CentOS Linux 7 (Core)   3.10.0-862.3.2.el7.x86_64   docker://19.3.9
192.168.104.72    Ready    v1.18.3   192.168.104.72   CentOS Linux 7 (Core)   3.10.0-862.3.2.el7.x86_64   docker://19.3.9
192.168.104.108   Ready    v1.18.3   192.168.104.108  CentOS Linux 7 (Core)   3.10.0-862.3.2.el7.x86_64   docker://19.3.9

KeyDB配置

KeyDB通过StatefulSet管理，一共有三个实例：

# kubectl -n default get pod -o wide -l app.kubernetes.io/name=keydb
NAME             READY   STATUS    RESTARTS     IP             NODE            
keydb-0   1/1     Running   0            172.29.2.63     192.168.104.108 
keydb-1   1/1     Running   0            172.29.1.69     192.168.104.72  
keydb-2   1/1     Running   0            172.29.1.121    192.168.104.51

这三个实例：

由于反亲和设置，会在每个节点上各运行一个实例
启用Active - Active（--active-replica）模式的多主（--multi-master）复制：每个实例都是另外两个的Slave，每个实例都支持读写

故障描述

触发条件

出现一个节点宕机的情况，就可能出现此故障。经过一段时间以后，会出现GET/PUT或者任何其它请求处理缓慢的情况。

故障特征

此故障有两个明显的特征：

故障出现前需要等待的时间，随机性很强，有时甚至测试了数小时都没有发现请求缓慢的情况。常常发生的情况是，宕机后剩下的两个实例，一个很快出现缓慢问题，另外一个却还能运行较长时间
请求处理延缓的时长不定，有时候没有明显延缓，有时候长达10+秒。而且一次缓慢请求后，可以跟着10多次正常速度处理的请求。这个特征提示故障和某种周期性的、长时间占用的锁有关。在锁被释放的间隙，请求可以被快速处理

故障分析

触发故障

我们将节点192.168.104.108强制关闭，这样实例keydb-0无法访问，另外两个节点无法和它进行Replication。

分别登录另外两个节点，监控GET/SET操作的性能：

kubectl -n default exec -it keydb-1 -- bash -c  \
  'while true; do key=keydb-1-$(date +%s); keydb-cli set $key $key-val; keydb-cli get $key; done'

kubectl -n default exec -it keydb-2 -- bash -c \
 'while true; do key=keydb-2-$(date +%s); keydb-cli set $key $key-val; keydb-cli get $key; done'

监控Replication相关信息：

watch -- kubectl -n default exec -i keydb-1 -- keydb-cli info replication

watch -- kubectl -n default exec -i keydb-2 -- keydb-cli info replication

监控KeyDB日志：

kubectl -n default logs  keydb-1 -f

kubectl -n default logs  keydb-2 -f

经过一段时间，keydb-1请求处理随机延缓的情况出现：

127.0.0.1:6379> set hello world
OK
(1.24s)
127.0.0.1:6379> set hello world
OK
(8.96s)
127.0.0.1:6379> get hello
"world"
(5.99s)
127.0.0.1:6379> get hello
"world"
(9.44s)

此时keydb-2仍然正常运行，请求处理速度正常

缓慢查询

获取keydb-1的慢查询，没有发现有价值的信息。而且延缓的时间没有计算在内：

127.0.0.1:6379> slowlog get 10                                         
1) 1) (integer) 7                                                      
   2) (integer) 1611833042                                             
   3) (integer) 14431               # 最慢的查询才耗时14ms                                            
   4) 1) "set"                                                         
      2) "keydb-1-1611833042"                                          
      3) "keydb-1-1611833042-val"                                      
   5) "127.0.0.1:38488"                                                
   6) ""                                                               
2) 1) (integer) 6                                                      
   2) (integer) 1611831322                                             
   3) (integer) 14486                                                  
   4) 1) "get"                                                         
      2) "keydb-1-1611831312"                                          
   5) "127.0.0.1:51680"                                                
   6) ""

日志分析

部署KeyDB已经设置

--loglevel debug

，以获得尽可能详尽的日志。

由于正在运行不间断执行SET/GET操作的脚本，因此日志量很大而刷屏，但是每隔一段时间就会出现卡顿。下面是keydb-1的日志片段：

7:11:S 28 Jan 2021 08:57:51.233 - Client closed connection
7:11:S 28 Jan 2021 08:57:51.251 - Accepted 127.0.0.1:44224
7:11:S 28 Jan 2021 08:57:51.252 - Client closed connection
7:12:S 28 Jan 2021 08:57:51.276 - Accepted 127.0.0.1:44226
7:11:S 28 Jan 2021 08:57:51.277 - Client closed connection
# 这一行日志之后，卡顿了10s。没有任何日志输出
7:11:S 28 Jan 2021 08:57:51.279 * Connecting to MASTER keydb-0.keydb:6379
7:11:S 28 Jan 2021 08:58:01.290 * Unable to connect to MASTER: Resource temporarily unavailable
7:11:S 28 Jan 2021 08:58:01.290 - Accepted 127.0.0.1:44228
7:11:S 28 Jan 2021 08:58:01.290 - Accepted 127.0.0.1:44264

从日志信息上可以看到，卡顿前keydb-1正在尝试连接到已经宕机的keydb-0，这个连接尝试被阻塞10秒后报

EAGAIN

错误。

阻塞期间SET/GET请求得不到处理，猜测原因包括：

连接keydb-0的时候，占用了某种全局的锁，SET/GET请求也需要持有该锁
连接keydb-0、处理SET/GET请求，由同一线程负责

第2种猜测应该不大可能，因为KeyDB宣称的优势之一就是，支持多线程处理请求。并且我们设置了参数

--server-threads 2

，也就是有两个线程用于处理请求。

EAGAIN这个报错也没有参考价值，因为目前不卡顿的实例keydb-2输出的日志是一样的，只是没有任何卡顿：

7:11:S 28 Jan 2021 08:19:22.624 * Connecting to MASTER keydb-0.keydb:6379
# 仅仅耗时5ms即检测到连接失败
7:11:S 28 Jan 2021 08:19:22.629 * Unable to connect to MASTER: Resource temporarily unavailable

源码分析

复制定时任务

我们使用的KeyDB版本是5.3.3，尝试用关键字“Connecting to MASTER”搜索，发现只有一个匹配，位于

replicationCron

函数中。从函数名称上就可以看到，它是和复制（Replication）有关的定时任务。

KeyDB启动时会调用

initServer

进行初始化，后者会在事件循环中每1ms调度一次

serverCron

。serverCron负责后台任务的总体调度，它的一个职责就是，每1s调度一次replicationCron函数。

下面看一下replicationCron的源码：

/* Replication cron function, called 1 time per second. */
void replicationCron(void) {
    static long long replication_cron_loops = 0;
    serverAssert(GlobalLocksAcquired());
    listIter liMaster;
    listNode *lnMaster;
    listRewind(g_pserver->masters, &liMaster);
    // 遍历当前实例的每一个Master
    while ((lnMaster = listNext(&liMaster)))
    {
        redisMaster *mi = (redisMaster*)listNodeValue(lnMaster);
        std::unique_lockmaster->lock)> ulock;
        // 获得              Master的 客户端的 锁
        if (mi->master != nullptr)
            ulock = decltype(ulock)(mi->master->lock);

        /* Non blocking connection timeout? */
        // 如果当前复制状态为：正在连接到Master
        // 或者复制状态处于握手阶段（包含多个状态）且超时了
        if (mi->masterhost &&
            (mi->repl_state == REPL_STATE_CONNECTING ||
            slaveIsInHandshakeState(mi)) &&
            (time(NULL)-mi->repl_transfer_lastio) > g_pserver->repl_timeout)
        {
            // 那么取消握手 —— 取消进行中的非阻塞连接尝试，或者取消进行中的RDB传输
            serverLog(LL_WARNING,"Timeout connecting to the MASTER...");
            cancelReplicationHandshake(mi);
        }

        /* Bulk transfer I/O timeout? */
        // 如果当前正在接收来自Master的RDB文件且超时了
        if (mi->masterhost && mi->repl_state == REPL_STATE_TRANSFER &&
            (time(NULL)-mi->repl_transfer_lastio) > g_pserver->repl_timeout)
        {
            serverLog(LL_WARNING,"Timeout receiving bulk data from MASTER... If the problem persists try to set the 'repl-timeout' parameter in keydb.conf to a larger value.");
            // 那么取消握手
            cancelReplicationHandshake(mi);
        }

        /* Timed out master when we are an already connected replica? */
        // 如果当前复制状态为：已连接。而且超时之前没有活动（正常情况下有心跳维持）
        if (mi->masterhost && mi->master && mi->repl_state == REPL_STATE_CONNECTED &&
            (time(NULL)-mi->master->lastinteraction) > g_pserver->repl_timeout)
        {
            // 那么释放掉客户端资源
            serverLog(LL_WARNING,"MASTER timeout: no data nor PING received...");
            if (FCorrectThread(mi->master))
                freeClient(mi->master);
            else
                freeClientAsync(mi->master);
        }

        /* Check if we should connect to a MASTER */
        // 上面几个分支都不会匹配我们的场景，因为keydb-0已经宕机，因此
        // 状态必然是REPL_STATE_CONNECT
        if (mi->repl_state == REPL_STATE_CONNECT) {
            // 这一行就是卡顿前的日志
            serverLog(LL_NOTICE,"Connecting to MASTER %s:%d",
                mi->masterhost, mi->masterport);
            // 发起连接
            if (connectWithMaster(mi) == C_OK) {
                serverLog(LL_NOTICE,"MASTER <-> REPLICA sync started");
            }
        }

        // 每秒钟发送心跳给Master
        if (mi->masterhost && mi->master &&
            !(mi->master->flags & CLIENT_PRE_PSYNC))
            replicationSendAck(mi);
    }

    // 后面处理和本实例的Slave有关的逻辑，例如发送心跳。和我们的场景无关，略...
}

很明显，卡顿是因为调用

connectWithMaster

导致的。从代码注释也可以看到，KeyDB期望这个连接操作是非阻塞的，但是不知道为何，在我们的场景中严重的阻塞了。

进一步查看connectWithMaster的代码：

int connectWithMaster(redisMaster *mi) {
    int fd;

    fd = anetTcpNonBlockBestEffortBindConnect(NULL,
        mi->masterhost,mi->masterport,NET_FIRST_BIND_ADDR);
    if (fd == -1) {
        int sev = g_pserver->enable_multimaster ? LL_NOTICE : LL_WARNING;
        // 这一行是卡顿10s后的日志，因此阻塞发生在anetTcpNonBlockBestEffortBindConnect函数中
        serverLog(sev,"Unable to connect to MASTER: %s", strerror(errno));
        return C_ERR;
    }
    // ...
}

int anetTcpNonBlockBestEffortBindConnect(char *err, char *addr, int port,
                                         char *source_addr)
{
    return anetTcpGenericConnect(err,addr,port,source_addr,
            // 非阻塞 + BestEffort绑定
            ANET_CONNECT_NONBLOCK|ANET_CONNECT_BE_BINDING);
}


static int anetTcpGenericConnect(char *err, char *addr, int port,
                                 char *source_addr, int flags)
{
    int s = ANET_ERR, rv;
    char portstr[6];  /* strlen("65535") + 1; */
    struct addrinfo hints, *servinfo, *bservinfo, *p, *b;

    snprintf(portstr,sizeof(portstr),"%d",port);
    memset(&hints,0,sizeof(hints));
    // 不指定地址族，这会触发getaddrinfo同时进行A/AAAA查询
    hints.ai_family = AF_UNSPEC;
    hints.ai_socktype = SOCK_STREAM;

    // 根据Master的主机名查找得到IP地址信息（addrinfo）列表
    if ((rv = getaddrinfo(addr,portstr,&hints,&servinfo)) != 0) {
        anetSetError(err, "%s", gai_strerror(rv));
        return ANET_ERR;
    }
    // 遍历Master的IP地址列表
    for (p = servinfo; p != NULL; p = p->ai_next) {
        // 创建套接字，如果socket/connect调用失败，则尝试下一个
        if ((s = socket(p->ai_family,p->ai_socktype,p->ai_protocol)) == -1)
            continue;
        // 设置套接字选项SO_REUSEADDR
        if (anetSetReuseAddr(err,s) == ANET_ERR) 
            goto error;
        // 设置套接字选项 SO_REUSEPORT
        if (flags & ANET_CONNECT_REUSEPORT && anetSetReusePort(err, s) != ANET_OK)
            goto error;
        // 调用fcntl设置 O_NONBLOCK
        if (flags & ANET_CONNECT_NONBLOCK && anetNonBlock(err,s) != ANET_OK)
            goto error;
        if (source_addr) {
            int bound = 0;
            /* Using getaddrinfo saves us from self-determining IPv4 vs IPv6 */
            // 解析源地址
            if ((rv = getaddrinfo(source_addr, NULL, &hints, &bservinfo)) != 0)
            {
                anetSetError(err, "%s", gai_strerror(rv));
                goto error;
            }
            for (b = bservinfo; b != NULL; b = b->ai_next) {
                // 绑定到第一个源地址
                if (bind(s,b->ai_addr,b->ai_addrlen) != -1) {
                    bound = 1;
                    break;
                }
            }
            freeaddrinfo(bservinfo);
            if (!bound) {
                // 绑定源地址失败，跳转到Best Effort绑定
                anetSetError(err, "bind: %s", strerror(errno));
                goto error;
            }
        }
        // 发起连接
        if (connect(s,p->ai_addr,p->ai_addrlen) == -1) {
            // 我们的场景下套接字是非阻塞的，因此这里会立即返回EINPROGRESS，属于预期行为
            if (errno == EINPROGRESS && flags & ANET_CONNECT_NONBLOCK)
                goto end;
            // 其它错误均认为失败，尝试连接下一个Master地址
            close(s);
            s = ANET_ERR;
            continue;
        }

        goto end;
    }
    if (p == NULL)
        anetSetError(err, "creating socket: %s", strerror(errno));

error:
    if (s != ANET_ERR) {
        close(s);
        s = ANET_ERR;
    }

end:
    freeaddrinfo(servinfo);

    // 上面指定源地址，绑定失败时跳转到此处。尝试不指定源地址来连接
    if (s == ANET_ERR && source_addr && (flags & ANET_CONNECT_BE_BINDING)) {
        return anetTcpGenericConnect(err,addr,port,NULL,flags);
    } else {
        return s;
    }
}

尽管可以确定connectWithMaster调用的anetTcpGenericConnect就是发生阻塞的地方，但是从代码上看不出什么问题，就是简单的socket、bind，外加一个非阻塞的connect操作。

请求处理逻辑

从现象上我们已经看到了，复制定时器卡顿的时候，请求处理也无法进行。通过代码分析，也明确了卡顿期间，复制定时器持有Master的客户端的锁。

那么，关于请求处理（线程？）会和复制定时器产生锁争用的猜测是否正确呢？

单步跟踪

为了精确定位阻塞的代码，我们使用GDB进行单步跟踪：

#              需要特权模式，否则无法加载符号表
docker run -it --rm --name gdb --privileged --net=host --pid=host --entrypoint gdb docker.gmem.cc/debug

(gdb) attach 449
(gdb) break replication.cpp:3084
# 连续执行s，以step into anet.c
(gdb) s
# 连续执行n命令
(gdb) n
# 卡顿后，查看变量
# 解析的地址
(gdb) p addr
$2 = 0x7f2f31411281 "keydb-0.keydb"
# getaddrinfo的返回值
(gdb) p rv
$3 = -3

进入anetTcpGenericConnect后，逐行执行，多次测试，均在

anet.c

的291行出现卡顿：

if ((rv = getaddrinfo(addr,portstr,&hints,&servinfo)) != 0) {
    anetSetError(err, "%s", gai_strerror(rv));
    return ANET_ERR;
}

也就是说，调用getaddrinfo函数耗时可能长达数秒。这是来自glibc的标准函数，用于将主机名解析为IP地址。

调试过程中发现此函数的返回值是-3，我们的场景中，需要解析的地址是keydb-0.keydb，卡顿时函数的返回值是-3，

man getaddrinfo

可以了解到此返回值的意义：

EAI_AGAIN The name server returned a temporary failure indication. Try again later.

乍看起来，好像是DNS服务器，也就是K8S的CoreDNS存在问题。但无法解释此时keydb-2.keydb没有受到影响？

检查CoreDNS

为了确认CoreDNS是否存在问题，我们分别在宿主机上、两个实例的网络命名空间中进行验证：

# nslookup keydb-0.keydb.default.svc.cluster.local 10.96.0.10
Server:		10.96.0.10
Address:	10.96.0.10#53

** server can't find keydb-0.keydb.default.svc.cluster.local: NXDOMAIN

# nslookup keydb-0.keydb.svc.cluster.local 10.96.0.10
Server:		10.96.0.10
Address:	10.96.0.10#53

** server can't find keydb-0.keydb.svc.cluster.local: NXDOMAIN

# nslookup keydb-0.keydb.cluster.local 10.96.0.10
Server:		10.96.0.10
Address:	10.96.0.10#53

** server can't find keydb-0.keydb.cluster.local: NXDOMAIN

# nslookup keydb-0.keydb 10.96.0.10
Server:		10.96.0.10
Address:	10.96.0.10#53

** server can't find keydb-0.keydb: SERVFAIL

反复测试循环测试，没有任何解析缓慢的现象。此外，查看CoreDNS的日志，我们也发现了来自keydb-1.keydb和keydb-2.keydb的查询请求，请求都是通过UDP协议发送的，处理耗时都是亚毫秒级别。

也就是说，从KeyDB实例所在宿主机/命名空间到CoreDNS的网络链路、CoreDNS服务器自身，都没有问题。

这就让人头疼了……难道问题出在getaddrinfo函数内部？或者在单步跟踪时判断错误，问题和DNS无关？为了确认，我们在CoreDNS上动了点手脚，强制将keydb-0.keydb解析到一个不存在的IP地址：

.:53 {
    # ...
    hosts {
        192.168.144.51  keydb-1.keydb
    }
    # ...
}

结果很快，卡顿的问题就消失了。所以，我们更加怀疑问题出在getaddrinfo函数上了。

调试getaddrinfo

查看文件/etc/lsb-release，可以看到KeyDB镜像是基于Ubuntu 18.04.4 LTS构建的，使用的libc6版本是2.27-3ubuntu1。

在launchpad.net找到了它的调试文件和源码。下载deb包，解压后复制到GDB容器，然后设置一下调试文件目录，就可以step into到glibc的代码进行跟踪了：

ar x libc6-dbg_2.27-3ubuntu1_amd64.deb
tar -xf data.tar.xz 

# 拷贝到我们正在运行GDB的容器
docker cp usr gdb:/root

# 修改调试文件搜索目录
(gdb) set debug-file-directory /root/usr/lib/debug
# 打断点，下面是缓慢的执行路径
(gdb) b anet.c:291
(gdb) b getaddrinfo.c:342
(gdb) b getaddrinfo.c:786  
# (gdb) print fct4
# $2 = (nss_gethostbyname4_r) 0x7f32f97e9a70 <_nss_dns_gethostbyname4_r>
(gdb) b dns-host.c:317
(gdb) b res_query.c:336
(gdb) b res_query.c:495                       # invoke __res_context_querydomain
(gdb) b res_query.c:601                       # invoke __res_context_query    
(gdb) b res_query.c:216                       # invoke __res_context_send
(gdb) b res_send.c:1066 if buflen==45         # send_dg

通过调试，我们发现getaddrinfo会依次对4个名字进行DNS查询：

keydb-0.keydb.default.svc.cluster.local.
keydb-0.keydb.svc.cluster.local.
keydb-0.keydb.cluster.local.
keydb-0.keydb.

CoreDNS的日志显示，所有请求都快速的处理完毕：

4242 "A IN keydb-0.keydb.default.svc.cluster.local. udp 68 false 512" NXDOMAIN qr,aa,rd 161 0.000215337s
38046 "AAAA IN keydb-0.keydb.default.svc.cluster.local. udp 68 false 512" NXDOMAIN qr,aa,rd 161 0.000203934s

23194 "A IN keydb-0.keydb.svc.cluster.local. udp 63 false 512" NXDOMAIN qr,aa,rd 156 0.000301011s
23722 "AAAA IN keydb-0.keydb.svc.cluster.local. udp 63 false 512" NXDOMAIN qr,aa,rd 156 0.000125386s

36552 "A IN keydb-0.keydb.cluster.local. udp 59 false 512" NXDOMAIN qr,aa,rd 152 0.000281247s
217 "AAAA IN keydb-0.keydb.cluster.local. udp 59 false 512" NXDOMAIN qr,aa,rd 152 0.000150689s

6776 "A IN keydb-0.keydb. udp 45 false 512" NOERROR - 0 0.000196686s
6776 "A IN keydb-0.keydb. udp 45 false 512" NOERROR - 0 0.000157011s

最后一个名字，也就是传递给getaddrinfo的原始请求keydb-0.keydb.的处理过程有以下值得注意的点：

从GDB角度来看，卡顿就是在解析该名字时出现
从CoreDNS日志上看，没有AAAA请求。由于KeyDB指定了AF_UNSPEC，getaddrinfo会同时发送并等待A/AAAA应答。可能因为某种原因，该名字的AAAA解析过程没有完成，导致getaddrinfo一直等待到超时。作为对比，没有卡顿的keydb-2的A/AAAA查询处理过程都是正常的
其它名字是一次A请求，一次AAAA请求。该名字却是两次A请求，而且，第一次A请求日志出现了数秒后，第二次日志才出现。有可能第二次是getaddrinfo没有收到应答而进行的重试
前三个名字分别的错误码是NXDOMAIN，该名字的错误码却是NOERROR。通过nslookup/dig查询，错误码却是SERVFAIL，难道是CoreDNS日志有BUG？尽管如此，是否不同的错误码影响了getaddrinfo的行为

抓包分析

glibc的代码是优化过（也必须优化）的，GDB跟踪起来相当耗时，因此我们打算换一个角度来定位问题。基于上一节的分析，我们相信实例keydb-1.keydb在发送DNS请求的时候存在超时或丢包的情况，可以抓包来证实：

# 进入keydb-1.keydb的网络命名空间
nsenter -t 449 --net
# 抓包
tcpdump -i any -vv -nn udp port 53

抓包的结果如下：

对 keydb-0.keydb.default.svc.cluster.local. 的A请求
172.29.1.69.42083 > 10.96.0.10.53: [bad udp cksum 0xb829 -> 0x95aa!] 22719+ A? keydb-0.keydb.default.svc.cluster.local. (68)
CoreDNS应答NXDomain
10.96.0.10.53 > 172.29.1.69.42083: [bad udp cksum 0xb886 -> 0x3f57!] 22719 NXDomain*- q: A? keydb-0.keydb.default.svc.cluster.local. 0/1/0 ns: cluster.local. SOA ns.dns.cluster.local. hostmaster.cluster.local. 1612073759 7200 1800 86400 30 (161)
对keydb-0.keydb.default.svc.cluster.local. 的AAAA请求，注意，仍然使用之前的UDP套接字
172.29.1.69.42083 > 10.96.0.10.53: [bad udp cksum 0xb829 -> 0x4d76!] 41176+ AAAA? keydb-0.keydb.default.svc.cluster.local. (68)
CoreDNS应答NXDomain
10.96.0.10.53 > 172.29.1.69.42083: [bad udp cksum 0xb886 -> 0xf722!] 41176 NXDomain*- q: AAAA? keydb-0.keydb.default.svc.cluster.local. 0/1/0 ns: cluster.local. SOA ns.dns.cluster.local. hostmaster.cluster.local. 1612073759 7200 1800 86400 30 (161)
对 keydb-0.keydb.svc.cluster.local. 的A请求，注意，这里使用了新的UDP套接字
172.29.1.69.45508 > 10.96.0.10.53: [bad udp cksum 0xb824 -> 0x3b5e!] 45156+ A? keydb-0.keydb.svc.cluster.local. (63)
CoreDNS应答NXDomain
10.96.0.10.53 > 172.29.1.69.45508: [bad udp cksum 0xb881 -> 0x21ce!] 45156 NXDomain*- q: A? keydb-0.keydb.svc.cluster.local. 0/1/0 ns: cluster.local. SOA ns.dns.cluster.local. hostmaster.cluster.local. 1612073759 7200 1800 86400 30 (156)
对keydb-0.keydb.svc.cluster.local. 的AAAA请求
172.29.1.69.45508 > 10.96.0.10.53: [bad udp cksum 0xb824 -> 0x8a4e!] 18036+ AAAA? keydb-0.keydb.svc.cluster.local. (63)
CoreDNS应答NXDomain
10.96.0.10.53 > 172.29.1.69.45508: [bad udp cksum 0xb881 -> 0x70be!] 18036 NXDomain*- q: AAAA? keydb-0.keydb.svc.cluster.local. 0/1/0 ns: cluster.local. SOA ns.dns.cluster.local. hostmaster.cluster.local. 1612073759 7200 1800 86400 30 (156)
对 keydb-0.keydb.cluster.local. 的A请求
172.29.1.69.48243 > 10.96.0.10.53: [bad udp cksum 0xb820 -> 0x5054!] 2718+ A? keydb-0.keydb.cluster.local. (59)
CoreDNS应答NXDomain
10.96.0.10.53 > 172.29.1.69.48243: [bad udp cksum 0xb87d -> 0x36c4!] 2718 NXDomain*- q: A? keydb-0.keydb.cluster.local. 0/1/0 ns: cluster.local. SOA ns.dns.cluster.local. hostmaster.cluster.local. 1612073759 7200 1800 86400 30 (152)
对 keydb-0.keydb.cluster.local. 的AAAA请求
172.29.1.69.48243 > 10.96.0.10.53: [bad udp cksum 0xb820 -> 0x5147!] 61098+ AAAA? keydb-0.keydb.cluster.local. (59)
CoreDNS应答NXDomain，这里开始，我们保留时间戳那一行日志
14:42:15.028168 IP (tos 0x0, ttl 63, id 44636, offset 0, flags [DF], proto UDP (17), length 180)
10.96.0.10.53 > 172.29.1.69.48243: [bad udp cksum 0xb87d -> 0x37b7!] 61098 NXDomain*- q: AAAA? keydb-0.keydb.cluster.local. 0/1/0 ns: cluster.local. SOA ns.dns.cluster.local. hostmaster.cluster.local. 1612073759 7200 1800 86400 30 (152)
对keydb-0.keydb的A请求，这里还没有出现卡顿
14:42:15.028328 IP (tos 0x0, ttl 64, id 30583, offset 0, flags [DF], proto UDP (17), length 73)
172.29.1.69.47652 > 10.96.0.10.53: [bad udp cksum 0xb812 -> 0x181e!] 26682+ A? keydb-0.keydb. (45)
很快接收到CoreDNS的ServFail应答，抓包和我们nslookup/dig的错误码一致，CoreDNS日志显示的应该不正常
猜测“有可能第二次是getaddrinfo没有收到应答而进行的重试”被排除，至少说没收到应答不是网络层面的原因
14:42:15.028651 IP (tos 0x0, ttl 63, id 44637, offset 0, flags [DF], proto UDP (17), length 73)
10.96.0.10.53 > 172.29.1.69.47652: [bad udp cksum 0xb812 -> 0x981b!] 26682 ServFail- q: A? keydb-0.keydb. 0/0/0 (45)
再一次对keydb-0.keydb.的A请求，注意时间戳，刚好5秒之后，这是默认DNS请求超时。还是使用之前的套接字
14:42:20.029271 IP (tos 0x0, ttl 64, id 33006, offset 0, flags [DF], proto UDP (17), length 73)
172.29.1.69.47652 > 10.96.0.10.53: [bad udp cksum 0xb812 -> 0x181e!] 26682+ A? keydb-0.keydb. (45)
很快接收到CoreDNS的ServFail应答
14:42:20.029812 IP (tos 0x0, ttl 63, id 46397, offset 0, flags [DF], proto UDP (17), length 73)
10.96.0.10.53 > 172.29.1.69.47652: [bad udp cksum 0xb812 -> 0x981b!] 26682 ServFail- q: A? keydb-0.keydb. 0/0/0 (45)

通过上述分析我们可以相信，keydb-1.keydb容器到CoreDNS之间的DNS通信是没有问题的。但是，getaddrinfo似乎没有收到keydb-0.keydb的第一次应答，并且在超时（5s）之后进行重试

Conntrack竞态条件

tcpdump和应用程序之间，还有个netfilter框架。回想起之前阅读过的文章：Kubernetes上和DNS相关的问题，conntrack相关的竞态条件可能导致DNS查询5秒超时。遗憾的是，这里的故障和此竞态条件无关：

通过
```
conntrack -S
```
看到的
```
insert_failed
```
是0
故障一旦出现，就每次都会超时5s，没有竞态条件的随机性
如果是conntrack竞态条件导致，无法解释为什么前面3个名字解析正常，也无法解释为什么CoreDNS中配置一个静态解析故障就消失

深入理解

getaddrinfo

在IPv4中，我们使用

gethostbyname

实现主机名到地址的解析。

getaddrinfo

也用于地址解析，而且它是协议无关的，既可用于IPv4也可用于IPv6。它的原型如下：

int getaddrinfo(const char* hostname,  // 主机名，可以使用IP地址或者DNS名称
                const char* service,   // 服务名，可以使用端口号或者/etc/services中的服务名
                const struct addrinfo* hints, // 可以NULL，或者一个addrinfo，提示调用者想得到的信息类型
                struct addrinfo** res);  // 解析得到的addrinfo，地址的链表

此函数返回的是套接字地址信息的链表，地址信息存储在下面的addrinfo结构中。参数

hints

会影响getaddrinfo的行为，提示信息同样存放在addrinfo结构中：

struct addrinfo
{
  // 额外的提示标记
  int ai_flags;	
  // 提示需要查询哪些地址族，默认AF_UNSPEC，这意味着同时查询IPv4和IPv6地址
  // 也就是同时发起A/AAAA查询
  int ai_family;
  // 提示偏好的套接字类型，例如SOCK_STREAM|SOCK_DGRAM，默认可以返回任何套接字类型
  int ai_socktype;
  // 提示返回的套接字地址的协议类型
  int ai_protocol;

  // 套接字地址
  socklen_t ai_addrlen;
  struct sockaddr *ai_addr;
  // ...
  // 指向链表的下一条目
  struct addrinfo *ai_next;
};

很多软件调用getaddrinfo的时候，都会指定AF_UNSPEC（或者不提供hints，效果一样），例如KeyDB。但是，很多运行环境根本没有IPv6支持，这就凭白的给DNS服务器增加了负担。这也是在K8S中查看CoreDNS日志，总是会发现很多AAAA记录的原因。

解析流程概览

KeyDB 5.3.3使用的glibc版本是2.27。函数getaddrinfo过于冗长，这里就不贴出来了，大概梳理一下：

如果可能，它会通过/var/run/nscd/socket访问DNS缓存服务，我们没有这个服务
初始化NSS的hosts数据库，如果没有在文件中配置，则默认使用
```
hosts: dns [!UNAVAIL=return] files
```
，我们的环境下配置是
```
hosts: files dns
```
通过NSS进行名字查询，实际上是调用
```
gethostbyname4_r
```
函数：
1. 查找files源，调用
```
_nss_files_gethostbyname4_r
```
  函数，也就是打开/etc/hosts查找。K8S容器中，/etc/hosts中仅仅存在当前Pod的条目，因此files源不会匹配
2. 查找dns源，调用
```
_nss_dns_gethostbyname4_r
```
  函数：
  1. 读取/etc/resolv.conf构建
```
resolv_context
```
    。我们的环境下，配置文件内容为：
```
nameserver 10.96.0.10
search default.svc.cluster.local svc.cluster.local cluster.local
options ndots:5
```
  2. 调用
```
__res_context_search
```
    ，执行DNS查找逻辑。它会将上面的search domain作为域名后缀，产生多个名字，逐个尝试。每个名字查询失败时都会重试，重试时尽可能选择不同的DNS服务器。可能同时发起A/AAA查询

DNS搜索逻辑

__res_context_search()首先会计算一下，待查找名字中的dot的数量。如果名字以dot结尾，或者dot数量大于等于ndots，则直接调用

__res_context_querydomain

向DNS服务器发请求，该函数会同时发起A/AAAA查询。

否则，它会根据/etc/resolv.conf中的search domain列表，给待查找的名字加后缀，然后多次向DNS服务器发请求。我们的环境下，待查找名字为keydb-0.keydb，getaddrinfo函数会依次尝试：

keydb-0.keydb.default.svc.cluster.local.
keydb-0.keydb.svc.cluster.local.
keydb-0.keydb.cluster.local.
keydb-0.keydb.

需要注意：

向DNS服务器发请求，仍然是由__res_context_querydomain()负责
一旦查找成功，就立即返回不再尝试其它search domain
不加修饰的原始名字，会放在最后尝试

在K8S中，*.cluster.local一般都由CoreDNS自身负责，处理速度会很快。至于keydb-0.keydb.的处理速度，如果为CoreDNS配置了上游DNS，则处理速度依赖于外部环境。

__res_context_querydomain仅仅是在domain参数不为空的时候，将name和domain连接起来，然后调用

__res_context_query

函数。

DNS查询过程

__res_context_query负责和DNS服务器的交互，完成单个名字的DNS查询。它会调用

__res_context_mkquery

构建一个查询请求（对应DNS报文），然后发送，然后等待应答。这是一个阻塞的过程，KeyDB在期望非阻塞的代码路径下调用getaddrinfo且没有任何缓存机制，同时还加了锁，我觉得是不妥的。这导致DNS缓慢/不可用会极大的影响KeyDB的服务质量。

发送DNS请求的代码在

__res_context_send

中，重试逻辑发生在该函数中，我们的环境下重试次数为2，这解释了两次keydb-0.keydb. A查询：

//                  重试次数，statp->retry为2
for (try = 0; try < statp->retry; try++) {
    // 如果有多个DNS服务器，重试时会轮询它们
    for (unsigned ns_shift = 0; ns_shift < statp->nscount; ns_shift++)
    {
    unsigned int ns = ns_shift + ns_offset;
    if (ns >= statp->nscount)
        ns -= statp->nscount;

    same_ns:
    if (__glibc_unlikely (v_circuit)) {
        // ...
    } else {
        // 使用UDP方式发送请求
        n = send_dg(statp, buf, buflen, buf2, buflen2,
                &ans, &anssiz, &terrno,
                ns, &v_circuit, &gotsomewhere, ansp,
                ansp2, nansp2, resplen2, ansp2_malloced);
        if (n < 0)
            return (-1);
        if (n == 0 && (buf2 == NULL || *resplen2 == 0))
            // 如果有多个DNS服务器的时候，会尝试下一个
            goto next_ns;
        // ...
    }
    return (resplen);
next_ns: ;
   } /*foreach ns*/
} /*foreach retry*/

通常情况下，都是通过UDP协议进行DNS查询的，因此会调用

send_dg

函数。在我们的场景中，两次尝试均5秒超时（尽管抓包显示应答报文很快就收到），__res_context_send设置错误码ETIMEDOUT，返回-1：

__res_iclose(statp, false);
if (!v_circuit) {
    if (!gotsomewhere)
        __set_errno (ECONNREFUSED);	/* no nameservers found */
    else
        __set_errno (ETIMEDOUT);	/* no answer obtained */
} else
    __set_errno (terrno);
return (-1);

而它的调用者__res_context_query则在返回值是-1的时候，设置错误码TRY_AGAIN，这就是我们从KeyDB日志上看到报错The name server returned a temporary failure indication的原因：

if (n < 0) {
    RES_SET_H_ERRNO(statp, TRY_AGAIN);
    return (n);
}

缓慢之源

缓慢的根源是send_dg函数，它阻塞了5秒。该函数的原型如下：

// 如果没有错误，返回第一个应答的字节数
// 对于可恢复错误，返回0；对于不可恢复错误，返回负数
static int send_dg(
    // 各种选项、DNS服务器列表、指向DNS服务器的套接字（文件描述符）
    res_state statp,
    // 查询请求1的缓冲区 和 长度
	const u_char *buf, int buflen, 
    // 查询请求2的缓冲区 和 长度
    const u_char *buf2, int buflen2,
    // 收到的第1个应答   和 最大长度
	u_char **ansp,      int *anssizp,
    // 出现错误时，将errno设置到此字段
	int *terrno, 
    // 使用的DNS服务器的序号
    int ns, 
    // 如果由于UDP数据报的限制而导致截断，则v_circuit设置为1，提示调用者使用TCP方式重试
    int *v_circuit, 
    // 提示访问DNS服务器时，是拒绝服务还是超时。如果是超时则设置为1
    int *gotsomewhere,
    // 提示遇到超长应答的时候，是否重新分配缓冲区
    u_char **anscp,
    // 收到的第2个应答 和 最大长度
	u_char **ansp2, int *anssizp2, 
    // 第2个应答的实际长度 是否为第2个应答重新分配了缓冲区
    int *resplen2, int *ansp2_malloced);

该函数会向指定序号的DNS服务器发送DNS查询。它同时支持IPv4/IPv6查询，你可以传递两个查询请求，分别放在buf和buf2参数中。如果提供了两个查询请求，默认使用并行方式发送查询。设置选项

RES_SINGLKUP

可以强制串行发送；设置选项

RES_SNGLKUPREOP

可以强制串行发送，同时总是关闭并重新打开套接字，这样可以和某些行为异常的DNS服务器一起工作。

由于请求可以并行发送，因此应答到达的顺序是不确定的。先收到的应答会存放在ansp中，入参最大长度anssizp。入参anscp用于提示，应答过长的时候的处理方式：

如果anscp不为空：则自动分配新的缓冲区，并且ansp、anscp都被修改为指向该缓冲区
如果anscp为空：则过长的部分被截断，DNS包头的TC字段被设置为1

glibc的2.27-3ubuntu1版本中send_dg的完整实现如下：

static int
send_dg(res_state statp,
	const u_char *buf, int buflen, const u_char *buf2, int buflen2,
	u_char **ansp, int *anssizp,
	int *terrno, int ns, int *v_circuit, int *gotsomewhere, u_char **anscp,
	u_char **ansp2, int *anssizp2, int *resplen2, int *ansp2_malloced)
{
	const HEADER *hp = (HEADER *) buf;
	const HEADER *hp2 = (HEADER *) buf2;
	struct timespec now, timeout, finish;
	struct pollfd pfd[1];
	int ptimeout;
	struct sockaddr_in6 from;
	int resplen = 0;
	int n;

	/*
	 * Compute time for the total operation.
	 */
	int seconds = (statp->retrans << ns); // 0. 计算超时
	if (ns > 0)
		seconds /= statp->nscount;
	if (seconds <= 0)
		seconds = 1;
	bool single_request_reopen = (statp->options & RES_SNGLKUPREOP) != 0; // 0. 确定是否并行请求
	bool single_request = (((statp->options & RES_SNGLKUP) != 0)
			       | single_request_reopen);
	int save_gotsomewhere = *gotsomewhere;

	int retval;
 retry_reopen: // tx1. 如果套接字没有创建，则创建， SOCK_DGRAM | SOCK_NONBLOCK | SOCK_CLOEXEC，非阻塞
	retval = reopen (statp, terrno, ns); // tx1. 然后调用一下connect操作，不发数据
	if (retval <= 0)
	  {
	    if (resplen2 != NULL)
	      *resplen2 = 0;
	    return retval;
	  }
 retry:
	evNowTime(&now);
	evConsTime(&timeout, seconds, 0);
	evAddTime(&finish, &now, &timeout);
	int need_recompute = 0;
	int nwritten = 0;
	int recvresp1 = 0;  // 用于标记请求1的应答是否接收到
	/* Skip the second response if there is no second query.
	   To do that we mark the second response as received.  */
	int recvresp2 = buf2 == NULL; // 用于标记请求2的应答是否接收到，如果buf2为空则立即标记为1
	pfd[0].fd = EXT(statp).nssocks[ns];
	pfd[0].events = POLLOUT; // tx2. 准备监听可写事件
 wait:
	if (need_recompute) {
	recompute_resend:
		evNowTime(&now);
		if (evCmpTime(finish, now) <= 0) {
		poll_err_out:
			return close_and_return_error (statp, resplen2);
		}
		evSubTime(&timeout, &finish, &now);
		need_recompute = 0;
	}
	/* Convert struct timespec in milliseconds.  */
	ptimeout = timeout.tv_sec * 1000 + timeout.tv_nsec / 1000000;

	n = 0;
	if (nwritten == 0)
	  n = __poll (pfd, 1, 0); // tx2. 等待套接字可写
	if (__glibc_unlikely (n == 0))       {
		n = __poll (pfd, 1, ptimeout); // rx1. 等待套接字可读，5秒超时
		need_recompute = 1;
	}
	if (n == 0) {
		if (resplen > 1 && (recvresp1 || (buf2 != NULL && recvresp2)))
		  { // 处理某些DNS服务器不支持处理并行请求的场景
		    /* There are quite a few broken name servers out
		       there which don't handle two outstanding
		       requests from the same source.  There are also
		       broken firewall settings.  If we time out after
		       having received one answer switch to the mode
		       where we send the second request only once we
		       have received the first answer.  */
		    if (!single_request)
		      {
			statp->options |= RES_SNGLKUP; // 这里永久改变为串行发送请求。statp是线程本地变量，
			single_request = true;         // KeyDB复制定时任务总是在同一线程中运行
			*gotsomewhere = save_gotsomewhere;
			goto retry;
		      }
		    else if (!single_request_reopen)
		      {
			statp->options |= RES_SNGLKUPREOP;
			single_request_reopen = true;
			*gotsomewhere = save_gotsomewhere;
			__res_iclose (statp, false);
			goto retry_reopen;
		      }

		    *resplen2 = 1;
		    return resplen;
		  }

		*gotsomewhere = 1;
		if (resplen2 != NULL)
		  *resplen2 = 0;
		return 0;
	}
	if (n < 0) {
		if (errno == EINTR)
			goto recompute_resend;

		goto poll_err_out;
	}
	__set_errno (0);
	if (pfd[0].revents & POLLOUT) { // tx3. 监听到可写事件
#ifndef __ASSUME_SENDMMSG
		static int have_sendmmsg;
#else
# define have_sendmmsg 1
#endif
		if (have_sendmmsg >= 0 && nwritten == 0 && buf2 != NULL // 查询请求2不为空
		    && !single_request) // 且允许并行发送
		  {
		    struct iovec iov[2];
		    struct mmsghdr reqs[2];
		    reqs[0].msg_hdr.msg_name = NULL;
		    reqs[0].msg_hdr.msg_namelen = 0;
		    reqs[0].msg_hdr.msg_iov = &iov[0];
		    reqs[0].msg_hdr.msg_iovlen = 1;
		    iov[0].iov_base = (void *) buf;
		    iov[0].iov_len = buflen;
		    reqs[0].msg_hdr.msg_control = NULL;
		    reqs[0].msg_hdr.msg_controllen = 0;

		    reqs[1].msg_hdr.msg_name = NULL;
		    reqs[1].msg_hdr.msg_namelen = 0;
		    reqs[1].msg_hdr.msg_iov = &iov[1];
		    reqs[1].msg_hdr.msg_iovlen = 1;
		    iov[1].iov_base = (void *) buf2;
		    iov[1].iov_len = buflen2;
		    reqs[1].msg_hdr.msg_control = NULL;
		    reqs[1].msg_hdr.msg_controllen = 0;
            // 发送消息，注意这里同时发送2个查询请求，返回值是实际发送的数量
		    int ndg = __sendmmsg (pfd[0].fd, reqs, 2, MSG_NOSIGNAL);
		    if (__glibc_likely (ndg == 2))
		      {
			if (reqs[0].msg_len != buflen
			    || reqs[1].msg_len != buflen2)
			  goto fail_sendmmsg;

			pfd[0].events = POLLIN;
			nwritten += 2;
		      }
		    else if (ndg == 1 && reqs[0].msg_len == buflen)
		      goto just_one;
		    else if (ndg < 0 && (errno == EINTR || errno == EAGAIN))
		      goto recompute_resend;
		    else
		      {
#ifndef __ASSUME_SENDMMSG
			if (__glibc_unlikely (have_sendmmsg == 0))
			  {
			    if (ndg < 0 && errno == ENOSYS)
			      {
				have_sendmmsg = -1;
				goto try_send;
			      }
			    have_sendmmsg = 1;
			  }
#endif

		      fail_sendmmsg:
			return close_and_return_error (statp, resplen2);
		      }
		  }
		else
		  { // 不支持并行发送
		    ssize_t sr;
#ifndef __ASSUME_SENDMMSG
		  try_send:
#endif
		    if (nwritten != 0)
		      sr = send (pfd[0].fd, buf2, buflen2, MSG_NOSIGNAL);
		    else
		      sr = send (pfd[0].fd, buf, buflen, MSG_NOSIGNAL); // tx4. 发送查询请求1

		    if (sr != (nwritten != 0 ? buflen2 : buflen)) { // 发送长度和缓冲区长度不匹配
		      if (errno == EINTR || errno == EAGAIN) // 如果原因是EINTR或EAGAIN，则尝试重发
			goto recompute_resend;
		      return close_and_return_error (statp, resplen2);
		    }
		  just_one:
		    if (nwritten != 0 || buf2 == NULL || single_request)
		      pfd[0].events = POLLIN;  // 串行模式下，后续只需监听可读时间
		    else
		      pfd[0].events = POLLIN | POLLOUT; // 并行发送，如果实际仅发送1个消息，跳转到这里。后续需要继续写入发送失败的那个消息
		    ++nwritten;
		  }
		goto wait; // tx4. 发送完毕，回到上面的wait分支等待应答
	} else if (pfd[0].revents & POLLIN) { // rx2. 监听到套接字可读
		int *thisanssizp; // 本次读数据到哪个缓冲
		u_char **thisansp;
		int *thisresplenp;

		if ((recvresp1 | recvresp2) == 0 || buf2 == NULL) {
			/* We have not received any responses
			   yet or we only have one response to
			   receive.  */
			thisanssizp = anssizp;
			thisansp = anscp ?: ansp;
			assert (anscp != NULL || ansp2 == NULL);
			thisresplenp = &resplen;
		} else {
			thisanssizp = anssizp2;
			thisansp = ansp2;
			thisresplenp = resplen2;
		}

		if (*thisanssizp < MAXPACKET
		    /* If the current buffer is not the the static
		       user-supplied buffer then we can reallocate
		       it.  */
		    && (thisansp != NULL && thisansp != ansp)
#ifdef FIONREAD
		    /* Is the size too small?  */
		    && (ioctl (pfd[0].fd, FIONREAD, thisresplenp) < 0
			|| *thisanssizp < *thisresplenp)
#endif
                    ) {
			/* Always allocate MAXPACKET, callers expect
			   this specific size.  */
			u_char *newp = malloc (MAXPACKET);
			if (newp != NULL) {
				*thisanssizp = MAXPACKET;
				*thisansp = newp;
				if (thisansp == ansp2)
				  *ansp2_malloced = 1;
			}
		}
		/* We could end up with truncation if anscp was NULL
		   (not allowed to change caller's buffer) and the
		   response buffer size is too small.  This isn't a
		   reliable way to detect truncation because the ioctl
		   may be an inaccurate report of the UDP message size.
		   Therefore we use this only to issue debug output.
		   To do truncation accurately with UDP we need
		   MSG_TRUNC which is only available on Linux.  We
		   can abstract out the Linux-specific feature in the
		   future to detect truncation.  */
		HEADER *anhp = (HEADER *) *thisansp;
		socklen_t fromlen = sizeof(struct sockaddr_in6);
		assert (sizeof(from) <= fromlen);
		*thisresplenp = recvfrom(pfd[0].fd, (char*)*thisansp, // rx3. 读取应答
					 *thisanssizp, 0,
					(struct sockaddr *)&from, &fromlen);
		if (__glibc_unlikely (*thisresplenp <= 0))       {
			if (errno == EINTR || errno == EAGAIN) {
				need_recompute = 1;
				goto wait;  // 如果EINTR|EAGAIN则重新等待
			}
			return close_and_return_error (statp, resplen2);
		}
		*gotsomewhere = 1;
		if (__glibc_unlikely (*thisresplenp < HFIXEDSZ))       { // 消息比报文头长度还小，错误
			/*
			 * Undersized message.
			 */
			*terrno = EMSGSIZE;
			return close_and_return_error (statp, resplen2);
		}
		if ((recvresp1 || hp->id != anhp->id)
		    && (recvresp2 || hp2->id != anhp->id)) { // 查询标识符不匹配，可能服务器缓慢，返回之前查询的应答
			/*
			 * response from old query, ignore it.
			 * XXX - potential security hazard could
			 *	 be detected here.
			 */
			goto wait;
		}
		if (!(statp->options & RES_INSECURE1) && // 安全性检查type1
		    !res_ourserver_p(statp, &from)) {
			/*
			 * response from wrong server? ignore it.
			 * XXX - potential security hazard could
			 *	 be detected here.
			 */
			goto wait;
		}
		if (!(statp->options & RES_INSECURE2) // 安全性检查type2
		    && (recvresp1 || !res_queriesmatch(buf, buf + buflen,
						       *thisansp,
						       *thisansp
						       + *thisanssizp))
		    && (recvresp2 || !res_queriesmatch(buf2, buf2 + buflen2,
						       *thisansp,
						       *thisansp
						       + *thisanssizp))) {
			/*
			 * response contains wrong query? ignore it.
			 * XXX - potential security hazard could
			 *	 be detected here.
			 */
			goto wait;
		}
		if (anhp->rcode == SERVFAIL ||
		    anhp->rcode == NOTIMP ||
		    anhp->rcode == REFUSED) {  //  rx4. 处理服务器不愿意处理请求的情况
		next_ns:
			if (recvresp1 || (buf2 != NULL && recvresp2)) {
			  *resplen2 = 0;
			  return resplen;
			}
			if (buf2 != NULL)
			  {
			    /* No data from the first reply.  */
			    resplen = 0;
			    /* We are waiting for a possible second reply.  */
			    if (hp->id == anhp->id)
			      recvresp1 = 1;
			    else
			      recvresp2 = 1;

			    goto wait;  // 事件类型仍然是POLLIN，会导致超时
			  }

			/* don't retry if called from dig */
			if (!statp->pfcode)
			  return close_and_return_error (statp, resplen2);
			__res_iclose(statp, false);
		}
		if (anhp->rcode == NOERROR && anhp->ancount == 0 // rx.4 处理nodata的情况，名字请求，请求的记录类型不存在
		    && anhp->aa == 0 && anhp->ra == 0 && anhp->arcount == 0) {
			goto next_ns;
		}
		if (!(statp->options & RES_IGNTC) && anhp->tc) { // rx.4 处理应答截断的情况
			/*
			 * To get the rest of answer,
			 * use TCP with same server.
			 */
			*v_circuit = 1; // 提示使用TCP重发请求
			__res_iclose(statp, false);
			// XXX if we have received one reply we could
			// XXX use it and not repeat it over TCP...
			if (resplen2 != NULL)
			  *resplen2 = 0;
			return (1);
		}
		/* Mark which reply we received.  */
		if (recvresp1 == 0 && hp->id == anhp->id)
			recvresp1 = 1;
		else
			recvresp2 = 1;
		/* Repeat waiting if we have a second answer to arrive.  */
		if ((recvresp1 & recvresp2) == 0) { // 如果只有一个查询请求，recvresp2一开始就标记为1，因此不会走到这个分支
			if (single_request) { // 如果是串行模式，这里开始处理第2个请求
				pfd[0].events = POLLOUT;
				if (single_request_reopen) {  // 如果需要关闭并重新打开套接字
					__res_iclose (statp, false);
					retval = reopen (statp, terrno, ns);
					if (retval <= 0)
					  {
					    if (resplen2 != NULL)
					      *resplen2 = 0;
					    return retval;
					  }
					pfd[0].fd = EXT(statp).nssocks[ns];
				}
			}
			goto wait;  // 事件类型已经改为POLLOUT，因此不会发生超时
		}
		/* All is well.  We have received both responses (if
		   two responses were requested).  */
		return (resplen); // rx.5 DNS查询完毕
	} else if (pfd[0].revents & (POLLERR | POLLHUP | POLLNVAL)) // poll出现错误
	  /* Something went wrong.  We can stop trying.  */
	  return close_and_return_error (statp, resplen2);
	else {
		/* poll should not have returned > 0 in this case.  */
		abort ();
	}
}

注释中tx.标注了DNS查询请求发送的基本过程，rx.则标注了DNS查询应答接收的基本过程。调试查询keydb-0.keydb时该函数的行为，发现以下事实：

查询时串行发送的，而不是并行。因此正常流程应该是发送A查询，接收A应答，发送AAAA查询，接收AAAA应答
仅执行了1225行，没有执行1223行。也就是说仅仅发送了A查询，没有发送AAA查询

走到了1241行的分支，也就是说，A请求的应答报文是接收到的：

// (gdb) i r eax
// eax            0x2d     45   A应答长度45
		*thisresplenp = recvfrom(pfd[0].fd, (char*)*thisansp,
					 *thisanssizp, 0,
					(struct sockaddr *)&from, &fromlen);

由于接收到的应答是servfail，因此走到这个分支：

if (anhp->rcode == SERVFAIL ||
		    anhp->rcode == NOTIMP ||
		    anhp->rcode == REFUSED) {
		next_ns:
			if (recvresp1 || (buf2 != NULL && recvresp2)) {
			  *resplen2 = 0;
			  return resplen;
			}
			if (buf2 != NULL)
			  {
			    /* No data from the first reply.  */
			    resplen = 0;
			    /* We are waiting for a possible second reply.  */
			    if (hp->id == anhp->id)
			      recvresp1 = 1;  // 接收到第一个应答
			    else
			      recvresp2 = 1;
                // 由于同时需要进行A和AAAA查询，这里仅仅接收到A应答（串行发送）
			    goto wait; // 因此需要跳转到这里，等待套接字可写，以发送AAAA请求
			  }

CoreDNS应答A查询SERVFAIL，重新跳转到wait标签：

if (need_recompute) { // 等待A应答的时候，设置了超时 need_recompute，因此再次wait执行这个分支
	recompute_resend:
		evNowTime(&now);
		if (evCmpTime(finish, now) <= 0) {
		poll_err_out: // 如果超时了，直接关闭套接字并返回错误
			return close_and_return_error (statp, resplen2);
		}
		evSubTime(&timeout, &finish, &now);
		need_recompute = 0;
	}
	/* Convert struct timespec in milliseconds.  */
	ptimeout = timeout.tv_sec * 1000 + timeout.tv_nsec / 1000000;

	n = 0;
	if (nwritten == 0)
	  n = __poll (pfd, 1, 0);  // 发送A请求的时候在这里pull，等待套接字可写。timeout 0表示立即返回
	if (__glibc_unlikely (n == 0))       {
		n = __poll (pfd, 1, ptimeout);  // 接收A应答的时候在这里poll，等待套接字可读
		need_recompute = 1; // 发送AAAA请求时，在这里等待套接字可写
	}

这时，由于nwritten已经被设置为1，因此走带有timeout的poll分支。然后在1110行出现5秒超时，并因为poll返回值是0而导致send_dg函数退出。在一次A请求处理过程中，有两次在1110行poll：
1. 第一次是尝试A请求的应答，poll前的pollfd是{fd = 87, events = 1, revents = 4}，之后是{fd = 87, events = 1, revents = 1}
2. 第二次就是因为这个跳转，poll前的pollfd是{fd = 87, events = 1, revents = 1}，超时之后是{fd = 87, events = 1, revents = 0}

poll函数原型：

int poll(struct pollfd *fds, nfds_t nfds, int timeout);

，它等待文件描述符集合中的某个可用（可执行I/O）。文件描述符集合由参数fds指定，它是pollfd结构的数组：

struct pollfd {
    // 打开文件的描述符
    int   fd;         
    // 输入参数，应用程序感兴趣的事件类型。如果置零则revents中仅能返回POLLHUP,POLLERR,POLLNVAL事件
    short events;
    // 输出参数，内核填充实际发生的事件
    short revents;    
};

如果文件描述符集中没有任何一个发生了events中指定的事件，则该函数会阻塞，直到超时或者被信号处理器中断。

事件类型1表示POLLIN，即有数据可读；事件类型4表示POLLOUT，即文件描述符可写。正常情况下该函数返回就绪的（revents非零）文件描述符数量，超时返回0，出现错误则返回-1

第2次在1110行的poll行为难以理解：

A的应答已经接收到，而由于进行的是串行发送A/AAAA，此时尚未发送AAAA请求，因此可以预期后续不会有可读事件
poll时events设置为POLLIN（肯定会导致超时），难道不是应该设置为POLLOUT，尝试发送AAAA请求或重试A请求么？

为了进行对照，我们由调试了没有发生缓慢问题的keydb-2.keydb。它在第2次执行1110行的poll时没有超时，pollfd的状态是{fd = 88, events = 1, revents = 1}。连续两次poll到可读事件，这提示进行了并行A/AAAA查询。检查变量single_request_reopen、single_request果然都是false，从CoreDNS日志上也可以看到A/AAAA

可能的情况是，keydb-1.keydb最初是并行发送A/AAAA查询的，后来由于某种原因，改为串行发送，从而导致出现5秒超时相关的缓慢现象。根源应该还是在glibc中，因为KeyDB调用getaddrinfo的方式是固定的。

回顾一下send_dg的代码，可以发现

statp->options

决定了是否进行串行发送，statp是

resolv_context

的一个字段，后者则是一个线程本地变量。如果某次并行发送请求后，可以接收到第一个应答，而在继续等待第二个应答时出现超时（1113行），则send_dg函数会修改statp->options，改为串行发送，这个修改具有全局性影响，以后KeyDB的复制定时任务（总是由同一线程执行）调用getaddrinfo，都会使用串行方式发送请求。

改变为串行方式后，由于CoreDNS应答keydb-0.keydb.以SERVFAIL，导致跳转到wait标签（1363行），进而执行了一次必然超时的poll调用。CoreDNS应答其它（加了search domain后缀的）域名以NXDOMAIN，则不会导致超时的poll调用，因为会在1396行修改事件类型为POLLOUT。

解决方案

触发本文中的glibc缺陷，需要满足以下条件：

出现某个KeyDB节点宕机的情况，并且没有修复。这会导致复制定时任务反复执行DNS查询，从而可能触发缺陷
某个DNS查询的应答UDP包丢失，导致当前线程串行发送DNS请求。由于UDP本身的不可靠性，随着程序不断运行，最终会发生
DNS服务器返回SERVFAIL、NOTIMP或者REFUSED应答

第1、2个条件都是随机性的，我们没法干预，只有从第3个条件入手。作为最快速的解决方案，只需要配置KeyDB，使用全限定域名来指定replicaof即可。

The post 记录一次KeyDB缓慢的定位过程 appeared first on 绿色记忆.

IPVS模式下ClusterIP泄露宿主机端口的问题

Alex — Tue, 05 Jan 2021 10:50:51 +0000

问题

在一个启用了IPVS模式kube-proxy的K8S集群中，运行着一个Docker Registry服务。我们尝试通过docker manifest命令（带上--insecure参数）来推送manifest时，出现TLS timeout错误。

这个Registry通过ClusterIP类型的Service暴露访问端点，且仅仅配置了HTTP/80端口。docker manifest命令的--insecure参数的含义是，在Registry不支持HTTPS的情况下，允许使用不安全的HTTP协议通信。从报错上来看，很明显docker manifest认为Registry支持HTTPS协议。

在宿主机上尝试

telnet RegistryClusterIP 443

，居然可以连通。检查后发现节点上使用443端口的，只有Ingress Controller的NodePort类型的Service，它在0.0.0.0上监听。删除此NodePort服务后，RegistryClusterIP:443就不通了，docker manifest命令恢复正常。

定义

如果kube-proxy启用了IPVS模式，并且宿主机在0.0.0.0:NonServicePort上监听，那么可以在宿主机上、或者Pod内，通过任意ClusterIP:NonServicePort访问到宿主机的NonServicePort。

这一行为显然不符合预期，我们期望仅仅在Service对象中声明的端口，才可能通过Cluster连通。如果ClusterIP上的未知端口，内核应该丢弃报文或者返回适当的ICMP。

如果kube-proxy使用iptables模式，不会出现这种异常行为。

原因

启用IPVS的情况下，所有ClusterIP都会绑定在kube-ipvs0这个虚拟的网络接口上。例如对于kube-dns服务的ClusterIP 10.96.0.10（ServicePort为TCP 53 / TCP 9153）：

5: kube-ipvs0:  mtu 1500 qdisc noop state DOWN group default 
    link/ether fa:d9:9e:37:12:68 brd ff:ff:ff:ff:ff:ff
    inet 10.96.0.10/32 brd 10.96.0.10 scope global kube-ipvs0
       valid_lft forever preferred_lft forever

这种绑定是必须的，因为IPVS的工作原理是，在netfilter挂载点LOCAL_IN上注册钩子ip_vs_in，拦截目的地是VIP（ClusterIP）的封包。而要使得封包进入到LOCAL_IN，它的目的地址必须是本机地址。

每当为网络接口添加一个IP地址，内核都会自动在local路由表中增加一条规则，对于上面的10.96.0.10，会增加：

# 对于目的地址是10.96.0.10的封包，从kube-ipvs0发出，如果没有指定源IP，使用10.96.0.10
local 10.96.0.10 dev kube-ipvs0 proto kernel scope host src 10.96.0.10

上述自动添加路由的一个副作用是，对于任意一个端口Port，如果不存在匹配ClusterIP:Port的IPVS规则，同时宿主机上某个应用在0.0.0.0:Port上监听，封包就会交由此应用处理。

在宿主机上执行

telnet 10.96.0.10 22

，会发生以下事件序列：

出站选路，根据local表路由规则，从kube-ipvs0接口发出封包
由于kube-ipvs0是dummy的，封包立刻从kube-ipvs0的出站队列移动到入站队列
目的地址是本地地址，因此进入LOCAL_IN挂载点
由于22不是ServicePort，封包被转发给本地进程处理，即监听了22的那个进程

如果删除内核自动在local表中添加的路由：

ip route del table local local 10.96.0.10 dev kube-ipvs0 proto kernel scope host src 10.96.0.10

则会出现以下现象：

无法访问10.96.0.10:22。这是我们期望的，因为10.96.0.10这个服务没有暴露22端口，此端口理当不通
无法ping 10.96.0.10。这不是我们期望的，但是一般情况下不会有什么问题。iptables模式下ClusterIP就是无法ping的，IPVS模式下可以在本机ping仅仅是绑定ClusterIP到kube-ipvs0的一个副作用。通常应用程序不应该对ClusterIP做ICMP检测，来判断服务是否可用，因为这依赖了kube-proxy的特定工作模式
在宿主机上，可以访问10.96.0.10:53。这是我们期望的，宿主机上可以访问ClusterIP
在某个容器的网络命名空间下，无法访问10.96.0.10:53。这不是我们期望的，相当于Pod无法访问ClusterIP了

以上4条，惟独3难以理解。为什么路由没了，宿主机仍然能访问ClusterIP:ServicePort？这个我们还没有从源码级别深究，但是很明显和IPVS有关。IPVS在LOCAL_OUT上挂有钩子，它可能在此钩子中检测到来自本机（主网络命名空间）的、访问ClusterIP+ServicePort（即IPVS虚拟服务）的封包，并进行了某种“魔法”处理，从而避开了没有路由的问题。

下面我们进一步验证上述“魔法”处理的可能性。使用

tcpdump -i any host 10.96.0.10

来捕获流量，从容器命名空间访问ClusterIP:ServicePort时，可以看到：

#                  容器IP
11:32:00.448470 IP 172.27.0.24.56378 > 10.96.0.10.53: Flags [S], seq 2946888109, win 28200, options...

但是从宿主机访问ClusterIP:ServicePort时，则捕获不到任何流量。但是，通过iptables logging，我们可以确定，内核的确以ClusterIP为源地址和目的地址，发起了封包：

iptables -t mangle -I OUTPUT 1 -p tcp --dport 53 -j LOG --log-prefix 'out-d53: '

# dmesg -w
#                                      源地址          目的地址
# [3374381.426541] out-d53: IN= OUT=lo SRC=10.96.0.100 DST=10.96.0.100 LEN=52 TOS=0x10 PREC=0x00 TTL=64 ID=18885 DF PROTO=TCP SPT=42442 DPT=53 WINDOW=86 RES=0x00 ACK URGP=0

回顾一下数据报出站、入站的处理过程：

出站，依次经过 netfilter/iptables ⇨ tcpdump ⇨ 网络接口 ⇨网线
入站，依次经过网线 ⇨ 网络接口 ⇨ tcpdump ⇨ netfilter/iptables

只有当IPVS在宿主机请求10.96.0.10的封包出站时，在netfilter中对匹配IPVS虚拟服务的封包进行如下处理，才能解释iptables中能看到10.96.0.10，而紧随其后的tcpdump中却又看不到的现象：

修改目的地址为Service的Endpoint地址，这就是NAT模式的IPVS（即kube-proxy使用NAT模式）应有的行为
修改了源地址为当前宿主机的地址，不这样做，回程报文就无法路由回来

另外注意一下，如果从宿主机访问ClusterIP:NonServicePort，则tcpdump能捕获到源或目的地址为ClusterIP的流量。这是因为IPVS发现它不匹配任何虚拟服务，会直接返回NF_ACCEPT，然后封包就按照常规流程处理了。

后果

安全问题

如果宿主机上有一个在0.0.0.0上监听的、存在安全漏洞的服务，则可能被恶意的工作负载利用。

行为异常

少部分的应用程序，例如docker manifest，其行为取决于端口探测的结果，会无法正常工作。

解决

可能的解决方案有：

在iptables中匹配哪些针对ClusterIP:NonServicePort的流量，Drop或Reject掉
使用基于fwmark的IPVS虚拟服务，这需要在iptables中对针对ClusterIP:ServicePort的流量打fwmark，而且每个ClusterIP都需要占用独立的fwmark，难以管理

对于解决方案1，可以使用如下iptables规则：

#                 如果目的地址是ClusterIP    但是目的端口不是ServicePort           则拒绝
iptables -A INPUT -d  10.96.0.0/12 -m set ! --match-set KUBE-CLUSTER-IP dst,dst -j REJECT

这个规则能够为容器解决宿主机端口泄露的问题，但是会导致宿主机上无法访问ClusterIP。

引起此问题的原因是，在宿主机访问ClusterIP时，会同时使用ClusterIP作为源地址/目的地址。这样，来自Endpoint的回程报文，unNATed后的目的地址，就会匹配到上面的iptables规则，从而导致封包被Reject掉。

要解决此问题，我们可以修改内核自动添加的路由，提示使用其它地址作为源地址：

# 这条路由给出src提示，当访问10.96.0.10时，选取192.168.104.82（节点IP）作为源地址
ip route replace table local local 10.96.0.10 dev kube-ipvs0 proto kernel scope host src 192.168.104.82

深入

上文我们提到了一个“魔法”处理的猜想，这里我们对IPVS的实现细节进行深入学习，证实此猜想。

本节牵涉到的内核源码均来自linux-3.10.y分支。

Netfilter

这是从2.4.x引入内核的一个框架，用于实现防火墙、NAT、封包修改、记录封包日志、用户空间封包排队之类的功能。

netfilter运行在内核中，允许内核模块在Linux网络栈的不同位置注册钩子（回调函数），当每个封包穿过网络栈时，这些钩子函数会被调用。

iptables是经典的，基于netfilter的用户空间工具。它的继任者是nftables，它更加灵活、可扩容、性能好。

钩子挂载点

netfilter提供了5套钩子（的挂载点）：

挂载点	说明
NF_IP_PER_ROUTING	当封包进入网络栈时调用。封包的目的地可能是本机，或者需要转发 ip_rcv / ipv6_rcv是内核接受并处理IP数据报的入口，此函数会调用这类钩子： int ip_rcv(struct sk_buff skb, struct net_device dev, struct packet_type pt, struct net_device orig_dev) { // ... return NF_HOOK(NFPROTO_IPV4, NF_INET_PRE_ROUTING, skb, dev, NULL, ip_rcv_finish); }
NF_IP_LOCAL_IN	当路由判断封包应该由本机处理时（目的地址是本机地址）调用 ip_local_deliver / ip6_input负责将IP数据报向上层传递，此函数会调用这类钩子
NF_IP_FORWARD	当路由判断封包应该被转发给其它机器（或者网络命名空间）时调用 ip_forward / ip6_forward负责封包转发，此函数会调用这类钩子
NF_IP_POST_ROUTING	在封包即将离开网络栈（进入网线）时调用，不管是转发的、还是本机发出的，都需要经过此挂载点 ip_output / ip6_finish_output2会调用这类钩子
NF_IP_LOCAL_OUT	当封包由本机产生，需要往外发送时调用 __ip_local_out / __ip6_local_out会调用这类钩子

这些挂载点，和iptables的各链是对应的。

注册钩子

要在内核中使用netfilter的钩子，你需要调用函数：

// 注册钩子
int nf_register_hook(struct nf_hook_ops *reg){}
// 反注册钩子
void nf_unregister_hook(struct nf_hook_ops *reg){}

入参nf_hook_ops是一个结构：

struct nf_hook_ops {
	// 钩子的函数指针，依据内核的版本不同此函数的签名有所差异
	nf_hookfn		*hook;
	struct net_device	*dev;
	void			*priv;
	// 钩子针对的协议族，PF_INET表示IPv4
	u_int8_t		pf;
	// 钩子类型代码，参考上面的表格
	unsigned int		hooknum;
	// 每种类型的钩子，都可以有多个，此数字决定执行优先级
	int			priority;
};


// 钩子函数的签名
typedef unsigned int nf_hookfn(unsigned int hooknum,
			       struct sk_buff *skb, // 正被处理的数据报
			       const struct net_device *in, // 输入设备
			       const struct net_device *out, // 是出设备
			       int (*okfn)(struct sk_buff *)); // 如果通过钩子检查，则调用此函数，通常用不到

钩子返回值

/* Responses from hook functions. */
// 丢弃该报文，不再继续传输或处理
#define NF_DROP 0
// 继续正常传输报文，如果后面由低优先级的钩子，仍然会调用它们
#define NF_ACCEPT 1
// 告知netfilter，报文被别人偷走处理了，不需要再对它做任何处理
// 下文的分析中，我们有个例子。一个netfilter钩子在内部触发了对netfilter钩子的调用
// 外层钩子返回的就是NF_STOLEN，相当于将封包的控制器转交给内层钩子了
#define NF_STOLEN 2
// 对该数据报进行排队，通常用于将数据报提交给用户空间进程处理
#define NF_QUEUE 3
// 再次调用该钩子函数
#define NF_REPEAT 4
// 继续正常传输报文，不会调用此挂载点的后续钩子
#define NF_STOP 5
#define NF_MAX_VERDICT NF_STOP

钩子优先级

优先级通常以下面的枚举为基准+/-：

enum nf_ip_hook_priorities {
	// 数值越小，优先级越高，越先执行
	NF_IP_PRI_FIRST = INT_MIN,
	NF_IP_PRI_CONNTRACK_DEFRAG = -400,
	// 可以看到iptables各表注册的钩子的优先级
	NF_IP_PRI_RAW = -300,
	NF_IP_PRI_SELINUX_FIRST = -225,
	NF_IP_PRI_CONNTRACK = -200,
	NF_IP_PRI_MANGLE = -150,
	NF_IP_PRI_NAT_DST = -100,
	NF_IP_PRI_FILTER = 0,
	NF_IP_PRI_SECURITY = 50,
	NF_IP_PRI_NAT_SRC = 100,
	NF_IP_PRI_SELINUX_LAST = 225,
	NF_IP_PRI_CONNTRACK_HELPER = 300,
	NF_IP_PRI_CONNTRACK_CONFIRM = INT_MAX,
	NF_IP_PRI_LAST = INT_MAX,
};

IPVS

钩子列表

ip_vs模块初始化时，会通过ip_vs_init函数，调用nf_register_hook，注册以下netfilter钩子：

static struct nf_hook_ops ip_vs_ops[] __read_mostly = {
	// 注册到LOCAL_IN，这两个钩子处理外部客户端的报文
	// 转而调用ip_vs_out，用于NAT模式下，处理LVS回复外部客户端的报文，例如修改IP地址
	{
		.hook		= ip_vs_reply4,
		.owner		= THIS_MODULE,
		.pf		= NFPROTO_IPV4,
		.hooknum	= NF_INET_LOCAL_IN,
		.priority	= NF_IP_PRI_NAT_SRC - 2,
	},
	// 转而调用ip_vs_in，用于处理外部客户端进入IPVS的请求报文
	// 如果没有对应请求报文的连接，则使用调度函数创建连接结构，这其中牵涉选择RS负载均衡算法
	{
		.hook		= ip_vs_remote_request4,
		.owner		= THIS_MODULE,
		.pf		= NFPROTO_IPV4,
		.hooknum	= NF_INET_LOCAL_IN,
		.priority	= NF_IP_PRI_NAT_SRC - 1,
	},

	// 注册到LOCAL_OUT，这两个钩子处理LVS本机的报文
	// 转而调用ip_vs_out，用于NAT模式下，处理LVS回复客户端的报文
	{
		.hook		= ip_vs_local_reply4,
		.owner		= THIS_MODULE,
		.pf		= NFPROTO_IPV4,
		.hooknum	= NF_INET_LOCAL_OUT,
		.priority	= NF_IP_PRI_NAT_DST + 1,
	},
	// 转而调用ip_vs_in，调度并转发（给RS）本机的请求
	{
		.hook		= ip_vs_local_request4,
		.owner		= THIS_MODULE,
		.pf		= NFPROTO_IPV4,
		.hooknum	= NF_INET_LOCAL_OUT,
		.priority	= NF_IP_PRI_NAT_DST + 2,
	},

	// 这两个函数注册到FORWARD
	// 转而调用ip_vs_in_icmp，用于处理外部客户端发到IPVS的ICMP报文，并转发到RS
	{
		.hook		= ip_vs_forward_icmp,
		.owner		= THIS_MODULE,
		.pf		= NFPROTO_IPV4,
		.hooknum	= NF_INET_FORWARD,
		.priority	= 99,
	},
	// 转而调用ip_vs_out，用于NAT模式下，修改RS给的应答报文的源地址为IPVS虚拟地址
	{
		.hook		= ip_vs_reply4,
		.owner		= THIS_MODULE,
		.pf		= NFPROTO_IPV4,
		.hooknum	= NF_INET_FORWARD,
		.priority	= 100,
	}
};

ip_vs_in

从上面的钩子我们可以看到：

针对外部发起的、本机发起的，对IPVS的请求（目的是VIP的SYN），钩子的位置是不一样的：
1. 对于外部的请求，在LOCAL_IN中处理，钩子函数为ip_vs_remote_request4
2. 对于本机的请求，在LOCAL_OUT中处理，钩子函数为ip_vs_local_request4

尽管钩子的位置不同，但是函数ip_vs_remote_request4、ip_vs_local_request4都是调用ip_vs_in。实际上，这两个函数的逻辑完全一样：

/*
 *	AF_INET handler in NF_INET_LOCAL_IN chain
 *	Schedule and forward packets from remote clients
 */
static unsigned int
ip_vs_remote_request4(unsigned int hooknum, struct sk_buff *skb,
		      const struct net_device *in,
		      const struct net_device *out,
		      int (*okfn)(struct sk_buff *))
{
	return ip_vs_in(hooknum, skb, AF_INET);
}

/*
 *	AF_INET handler in NF_INET_LOCAL_OUT chain
 *	Schedule and forward packets from local clients
 */
static unsigned int
ip_vs_local_request4(unsigned int hooknum, struct sk_buff *skb,
		     const struct net_device *in, const struct net_device *out,
		     int (*okfn)(struct sk_buff *))
{
	return ip_vs_in(hooknum, skb, AF_INET);
}

回顾一下上文我们关于“魔法”处理的疑惑。对于从宿主机发起对10.96.0.10:53的请求，我们通过iptables logging证实了使用的源IP地址是10.96.0.10：

这个请求为什么tcpdump捕获不到？
为什么删除路由不影响宿主机对ClusterIP的请求（却又导致容器无法请求ClusterIP）？

这两个问题的答案，很可能就隐藏在ip_vs_in函数中，因为它是处理进入IPVS的数据报的统一入口。如果该函数同时修改了原始封包的源/目的地址，就解释了问题1；如果该函数在内部进行了选路操作，则解释了问题2。

下面分析一下ip_vs_in的代码：

static unsigned int
ip_vs_in(unsigned int hooknum, struct sk_buff *skb, int af)
{
	// 网络命名空间
	struct net *net;
	// IPVS的IP头，其中存有3层头len、协议、标记、源/目的地址
	struct ip_vs_iphdr iph;
	// 持有协议（TCP/UDP/SCTP/AH/ESP）信息，更重要的是带着很多函数指针。这些指针负责针对特定协议的IPVS逻辑
	struct ip_vs_protocol *pp;
	// 每个命名空间一个此对象，包含统计计数器、超时表
	struct ip_vs_proto_data *pd;
	// 当前封包所属的IPVS连接对象，此对象最重要的是packet_xmit函数指针。它负责将封包发走
	struct ip_vs_conn *cp;
	int ret, pkts;
	// 描述当前网络命名空间的IPVS状态
	struct netns_ipvs *ipvs;

	// 如果封包已经被标记为IPVS请求/应答，不做处理，继续netfilter常规流程
	// 后续ip_vs_nat_xmit会让封包“重入”netfilter，那时封包已经打上IPVS标记
	// 这里的判断确保重入的封包走netfilter常规流程，而不是进入死循环
	if (skb->ipvs_property)
		return NF_ACCEPT;


	// 如果封包目的地不是本机且当前不在LOCAL_OUT
	// 或者封包的dst_entry不存在，不做处理，继续netfilter常规流程
	if (unlikely((skb->pkt_type != PACKET_HOST &&
		      hooknum != NF_INET_LOCAL_OUT) ||
		     !skb_dst(skb))) {
		ip_vs_fill_iph_skb(af, skb, &iph);
		IP_VS_DBG_BUF(12, "packet type=%d proto=%d daddr=%s"
			      " ignored in hook %u\n",
			      skb->pkt_type, iph.protocol,
			      IP_VS_DBG_ADDR(af, &iph.daddr), hooknum);
		return NF_ACCEPT;
	}
	// 如果当前IPVS主机是backup，或者当前命名空间没有启用IPVS，不做处理，继续netfilter常规流程
	net = skb_net(skb);
	ipvs = net_ipvs(net);
	if (unlikely(sysctl_backup_only(ipvs) || !ipvs->enable))
		return NF_ACCEPT;

	// 使用封包的IP头填充IPVS的IP头
	ip_vs_fill_iph_skb(af, skb, &iph);

	// 如果是RAW套接字，不做处理，继续netfilter常规流程
	if (unlikely(skb->sk != NULL && hooknum == NF_INET_LOCAL_OUT &&
		     af == AF_INET)) {
		struct sock *sk = skb->sk;
		struct inet_sock *inet = inet_sk(skb->sk);

		if (inet && sk->sk_family == PF_INET && inet->nodefrag)
			return NF_ACCEPT;
	}

	// 处理ICMP报文，和我们的场景无关
	if (unlikely(iph.protocol == IPPROTO_ICMP)) {
		int related;
		int verdict = ip_vs_in_icmp(skb, &related, hooknum);
		if (related)
			return verdict;
	}

	// 如果协议不受IPVS支持，不做处理，继续netfilter常规流程
	pd = ip_vs_proto_data_get(net, iph.protocol);
	if (unlikely(!pd))
		return NF_ACCEPT;
	// 协议被支持，得到pp
	pp = pd->pp;
	// 尝试获取封包所属的IPVS连接对象
	cp = pp->conn_in_get(af, skb, &iph, 0);
	// 如果封包属于既有IPVS连接，且此连接的RS（dest）已经设置，且RS的权重为0
	// 认为是无效连接，设为过期
	if (unlikely(sysctl_expire_nodest_conn(ipvs)) && cp && cp->dest &&
	    unlikely(!atomic_read(&cp->dest->weight)) && !iph.fragoffs &&
	    is_new_conn(skb, &iph)) {
		ip_vs_conn_expire_now(cp);
		__ip_vs_conn_put(cp);
		cp = NULL;
	}

	// 调度一个新的IPVS连接，这里牵涉到RS的LB算法
	if (unlikely(!cp) && !iph.fragoffs) {
		int v;
		if (!pp->conn_schedule(af, skb, pd, &v, &cp, &iph))
			// 如果返回0，通常v是NF_DROP，这以为这调度失败，封包丢弃
			return v;
	}

	if (unlikely(!cp)) {
		IP_VS_DBG_PKT(12, af, pp, skb, 0,
			      "ip_vs_in: packet continues traversal as normal");
		if (iph.fragoffs) {
			IP_VS_DBG_RL("Unhandled frag, load nf_defrag_ipv6\n");
			IP_VS_DBG_PKT(7, af, pp, skb, 0, "unhandled fragment");
		}
		return NF_ACCEPT;
	}

	// 入站封包 —— 在我们的场景中，这是本地客户端入了IPVS系统的封包
	// 从网络栈的角度来说，我们正在处理的是出站封包...
	IP_VS_DBG_PKT(11, af, pp, skb, 0, "Incoming packet");

	// IPVS连接的RS不可用
	if (cp->dest && !(cp->dest->flags & IP_VS_DEST_F_AVAILABLE)) {
		// 立即将连接设为过期
		if (sysctl_expire_nodest_conn(ipvs)) {
			ip_vs_conn_expire_now(cp);
		}
		// 丢弃封包
		__ip_vs_conn_put(cp);
		return NF_DROP;
	}
	// 更新计数器
	ip_vs_in_stats(cp, skb);
	// 更新IPVS连接状态机，做的事情包括
	//   根据数据包 tcp 标记字段来更新当前状态机
	//   更新连接对应的统计数据，包括：活跃连接和非活跃连接
	//   根据连接状态，设置超时时间
	ip_vs_set_state(cp, IP_VS_DIR_INPUT, skb, pd);

	if (cp->packet_xmit)
		// 调用packet_xmit将封包发走，实际上是重入netfilter的LOCAL_OUT，封包控制权转移走，后续不该再操控skb
		ret = cp->packet_xmit(skb, cp, pp, &iph);
	else {
		IP_VS_DBG_RL("warning: packet_xmit is null");
		ret = NF_ACCEPT;
	}

	if (cp->flags & IP_VS_CONN_F_ONE_PACKET)
		pkts = sysctl_sync_threshold(ipvs);
	else
		pkts = atomic_add_return(1, &cp->in_pkts);

	if (ipvs->sync_state & IP_VS_STATE_MASTER)
		ip_vs_sync_conn(net, cp, pkts);

	// 放回连接对象，重置连接定时器
	ip_vs_conn_put(cp);
	return ret;
}

上面这段代码中，“魔法”处理最可能发生在：

conn_schedule：在这里需要进行IPVS连接的调度
packet_xmit：在这里发送经过IPVS处理的封包

二者都是函数指针，在TCP协议下，conn_schedule指向tcp_conn_schedule。在NAT模式下，packet_xmit指向ip_vs_nat_xmit。packet_xmit指针是在conn_schedule过程中初始化的。

tcp_conn_schedule

我们看一下TCP协议下IPVS连接的调度过程。

static int
tcp_conn_schedule(int af, struct sk_buff *skb, struct ip_vs_proto_data *pd,
		  int *verdict, struct ip_vs_conn **cpp,
		  struct ip_vs_iphdr *iph)
{
	// 网络命名空间
	struct net *net;
	// IPVS虚拟服务对象
	struct ip_vs_service *svc;
	struct tcphdr _tcph, *th;

	// 解析L4头，如果失败，提示ip_vs_in丢弃封包
	th = skb_header_pointer(skb, iph->len, sizeof(_tcph), &_tcph);
	if (th == NULL) {
		*verdict = NF_DROP;
		return 0;
	}
	net = skb_net(skb);
	rcu_read_lock();
	if (th->syn &&
	    // 根据封包特征，去查找匹配的虚拟服务
	    (svc = ip_vs_service_find(net, af, skb->mark, iph->protocol,
				      &iph->daddr, th->dest))) {
		int ignored;
		// 如果当前网络命名空间“过载”了，丢弃封包
		if (ip_vs_todrop(net_ipvs(net))) {
			rcu_read_unlock();
			*verdict = NF_DROP;
			return 0;
		}

		// 选择一个RS，建立IPVS连接
		// 如果找不到RS，或者发生致命错误，则ignore为0或-1，这种情况下
		// IPVS连接没有成功创建，提示ip_vs_in丢弃封包，可能附带回复ICMP
		*cpp = ip_vs_schedule(svc, skb, pd, &ignored, iph);
		if (!*cpp && ignored <= 0) {
			if (!ignored)
				// ignored=0，找不到RS
				*verdict = ip_vs_leave(svc, skb, pd, iph);
			else
				*verdict = NF_DROP;
			rcu_read_unlock();
			return 0;
		}
	}
	rcu_read_unlock();
	// 如果调度成功，IPVS连接对象不为空，返回1
	/* NF_ACCEPT */
	return 1;
}

到这里我们还没有看到IPVS对封包地址进行更改，需要进一步阅读ip_vs_schedule。

ip_vs_schedule

这是IPVS调度的核心函数，它支持TCP/UDP，它为虚拟服务选择一个RS，创建IPVS连接对象。

struct ip_vs_conn *
ip_vs_schedule(struct ip_vs_service *svc, struct sk_buff *skb,
	       struct ip_vs_proto_data *pd, int *ignored,
	       struct ip_vs_iphdr *iph)
{
	struct ip_vs_protocol *pp = pd->pp;
	// IPVS连接对象（connection entry）
	struct ip_vs_conn *cp = NULL;
	struct ip_vs_scheduler *sched;
	struct ip_vs_dest *dest;
	__be16 _ports[2], *pptr;
	unsigned int flags;

	// ...

	*ignored = 0;

	/*
	 *    Non-persistent service
	 */
	// 调度工作委托给虚拟服务的scheduler
	sched = rcu_dereference(svc->scheduler);
	// 调度器就是选择一个RS（ip_vs_dest）
	dest = sched->schedule(svc, skb);
	if (dest == NULL) {
		IP_VS_DBG(1, "Schedule: no dest found.\n");
		return NULL;
	}

	flags = (svc->flags & IP_VS_SVC_F_ONEPACKET
		 && iph->protocol == IPPROTO_UDP) ?
		IP_VS_CONN_F_ONE_PACKET : 0;

	// 初始化IPVS连接对象 ip_vs_conn
	{
		struct ip_vs_conn_param p;

		ip_vs_conn_fill_param(svc->net, svc->af, iph->protocol,
				      &iph->saddr, pptr[0], &iph->daddr,
				      pptr[1], &p);
		// 操控ip_vs_conn的逻辑包括：
		//   初始化定时器
		//   设置网络命名空间
		//   设置地址、fwmark、端口
		//   根据IP版本、IPVS模式（NAT/DR/TUN）为连接设置一个packet_xmit
		cp = ip_vs_conn_new(&p, &dest->addr,
				    dest->port ? dest->port : pptr[1],
				    flags, dest, skb->mark);
		if (!cp) {
			*ignored = -1;
			return NULL;
		}
	}

	// ...
	return cp;
}

到这里我们可以看到， conn_schedule仍然没有对封包做任何修改。看来关键在packet_xmit函数中。

ip_vs_nat_xmit

int ip_vs_nat_xmit(struct sk_buff *skb, struct ip_vs_conn *cp,
	       struct ip_vs_protocol *pp, struct ip_vs_iphdr *ipvsh)
{
	// 路由表项
	struct rtable *rt;		/* Route to the other host */
	// 是否本机    是否输入路由
	int local, rc, was_input;

	EnterFunction(10);

	rcu_read_lock();
	// 是否尚未设置客户端端口
	if (unlikely(cp->flags & IP_VS_CONN_F_NO_CPORT)) {
		__be16 _pt, *p;

		p = skb_header_pointer(skb, ipvsh->len, sizeof(_pt), &_pt);
		if (p == NULL)
			goto tx_error;
		// 设置IPVS连接对象的cport
		// caddr cport 客户端地址
		// vaddr vport 虚拟服务地址
		// daddr dport RS地址
		ip_vs_conn_fill_cport(cp, *p);
		IP_VS_DBG(10, "filled cport=%d\n", ntohs(*p));
	}

	was_input = rt_is_input_route(skb_rtable(skb));
	// 出口路由查找，依据是封包、RS的地址、以及若干标识位
	// 返回值提示路由目的地是否是本机
	local = __ip_vs_get_out_rt(skb, cp->dest, cp->daddr.ip,
				   IP_VS_RT_MODE_LOCAL |
				   IP_VS_RT_MODE_NON_LOCAL |
				   IP_VS_RT_MODE_RDR, NULL);
	if (local < 0)
		goto tx_error;
	rt = skb_rtable(skb);

	// 如果目的地是本机，RS地址是环回地址，是输入
	if (local && ipv4_is_loopback(cp->daddr.ip) && was_input) {
		IP_VS_DBG_RL_PKT(1, AF_INET, pp, skb, 0, "ip_vs_nat_xmit(): "
				 "stopping DNAT to loopback address");
		goto tx_error;
	}

	// 封包将被修改，执行copy-on-write
	if (!skb_make_writable(skb, sizeof(struct iphdr)))
		goto tx_error;

	if (skb_cow(skb, rt->dst.dev->hard_header_len))
		goto tx_error;

	// 修改封包，dnat_handler指向tcp_dnat_handler
	if (pp->dnat_handler && !pp->dnat_handler(skb, pp, cp, ipvsh))
		goto tx_error;
	// 更改目的地址
	ip_hdr(skb)->daddr = cp->daddr.ip;
	// 为出站封包生成chksum
	ip_send_check(ip_hdr(skb));

	IP_VS_DBG_PKT(10, AF_INET, pp, skb, 0, "After DNAT");

	skb->local_df = 1;

	// 发送封包：
	//   如果发送出去了，返回 NF_STOLEN
	//   如果没有发送（local=1，目的地是本机），返回NF_ACCEPT
	rc = ip_vs_nat_send_or_cont(NFPROTO_IPV4, skb, cp, local);
	rcu_read_unlock();

	LeaveFunction(10);
	return rc;

  tx_error:
	kfree_skb(skb);
	rcu_read_unlock();
	LeaveFunction(10);
	return NF_STOLEN;
}

static inline int ip_vs_nat_send_or_cont(int pf, struct sk_buff *skb,  struct ip_vs_conn *cp, int local)
{
	// 注意这个NF_STOLEN的含义，参考上文
	int ret = NF_STOLEN;
	// 给封包设置IPVS标记，NF_HOOK会导致当前封包重入netfilter，此标记会让重入后的封包立即NF_ACCEPT、
	// 重入让修改后的封包有机会被ipables处理
	skb->ipvs_property = 1;
	if (likely(!(cp->flags & IP_VS_CONN_F_NFCT)))
		ip_vs_notrack(skb);
	else
		ip_vs_update_conntrack(skb, cp, 1);
	// 如果目的地不是本机
	if (!local) {
		skb_forward_csum(skb);
		// 调用LOCAL_OUT挂载点
		NF_HOOK(pf, NF_INET_LOCAL_OUT, skb, NULL, skb_dst(skb)->dev, dst_output);
	} else
		ret = NF_ACCEPT;
	return ret;
}

在ip_vs_nat_xmit中，我们可以了解到，对于宿主机发起的针对ClusterIP:ServicePort的请求

封包的目的地址被修改为Endpoint（通常是Pod，IPVS中的RS）的地址
修改后的封包，重新被塞入netfilter（内层），注意当前就正在netfilter（外层）中
1. 外层钩子的返回值是NF_STOLEN：封包处理权转移给内层钩子，停止后续netfilter流程
2. 内层钩子的返回值是NF_ACCEPT：不做IPVS相关处理，继续后续netfilter流程。IPVS前、后的LOCAL_OUT、POSTROUTING钩子都会正常执行。也就是说，对于修改后的封包，内核会进行完整、常规的netfilter处理，就像没有IPVS存在一样

到这里，我们确定了，IPVS会在LOCAL_OUT中进行DNAT。但是只有同时进行SNAT，才能解释上文的中的疑惑。

SNAT

花费了不少时间在IPVS上探究后，我们意识到走错了方向。我们忘记了SANT是kube-proxy会去做的事情。查看一下iptables规则就一目了然了：

# iptables -t nat -L -n -v

Chain OUTPUT (policy ACCEPT 2 packets, 150 bytes)
 pkts bytes target     prot opt in     out     source        destination         
# 所有出站流量都要经过自定义的 KUBE-SERVICES 链
  21M 3825M KUBE-SERVICES  all  --  *  *   0.0.0.0/0         0.0.0.0/0            /* kubernetes service portals */

Chain KUBE-SERVICES (2 references)
 pkts bytes target     prot opt in     out     source        destination         
# 如果目的IP:PORT属于K8S服务，则调用KUBE-MARK-MASQ链
    0     0 KUBE-MARK-MASQ  all  --  * *  !172.27.0.0/16     0.0.0.0/0   match-set KUBE-CLUSTER-IP dst,dst

# 给封包打上标记 0x4000
Chain KUBE-MARK-MASQ (5 references)
 pkts bytes target     prot opt in     out     source         destination         
   98  5880 MARK       all  --  *       *  0.0.0.0/0          0.0.0.0/0            MARK or 0x4000

 
Chain POSTROUTING (policy ACCEPT 2 packets, 150 bytes)
 pkts bytes target     prot opt in     out     source         destination         
  44M 5256M KUBE-POSTROUTING  all  --  *  *       0.0.0.0/0   0.0.0.0/0            /* kubernetes postrouting rules */

Chain KUBE-POSTROUTING (1 references)
 pkts bytes target     prot opt in     out     source         destination         
# 仅仅处理 0x4000标记的封包
 1781  166K RETURN     all  --  *      *  0.0.0.0/0           0.0.0.0/0            mark match ! 0x4000/0x4000
# 执行SNAT     
   97  5820 MARK       all  --  *      *  0.0.0.0/0           0.0.0.0/0   MARK xor 0x4000
   97  5820 MASQUERADE  all  --  *     *  0.0.0.0/0           0.0.0.0/0   /* kubernetes service traffic requiring SNAT */

由于ip_vs_local_request4挂钩在LOCAL_OUT，优先级为NF_IP_PRI_NAT_DST+2 ，因此它是发生在上面nat表OUTPUT链中MARK之后的。也就是说在IPVS处理之前，kube-proxy已经给原始的封包打上标记。

重入的、DNAT后的封包进入LOCAL_OUT，随后进入POSTROUTING。由于标记的缘故，封包被kube-proxy的规则SNAT。

经过POSTROUTING的封包，经过tcpdump，但是由于源、目的IP地址，以及目的端口都改变了，因而我们看到tcpdump没有任何输出。

总结

这里做一下小结。

为什么IPVS模式下，能够ping通ClusterIP？这是因为IPVS模式下，ClusterIP被配置为宿主机上一张虚拟网卡kube-ipvs0的IP地址。

为什么IPVS模式下，宿主机端口被ClusterIP泄漏？每当添加一个ClusterIP给网络接口后，内核自动在local表中增加一条路由，此路由保证了针对ClusterIP的访问，在没有IPVS干涉的情况下，路由到本机处理。这样，在0.0.0.0上监听的进程，就接收到报文并处理。

为什么删除内核添加的路由后：

宿主机上访问ClusterIP:NonServicePort不通了？因为没有路由了
没有路由了，为什么宿主机上访问ClusterIP:ServicePort仍然畅通？如上文分析，IPVS在ip_vs_nat_xmit中仍然会进行选路操作
那为什么从容器网络命名空间访问ClusterIP:ServicePort不通呢？IPVS处理本地、远程客户端的代码路径不一样。容器网络命名空间是远程客户端，需要首先进入PER_ROUTING，然后选路，路由目的地是本机，才会进入LOCAL_IN，IPVS才有介入的时机。由于路由被删掉了，选路那一步就会出问题

为什么通过--match-set KUBE-CLUSTER-IP匹配目的地址，如果封包目的端口是NonServicePort则Reject：

这种方案对容器命名空间有效？容器请求的源地址不会是ClusterIP，因此回程报文的目的地址不会因为匹配规则而Reject
这种方案导致宿主机无法访问ClusterIP？宿主机发起请求时用的是ClusterIP，请求端口是随机的。这种请求的回程报文必然匹配规则导致Reject

The post IPVS模式下ClusterIP泄露宿主机端口的问题 appeared first on 绿色记忆.

内核缺陷触发的NodePort服务63秒延迟问题

Alex — Fri, 14 Aug 2020 09:05:27 +0000

现象

我们有一个新创建的TKE 1.3.0集群，使用基于Galaxy + Flannel（VXLAN模式）的容器网络，集群由三个二层互通的Master节点

10.0.0.11

、

10.0.0.12

、

10.0.0.13

组成。在访问宿主机端口为

的NodePort类型的Service时，出现了很有趣的现象：

在节点

10.0.0.11

、

10.0.0.13

节点上

curl http://localhost:30153

，有50%几率卡住

在节点
```
10.0.0.12
```
上
```
curl http://localhost:30153
```
，100%几率卡住
从集群内部，访问非本节点的30153端口，畅通
从集群外部，访问任意节点的30153端口，畅通

三个节点本身并无差异，卡住几率不同，可能和服务的端点（Endpoint，即Pod）的分布情况有关。

NodePort服务的定义如下：

apiVersion: v1
kind: Service
metadata:
  name: kube-dns-nodeport
  namespace: kube-system
spec:
  externalTrafficPolicy: Cluster
  ports:
  - name: metrics
    nodePort: 30153
    port: 9153
    protocol: TCP
    targetPort: 9153
  selector:
    k8s-app: kube-dns
  sessionAffinity: None
  type: NodePort

该服务的端点有两个：

kubectl -n kube-system get pod -l k8s-app=kube-dns -o wide
NAME                      READY   STATUS    RESTARTS   AGE    IP           NODE        NOMINATED NODE   READINESS GATES
coredns-bbc9b5888-r72zd   1/1     Running   0          140m   172.29.0.2   10.0.0.11              
coredns-bbc9b5888-v6wx6   1/1     Running   0          10m    172.29.2.3   10.0.0.13

可以看到，端点在10.0.0.11、10.0.0.13上分别有一个。假设容器网络存在问题，只能访问本机的Pod，则能解释前面的卡住现象 —— 10.0.0.12上没有端点，因此一直卡住。10.0.0.11、10.0.0.13分别占有50%端点，因此50%几率卡住。

但是，我们在任意节点直接访问Pod，发现都是畅通的：

curl http://172.29.0.2:9153
404 page not found

curl http://172.29.2.3:9153
404 page not found

这说明故障和容器网络没有直接关系。

分析

内层封包

我们在10.0.0.11向localhost:30153发起请求，并且抓取卡住时的封包：

# 经过iptables时，DNAT为POD_IP:9153，SNAT为宿主机eth0地址
curl http://127.0.0.1:30153

tcpdump -ttttt -nn -vvv -i any 'tcp port 9153'

# 请求端
# SYN 0
 00:00:00.000000 IP (tos 0x0, ttl 64, id 42199, offset 0, flags [DF], proto TCP (6), length 60)
    172.29.0.0.40233 > 172.29.2.3.9153: Flags [S], cksum 0x5a6c (incorrect -> 0xd480), seq 1769165320, win 43690, options [mss 65495,sackOK,TS val 19658463 ecr 0,nop,wscale 9], length 0
# SYN 1
 00:00:01.000549 IP (tos 0x0, ttl 64, id 42200, offset 0, flags [DF], proto TCP (6), length 60)
    172.29.0.0.40233 > 172.29.2.3.9153: Flags [S], cksum 0x5a6c (incorrect -> 0xd097), seq 1769165320, win 43690, options [mss 65495,sackOK,TS val 19659464 ecr 0,nop,wscale 9], length 0
# SYN 2
 00:00:03.005510 IP (tos 0x0, ttl 64, id 42201, offset 0, flags [DF], proto TCP (6), length 60)
    172.29.0.0.40233 > 172.29.2.3.9153: Flags [S], cksum 0x5a6c (incorrect -> 0xc8c2), seq 1769165320, win 43690, options [mss 65495,sackOK,TS val 19661469 ecr 0,nop,wscale 9], length 0
# SYN 3
 00:00:07.008579 IP (tos 0x0, ttl 64, id 42202, offset 0, flags [DF], proto TCP (6), length 60)
    172.29.0.0.40233 > 172.29.2.3.9153: Flags [S], cksum 0x5a6c (incorrect -> 0xb91f), seq 1769165320, win 43690, options [mss 65495,sackOK,TS val 19665472 ecr 0,nop,wscale 9], length 0
# SYN 4
 00:00:15.024516 IP (tos 0x0, ttl 64, id 42203, offset 0, flags [DF], proto TCP (6), length 60)
    172.29.0.0.40233 > 172.29.2.3.9153: Flags [S], cksum 0x5a6c (incorrect -> 0x99cf), seq 1769165320, win 43690, options [mss 65495,sackOK,TS val 19673488 ecr 0,nop,wscale 9], length 0
# SYN 5
 00:00:31.072562 IP (tos 0x0, ttl 64, id 42204, offset 0, flags [DF], proto TCP (6), length 60)
    172.29.0.0.40233 > 172.29.2.3.9153: Flags [S], cksum 0x5a6c (incorrect -> 0x5b1f), seq 1769165320, win 43690, options [mss 65495,sackOK,TS val 19689536 ecr 0,nop,wscale 9], length 0
# SYN 6   63秒
 00:01:03.136526 IP (tos 0x0, ttl 64, id 42205, offset 0, flags [DF], proto TCP (6), length 60)
    172.29.0.0.40233 > 172.29.2.3.9153: Flags [S], cksum 0x5a6c (incorrect -> 0xddde), seq 1769165320, win 43690, options [mss 65495,sackOK,TS val 19721600 ecr 0,nop,wscale 9], length 0
# SYN+ACK 通讯建立
 00:01:03.137188 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto TCP (6), length 60)
    172.29.2.3.9153 > 172.29.0.0.40233: Flags [S.], cksum 0xbdbe (correct), seq 4208932479, ack 1769165321, win 27960, options [mss 1410,sackOK,TS val 19735883 ecr 19721600,nop,wscale 9], length 0


# 服务端

# 这个报文在63秒后才收到
# SYN 6
 00:00:00.000000 IP (tos 0x0, ttl 64, id 42205, offset 0, flags [DF], proto TCP (6), length 60)
    172.29.0.0.40233 > 172.29.2.3.9153: Flags [S], cksum 0xddde (correct), seq 1769165320, win 43690, options [mss 65495,sackOK,TS val 19721600 ecr 0,nop,wscale 9], length 0
 00:00:00.000025 IP (tos 0x0, ttl 63, id 42205, offset 0, flags [DF], proto TCP (6), length 60)
    172.29.0.0.40233 > 172.29.2.3.9153: Flags [S], cksum 0xddde (correct), seq 1769165320, win 43690, options [mss 65495,sackOK,TS val 19721600 ecr 0,nop,wscale 9], length 0
 00:00:00.000065 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 60)
    172.29.2.3.9153 > 172.29.0.0.40233: Flags [S.], cksum 0x5a6c (incorrect -> 0xbdbe), seq 4208932479, ack 1769165321, win 27960, options [mss 1410,sackOK,TS val 19735883 ecr 19721600,nop,wscale 9], length 0

可以注意到：

当Service负载均衡到本机的Pod时畅通，负载均衡到其它节点的Pod时卡住。这就是50%卡住的原因
并非彻底卡死，在63秒后，SYN成功

iptables规则

从上面的抓包结果分析，我们初步判断故障和iptables没有关系。iptables导致的问题可能是无限卡死直到超时（静默的丢弃了报文）、ICMP错误、TCP RST等，通常不会出现过了一段时间自动恢复的情况。

然后，这个故障很特别，它的确是由iptables规则所触发的。我们是后来查找资料才发现的这一事实，这里先列出相关的规则。其中PREROUTING阶段的规则如下：

# iptables -L -n -v -t nat

Chain PREROUTING (policy ACCEPT 1 packets, 60 bytes)
 pkts bytes target                prot opt in  out  source     destination         
# 所有封包都要这经过这个链
# kubernetes service portals
46185 2817K KUBE-SERVICES         all  --  *   *    0.0.0.0/0  0.0.0.0/0            

Chain KUBE-SERVICES (2 references)
 pkts bytes target                prot opt in  out  source      destination         
# 这些会匹配ClusterIP，和本场景无关
# kube-system/kube-dns:metrics cluster IP 
# 0  0 KUBE-SVC-JD5MR3NA4I4DYORP  tcp  --  *  *     0.0.0.0/0  172.29.255.10  tcp dpt:9153
#kube-system/kube-dns-nodeport:metrics cluster IP                   
# 0  0 KUBE-SVC-CZA6AQQ7F4S64XIF  tcp  --  *  *     0.0.0.0/0  172.29.255.56  tcp dpt:9153
# default/kubernetes:https cluster IP                             
# 0  0 KUBE-SVC-NPX46M4PTMTKRN6Y  tcp  --  *  *     0.0.0.0/0  172.29.255.1   tcp dpt:443
# kube-system/kube-dns:dns cluster IP                           
# 0  0 KUBE-SVC-TCOU7JCQXEZGVUNU  udp  --  *  *     0.0.0.0/0  172.29.255.10  udp dpt:53
#kube-system/kube-dns:dns-tcp cluster IP                         
# 0  0 KUBE-SVC-ERIFXISQEP7F7OF4  tcp  --  *  *     0.0.0.0/0  172.29.255.10  tcp dpt:53
# 不是访问ClusterIP的、目的地址是本机绑定地址的封包，都要经过这个链
# kubernetes service nodeports; NOTE: this must be the last rule in this chain 
  678 40680 KUBE-NODEPORTS        all  --  *  *     0.0.0.0/0  0.0.0.0/0  ADDRTYPE match dst-type LOCAL


Chain KUBE-NODEPORTS (1 references)
 pkts bytes target                prot opt in out   source     destination         
# 匹配本场景（目标端口30153），会给封包打标记，因此不会终止规则链遍历
# kube-system/kube-dns-nodeport:metrics 
    0     0 KUBE-MARK-MASQ        tcp  --  *  *     0.0.0.0/0  0.0.0.0/0  tcp dpt:30153
# 匹配本场景（目标端口30153），跳转到NodePort的目标服务的专属规则链
# kube-system/kube-dns-nodeport:metrics 
    0     0 KUBE-SVC-CZAXXX       tcp  --  *  *     0.0.0.0/0  0.0.0.0/0  tcp dpt:30153


Chain KUBE-MARK-MASQ (8 references)
 pkts bytes target                prot opt in out   source     destination         
# 封包会被打上 0x4000标记
    0     0 MARK                  all  --  *  *     0.0.0.0/0  0.0.0.0/0  MARK or 0x4000


# 这个是NodePort的目标服务的专属规则链，随机转发给某个服务端点
Chain KUBE-SVC-CZAXXX (2 references)
 pkts bytes target                prot opt in out   source     destination    
# kube-system/kube-dns-nodeport:metrics     
    0     0 KUBE-SEP-DZXXXX       all  --  *  *     0.0.0.0/0  0.0.0.0/0  statistic mode random probability 0.50000000000
# kube-system/kube-dns-nodeport:metrics 
    0     0 KUBE-SEP-COSXXX       all  --  *  *     0.0.0.0/0  0.0.0.0/0           


# 这是NodePort服务的某个端点的专属规则链
Chain KUBE-SEP-DZXXXX (1 references)
 pkts bytes target                prot opt in out   source     destination         
# kube-system/kube-dns-nodeport:metrics
    0     0 KUBE-MARK-MASQ        all  --  *  *     172.29.2.3 0.0.0.0/0            
# 匹配本场景，进行DNAT，将目的地址从本机地址转为服务端点地址，如果端点不在本机，报文会从flannel.1接口发出
# kube-system/kube-dns-nodeport:metrics 
    0     0 DNAT                  tcp  --  *  *     0.0.0.0/0  0.0.0.0/0  tcp to:172.29.2.3:9153

我们可以看到，如果服务端点不在本机，发往localhost:30153的封包，会被先打上0x4000标记，然后DNAT到服务端点的IP:PORT（例如172.29.2.3:9153），这会保证封包从flannel.1发出。

POSTROUTING阶段的规则如下：

Chain POSTROUTING (policy ACCEPT 2 packets, 120 bytes)
pkts bytes target   prot opt in  out  source     destination         
# kubernetes postrouting rules
83159 5015K KUBE-POSTROUTING all  --  *   *    0.0.0.0/0  0.0.0.0/0            

Chain KUBE-POSTROUTING (1 references)
pkts bytes target prot opt in out source destination   
# kubernetes service traffic requiring SNAT      
0   0 MASQUERADE  all  --  *  *  0.0.0.0/0 0.0.0.0/0  mark match 0x4000/0x4000 random-fully

可以看到，这里做了SNAT，任何具有0x4000标记的封包，都被SNAT，确保使用flannel.1的地址作为源IP。

63秒现象

经过反复测试，发现卡住时，总是会消耗63秒左右，然后接收到响应。

63秒这个数字，和TCP默认的SYN重试机制有关。SYN如果没有收到ACK，发送端会自动重发SYN，每次重试的延迟时间指数增长，依次为1, 2, 4, 8, 16, 32，这会引发合计63秒的总延迟。

令人费解的是，为什么63秒之后，不是超时，而是连接成功？

外层封包

从上文抓取的TCP封包看，服务端的Pod网卡没有收到前面6次SYN，这些封包应该在链路的某个位置被丢弃了。

在VXLAN模式下，上面抓的TCP封包，会封装在UDP报文中，并通过节点物理网卡的8472端口发出。我们从外层报文的角度分析一下

# tcpdump -ttttt -n -v -i eth0 'udp port 8472'
# 畅通时，没有输出，因为访问本机的Pod时不走VXLAN

# 卡住时，请求端封包
 00:00:00.000000 IP (tos 0x0, ttl 64, id 43516, offset 0, flags [none], proto UDP (17), length 110)
    10.0.0.11.60142 > 10.0.0.13.8472: [bad udp cksum 0xffff -> 0x4b80!] OTV, flags [I] (0x08), overlay 0, instance 1
IP (tos 0x0, ttl 64, id 42199, offset 0, flags [DF], proto TCP (6), length 60)
    172.29.0.0.40233 > 172.29.2.3.9153: Flags [S], cksum 0xd480 (correct), seq 1769165320, win 43690, options [mss 65495,sackOK,TS val 19658463 ecr 0,nop,wscale 9], length 0
 00:00:01.000542 IP (tos 0x0, ttl 64, id 44011, offset 0, flags [none], proto UDP (17), length 110)
    10.0.0.11.60142 > 10.0.0.13.8472: [bad udp cksum 0xffff -> 0x4b80!] OTV, flags [I] (0x08), overlay 0, instance 1
IP (tos 0x0, ttl 64, id 42200, offset 0, flags [DF], proto TCP (6), length 60)
    172.29.0.0.40233 > 172.29.2.3.9153: Flags [S], cksum 0xd097 (correct), seq 1769165320, win 43690, options [mss 65495,sackOK,TS val 19659464 ecr 0,nop,wscale 9], length 0
 00:00:03.005505 IP (tos 0x0, ttl 64, id 45443, offset 0, flags [none], proto UDP (17), length 110)
    10.0.0.11.60142 > 10.0.0.13.8472: [bad udp cksum 0xffff -> 0x4b80!] OTV, flags [I] (0x08), overlay 0, instance 1
IP (tos 0x0, ttl 64, id 42201, offset 0, flags [DF], proto TCP (6), length 60)
    172.29.0.0.40233 > 172.29.2.3.9153: Flags [S], cksum 0xc8c2 (correct), seq 1769165320, win 43690, options [mss 65495,sackOK,TS val 19661469 ecr 0,nop,wscale 9], length 0
 00:00:07.008579 IP (tos 0x0, ttl 64, id 46574, offset 0, flags [none], proto UDP (17), length 110)
    10.0.0.11.60142 > 10.0.0.13.8472: [bad udp cksum 0xffff -> 0x4b80!] OTV, flags [I] (0x08), overlay 0, instance 1
IP (tos 0x0, ttl 64, id 42202, offset 0, flags [DF], proto TCP (6), length 60)
    172.29.0.0.40233 > 172.29.2.3.9153: Flags [S], cksum 0xb91f (correct), seq 1769165320, win 43690, options [mss 65495,sackOK,TS val 19665472 ecr 0,nop,wscale 9], length 0
 00:00:15.024518 IP (tos 0x0, ttl 64, id 50068, offset 0, flags [none], proto UDP (17), length 110)
    10.0.0.11.60142 > 10.0.0.13.8472: [bad udp cksum 0xffff -> 0x4b80!] OTV, flags [I] (0x08), overlay 0, instance 1
IP (tos 0x0, ttl 64, id 42203, offset 0, flags [DF], proto TCP (6), length 60)
    172.29.0.0.40233 > 172.29.2.3.9153: Flags [S], cksum 0x99cf (correct), seq 1769165320, win 43690, options [mss 65495,sackOK,TS val 19673488 ecr 0,nop,wscale 9], length 0
 00:00:31.072564 IP (tos 0x0, ttl 64, id 65085, offset 0, flags [none], proto UDP (17), length 110)
    10.0.0.11.60142 > 10.0.0.13.8472: [bad udp cksum 0xffff -> 0x4b80!] OTV, flags [I] (0x08), overlay 0, instance 1
IP (tos 0x0, ttl 64, id 42204, offset 0, flags [DF], proto TCP (6), length 60)
    172.29.0.0.40233 > 172.29.2.3.9153: Flags [S], cksum 0x5b1f (correct), seq 1769165320, win 43690, options [mss 65495,sackOK,TS val 19689536 ecr 0,nop,wscale 9], length 0
 00:01:03.136538 IP (tos 0x0, ttl 64, id 19809, offset 0, flags [none], proto UDP (17), length 110)
    10.0.0.11.50024 > 10.0.0.13.8472: [no cksum] OTV, flags [I] (0x08), overlay 0, instance 1
IP (tos 0x0, ttl 64, id 42205, offset 0, flags [DF], proto TCP (6), length 60)
    172.29.0.0.40233 > 172.29.2.3.9153: Flags [S], cksum 0xddde (correct), seq 1769165320, win 43690, options [mss 65495,sackOK,TS val 19721600 ecr 0,nop,wscale 9], length 0
 00:01:03.137105 IP (tos 0x0, ttl 64, id 63229, offset 0, flags [none], proto UDP (17), length 110)
    10.0.0.13.50017 > 10.0.0.11.8472: [no cksum] OTV, flags [I] (0x08), overlay 0, instance 1
IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto TCP (6), length 60)
    172.29.2.3.9153 > 172.29.0.0.40233: Flags [S.], cksum 0xbdbe (correct), seq 4208932479, ack 1769165321, win 27960, options [mss 1410,sackOK,TS val 19735883 ecr 19721600,nop,wscale 9], length 0


# 卡住时，服务端封包
# SYN 0
 00:00:00.000000 IP (tos 0x0, ttl 64, id 43516, offset 0, flags [none], proto UDP (17), length 110)
    10.0.0.11.60142 > 10.0.0.13.8472: [bad udp cksum 0xffff -> 0x4b80!] OTV, flags [I] (0x08), overlay 0, instance 1
IP (tos 0x0, ttl 64, id 42199, offset 0, flags [DF], proto TCP (6), length 60)
    172.29.0.0.40233 > 172.29.2.3.9153: Flags [S], cksum 0xd480 (correct), seq 1769165320, win 43690, options [mss 65495,sackOK,TS val 19658463 ecr 0,nop,wscale 9], length 0
# SYN 1
 00:00:01.000543 IP (tos 0x0, ttl 64, id 44011, offset 0, flags [none], proto UDP (17), length 110)
    10.0.0.11.60142 > 10.0.0.13.8472: [bad udp cksum 0xffff -> 0x4b80!] OTV, flags [I] (0x08), overlay 0, instance 1
IP (tos 0x0, ttl 64, id 42200, offset 0, flags [DF], proto TCP (6), length 60)
    172.29.0.0.40233 > 172.29.2.3.9153: Flags [S], cksum 0xd097 (correct), seq 1769165320, win 43690, options [mss 65495,sackOK,TS val 19659464 ecr 0,nop,wscale 9], length 0
# SYN 2
 00:00:03.005514 IP (tos 0x0, ttl 64, id 45443, offset 0, flags [none], proto UDP (17), length 110)
    10.0.0.11.60142 > 10.0.0.13.8472: [bad udp cksum 0xffff -> 0x4b80!] OTV, flags [I] (0x08), overlay 0, instance 1
IP (tos 0x0, ttl 64, id 42201, offset 0, flags [DF], proto TCP (6), length 60)
    172.29.0.0.40233 > 172.29.2.3.9153: Flags [S], cksum 0xc8c2 (correct), seq 1769165320, win 43690, options [mss 65495,sackOK,TS val 19661469 ecr 0,nop,wscale 9], length 0
# SYN 3
 00:00:07.008577 IP (tos 0x0, ttl 64, id 46574, offset 0, flags [none], proto UDP (17), length 110)
    10.0.0.11.60142 > 10.0.0.13.8472: [bad udp cksum 0xffff -> 0x4b80!] OTV, flags [I] (0x08), overlay 0, instance 1
IP (tos 0x0, ttl 64, id 42202, offset 0, flags [DF], proto TCP (6), length 60)
    172.29.0.0.40233 > 172.29.2.3.9153: Flags [S], cksum 0xb91f (correct), seq 1769165320, win 43690, options [mss 65495,sackOK,TS val 19665472 ecr 0,nop,wscale 9], length 0
# SYN 4
 00:00:15.024575 IP (tos 0x0, ttl 64, id 50068, offset 0, flags [none], proto UDP (17), length 110)
    10.0.0.11.60142 > 10.0.0.13.8472: [bad udp cksum 0xffff -> 0x4b80!] OTV, flags [I] (0x08), overlay 0, instance 1
IP (tos 0x0, ttl 64, id 42203, offset 0, flags [DF], proto TCP (6), length 60)
    172.29.0.0.40233 > 172.29.2.3.9153: Flags [S], cksum 0x99cf (correct), seq 1769165320, win 43690, options [mss 65495,sackOK,TS val 19673488 ecr 0,nop,wscale 9], length 0
# SYN 5
 00:00:31.072593 IP (tos 0x0, ttl 64, id 65085, offset 0, flags [none], proto UDP (17), length 110)
    10.0.0.11.60142 > 10.0.0.13.8472: [bad udp cksum 0xffff -> 0x4b80!] OTV, flags [I] (0x08), overlay 0, instance 1
IP (tos 0x0, ttl 64, id 42204, offset 0, flags [DF], proto TCP (6), length 60)
    172.29.0.0.40233 > 172.29.2.3.9153: Flags [S], cksum 0x5b1f (correct), seq 1769165320, win 43690, options [mss 65495,sackOK,TS val 19689536 ecr 0,nop,wscale 9], length 0
# SYN 6 63秒，可以看到这次没有UDP封包没有chksum了，服务端也收到SYN了
 00:01:03.136659 IP (tos 0x0, ttl 64, id 19809, offset 0, flags [none], proto UDP (17), length 110)
    10.0.0.11.50024 > 10.0.0.13.8472: [no cksum] OTV, flags [I] (0x08), overlay 0, instance 1
IP (tos 0x0, ttl 64, id 42205, offset 0, flags [DF], proto TCP (6), length 60)
    172.29.0.0.40233 > 172.29.2.3.9153: Flags [S], cksum 0xddde (correct), seq 1769165320, win 43690, options [mss 65495,sackOK,TS val 19721600 ecr 0,nop,wscale 9], length 0
 00:01:03.136830 IP (tos 0x0, ttl 64, id 63229, offset 0, flags [none], proto UDP (17), length 110)
    10.0.0.13.50017 > 10.0.0.11.8472: [no cksum] OTV, flags [I] (0x08), overlay 0, instance 1
IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto TCP (6), length 60)
    172.29.2.3.9153 > 172.29.0.0.40233: Flags [S.], cksum 0xbdbe (correct), seq 4208932479, ack 1769165321, win 27960, options [mss 1410,sackOK,TS val 19735883 ecr 19721600,nop,wscale 9], length 0

可以看到，请求端/服务端的UDP报文相互呼应，至少可以说，请求端的全部报文都送到了服务端。

但是，前面5次重试的UDP报文都被标注了bad udp cksum，最后一次UDP报文没有chksum，连接成功建立。有理由怀疑，故障和chksum有关系。

通过查阅VXLAN的RFC，在VXLAN Frame Format一章中，关于UDP封包的Checksum，具有如下说明：

UDP Checksum应该以零传递。接收端接收到零Checksum的UDP包后，它必须接受，用于解包（decapsulation）。但是，如果发送端的确提供了非零Checksum，那么它必须是正确的、基于整个封包进行计算的 —— 包括IP头、UDP头、VXLAN头，以及最里层的MAC帧。接收端可以对非零Checksum进行校验，或者不去校验。但是，如果进行了校验，且校验结果不正确，则必须丢弃UDP封包

RFC说的很明确，如果Checksum是错误的，并且进行了校验，则封包会被丢弃。带入我们的场景中，可以推测，服务端内核丢弃了那些bad udp cksum的封包，因而服务端的Pod网卡一直没有收到SYN。

那么，Checksum为什么会错了呢？根源应该在内核。

内核缺陷

现代操作系统都支持某些形式的Network Offloading，将某些工作委托给网卡完成，从而减轻CPU的负担。从内核代码的演变情况来看，这种Offloading的种类越来越丰富。

Checksum就可以Offload给网卡来完成，这样，IP、TCP和UDP的Checksum，会在报文即将从网络接口发送出去的时候进行计算。Offloading需要内核的TCP/IP栈、设备驱动、硬件正确的配合才能完成。

通过查阅资料，我们了解到，内核中存在一个和VXLAN处理有关的缺陷，该缺陷会导致Checksum Offloading不能正确完成。这个缺陷仅仅在很边缘的场景下才会表现出来。

在VXLAN的UDP头被NAT过（见下文的二次SNAT问题）的前提下，如果：

VXLAN设备禁用（这是RFC的建议）了UDP Checksum
VXLAN设备启用了Tx Checksum Offloading

就会导致生成错误的UDP Checksum。

二次SNAT

前面提到内核缺陷必须在VXLAN的UDP封包被NAT时，才会触发。那么，在源、目标地址都是宿主机网段的情况下，为什么还对UDP封包进行NAT呢？

在上文的iptables分析中我们看到，访问localhost:30153的封包，会被：

DNAT到服务端Pod的地址，这保证封包能够通过flannel.1发出
打上0x4000标记，这个标记会在随后的POSTROUTING阶段，用于进行SNAT。使用flannel.1的地址作为源地址

被DNAT+SNAT后的内层TCP报文，进入flannel.1接口，进而在内核的VXLAN驱动中处理，封装为UDP报文。需要注意，iptables打标记，我们期望是针对内层报文的。然而，内层封包被VXLAN处理后包裹了外层UDP，重新进入网络栈，内核自动将0x4000标记关联到外层UDP报文上，这导致了额外的一次SNAT：

iptables -t nat -I  KUBE-POSTROUTING 1 -j LOG --log-prefix "0x4000-marked: " -m mark --mark 0x4000/0x4000

dmesg -wH

# 第一次NAT，针对内层报文，我们期望将127.0.0.1 SNAT为 flannel.1的地址
[  +3.851027] 0x4000-marked: IN= OUT=flannel.1 SRC=127.0.0.1 DST=172.29.2.3 LEN=60 TOS=0x00 PREC=0x00 TTL=64 ID=44704 DF PROTO=TCP SPT=43326 DPT=9153 WINDOW=43690 RES=0x00 SYN URGP=0 MARK=0x4000 
# 第二次NAT，针对外层报文，我们并没有期望这次SNAT，因为源地址本来就是eth0的地址了
[  +0.000019] 0x4000-marked: IN= OUT=eth0 SRC=10.0.0.12 DST=10.0.0.11 LEN=110 TOS=0x00 PREC=0x00 TTL=64 ID=9697 PROTO=UDP SPT=60211 DPT=8472 LEN=90 MARK=0x4000

在Kubernetes 1.16.0之前的版本，Kube Proxy做SNAT（

-j MASQUERADE

）时，没有使用

--random-fully

参数。这意味着第二次SNAT不会有任何效果，因为内核会在Masquerading时尝试保持源端口不变，与此同时，源端口已经是期望的地址了。

但是，使用了--random-fully参数后，情况变得不同。该参数会强制的进行随机的源端口映射。这就触发了上文提到的内核缺陷。

random-fully

这是SNAT目标的一个参数，它会使用伪随机数生成器，自动产生一个端口，来替换NAT前的端口。根据文档，它需要内核3.14+才能支持。

然而，我们用的是CentOS 7，内核版本是 3.10.0-1127.13.1.el7.x86_64，照理说应该不支持这个特性。

在宿主机上，用iptables-save导出规则，也是看不到--random-fully的。但是，从Kube Proxy容器里面导出规则，却能看见：

# iptables-save | egrep '\-A\sKUBE-POSTROUTING'
-A KUBE-POSTROUTING  -m mark --mark 0x4000/0x4000 -j MASQUERADE

# kubectl -n kube-system exec kube-proxy-7qtzm -- iptables-save | egrep '\-A\sKUBE-POSTROUTING'
-A KUBE-POSTROUTING  -m mark --mark 0x4000/0x4000 -j MASQUERADE --random-fully

原因可能是两个iptables的版本不同。有一点可以明确，--random-fully在我们的环境下的确产生了影响，因为禁用该参数后，问题就消失了。

解决

临时方案

关闭Offloading

既然故障的根源是内核中，和Offloading有关的缺陷，因此，禁用Offloading是最直接的手段：

ethtool --offload flannel.1 rx off tx off

这个命令执行的时机很重要，如果主机重启，Flannel创建网卡后，才能执行该命令，否则会提示找不到设备。

防止二次SNAT

有两种方式防止对VXLAN的UDP封包进行SNAT。第一种是禁用--random-fully参数。这种做法印证了上文关于此参数的猜测。

iptables -t nat -R KUBE-POSTROUTING  1  -m mark --mark 0x4000/0x4000 -j MASQUERADE

第二种，将发往8472端口的UDP封包，做一个重置标记的操作。Kubernetes社区就是这种做法。

iptables -A OUTPUT -p udp -m udp --dport 8472 -j MARK --set-mark 0x0

永久方案

升级K8S版本

查看Kubernetes v1.18.5的Changelog，可以发现PR 92035修复了这个故障。这个PR会在不需要0x4000标记时，将其清除。

在Kubelet初始化期间，会在NAT表创建KUBE-MARK-MASQ、KUBE-MARK-DROP、KUBE-POSTROUTING等链，并添加一些规则。该PR对这部分的逻辑进行了修改：

func (kl *Kubelet) syncNetworkUtil() {
	// ...
	if _, err := kl.iptClient.EnsureRule(utiliptables.Append, utiliptables.TableNAT,
	// 这里将原先有缺陷的--set-xmark 0x4000/0x4000 改为 --xor-mark 
      KubeMarkMasqChain, "-j", "MARK", "--or-mark", masqueradeMark); err != nil {
		klog.Errorf("Failed to ensure marking rule for %v: %v", KubeMarkMasqChain, err)
		return
	}
	// ...
	// 这里是关键的修改，在KUBE-POSTROUTING中添加以下规则：
	// 如果封包没有0x4000标记位，则不做处理
	// iptables -t NAT -A KUBE-POSTROUTING -m mark ! --mark=0x4000/0x4000 -j RETRUN
	if _, err := kl.iptClient.EnsureRule(utiliptables.Append, utiliptables.TableNAT, KubePostroutingChain,
		"-m", "mark", "!", "--mark", fmt.Sprintf("%s/%s", masqueradeMark, masqueradeMark),
		"-j", "RETURN"); err != nil {
		klog.Errorf("Failed to ensure filtering rule for %v: %v", KubePostroutingChain, err)
		return
	}
	// 否则，清除0x4000标记位，防止封包重新遍历网络栈时，被再次SNAT
	// 注意，在这里可以明确知道0x4000被设置，因此可以安全的用XOR将该位取消掉，不需要关心其它位
	// iptables -t NAT -A  KUBE-POSTROUTING  -j MARK --xor-mark=0x4000
	if _, err := kl.iptClient.EnsureRule(utiliptables.Append, utiliptables.TableNAT, KubePostroutingChain,
		"-j", "MARK", "--xor-mark", masqueradeMark); err != nil {
		klog.Errorf("Failed to ensure unmarking rule for %v: %v", KubePostroutingChain, err)
		return
	}
	// ...
}

此外，该PR还对Kube Proxy的iptables/ipvs相关模块进行了类似修改，这里就不张贴代码了。

升级内核

已知内核版本5.6.13, 5.4.41, 4.19.123, 4.14.181修复了上文提到的内核缺陷，但是CentOS 7何时修复未知，可能需要自行Patch。

深入

Checksum

所谓Checksum是一个固定长度的字段，网络协议使用该字段来纠正某些传输错误。

Checksum通常是基于某些报文字段来计算摘要信息，算法决定了Checksum的可靠性和计算成本。IP协议仅仅使用报文头，而大部分L4协议，同时使用报文头、报文体。

在IPv4（IPV6没有IP Checksum）中，IP Checksum是16bit字段，信息来自IP头所有字段。在任一跳发现Checksum错误，都会导致静默的丢弃，而不产生ICMP报文 —— L4协议需要考虑这种静默丢弃的可能并进行相应处理，例如TCP在ACK没有及时收到时会进行重传。

IP数据报在经过每一跳时，都需要更新Checksum，至少TTL的变化需要重新计算Checksum。除了TTL，IP头还可能因为以下原因变化：

NAT导致的地址变化
IP选项处理
IP分片

计算IP Checksum时，报文被分隔为16bit的小段，将这些小段相加并取反（ones-complemented），就得到最后的Checksum。在Linux中，可能分隔为32bit甚至64bit的小段，以提升计算速度，但是取反操作前需要一个额外的折叠（csum_fold）操作。

由于IP Checksum仅仅牵涉到报文头，成本很低，Linux总是在CPU中进行计算，不会Offload给硬件。

L4协议的Checksum牵涉完整报文，包括L4报文头、L4报文体、以及所谓的伪头（pseudoheader）。伪头其实就是IP头中的源地址、目的地址、以及之后的32bit。

IP层在NAT等场景下，需要对IP头进行变更，这会导致L4协议计算的Checksum失效。如果没有更新失效的Checksum，则在IP报文传输的每一跳都不会发现错误，因为中间路由仅仅会校验IP Checksum。结果就是，只有目的地内核才能在L4发现这一情况。我们可以了解到Checksum算法具有可逆性，因此NAT这样导致很少字段变化的情况下，更新Checksum不需要从头计算。

Offloading

前面提到过，L4的Checksum计算涉及完整报文，成本较高。因此Linux支持将L4的Checksum委托给硬件完成，这就是Checksum Offloading。

设备能否支持Checksum Offloading，是通过

net_device->features

标记传递给内核的：

NETIF_F_HW_CSUM 驱动能够为任何协议组合、协议层计算IP Checksum
NETIF_F_IP_CSUM 驱动支持L4（仅限于TCP/UDP over IPv4）的Checksum计算
NETIF_F_IPV6_CSUM 驱动支持L4（仅限于TCP/UDP over IPv6）的Checksum计算
NETIF_F_NO_CSUM 表示设备明确知道不需要计算Checksum，通常用于loopback设备
NETIF_F_RXCSUM 驱动进行接收封包的Checksum Offloading，仅仅用于禁用设备的RX Checksum

skb->ip_summed

字段存放了Checksum的状态，其含义在接收封包、发送封包期间有所不同。

在接收封包期间：

CHECKSUM_NONE 提示设备没有对封包进行Checksum校验，可能由于缺少相关特性
CHECKSUM_UNNECESSARY 提示内核不再需要对Checksum进行校验
CHECKSUM_COMPLETE 提示设备已经提供了完整的L4 Checksum，L4代码只需要加上伪头即可进行校验

在发送封包期间：

CHECKSUM_NONE 提示内核已经完全处理好Checksum了，设备不需要做任何事情
CHECKSUM_UNNECESSARY 意义和CHECKSUM_NONE相同
CHECKSUM_PARTIAL 提示内核已经完成伪头部分的Checksum，驱动必须计算从
```
skb->csum_start
```
到封包结尾部分的Checksum，并且将其存放在
```
skb->csum_start + skb->csum_offset
```
这个位置
CHECKSUM_COMPLETE 不使用

可以看到，在发送封包时，如果skb->ip_summed的值为CHECKSUM_PARTIAL，则意味着内核要求驱动Checksum Offloading。

内核缺陷

基于上面的认识，我们可以看一下本文牵涉到的内核缺陷到底是什么了：

// linux-3.10.y
static bool
udp_manip_pkt(struct sk_buff *skb,  // 当前操控的套接字缓冲
	      const struct nf_nat_l3proto *l3proto,  // 持有NAT操作相关的若干函数指针
	      unsigned int iphdroff, unsigned int hdroff,  // IP头、L4头的偏移量
	      const struct nf_conntrack_tuple *tuple,  // 连接跟踪相关的信息，新旧IP端口
	      enum nf_nat_manip_type maniptype)  // 是SNAT还是DNAT
{
	struct udphdr *hdr;
	__be16 *portptr, newport;

	if (!skb_make_writable(skb, hdroff + sizeof(*hdr)))
		return false;

	// 获得UDP头
	hdr = (struct udphdr *)(skb->data + hdroff);

	if (maniptype == NF_NAT_MANIP_SRC) {
		// NAT后的源端口
		newport = tuple->src.u.udp.port;
		// NAT前的源端口
		portptr = &hdr->source;
	} else {
		/* Get rid of dst port */
		newport = tuple->dst.u.udp.port;
		portptr = &hdr->dest;
	}
	// 如果Checksum不为零， 或者 开启了Offloading，则更新Checksum
	if (hdr->check || skb->ip_summed == CHECKSUM_PARTIAL) {

		//       这里调用的是 nf_nat_ipv4_csum_update
		l3proto->csum_update(skb, iphdroff, &hdr->check,
				     tuple, maniptype);
		inet_proto_csum_replace2(&hdr->check, skb, *portptr, newport,
					 0);
		if (!hdr->check)
			hdr->check = CSUM_MANGLED_0;
	}
	*portptr = newport;
	return true;
}

static void nf_nat_ipv4_csum_update(struct sk_buff *skb,
				    unsigned int iphdroff, __sum16 *check,
				    const struct nf_conntrack_tuple *t,
				    enum nf_nat_manip_type maniptype)
{
	struct iphdr *iph = (struct iphdr *)(skb->data + iphdroff);
	__be32 oldip, newip;

	if (maniptype == NF_NAT_MANIP_SRC) {
		oldip = iph->saddr;
		newip = t->src.u3.ip;
	} else {
		oldip = iph->daddr;
		newip = t->dst.u3.ip;
	}
	// 这里传入了无效的Checksum
	inet_proto_csum_replace4(check, skb, oldip, newip, 1);
}

void inet_proto_csum_replace4(__sum16 *sum, struct sk_buff *skb,
			      __be32 from, __be32 to, int pseudohdr)
{
	__be32 diff[] = { ~from, to };
	if (skb->ip_summed != CHECKSUM_PARTIAL) {
		*sum = csum_fold(csum_partial(diff, sizeof(diff),
				~csum_unfold(*sum)));
		if (skb->ip_summed == CHECKSUM_COMPLETE && pseudohdr)
			skb->csum = ~csum_partial(diff, sizeof(diff),
						~skb->csum);
	} else if (pseudohdr)
		// 走这个分支，可以看到，更新Checksum依赖于先前的Checksum是正确值
		*sum = ~csum_fold(csum_partial(diff, sizeof(diff), csum_unfold(*sum)));
}

当VXLAN端点的UDP被NAT的情况下，上述代码会执行。如果 VXLAN设备禁用了UDP Checksum，它会将udphdr->check置零。如果同时VXLAN设备还启用了Tx Checksum Offloading，skb->ip_summed的值就会是CHECKSUM_PARTIAL。这就是我们环境下的配置。

UDP Checksum被禁用情况下，udphdr->check是个零值，显然没有包含旧的伪头的Checksum信息，因为通过伪头计算的Checksum，至少协议类型部分（UDP 0x11）是非零。

因此，判断是否需要更新Checksum，应当只VXLAN接口是否禁用了UDP Checksum，禁用了就不应该更新。

参考

The post 内核缺陷触发的NodePort服务63秒延迟问题 appeared first on 绿色记忆.