Menu

  • Home
  • Work
    • AI
    • Cloud
      • Virtualization
      • IaaS
      • PaaS
    • Architecture
    • BigData
    • Python
    • Java
    • Go
    • C
    • C++
    • JavaScript
    • PHP
    • Others
      • Assembly
      • Ruby
      • Perl
      • Lua
      • Rust
      • XML
      • Network
      • IoT
      • GIS
      • Algorithm
      • Math
      • RE
      • Graphic
    • OS
      • Linux
      • Windows
      • Mac OS X
    • Database
      • MySQL
      • Oracle
    • Mobile
      • Android
      • IOS
    • Web
      • HTML
      • CSS
  • Life
    • Cooking
    • Travel
    • Gardening
  • Gallery
  • Video
  • Music
  • Essay
  • Home
  • Work
    • AI
    • Cloud
      • Virtualization
      • IaaS
      • PaaS
    • Architecture
    • BigData
    • Python
    • Java
    • Go
    • C
    • C++
    • JavaScript
    • PHP
    • Others
      • Assembly
      • Ruby
      • Perl
      • Lua
      • Rust
      • XML
      • Network
      • IoT
      • GIS
      • Algorithm
      • Math
      • RE
      • Graphic
    • OS
      • Linux
      • Windows
      • Mac OS X
    • Database
      • MySQL
      • Oracle
    • Mobile
      • Android
      • IOS
    • Web
      • HTML
      • CSS
  • Life
    • Cooking
    • Travel
    • Gardening
  • Gallery
  • Video
  • Music
  • Essay

DevPod on Kubernetes: turning devcontainer.json into a persistent remote workspace

10
Apr
2026

DevPod on Kubernetes: turning devcontainer.json into a persistent remote workspace

By Alex
/ in Cloud
0 Comments

DevPod is an open source workspace manager for reproducible development environments across Docker, Kubernetes, SSH hosts, and several cloud backends. This note documents a full Kubernetes-based remote development setup with DevPod, including persistent volume strategy, custom images, file sync, IDE integration, and the GPU issues that tend to burn the most time.

What DevPod is

DevPod, from Loft Labs, separates environment definition from the infrastructure that runs it. The developer describes the environment in devcontainer.json, including the base image, toolchain, ports, and lifecycle hooks. DevPod then creates and manages the matching workspace on the selected Provider.

Three terms matter more than anything else:

  • Provider: the infrastructure backend. DevPod supports Docker, Kubernetes, SSH, and several cloud platforms.
  • Workspace: an isolated development environment instance, usually backed by a container or VM on the provider.
  • devcontainer.json: a Dev Container specification file that defines the image, lifecycle hooks, port forwarding, and editor behavior.

Compared with GitHub Codespaces or Gitpod, DevPod is a client-side tool rather than a hosted platform. On a self-managed Kubernetes cluster, that means you keep control over networking, storage, security policy, and node placement.

Kubernetes provider architecture

When Kubernetes is the provider, DevPod creates a Pod to host the workspace. Most setups end up with three files:

  1. devcontainer.json, which defines the image, workspace directory, forwarded ports, and lifecycle commands.
  2. pod-manifest.yaml, which carries the Kubernetes-native parts such as security context, resource limits, and volume mounts.
  3. An orchestration script such as devpod.sh, which wraps devpod up, file sync, and environment bootstrap. That script is usually the glue that makes the workflow tolerable.
Workspace lifecycle

A typical flow looks like this:

Shell
1
2
3
4
5
6
7
8
9
10
11
12
13
14
# Create and start the workspace, which creates a Pod on Kubernetes
devpod up . --ide none --provider K8s
 
# Sync local source code to the remote workspace
rsync -az --exclude='node_modules' ./project/ remote:/workspace/project/
 
# Enter the development environment
devpod ssh my-workspace
 
# Stop the workspace, which removes the Pod but keeps the PVC
devpod stop my-workspace
 
# Delete everything, including the Pod and the PVC
devpod delete my-workspace

What matters is how devpod stop behaves. It removes the Pod but keeps the PVC. The next devpod up recreates the Pod and reattaches the same volume, so the data survives Pod recreation.

Managing multiple environments

The simplest way to split environments is to keep a separate Pod manifest for each one and switch them in a wrapper script:

Shell
1
2
3
4
5
6
7
8
9
10
11
# Example orchestration logic: select a manifest and disk size by environment
case "$ENV" in
  prod) MANIFEST="pod-manifest.yaml";      DISK="300Gi" ;;
  dev)  MANIFEST="pod-manifest-dev.yaml";  DISK="50Gi"  ;;
  test) MANIFEST="pod-manifest-test.yaml"; DISK="500Gi" ;;
esac
 
devpod up . --ide none \
  --provider K8s \
  --provider-option DISK_SIZE="$DISK" \
  --provider-option POD_MANIFEST="$MANIFEST"

This lets each environment define its own node selectors, quotas, and security policy while still sharing one devcontainer.json and one base image.

Persistent volume strategy

Where you mount the PVC decides what survives a Pod rebuild.

Recommended: mount the PVC at $HOME

Mount the PVC at the container's $HOME, for example /root. In most setups, that is the least painful option. There are a few reasons to prefer it:

  • The IDE server side, such as VS Code Server or Cursor Server, installs itself under ~/.vscode-server or ~/.cursor-server. Those directories land on persistent storage automatically.
  • Toolchain state such as ~/.nvm and ~/.local/bin does not need extra symlink work.
  • Shell files such as ~/.bashrc also persist, so environment setup happens once instead of on every Pod restart.

If the PVC is mounted somewhere else, such as /workspace, you usually end up adding symlinks or reinstalling tooling whenever the Pod comes back.

Example directory layout
1
2
3
4
5
6
7
8
9
10
11
/root/                          # PVC mount point = $HOME
├── .cursor-server/             # IDE server and extensions, persistent
│   ├── cli/                    # Server binaries, disposable
│   └── extensions/             # Installed extensions, keep these
├── .nvm/                       # Node.js version manager, persistent
├── .local/bin/                 # kubectl and other tools, persistent
├── .bashrc                     # Shell configuration, persistent
├── Projects/
│   ├── my-project/             # Project source code
│   └── shared-libs/            # Shared libraries
└── .config/                    # Tool configuration
Common commands

DevPod manages the whole workspace lifecycle through the devpod CLI. These are the commands that tend to matter in daily use.

Provider management

Add and configure the provider first:

Shell
1
2
3
4
5
6
7
8
9
10
# Add the Kubernetes provider
devpod provider add kubernetes
 
# List configured providers
devpod provider list
 
# Set provider options such as the namespace and Pod manifest path
devpod provider set-options kubernetes \
  --option KUBERNETES_NAMESPACE=devpod \
  --option POD_MANIFEST=pod-manifest.yaml
Workspace lifecycle
Shell
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# Create and start a workspace
# --ide none skips automatic IDE attach and works well in scripts
devpod up . --provider kubernetes --ide none
 
# List workspace state
devpod list
 
# SSH into the workspace
devpod ssh my-workspace
 
# Stop the workspace, which removes the Pod and keeps the PVC
devpod stop my-workspace
 
# Delete the workspace and the PVC
devpod delete my-workspace

stop only removes the Pod. Everything on the PVC, including extensions, toolchain state, and source code, stays in place. The next up recreates the Pod and reattaches the volume, so the environment comes back quickly.

Useful provider options

The Kubernetes provider accepts extra parameters through --provider-option:

Shell
1
2
3
4
devpod up . --provider kubernetes --ide none \
  --provider-option DISK_SIZE=100Gi \
  --provider-option POD_MANIFEST=pod-manifest-test.yaml \
  --provider-option KUBERNETES_NAMESPACE=devpod
Option Description
DISK_SIZE PVC size, for example 50Gi or 300Gi.
POD_MANIFEST Path to the custom Pod manifest.
KUBERNETES_NAMESPACE Target namespace for workspace Pods.
Status checks and debugging
Shell
1
2
3
4
5
6
7
8
# Show detailed workspace status
devpod status my-workspace
 
# Inspect the underlying Pod directly
kubectl get pod -n devpod -l app=devpod
 
# Show Pod events when startup fails
kubectl describe pod my-workspace -n devpod
Configuration details

devcontainer.json is the core Dev Container file. It defines the image, lifecycle hooks, forwarded ports, editor customization, and the rest of the workspace contract. DevPod fully supports that specification. The file usually lives at .devcontainer/devcontainer.json.

A full example for remote development on Kubernetes:

devcontainer.json
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
{
  "name": "my-workspace",
 
  // Use a custom image with all tools preinstalled
  "image": "registry.example.com/dev/ubuntu:22.04-tools",
 
  // Skip first-run installation work
  "onCreateCommand": "true",
 
  // Mount the PVC at $HOME so IDE state and extensions persist
  // workspaceMount is left empty on purpose. DevPod v0.6.x has a known
  // .devpodignore issue, so large monorepos can get uploaded in full.
  "workspaceFolder": "/root",
 
  "customizations": {
    "vscode": {
      "extensions": [
        "ms-python.python",
        "ms-python.vscode-pylance",
        "ms-python.debugpy",
        "redhat.vscode-yaml",
        "ms-kubernetes-tools.vscode-kubernetes-tools"
      ],
      "settings": {
        "python.defaultInterpreterPath": "/usr/local/bin/python",
        "editor.formatOnSave": true,
        "terminal.integrated.defaultProfile.linux": "bash"
      }
    }
  },
 
  "forwardPorts": [8000, 8080, 5432, 6379],
  "portsAttributes": {
    "8000": { "label": "API Server" },
    "8080": { "label": "Web UI" },
    "5432": { "label": "PostgreSQL", "onAutoForward": "silent" },
    "6379": { "label": "Redis", "onAutoForward": "silent" }
  },
  "otherPortsAttributes": {
    "onAutoForward": "silent"
  }
}
Images and builds

You can define the base container either by pointing directly at an image or by building one from a Dockerfile.

The image field accepts any OCI image, including Docker Hub, GHCR, or a private registry. For remote development on Kubernetes, a prebuilt image usually saves trouble. Baking the whole toolchain into the image cuts startup time from minutes to seconds.

If you need to customize the image, use build:

1
2
3
4
5
6
7
8
9
{
  "build": {
    "dockerfile": "Dockerfile",
    "context": "..",
    "args": {
      "PYTHON_VERSION": "3.11"
    }
  }
}

context defaults to ".", which means the directory that contains devcontainer.json. Setting it to ".." lets the Dockerfile reference files from the project root.

workspaceFolder and workspaceMount

workspaceFolder is the directory the IDE opens by default after it connects. On Kubernetes, it usually makes sense to point it at the PVC mount, for example /root, so the workspace path and the persistent path are the same thing.

workspaceMount controls how local source code gets mounted into the container. It is useful in local Docker workflows. In remote Kubernetes workflows, it is often better to leave it empty. DevPod v0.6.x has a known issue in #1885 where .devpodignore can be ignored during streaming upload, which means a large workspace, including venv and node_modules, can get pushed in full. A custom rsync step gives you much better control.

Lifecycle hooks

The Dev Container spec defines six lifecycle hooks, in this order:

1
2
3
4
5
6
7
8
9
10
11
initializeCommand     # runs on the host, every startup
  ↓
onCreateCommand       # runs once after first container creation
  ↓
updateContentCommand  # runs after content updates, at least once
  ↓
postCreateCommand     # runs after user assignment, user secrets available
  ↓
postStartCommand      # runs after each container start
  ↓
postAttachCommand     # runs after each IDE attach

Each hook accepts three forms:

  • String: executed through /bin/sh.
  • Array: executed directly without a shell, which is safer.
  • Object: multiple named commands executed in parallel, useful when several services need to start together.
1
2
3
4
5
6
{
  "postAttachCommand": {
    "api-server": "cd /root/api && python -m uvicorn main:app --port 8000",
    "worker": "cd /root/worker && python -m celery -A tasks worker"
  }
}

A few practical rules help here:

  • If all tools are already in the image, set onCreateCommand to "true" and skip it.
  • postStartCommand is a good place for startup checks or light warmup.
  • The waitFor field decides which phase must finish before the IDE attaches. The default is "updateContentCommand".
IDE customization

You can declare extensions and settings under customizations.vscode, and they are applied automatically when the IDE connects:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
"customizations": {
  "vscode": {
    "extensions": [
      "ms-python.python",
      "ms-python.vscode-pylance",
      "ms-python.debugpy",
      "redhat.vscode-yaml",
      "ms-kubernetes-tools.vscode-kubernetes-tools"
    ],
    "settings": {
      "python.defaultInterpreterPath": "/usr/local/bin/python",
      "editor.formatOnSave": true,
      "terminal.integrated.defaultProfile.linux": "bash"
    }
  }
}

Extensions listed under extensions install automatically on first attach. With the PVC mounted at $HOME, you only pay that cost once. Settings defined here override local editor settings, which helps keep behavior consistent across a team.

Port forwarding

Ports listed in forwardPorts are forwarded automatically after the IDE connects. When a service starts inside the container, you can usually hit it on local localhost without extra setup.

portsAttributes lets you define a label and behavior per port:

1
2
3
4
5
6
7
8
9
10
"forwardPorts": [8000, 8080, 5432, 6379],
"portsAttributes": {
  "8000": { "label": "API Server" },
  "8080": { "label": "Web UI", "onAutoForward": "openBrowser" },
  "5432": { "label": "PostgreSQL", "onAutoForward": "silent" },
  "6379": { "label": "Redis", "onAutoForward": "silent" }
},
"otherPortsAttributes": {
  "onAutoForward": "silent"
}

onAutoForward controls the first reaction when DevPod sees the port: "notify" shows a notification, "openBrowser" opens a browser, "silent" forwards quietly, and "ignore" does nothing. otherPortsAttributes sets the default for ports you did not list explicitly.

Environment variables

The Dev Container spec splits environment variables into two layers:

  • containerEnv: set on the container itself, visible to all processes, and fixed for the life of that container.
  • remoteEnv: only visible to IDE-launched processes such as terminals, tasks, and debuggers. This layer can reference ${containerEnv:VAR} and does not require a container rebuild when changed.
1
2
3
4
5
6
7
8
{
  "containerEnv": {
    "PYTHONPATH": "/root/libs/common:/root/libs/shared"
  },
  "remoteEnv": {
    "PATH": "${containerEnv:PATH}:/root/.local/bin"
  }
}

Both fields also support ${localEnv:VAR}, which reads an environment variable from the host, for example ${localEnv:HOME}.

Features

Dev Container Features are reusable Dockerfile fragments distributed as OCI artifacts. The features field lets you add tools without editing the base image directly:

1
2
3
4
5
6
7
8
9
10
11
{
  "features": {
    "ghcr.io/devcontainers/features/docker-in-docker:2": {},
    "ghcr.io/devcontainers/features/kubectl-helm-minikube:1": {
      "version": "latest"
    },
    "ghcr.io/devcontainers/features/node:1": {
      "version": "22"
    }
  }
}

You can browse the available features at containers.dev/features. For Kubernetes-based remote development, though, baking tools into the base image is usually better than paying installation time on every new workspace. Features fit local Docker prototypes better than long-lived remote workspaces.

Container behavior controls

A few fields change how the container behaves at runtime:

Field Default Description
overrideCommand true Overrides the image command with an infinite loop so the container stays alive. This default usually makes sense for custom development images.
shutdownAction stopContainer What happens when the IDE closes. Options include stopContainer and none. For Kubernetes, none is often the better choice.
init false Uses tini as PID 1 to reap zombie processes.
privileged false Enables privileged mode. In Docker workflows this can be set here. In Kubernetes, it belongs in the Pod manifest.
containerUser root or the Dockerfile USER The user for all container operations.
remoteUser same as containerUser The user for IDE terminals and tasks. It can differ from containerUser.
Predefined variables

String values in devcontainer.json can use these predefined variables:

Variable Meaning
${localEnv:VAR_NAME} Host environment variable, with optional default value syntax: ${localEnv:VAR:default}
${containerEnv:VAR_NAME} Container environment variable, available only inside remoteEnv
${localWorkspaceFolder} Workspace path on the host
${containerWorkspaceFolder} Workspace path inside the container
${devcontainerId} Stable unique identifier for the container
Customizing the base image

You can point the Dev Container image field at any public image, but for remote development on Kubernetes it is usually worth building a dedicated base image with the toolchain, language runtimes, and system libraries locked into image layers.

That pays off in a few ways:

  • The Pod is usable as soon as it starts. You do not wait for onCreateCommand to install half the environment.
  • Environment consistency improves because everyone shares the same image instead of replaying installation steps in slightly different conditions.
  • When the Pod is rebuilt, the toolchain comes back with it. You are not depending on package manager availability at workspace creation time.
Dockerfile layering rules

Good layering makes build caching much more effective. Put low-churn tools in lower layers and faster-moving pieces in upper layers. End each RUN block with apt-get clean && rm -rf /var/lib/apt/lists/* to keep layers smaller, and use --no-install-recommends to avoid pulling in packages you do not need.

The following example builds a development image with Python 3.11, common system tools, and the NVIDIA CUDA runtime:

Dockerfile
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
FROM ubuntu:22.04
ENV DEBIAN_FRONTEND=noninteractive
 
# Layer 1: system tools, Python 3.11, and all PPAs
RUN apt-get update && \
    apt-get install -y --no-install-recommends \
      software-properties-common gnupg2 wget curl ca-certificates && \
    add-apt-repository -y ppa:deadsnakes/ppa && \
    add-apt-repository -y ppa:graphics-drivers/ppa && \
    wget -qO /tmp/cuda-keyring.deb \
      https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb && \
    dpkg -i /tmp/cuda-keyring.deb && rm /tmp/cuda-keyring.deb && \
    apt-get update && \
    apt-get install -y --no-install-recommends \
      python3.11 python3.11-venv python3.11-dev python3-pip \
      git make vim jq postgresql-client \
      openssh-server procps iproute2 iputils-ping \
      rsync htop telnet && \
    update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.11 1 && \
    update-alternatives --install /usr/bin/python  python  /usr/bin/python3.11 1 && \
    apt-get clean && rm -rf /var/lib/apt/lists/*
 
# Layer 2: NVIDIA driver tools such as nvidia-smi
RUN apt-get update && \
    apt-get install -y --no-install-recommends nvidia-utils-580-server && \
    apt-get clean && rm -rf /var/lib/apt/lists/*
 
# Layer 3: CUDA runtime libraries
RUN apt-get update && \
    apt-get install -y --no-install-recommends cuda-libraries-12-8 && \
    apt-get clean && rm -rf /var/lib/apt/lists/*

Several design choices here matter:

  • Add PPAs and GPG keys in layer 1, before update-alternatives changes the default Python. If you switch Python first, add-apt-repository can fail with No module named 'apt_pkg' because the apt_pkg binding expects the system Python.
  • Keep NVIDIA tools and CUDA libraries in separate layers. That way a driver update only rebuilds one layer.
  • Install nvidia-utils-xxx-server, not nvidia-utils-xxx. On Ubuntu, the latter can be a transitional dummy package without the actual nvidia-smi binary.
  • Pick cuda-libraries-12-8, roughly 1.2 GB, instead of the full cuda-toolkit-12-8, which is closer to 10 GB. Most development environments need the runtime more often than the full compiler and debugger stack.
How the image and devcontainer.json work together

Once the image already contains the full toolchain, devcontainer.json becomes much simpler:

1
2
3
4
5
{
  "image": "registry.example.com/dev/ubuntu:22.04-cuda12.8",
  "onCreateCommand": "true",
  "workspaceFolder": "/root"
}

Setting onCreateCommand to "true" means there is nothing left to install at first startup. The Pod is ready immediately after creation.

Customizing the Pod spec

The Pod manifest is the core Kubernetes-side configuration. It controls the things DevPod cannot express through devcontainer.json.

Template variables

DevPod renders the Pod manifest as a template before it creates the Pod. These placeholders are commonly used:

Variable Meaning
{{.WorkspaceId}} Workspace name, often reused as the Pod name and label value.
{{.Image}} Image declared in devcontainer.json.
Security context

Remote development containers often need looser permissions than production containers. These are the settings that come up most often:

Setting Use Risk
privileged: true Docker-in-Docker, device access, debugging tools Full access to host kernel capabilities
SYS_ADMIN mount and cgroup operations Medium
SYS_PTRACE strace, gdb, and similar debugging Low
NET_ADMIN Network debugging and iptables work Medium
hostNetwork: true Direct use of the host network stack, which avoids CNI overhead Port conflicts and loss of network isolation
hostPID: true Inspect host processes for system-level debugging Loss of process isolation

Loosen permissions only where the workspace actually needs them, and keep these Pods isolated to dedicated namespaces or nodes so they do not interfere with production workloads.

Resource requests and limits
YAML
1
2
3
4
5
6
7
resources:
  requests:
    cpu: "500m"
    memory: "1Gi"
  limits:
    cpu: "16"
    memory: "64Gi"

Set requests low enough to keep scheduling realistic, and limits high enough to leave room for bursts. Development environments rarely sit at peak usage all day, but builds and test runs can spike hard for a short time.

File sync strategy
Default DevPod sync vs custom rsync

DevPod includes a built-in sync path through devpod up, and it works fine for small projects. On large multi-repo workspaces, with dozens of subprojects and millions of files, two problems show up fast:

  • The first sync can take a very long time, and exclusion control is limited.
  • DevPod may try to upload the entire workspaceFolder, including directories you do not want remotely, such as node_modules and .git.

The usual way around this is to launch DevPod with --ide none, skip automatic sync, and then run your own rsync command with explicit include and exclude rules.

The stub directory trick

Even with --ide none, DevPod still tries to sync the local directory that matches workspaceFolder during devpod up. If that directory is large, the initial startup can still crawl. One workaround is to create a temporary empty directory and use that for the initial workspace creation:

Shell
1
2
3
4
STUB_DIR=$(mktemp -d)
devpod up "$STUB_DIR" --ide none --provider K8s ...
rm -rf "$STUB_DIR"
# Then sync the real source tree with rsync
rsync in practice
Shell
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
SSH_CMD="ssh my-workspace.devpod"
 
rsync -az \
  --exclude='node_modules' \
  --exclude='.git' \
  --exclude='__pycache__' \
  --exclude='venv' \
  --exclude='.venv' \
  --exclude='dist' \
  --exclude='.next' \
  --exclude='.temp' \
  --exclude='.logs' \
  --exclude='.vscode/sessions.json' \
  --copy-unsafe-links \
  ./my-project/ my-workspace.devpod:/root/Projects/my-project/

The most useful flags here are:

  • -az: archive mode plus compression. Do not add --progress when you have a large number of small files. The extra output can slow the SSH stream badly enough to trigger a broken pipe.
  • --copy-unsafe-links: dereferences symlinks that point outside the synced tree. In multi-repo setups this is useful because shared directories linked from elsewhere often do not resolve correctly on the remote side.
  • --exclude: keep anything noisy or disposable out of the remote workspace. .vscode/sessions.json changes constantly and tends to fight with remote state, so it should stay out.
Remote IDE access

VS Code and Cursor run remote development by installing a server-side component inside the container. The local editor talks to that server through SSH.

How the server gets installed

The server build has to match the local IDE version, usually by commit hash. The installation flow is usually:

  1. Read the current commit hash from the local IDE.
  2. Download the matching server bundle.
  3. Transfer and unpack it to ~/.cursor-server/cli/servers/Stable-{commit}/ on the remote side.

The wrapper script should make installation idempotent:

Shell
1
2
3
4
5
6
7
8
9
COMMIT=$(get_ide_commit_hash)
SERVER_BIN="$HOME/.cursor-server/cli/servers/Stable-$COMMIT/server/bin/code-server"
 
if $SSH_CMD "test -x $SERVER_BIN"; then
  echo "Server already installed"
else
  # Download and install the server
  install_ide_server "$COMMIT"
fi
Extension persistence

Extensions live under ~/.cursor-server/extensions/ or ~/.vscode-server/extensions/. If the PVC is mounted at $HOME, those extensions persist automatically.

A common mistake is wiping the whole ~/.cursor-server directory during a server reinstall. That blows away every installed extension. The safer cleanup target is the server binary directory only:

Shell
1
2
3
4
5
# Wrong: removes extensions too
rm -rf ~/.cursor-server
 
# Right: remove only the server binaries
rm -rf ~/.cursor-server/cli
Bulk extension sync

When you first prepare a remote environment, it can be faster to sync already installed local extensions than to redownload everything from the marketplace:

Shell
1
2
3
rsync -az \
  ~/.cursor-server/extensions/ \
  my-workspace.devpod:~/.cursor-server/extensions/

After the sync, check for broken symlinks. Some extensions include links to a local Node.js path that does not exist remotely. Those need to be replaced with real files.

Shell
1
2
3
4
5
# Find broken symlinks on the remote side
find ~/.cursor-server/extensions/ -type l ! -exec test -e {} \; -print
 
# Replace each broken link with a real copy of the target file
# fetched from the local machine
Keeping SSH sessions alive

DevPod SSH sessions can drop when left idle because the server's SSH daemon, a firewall, or a load balancer between the client and the pod times out the connection. The standard fix is to enable SSH keepalive on the client side:

1
2
3
4
# ~/.ssh/config
Host *.devpod
    ServerAliveInterval 60
    ServerAliveCountMax 10

ServerAliveInterval 60 sends a keepalive packet every 60 seconds. ServerAliveCountMax 10 allows up to 10 consecutive missed responses before the client closes the connection. That combination keeps the tunnel alive through typical idle timeouts and handles pauses of up to roughly 10 minutes.

For sessions opened through scripts, add the options as flags:

Shell
1
ssh -o ServerAliveInterval=60 -o ServerAliveCountMax=10 my-workspace.devpod

For Cursor remote connections, the keepalive must be in ~/.ssh/config rather than a command flag, because Cursor manages the underlying SSH process itself and does not expose extra flags to the user.

Why the first connection is slow

The first IDE attach to a fresh workspace often takes anywhere from 30 seconds to several minutes because the IDE still has to:

  • Establish the SSH tunnel, which adds some overhead through DevPod's SSH layer.
  • Download and install the server if it is not already present.
  • Initialize the installed extensions.

Later connections are much faster because the server and extensions are already sitting on the PVC.

GPU access inside Kubernetes workspaces

GPU access on Kubernetes depends on several moving parts, including the host driver, the device plugin, and the container runtime hook. If any one of them is wrong, the container will come up without usable GPU devices.

How the NVIDIA device plugin works

The NVIDIA Device Plugin runs as a DaemonSet on GPU nodes and registers the extended resource nvidia.com/gpu with Kubernetes. A Pod requests GPUs by declaring the count in resources.limits:

YAML
1
2
3
4
5
resources:
  limits:
    nvidia.com/gpu: "4"
  requests:
    nvidia.com/gpu: "4"

The scheduler places the Pod on a node with enough GPU capacity, and the device plugin injects the actual device nodes such as /dev/nvidia0.

runtimeClassName: nvidia

Requesting GPU resources is not enough. Kubernetes also has to know which container runtime class should handle GPU device setup. That happens through the Pod's runtimeClassName field:

YAML
1
2
3
4
5
spec:
  runtimeClassName: nvidia
  containers:
    - name: devpod
      # ...

If you omit runtimeClassName, the Pod may still get GPU quota, but the runtime will not call NVIDIA's prestart hook. The result is simple: no /dev/nvidia* devices inside the container. This is one of the most common failure modes.

AppArmor blocking

privileged: true does not mean AppArmor is unconfined. On nodes with AppArmor enabled, a privileged container can still be blocked by the default profile, such as cri-containerd.apparmor.d, when it tries to access GPU device nodes.

The fix is to declare an unconfined AppArmor profile in the Pod annotations:

YAML
1
2
3
metadata:
  annotations:
    container.apparmor.security.beta.K8s.io/devpod: unconfined

Here devpod is the container name. The annotation must match it exactly.

The NVIDIA_VISIBLE_DEVICES trap

It is tempting to set NVIDIA_VISIBLE_DEVICES=all in the Pod manifest to expose every GPU. In a setup that already uses runtimeClassName: nvidia, that usually backfires. A manually set value can interfere with the device plugin's own injection logic.

The NVIDIA container runtime behaves like this:

  • If NVIDIA_VISIBLE_DEVICES comes from the device plugin, the runtime mounts exactly the devices that value names.
  • If the manifest hardcodes NVIDIA_VISIBLE_DEVICES=all, that value overrides the plugin-managed one and can break the mapping step.

The safer approach is to leave NVIDIA_VISIBLE_DEVICES alone and let the device plugin manage it. Keeping NVIDIA_DRIVER_CAPABILITIES=all is fine if the container needs full driver capability access.

Getting nvidia-smi inside the container

nvidia-smi is the fastest way to confirm GPU visibility. There is one common trap when you install it inside the container: on some Linux distributions, packages named nvidia-utils-xxx are only transitional dummy packages. They install successfully but do not include the real nvidia-smi binary.

On Ubuntu 22.04, the reliable path is:

  1. Add ppa:graphics-drivers/ppa.
  2. Install nvidia-utils-xxx-server, with the -server suffix.

If changing the image is inconvenient, one temporary workaround is to mount host driver tools and libraries into the container with hostPath:

YAML
1
2
3
4
5
6
7
8
volumeMounts:
  - name: host-root
    mountPath: /host
    readOnly: true
volumes:
  - name: host-root
    hostPath:
      path: /

After startup, add /host/usr/lib/x86_64-linux-gnu to LD_LIBRARY_PATH and call /host/usr/bin/nvidia-smi directly. It works, but it is still a workaround. The long-term fix is to bake the required driver tools into the image.

How to debug NVML Unknown Error

If nvidia-smi returns Failed to initialize NVML: Unknown Error, check things in this order:

  1. AppArmor. Confirm the Pod annotation is unconfined, and inspect the actual container profile with cat /proc/1/attr/current.
  2. Device nodes. Check whether ls /dev/nvidia* returns anything. If the files exist but opening them returns EPERM, the cgroup device filter is the problem, not the driver.
  3. Runtime class. Confirm the Pod spec sets runtimeClassName: nvidia and that the cluster actually has that RuntimeClass.
  4. Environment variables. Verify that NVIDIA_VISIBLE_DEVICES was not overridden manually.
  5. Driver versions. Make sure the user-space NVIDIA libraries in the container are compatible with the host kernel driver.
  6. containerd privileged_without_host_devices. If the cluster uses nvidia-container-runtime as the default runtime and this flag is false, privileged pods that do not request nvidia.com/gpu will see device files in /dev but be blocked by the eBPF cgroup program. See the next section.
GPU access in privileged pods that bypass the device plugin

A development pod sometimes skips the nvidia.com/gpu resource request entirely, for example when the node already runs an inference service and the workspace wants to share the hardware without holding a scheduler slot. That approach works until the cluster enables GPU time-slicing.

A common part of the time-slicing setup is replacing the default containerd runc binary with nvidia-container-runtime:

TOML
1
2
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
  BinaryName = "nvidia-container-runtime"

When this is in place and the containerd setting privileged_without_host_devices is false, privileged containers no longer inherit host /dev automatically. The nvidia-container-runtime attaches an eBPF cgroup program that controls device access. For pods without a device plugin allocation, that program blocks /dev/nvidiactl and friends at the open syscall level.

The device files appear in ls /dev because the runtime still creates their directory entries. But opening them returns EPERM, and NVML fails immediately:

Python
1
2
>>> import os; os.open('/dev/nvidiactl', os.O_RDWR)
PermissionError: [Errno 1] Operation not permitted: '/dev/nvidiactl'

The error is at the kernel cgroup level, not in the userspace library. The same block applies even if you run the host's own nvidia-smi binary via chroot /host nvidia-smi because the eBPF program acts on any process in that container's cgroup.

The fix is one line in /etc/containerd/config.toml:

TOML
1
2
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
  privileged_without_host_devices = true

After updating the file, containerd must be restarted. If you cannot SSH directly to the node, a temporary privileged Job that mounts the host root is the standard approach. Use an image that is already cached on the node and avoid pulling from a registry:

YAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
apiVersion: batch/v1
kind: Job
metadata:
  name: containerd-restart
  namespace: kube-system
spec:
  ttlSecondsAfterFinished: 60
  template:
    spec:
      restartPolicy: Never
      nodeSelector:
        kubernetes.io/hostname: "your-node"
      hostPID: true
      hostNetwork: true
      containers:
        - name: restart
          image: your-cached-image
          imagePullPolicy: IfNotPresent
          securityContext:
            privileged: true
          command:
            - /bin/bash
            - -c
            - |
              chroot /host systemctl restart containerd
              sleep 5
              chroot /host systemctl is-active containerd
          volumeMounts:
            - name: host-root
              mountPath: /host
      volumes:
        - name: host-root
          hostPath:
            path: /

The pod must be recreated after containerd restarts. Stop the workspace with devpod stop before applying the config change. The PVC is not touched by any of these steps.

Common failure modes
Symptom Cause Fix
Pod enters Dead or Failed state OOM, node issues, or a bad manifest Run devpod stop, fix the manifest, then run devpod up again. The PVC stays intact.
SSH exits with code 255 The Pod is not ready yet, or the SSH tunnel dropped Check Pod state and retry after it reaches Running. If server installation was interrupted, rerun the installation step manually.
rsync reports Broken pipe Progress output flooded the SSH channel Use rsync -az without --progress or --info=progress2.
add-apt-repository fails with No module named 'apt_pkg' The default Python was switched before repository setup Add all PPAs before calling update-alternatives.
IDE extensions disappear after a Pod rebuild The reinstall script removed the extensions directory Delete only the cli/ subtree and keep extensions/.
nvidia-smi: command not found A transitional dummy package was installed Install nvidia-utils-xxx-server from ppa:graphics-drivers/ppa.
NVML Unknown Error AppArmor, runtime class, device injection, environment override, or cgroup eBPF block from privileged_without_host_devices = false Try opening /dev/nvidiactl in Python. EPERM means cgroup block. Set privileged_without_host_devices = true in containerd config and restart. Otherwise debug: AppArmor, device nodes, runtimeClassName, then environment variables.
/dev/nvidia* does not exist Missing runtimeClassName: nvidia or a broken device plugin Confirm the RuntimeClass exists and the device plugin DaemonSet is healthy.
← OpenClaw: Architecture, Components, and Deployment Notes
多语言敏感信息检测模型训练日志 →

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code class="" title="" data-url=""> <del datetime=""> <em> <i> <q cite=""> <strike> <strong> <pre class="" title="" data-url=""> <span class="" title="" data-url="">

Related Posts

  • SOA知识集锦
  • Prometheus学习笔记
  • Spring Cloud学习笔记
  • Investigating and Solving the Issue of Failed Certificate Request with ZeroSSL and Cert-Manager
  • Kong学习笔记

Recent Posts

  • 人工智能知识 - 编程(二)
  • 人工智能知识 - 编程(一)
  • 人工智能知识 - 智能体
  • 人工智能知识 - Transformers和大模型
  • 人工智能知识 - 主要应用领域
ABOUT ME

汪震 | Alex Wong

江苏淮安人,现居北京。目前供职于腾讯云,专注国际化和AI落地。

GitHub:gmemcc

Git:git.gmem.cc

Email:gmemjunk@gmem.cc@me.com

ABOUT GMEM

绿色记忆是我的个人网站,域名gmem.cc中G是Green的简写,MEM是Memory的简写,CC则是我的小天使彩彩名字的简写。

我在这里记录自己的工作与生活,同时和大家分享一些编程方面的知识。

GMEM HISTORY
v2.00:微风
v1.03:单车旅行
v1.02:夏日版
v1.01:未完成
v0.10:彩虹天堂
v0.01:阳光海岸
MIRROR INFO
Meta
  • Log in
  • Entries RSS
  • Comments RSS
  • WordPress.org
Recent Posts
  • 人工智能知识 - 编程(二)
    这一篇承接人工智能知识 - 编程(一)。前一篇已经梳理 AI 训练与推理编程的横向工程栈;本篇进入重点框架详解与 ...
  • 人工智能知识 - 编程(一)
    这一篇专门处理 AI 训练、微调、推理与部署中的编程栈问题。前几篇分别讲了机器学习基础、任务版图、Transfo ...
  • 人工智能知识 - 智能体
    这一篇处理模型之外的系统层问题,包括上下文工程、Harness Engineering、检索增强生成(RAG)与 ...
  • 人工智能知识 - Transformers和大模型
    这一篇聚焦现代大模型主线,内容从 Transformer 架构出发,延伸到语言模型、多模态模型、预训练与微调,以 ...
  • 人工智能知识 - 主要应用领域
    这一篇从常用算法进入机器学习基础概念、经典机器学习与神经网络,重点讨论“模型如何被构造、训练、评估与正则化”。前 ...
  • 人工智能知识 - 算法和机器学习
    这一篇从常用算法进入机器学习基础概念、经典机器学习与神经网络,重点讨论“模型如何被构造、训练、评估与正则化”。前 ...
  • 人工智能知识 - 数学基础
    这一篇整理 AI 所需的数学基础,包括基础数学、线性代数、微积分与概率论统计。它回答的核心问题是:模型里的向量、 ...
  • 人工智能知识 - 简介
    这一篇作为整套 AI 总纲的导论,先回答更根本的问题,不急于进入公式和具体模型细节:什么叫智能,人工智能究竟在试 ...
  • 多语言敏感信息检测模型训练日志
    这篇文章记录一个多语言敏感信息识别项目的完整训练日志。它关注的是工程路径本身:原始 AI 合成语料如何被清洗成可 ...
  • DevPod on Kubernetes: turning devcontainer.json into a persistent remote workspace
    DevPod is an open source workspace manager ...
  • OpenClaw: Architecture, Components, and Deployment Notes
    Four Months, 343,000 Stars On November 24, 2025, ...
  • Replacing Docker Desktop with Colima on macOS
    Colima is one of the cleanest ways ...
  • Kubernetes GPU Sharing
    GPU sharing in Kubernetes depends on what ...
  • Investigating and Solving the Issue of Failed Certificate Request with ZeroSSL and Cert-Manager
    In this blog post, I will walk ...
  • A Comprehensive Study of Kotlin for Java Developers
    Introduction Purpose of the Study Understanding the Mo ...
  • LangChain: Architecture, LCEL, Agents, LangGraph, Retrieval, and Production Patterns
    LangChain is no longer best understood as ...
  • Kubernetes Migration
    Migrating a Kubernetes cluster from one cloud ...
  • Terraform: a practical guide to infrastructure as code
    Terraform is an infrastructure-as-code tool. You describ ...
TOPLINKS
  • Zitahli's blue 91 people like this
  • 梦中的婚礼 64 people like this
  • 汪静好 61 people like this
  • 那年我一岁 36 people like this
  • 为了爱 28 people like this
  • 小绿彩 26 people like this
  • 杨梅坑 6 people like this
  • 亚龙湾之旅 1 people like this
  • 汪昌博 people like this
  • 彩虹姐姐的笑脸 24 people like this
  • 2013年11月香山 10 people like this
  • 2013年7月秦皇岛 6 people like this
  • 2013年6月蓟县盘山 5 people like this
  • 2013年2月梅花山 2 people like this
  • 2013年淮阴自贡迎春灯会 3 people like this
  • 2012年镇江金山游 1 people like this
  • 2012年徽杭古道 9 people like this
  • 2011年清明节后扬州行 1 people like this
  • 2008年十一云龙公园 5 people like this
  • 2008年之秋忆 7 people like this
  • 老照片 13 people like this
  • 火一样的六月 16 people like this
  • 发黄的相片 3 people like this
  • Cesium学习笔记 90 people like this
  • IntelliJ IDEA知识集锦 59 people like this
  • 基于Kurento搭建WebRTC服务器 38 people like this
  • Bazel学习笔记 38 people like this
  • PhoneGap学习笔记 32 people like this
  • NaCl学习笔记 32 people like this
  • 使用Oracle Java Mission Control监控JVM运行状态 29 people like this
  • 基于Calico的CNI 27 people like this
  • Ceph学习笔记 27 people like this
  • Three.js学习笔记 24 people like this
Tag Cloud
ActiveMQ AspectJ CDT Ceph Chrome CNI Command Cordova Coroutine CXF Cygwin DNS Docker eBPF Eclipse ExtJS F7 FAQ Groovy Hibernate HTTP IntelliJ IO编程 IPVS JacksonJSON JMS JSON JVM K8S kernel LB libvirt Linux知识 Linux编程 LOG Maven MinGW Mock Monitoring Multimedia MVC MySQL netfs Netty Nginx NIO Node.js NoSQL Oracle PDT PHP Redis RPC Scheduler ServiceMesh SNMP Spring SSL svn Tomcat TSDB Ubuntu WebGL WebRTC WebService WebSocket wxWidgets XDebug XML XPath XRM ZooKeeper 亚龙湾 单元测试 学习笔记 实时处理 并发编程 彩姐 性能剖析 性能调优 文本处理 新特性 架构模式 系统编程 网络编程 视频监控 设计模式 远程调试 配置文件 齐塔莉
Recent Comments
  • xdemo on 人工智能知识 - 编程(二)
  • 杨松涛 on snmp4j学习笔记
  • kaka on Cilium学习笔记
  • JackZhouMine on Cesium学习笔记
  • 陈黎 on 通过自定义资源扩展Kubernetes
  • qg on Istio中的透明代理问题
  • heao on 基于本地gRPC的Go插件系统
  • 黄豆豆 on Ginkgo学习笔记
  • cloud on OpenStack学习笔记
  • 5dragoncon on Cilium学习笔记
  • Archeb on 重温iptables
  • C/C++编程:WebSocketpp(Linux + Clion + boostAsio) – 源码巴士 on 基于C/C++的WebSocket库
  • jerbin on eBPF学习笔记
  • point on Istio中的透明代理问题
  • G on Istio中的透明代理问题
  • 绿色记忆:Go语言单元测试和仿冒 on Ginkgo学习笔记
  • point on Istio中的透明代理问题
  • 【Maven】maven插件开发实战 – IT汇 on Maven插件开发
  • chenlx on eBPF学习笔记
  • Alex on eBPF学习笔记
  • CFC4N on eBPF学习笔记
  • 李运田 on 念爷爷
  • yongman on 记录一次KeyDB缓慢的定位过程
©2005-2026 Gmem.cc | Powered by WordPress | 京ICP备18007345号-2