Kubernetes Migration
Migrating a Kubernetes cluster from one cloud provider to another usually breaks into three separate problems: moving Kubernetes resources, moving the data attached to workloads, and moving the container images those workloads depend on.
- Kubernetes resource migration
- Persistent volume migration
- Container image migration
Kubernetes resources and persistent volumes can be handled with Velero. Image registry migration is simpler in most cases. Common open source options include Alibaba Cloud's image-syncer and Tencent Cloud's image-transfer.
Velero is an open source backup and restore system built for Kubernetes. A common cross-cloud migration pattern is to back up the source cluster and restore that backup into the target cluster.
Velero consists of two parts:
- Server-side components running inside the Kubernetes clusters being backed up or restored
- A CLI client
The server side is a collection of controllers that watch Velero custom resources for backup and restore operations. The CLI mostly saves you from writing those custom resources by hand.
Compared with the version we reviewed in the earlier note on Kubernetes failure detection and self-healing, Velero has added several capabilities that matter in real migrations:
- ReadWriteMany volumes are no longer backed up repeatedly.
- Cloud provider plugins have been split out from the core Velero repository.
- Restic-based persistent volume backups are always incremental, even when Pods move.
- Namespace cloning can automatically clone the related persistent volumes.
- CSI-backed persistent volumes are supported, including the mainstream AWS, Azure, and GCP cases.
- Backup and restore progress reporting is supported.
- Velero can back up all API versions of a resource.
- Volume backup through Restic can be enabled by default with --default-volumes-to-restic.
- restoreStatus can be used to control which resource status fields are restored.
- --existing-resource-policy can change restore behavior when a resource already exists. The default is to skip existing resources, except for ServiceAccounts. Setting it to update makes Velero update existing resources instead.
- Since 1.10, Velero supports Kopia as an alternative to Restic. Kopia often performs better on large backup sets or very large file counts.
Velero supports both on-demand and scheduled backups. In both cases it collects Kubernetes resources, applies filters if requested, packages the result, and uploads it to an object storage backend.
A typical backup flow looks like this:
- The user runs velero backup create, which creates a Backup resource.
- BackupController sees the new Backup resource and validates it.
- If validation succeeds, the controller runs the backup. By default, Velero creates snapshots for all persistent volumes. Use --snapshot-volumes=false to change that behavior.
- The controller uploads the backup data to object storage.
When Velero backs up resources, it stores them using the preferred API version. If the source API server exposes two versions of a group, for example teleport/v1alpha1 and teleport/v1, and v1 is the preferred version, the backup stores the resource in v1 form. The target cluster does not have to prefer that version, but it must support it. That is one reason restore can fail across clusters with different Kubernetes or CRD versions.
Backups can have a retention period through --ttl. When that retention window expires, Velero deletes the Kubernetes backup records, the backup files, the snapshots, and the related Restore objects. If garbage collection fails, Velero adds a velero.io/gc-failure=REASON label to the Backup object.
There is one important caveat for cross-cloud migration: snapshot-based volume backup is not enough. A snapshot created on cloud A is not something you can usually restore directly on cloud B.
Restore takes a previous backup, including Kubernetes resources and volume data, and replays it into the target cluster. The target cluster can be the source cluster itself, and the restore can be filtered so only part of the backup is restored.
Restored Kubernetes resources receive the label velero.io/restore-name=RESTORE_NAME. By default, the restore name is BACKUP_NAME-TIMESTAMP, where the timestamp format is YYYYMMDDhhmmss.
A typical restore flow looks like this:
- The user runs velero restore create, which creates a Restore resource.
- RestoreController sees the Restore object and validates it.
- If validation succeeds, the controller reads the backup metadata from object storage and performs prechecks, including API version checks, to see whether the resources can run on the new cluster.
- The controller restores resources one by one.
By default, Velero does not delete or overwrite existing objects in the target cluster. If a resource already exists, Velero skips it. Setting --existing-resource-policy=update tells Velero to try to update matching existing resources instead.
The object storage backend is Velero's single source of truth. That has two practical consequences:
- If object storage contains backup data but the Kubernetes API does not contain the matching Backup resource, Velero recreates the Backup object.
- If Kubernetes contains a Backup resource but object storage does not contain the matching backup data, Velero deletes the Backup object.
This is also why cross-cloud migration works at all. The source and target clusters do not need to talk directly to each other. Object storage becomes the only shared medium.
The CRD that defines where backup metadata is stored is BackupStorageLocation. It points to a bucket or a prefix inside a bucket. Velero stores backup metadata there, and file-system-based volume backups through Restic or Kopia also live there. Snapshot-based volume backups do not live in that bucket, because the snapshot implementation is controlled by the cloud provider.
Each Backup can use one BackupStorageLocation.
Snapshot-related information is stored in VolumeSnapshotLocation. The actual fields depend on the cloud plugin, because snapshot implementation is provider-specific.
Each Backup can use one VolumeSnapshotLocation per volume snapshot provider.
Velero uses a plugin model that keeps storage and cloud provider integrations outside the core project.
Velero also exposes hooks around the standard backup and restore flow.
Backup hooks run during backup. One standard use is telling a database to flush in-memory buffers before a snapshot or file backup starts.
Restore hooks run during restore. They are often used for initialization steps that need to happen before the application starts normally.
Install the Velero CLI binary, extract it, and place velero on $PATH. To enable shell completion:
|
1 |
echo 'source <(velero completion bash)' >> ~/.bashrc |
Client-side configuration can be adjusted like this:
|
1 2 3 4 5 |
# Enable client features velero client config set features=EnableCSI # Disable color output velero client config set colorized=false |
The CLI can also install the server components:
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
velero install \ --namespace=teleport-system \ --use-node-agent \ --default-volumes-to-fs-backup \ --features=EnableCSI,EnableAPIGroupVersions \ --velero-pod-cpu-request \ --velero-pod-mem-request \ --velero-pod-cpu-limit \ --velero-pod-mem-limit \ --node-agent-pod-cpu-request \ --node-agent-pod-mem-request \ --node-agent-pod-cpu-limit \ --node-agent-pod-mem-limit \ --provider aws \ --bucket backups \ --secret-file ./aws-iam-creds \ --backup-location-config region=us-east-2 \ --snapshot-location-config region=us-east-2 \ --no-default-backup-location \ --dry-run -o yaml |
Several flags in that example matter in migration scenarios:
- --use-node-agent enables file-system-based backup support.
- --default-volumes-to-fs-backup makes file-system backup the default for Pod volumes. Without it, volumes normally have to be selected through annotations.
- --features=EnableCSI,EnableAPIGroupVersions turns on feature gates that matter in newer storage and API-version scenarios.
- Resource request and limit flags often need adjustment when file-system backup is used heavily.
After installation, you can configure default backup and snapshot locations:
|
1 2 3 4 5 6 7 |
velero backup-location create backups-primary \ --provider aws \ --bucket velero-backups \ --config region=us-east-1 \ --default velero server --default-volume-snapshot-locations="PROVIDER-NAME:LOCATION-NAME,PROVIDER2-NAME:LOCATION2-NAME" |
You can also add extra snapshot providers after the initial install:
|
1 2 3 4 5 |
velero plugin add registry/image:version velero snapshot-location create NAME \ --provider PROVIDER-NAME \ [--config PROVIDER-CONFIG] |
One practical test setup is to create one Kubernetes cluster on Alibaba Cloud as the source cluster and another on Tencent Cloud as the target cluster, then use Velero to move workloads across them.
Create the clusters through the two cloud consoles. The exact steps depend on the providers and are not the point here.
The most common Kubernetes use case is still stateless workloads. Stateful infrastructure such as databases is often delegated to cloud PaaS products instead of being hosted inside the cluster. That reality makes Kubernetes migration much easier, because volume migration often drops out of scope.
A simple test case is an Nginx Deployment plus a Service. Start by creating the resources on the source cluster:
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 |
apiVersion: apps/v1 kind: Deployment metadata: name: nginx spec: selector: matchLabels: app: nginx replicas: 2 template: metadata: labels: app: nginx spec: containers: - image: nginx:1.7.9 name: nginx ports: - containerPort: 80 --- apiVersion: v1 kind: Service metadata: labels: app: nginx name: nginx spec: ports: - port: 80 targetPort: 80 selector: app: nginx type: LoadBalancer |
For this kind of workload, the migration path is fairly direct: back up the namespace or selected resources from the source cluster, restore them into the target cluster, and then verify that the restored Deployment, Service, and related objects match expectations. The harder cases show up later, when CRDs, API version skew, storage classes, and cloud-specific integrations enter the picture.
Leave a Reply