Step 2.1 Primary and Target storage Selection
Primary Storage
- For Kubernetes platforms that offer both Container Storage Interface CSI and in-tree storage provisioner options, always choose the CSI provisioner. CSI drivers represent the future of storage operations in K8s as the legacy in-tree plugin-based provisioners have already been deprecated and will be removed in subsequent versions of K8s. The PVCs (Persistent Volume Claim) (Persistent Volume Claim) provisioned by these legacy in-tree provisioners are being migrated to CSI drivers.
- Verify that any CSI storage provisioners support VolumeSnapshots. This may be indicated in the documentation from the storage provider, but it can also be validated using K10 Tools. K10 strongly recommends using a CSI storage provisioner as a prerequisite.
- Verify, using storage provisioner documentation, any limitations on the total number of snapshots per persistent volume claim, as this may impact policy retention settings.
- Ensure the VolumeSnapshotClass has the required K10 annotation to leverage CSI Volume Snapshots. Check K10 documentation for additional information on Storage Integration and the required configuration.
Target/Export Storage
- When choosing a target for backup exports (aka Location Profiles), it is advisable to prioritize object storage over NFS. Object storage solutions (S3, S3-Compatible, Azure Blob, and Google Cloud Storage) provide multiple benefits over NFS. These solutions are designed to work across multiple datacenters under a single namespace and can be made more durable than NFS by distributing copies to multiple locations. It is more scalable and easier to manage when compared to NFS. Additionally, Azure, S3 and S3-compatible solutions can be configured to mitigate ransomware attacks by enabling object locking and versioning to store immutable copies of your backup data.
- Specific to VMware Tanzu clusters, K10 supports Changed Block Tracking (CBT) to efficiently backup Persistent Volumes. This feature is extremely useful and improves performance when backing up large PVCs on VMware. For additional details, refer to the Block Mode Export section. Note that enabling CBT in the K10 policy requires Tanzu Advanced licenses. Refer to the URL for additional details.
Step 2.2 Business continuity and DR (Disaster Recovery) planning
Prior to deploying Kasten K10 to backup Kubernetes workloads, we need to understand our overall approach for Business Continuity Planning (BCP) and Disaster Recovery (DR) planning. While having a robust backup tool for systems, infrastructure, and data is important, organizations also need to assess their operational readiness, approaches, responses, and logistics in the event of disasters, ransomware attacks, or other significant incidents.
Business Continuity Planning (BCP) and Disaster Recovery (DR) are closely related and often considered as complementary processes within an organization's overall resilience strategy. Although they have distinct areas of focus, they work together to ensure the organization's ability to withstand and recover from disruptive events.
While this guide is not comprehensive, it outlines various essential categories for consideration:
Business Continuity Planning
- Operational Readiness: If a datacenter, remote office, or callcenter is rendered incapacitated, organizations need an operational plan to compensate for the loss of the site and minimize the impact on the business. Some examples may include:
- Designated individuals from different departments report to a different physical site.
- Office workers have secure remote access to business systems, data, and communication infrastructure, and this access is regularly tested and verified.
- System Fallbacks and Workarounds: If a key business system (e.g., ERP, ordering and fulfilment system, Customer Relationship Management) goes offline, are secondary or fail-back processes available to compensate or “keep the lights on” for the business while systems and data are restored? For some organizations, this may be as simple as paper-based or manual systems.
- Categorization and Prioritization of Business Systems and Data: Recognizing that all data and systems are equivalent within organizations, Cross-functional teams should collaborate to rank and prioritize systems, data, and infrastructure. This ensures that in the event of a disaster or outage, a documented order and/or prioritization of actions are defined.
- Paper-based and Real-world Exercises: While having lists, plans, and workarounds defined is important, if they are not regularly tested, or are subject to failure when actually implemented, the impacts on the business are effectively the same as a complete outage. Organizations should plan to regularly audit, test, and update their Business Continuity and Disaster Recovery Plans to ensure they remain effective and relevant to the enterprise. This can be paper-based reviews or ideally, a simulated disaster event conducted semi-annually or annually.
Disaster Recovery
- Definition of Recovery Time Objective (RTO) and Recovery Point Objective (RPO) for each system, application, and/or their relevant subcomponents: In conjunction with categorization and prioritization mentioned above, RTO and RPO definitions help organizations plan their infrastructure, backup policies, and overall approach to ensure the most critical systems and/or data have the lowest RTO and RPO (lower values are better, but typically more costly). While both are time measures, they address two distinct aspects:
- RPO is measured before a disaster. It represents the amount of time during which you may lose unrecoverable transactions prior to the disaster.
- RTO is measured after the disaster. It represents the time it will take to make your service available for new transactions.
- 3-2-1
Once we have defined all the numbers required for the sizing calculations, we need to have a look at the infrastructure estate. It is a best practice to follow the 3-2-1 rule:
- Have at least three copies of your data.
- Store the copies on two different media.
- Keep one backup copy off-site.
Kasten K10 by Veeam can help you fulfil all 3-2-1 backup rule requirements:
- Have at least three copies of data:
It is recommended to always have three copies of data. The three copies can be original data on the Persistent Volume Claim, the snapshot data and the snapshot exported to an external location.
- Store the copies on two different media:
Kasten K10 is storage and cloud-agnostic, meaning it supports multiple storage infrastructure, including block, file, and object storage. For example, storing your data on PVC (Persistent Volume Claim) and external S3 counts as two different media. Having two copies of data, one on PVC and the other as a local snapshot on the K8s cluster, does not count as two different media. In the event of a disaster and the cluster is beyond recoverable, both copies of data are lost.
- Keep one backup copy off-site:
Set up backup copy jobs to transfer your backups off-site to another location (i.e. a Public Cloud Provider or secondary storage in a separate site). Exporting the backup to a location profile using NFS File Storage is not considered off-site.
Backup and Recovery Testing
Backups are only useful if they are recoverable. Simply targeting workloads and data for backup is not enough to ensure that organizations can withstand a disaster event, ransomware attack, or accidental deletion. Similar to the Business Continuity Testing described above, organizations should do a recovery test on their backups on a regular basis to make sure they are correct and can be restored. Such tests help organizations ensure that they meet their RTO and RPO targets.
Disaster Recovery Testing
In line with testing backup and recovery, organizations should aim to conduct a full Disaster Recovery test. Ideally, organizations should have an end-to-end documented DR plan in place, which includes people, processes, and systems documented ahead of time. This plan can serve as the "single source of truth" during both test exercises and real-world events. Organizations that have the most robust and/or mature Disaster Recovery plans and infrastructure will regularly run through partial or full DR tests in production environments (e.g., fail over from Data Center A to Data Center B, operate out of Data Center B for a set time, before falling back to Data Center A).
Object storage immutability
Kasten K10 allows you to prohibit the deletion of data in object storage repositories by making the data temporarily immutable. This is done to enhance security and protect your data from loss due to attacks, malware activity (e.g., ransomware), or other actions.
Step 2.3 Preparing for Air-gapped Installation of K10 (Optional)
if an air-gapped installation is required, it is possible to use your own private container registry to install K10. While this can always be done manually, the ``k10offline`` tool makes it easier to automate the process.
Fetching the Helm Chart for Local Use
To fetch the most recent K10 Helm chart for local use, run the following command to pull the latest K10 chart as a compressed tarball (.tgz) file into the working directory.
helm repo update && \
helm fetch kasten/k10
If you need to fetch a specific version, please run the following command:
helm repo update && \
helm fetch kasten/k10 --version=<k10-version>
Preparing K10 Container Images for Air-Gapped Use
There are multiple ways to use a private repository including setting up a caching or proxy image registry that points to the Kasten K10 image repositories using tools such as JFrog Artifactory. However, if images need to be manually uploaded or an automated upload pipeline is required to add K10 images into your private repository, the following documentation should help.
The following command will list all images used by the current K10 version, this can be helpful if there is a requirement to tag and push K10 images into your private repository manually instead of using the Kasten provided tool documented below.
docker run --rm -it gcr.io/kasten-images/k10offline:6.5.12 list-images
Finally, to completely automate the download and re-upload of K10 container images, the following command will pull all K10 images into your local repository, re-tag them for a repository located at repo.example.com and push them to this specified registry.
docker run --rm -ti -v /var/run/docker.sock:/var/run/docker.sock \
-v ${HOME}/.docker:/root/.docker \
gcr.io/kasten-images/k10offline:6.5.12 pull images --newreporepo.example.com
Note that k10offline tool will use your local docker config if the private registry requires authentication. To access the private registry, you may need to log in manually within the k10offline container shell if your local docker config does not have the credentials stored. You can execute below commands to access k10offline container shell.
Attach to the k10offline container
docker run --rm -v /var/run/docker.sock:/var/run/docker.sock \
-it --entrypoint /bin/sh gcr.io/kasten-images/k10offline:6.5.12
Manually do a docker login
docker login repo.example.com
Once logged in, push the image to the external repository.
/k10offline pull images --newrepo repo.example.com
Providing Credentials if Local Container Repository is Private
If the external registry that you are using is private, credentials for that repository can be provided using secrets.dockerConfig and global.imagePullSecret flags, as below, with the helm install command.
--set secrets.dockerConfig=$(base64 -w 0 < ${HOME}/.docker/config.json) \
--set global.imagePullSecret="k10-ecr"
If you already have a custom secret with the docker config to connect to your private registry, global.imagePullSecret flag can be used to refer the name of the secret
--set global.imagePullSecret="<custom-docker-config-secret-name>"
Installing K10 with Local Helm Chart and Container Images
If the K10 container images were uploaded to a registry at repo.example.com, an air-gapped installation can be performed by setting `global.airgapped.repository=repo.example.com` as shown in the below command:
helm install k10 k10-6.5.12.tgz --namespace kasten-io --create-namespace \
--set global.airgapped.repository=repo.example.com \
--set secrets.dockerConfig=$(base64 -w 0 < ${HOME}/.docker/config.json) \
--set global.imagePullSecret="k10-ecr" --set metering.mode=airgap
Step 2.4 Security Requirements
Veeam Kasten for Kubernetes requires additional privileges to efficiently backup and restore applications due to the nature of backup, recovery, and migration operations. This article contains descriptions and motivation for all the privileges required by Veeam Kasten for Kubernetes.
Permissions Requirements
Veeam Kasten for Kubernetes requires the following capabilities for both the Veeam Kasten installation namespace (default: kasten-io) and the target application's namespace:
- DAC_OVERRIDE: Allows to read the data on the volume regardless of the permissions set. Veeam Kasten needs this capability to read all the data from the volume.
- FOWNER: Allows to change owner (chown) of the files and directories. This capability allows Veeam Kasten to correctly restore the owner of the entity following the restore process.
- CHOWN: Allows to change permissions (chmod) of files and directories. This capability allows Veeam Kasten to correctly restore access permissions for the entity following the restore process.
See Linux Capabilities for a detailed description of the above capability requirements.
runAsUser, runAsGroup
Veeam Kasten runs pods with UID = 1000 and GID = 1000, which need to be permitted by the security policies.
Additionally, it might be required to allow the default Grafana and Prometheus UID\GID.
Note: If the StorageSecurityContext is used, userId and groupId fields should be permitted to be used as values for runAsUser(userId) and runAsGroup(groupId) fields by the security policies. In addition, groupId and supplementalGroup should be permitted as fsGroup values.
fsGroup
Value 1000 for fsGroup parameter should be allowed by security policies. During the restore phase, Veeam Kasten creates a volume for restoring data and sets fsGroup = 1000 to the internal restore-data-* pod's securityContext so that data can be written to that volume.
NFS Location Profile
If the NFS location profile is used in rootless mode, the security policies must allow the supplementalGroup used by the profile.
See NFS Location Profile for details.
Get ready for Step 3 of your onboarding journey. At the next step, we will explore installing Veeam Kasten for Kubernetes and learn how to integrate an authentication mechanism.
If you need more help getting started, you can post your question in the comments section below or contact us at veeam.university@veeam.com any time and someone from the Customer Success team will be there to assist you.