Since we are transitioning to external cloud provider, we need a way
to use the existing cinder volume plugin (from kubelet). With external
cloud manager kubelet will be run with --cloud=provider=external and
no --cloud-config file will be in the command line. So we need a way
to load the openstack config file from somewhere.
Taking a cue from kubeadm, which currently is picking up "/etc/kubernetes/cloud-config"
https://github.com/kubernetes/kubernetes/blob/master/cmd/kubeadm/app/phases/controlplane/manifests.go#L44
let's support the scenario where we fall back to this static location if
there is no cloud provider specified in the command line.
This has been tested with local-up-cluster using the following params:
EXTERNAL_CLOUD_PROVIDER=true
CLOUD_PROVIDER=openstack
CLOUD_CONFIG=/etc/kubernetes/cloud-config
Automatic merge from submit-queue (batch tested with PRs 51235, 50819, 51274, 50972, 50504)
Clean cinder detachlogerror
**What this PR does / why we need it**:
**Which issue this PR fixes** : fixes#50441
**Special notes for your reviewer**:
**Release note**:
```release-note
```
Automatic merge from submit-queue (batch tested with PRs 45345, 49470, 49407, 49448, 49486)
Support "fstype" parameter in dynamically provisioned PVs
This PR is a replacement for https://github.com/kubernetes/kubernetes/pull/40805. I was not able to push fixes and rebases to the original branch as I don't have access to the Github organization anymore.
I assume the PR will need a new "ok to test"
**ORIGINAL PR DESCRIPTION**
**What this PR does / why we need it**: This PR allows specifying the desired FSType when dynamically provisioning volumes with storage classes. The FSType can now be set as a parameter:
```yaml
kind: StorageClass
apiVersion: storage.k8s.io/v1beta1
metadata:
name: test
provisioner: kubernetes.io/azure-disk
parameters:
fstype: xfs
```
**Which issue this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close that issue when PR gets merged)*: fixes#37801
**Special notes for your reviewer**:
The PR also implicitly adds checks for unsupported parameters.
**Release note**:
```release-note
Support specifying of FSType in StorageClass
```
When volume's status is 'attaching', its attachments will be None,
controllermanager can't get device path and make some failed event.
But it is normal, let's fix it.
Automatic merge from submit-queue
Statefulsets for cinder: allow multi-AZ deployments, spread pods across zones
**What this PR does / why we need it**: Currently if we do not specify availability zone in cinder storageclass, the cinder is provisioned to zone called nova. However, like mentioned in issue, we have situation that we want spread statefulset across 3 different zones. Currently this is not possible with statefulsets and cinder storageclass. In this new solution, if we leave it empty the algorithm will choose the zone for the cinder drive similar style like in aws and gce storageclass solutions.
**Which issue this PR fixes** fixes#44735
**Special notes for your reviewer**:
example:
```
kind: StorageClass
apiVersion: storage.k8s.io/v1beta1
metadata:
name: all
provisioner: kubernetes.io/cinder
---
apiVersion: v1
kind: Service
metadata:
annotations:
service.alpha.kubernetes.io/tolerate-unready-endpoints: "true"
name: galera
labels:
app: mysql
spec:
ports:
- port: 3306
name: mysql
clusterIP: None
selector:
app: mysql
---
apiVersion: apps/v1beta1
kind: StatefulSet
metadata:
name: mysql
spec:
serviceName: "galera"
replicas: 3
template:
metadata:
labels:
app: mysql
annotations:
pod.alpha.kubernetes.io/initialized: "true"
spec:
containers:
- name: mysql
image: adfinissygroup/k8s-mariadb-galera-centos:v002
imagePullPolicy: Always
ports:
- containerPort: 3306
name: mysql
- containerPort: 4444
name: sst
- containerPort: 4567
name: replication
- containerPort: 4568
name: ist
volumeMounts:
- name: storage
mountPath: /data
readinessProbe:
exec:
command:
- /usr/share/container-scripts/mysql/readiness-probe.sh
initialDelaySeconds: 15
timeoutSeconds: 5
env:
- name: POD_NAMESPACE
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: metadata.namespace
volumeClaimTemplates:
- metadata:
name: storage
annotations:
volume.beta.kubernetes.io/storage-class: all
spec:
accessModes: [ "ReadWriteOnce" ]
resources:
requests:
storage: 12Gi
```
If this example is deployed it will automatically create one replica per AZ. This helps us a lot making HA databases.
Current storageclass for cinder is not perfect in case of statefulsets. Lets assume that cinder storageclass is defined to be in zone called nova, but because labels are not added to pv - pods can be started in any zone. The problem is that at least in our openstack it is not possible to use cinder drive located in zone x from zone y. However, should we have possibility to choose between cross-zone cinder mounts or not? Imo it is not good way of doing things that they mount volume from another zone where the pod is located(means more network traffic between zones)? What you think? Current new solution does not allow that anymore (should we have possibility to allow it? it means removing the labels from pv).
There might be some things that needs to be fixed still in this release and I need help for that. Some parts of the code is not perfect.
Issues what i am thinking about (I need some help for these):
1) Can everybody see in openstack what AZ their servers are? Can there be like access policy that do not show that? If AZ is not found from server specs, I have no idea how the code behaves.
2) In GetAllZones() function, is it really needed to make new serviceclient using openstack.NewComputeV2 or could I somehow use existing one
3) This fetches all servers from some openstack tenant(project). However, in some cases kubernetes is maybe deployed only to specific zone. If kube servers are located for instance in zone 1, and then there are another servers in same tenant in zone 2. There might be usecase that cinder drive is provisioned to zone-2 but it cannot start pod, because kubernetes does not have any nodes in zone-2. Could we have better way to fetch kubernetes nodes zones? Currently that information is not added to kubernetes node labels automatically in openstack (which should I think). I have added those labels manually to nodes. If that zone information is not added to nodes, the new solution does not start stateful pods at all, because it cannot target pods.
cc @rootfs @anguslees @jsafrane
```release-note
Default behaviour in cinder storageclass is changed. If availability is not specified, the zone is chosen by algorithm. It makes possible to spread stateful pods across many zones.
```
This implements Bulk volume polling using ideas presented by
justin in https://github.com/kubernetes/kubernetes/pull/39564
But it changes the implementation to use an interface
and doesn't affect other implementations.
This PR is to fix the issue in converting aws volume id from mount
paths. Currently there are three aws volume id formats supported. The
following lists example of those three formats and their corresponding
global mount paths:
1. aws:///vol-123456
(/var/lib/kubelet/plugins/kubernetes.io/aws-ebs/mounts/aws/vol-123456)
2. aws://us-east-1/vol-123456
(/var/lib/kubelet/plugins/kubernetes.io/mounts/aws/us-est-1/vol-123455)
3. vol-123456
(/var/lib/kubelet/plugins/kubernetes.io/mounts/aws/us-est-1/vol-123455)
For the first two cases, we need to check the mount path and convert
them back to the original format.
See issue #33128
We can't rely on the device name provided by Cinder, and thus must perform
detection based on the drive serial number (aka It's cinder ID) on the
kubelet itself.
This patch re-works the cinder volume attacher to ignore the supplied
deviceName, and instead defer to the pre-existing GetDevicePath method to
discover the device path based on it's serial number and /dev/disk/by-id
mapping.
This new behavior is controller by a config option, as falling back
to the cinder value when we can't discover a device would risk devices
not showing up, falling back to cinder's guess, and detecting the wrong
disk as attached.
At master volume reconciler, the information about which volumes are
attached to nodes is cached in actual state of world. However, this
information might be out of date in case that node is terminated (volume
is detached automatically). In this situation, reconciler assume volume
is still attached and will not issue attach operation when node comes
back. Pods created on those nodes will fail to mount.
This PR adds the logic to periodically sync up the truth for attached volumes kept in the actual state cache. If the volume is no longer attached to the node, the actual state will be updated to reflect the truth. In turn, reconciler will take actions if needed.
To avoid issuing many concurrent operations on cloud provider, this PR
tries to add batch operation to check whether a list of volumes are
attached to the node instead of one request per volume.
More details are explained in PR #33760
Gluster provisioner is interested in pvc.Namespace and I don't want to add
at as a new field in VolumeOptions - it would contain almost whole PVC.
Let's pass direct reference to PVC instead and let the provisioner to pick
information it is interested in.
Currently kubelet volume management works on the concept of desired
and actual world of states. The volume manager periodically compares the
two worlds and perform volume mount/unmount and/or attach/detach
operations. When kubelet restarts, the cache of those two worlds are
gone. Although desired world can be recovered through apiserver, actual
world can not be recovered which may cause some volumes cannot be cleaned
up if their information is deleted by apiserver. This change adds the
reconstruction of the actual world by reading the pod directories from
disk. The reconstructed volume information is added to both desired
world and actual world if it cannot be found in either world. The rest
logic would be as same as before, desired world populator may clean up
the volume entry if it is no longer in apiserver, and then volume
manager should invoke unmount to clean it up.
This commit adds a new volume manager in kubelet that synchronizes
volume mount/unmount (and attach/detach, if attach/detach controller
is not enabled).
This eliminates the race conditions between the pod creation loop
and the orphaned volumes loops. It also removes the unmount/detach
from the `syncPod()` path so volume clean up never blocks the
`syncPod` loop.
This is a first-aid bandage to let admission controller ignore persistent
volumes that are being provisioned right now and thus may not exist in
external cloud infrastructure yet.
Add a mutex to guard SetUpAt() and TearDownAt() calls - they should not
run in parallel. There is a race in these calls when there are two pods
using the same volume, one of them is dying and the other one starting.
TearDownAt() checks that a volume is not needed by any pods and detaches the
volume. It does so by counting how many times is the volume mounted
(GetMountRefs() call below).
When SetUpAt() of the starting pod already attached the volume and did not mount
it yet, TearDownAt() of the dying pod will detach it - GetMountRefs() does not
count with this volume.
These two threads run in parallel:
dying pod.TearDownAt("myVolume") starting pod.SetUpAt("myVolume")
| |
| AttachDisk("myVolume")
refs, err := mount.GetMountRefs() |
Unmount("myDir") |
if refs == 1 { |
| | Mount("myVolume", "myDir")
| | |
| DetachDisk("myVolume") |
| start containers - OOPS! The volume is detached!
|
finish the pod cleanup
Also, add some logs to cinder plugin for easier debugging in the future, add
a test and update the fake mounter to know about bind mounts.