NFS plugin should not use IsLikelyNotMountPoint(), as it uses lstat() / stat()
to determine if the NFS volume is still mounted - NFS server may use
root_squash and kubelet may not be allowed to do lstat() / stat() there.
It must use slower IsNotMountPoint() instead, including TearDown() function.
Automatic merge from submit-queue (batch tested with PRs 46341, 53629). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.
fix azure file mount limit issue on windows due to using drive letter
**What this PR does / why we need it**:
It's not necessary to use drive letter in azure file mount, correct usage for New-SmbGlobalMapping is like following:
```
New-SmbGlobalMapping -RemotePath $AzureFilePath -Credential $Credential
mklink /D $mountPath $AzureFilePath
```
I removed the `LocalPath` parameter in New-SmbGlobalMapping
**Which issue this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close that issue when PR gets merged)*: fixes#54668
Without this PR, there is a limit(25) for azure file mount number on each node because only 25 drive letters could be used on each windows node, With this PR, there would be no such limit.
**Special notes for your reviewer**:
@PatrickLang
**Release note**:
```
fix azure file mount limit issue on windows due to using drive letter
```
/sig azure
/sig windows
Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.
fix warning messages due to GetMountRefs func not implemented in windows
**What this PR does / why we need it**:
This PR completes the windows implementation of GetMountRefs in mount.go. In linux, the GetMountRefs implementaion is: read `/proc/mounts` and find all mount points, while in Windows, there is no such `/proc/mounts` place which shows all mounting points.
There is another way in windows, **we could walk through(by `getAllParentLinks` func) the mount path(symbolic link) and get all symlinks until we got the final device, which is actually a drive**.
**Which issue this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close that issue when PR gets merged)*: fixes#54670
This PR fixed the warnning issue mentioned in https://github.com/kubernetes/kubernetes/pull/51252
**Special notes for your reviewer**:
Some values in the code would be like follwoing:
```
GetMountRefs: mountPath ("\\var\\lib\\kubelet\\pods/4c74b128-92ca-11e7-b86b-000d3a36d70c/volumes/kubernetes.io~azure-disk/pvc-1cc91c70-92ca-11e7-b86b-000d3a36d70c")
getAllParentLinks: refs (["" "" "c:\\var\\lib\\kubelet\\plugins\\kubernetes.io\\azure-disk\\mounts\\b1246717734" "G:\\"])
basemountPath c:\var\lib\kubelet\plugins\kubernetes.io\azure-disk\mounts
got volumeID b1246717734
```
**Release note**:
```
fix warning messages due to GetMountRefs func not implemented in windows
```
1) Fix FakeMounter.IsLikelyNotMountPoint to return ErrNotExist if the
directory does not exist. Mounter.IsLikelyNotMountPoint interface
requires this, and RBD plugin depends on it.
Automatic merge from submit-queue (batch tested with PRs 52469, 52574, 52330, 52689, 52829). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>..
add feature: azurefile mount on windows node
**What this PR does / why we need it**:
feature: azurefile mount on windows node. I created this new PR, close the original one(https://github.com/kubernetes/kubernetes/pull/50233) as there is a big rebase change.
Currently only SMB(a nfs protocol) is supported for windows container in the new Windows 2016 RS3 image, and windows container in RS3 could only use New-SmbGlobalMapping cmdlet for volume mapping, "net use" command does not work for windows container.
**Which issue this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close that issue when PR gets merged)*: fixes #
**Special notes for your reviewer**:
As there is a known blocking issue in Windows 2016 server when mounting a SMB(a NFS protocol in Windows) share on a container host and then bind that share to a container ( Azure file on Windows is using SMB protocol), this PR still could not mount an azure file on current windows 2016 server node, it depends on 2016 RS3 release, and it will still succeed (as a workaround) if customer want to mount an azure file on current windows node.
Main code logic is similar to what it does in Linux node:
1. create target directory in Windows host
2. Use New-SmbGlobalMapping powershell cmdlet to mount SMB azure file to a drive in Windows host
3. Use mklink command to link target directory to the mounted drive
K8s would bind target directory to the container directory
source in mount function would be like:
`\\[accountname].file.core.windows.net\test`
target in mount function would be like:
`c:\var\lib\kubelet\pods\5f679f75-7ce3-11e7-b718-000d3a31dac4\volumes\kubernetes.io~azure-file`
sample azure file config file:
```
apiVersion: v1
kind: Pod
metadata:
name: iis
spec:
containers:
- image: microsoft/iis
name: iis
volumeMounts:
- name: azure
mountPath: "d:"
nodeSelector:
beta.kubernetes.io/os: windows
volumes:
- name: azure
azureFile:
secretName: azure-secret
shareName: k8stest
readOnly: false
```
**Release note**:
```release-note
```
Automatic merge from submit-queue (batch tested with PRs 43016, 50503, 51281, 51518, 51582). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>..
double const in mount_linux.go
**What this PR does / why we need it**:
fix some typo and double const
**Which issue this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close that issue when PR gets merged)*: fixes #
**Special notes for your reviewer**:
**Release note**:
```release-note
NONE
```
Automatic merge from submit-queue (batch tested with PRs 43016, 50503, 51281, 51518, 51582). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>..
Clean up diskLooksUnformatted literal
**What this PR does / why we need it**:
#16948 moved the `formatAndMount` function to mount_linux.go, but `diskLooksUnformatted` does not necessarily need to appear in mount_unsupported.go
#31515 Renames `diskLooksUnformatted` to `getDiskFormat`, but did not update the comment
This is to do the small cleanup.
**Which issue this PR fixes**
**Special notes for your reviewer**:
**Release note**:
add initial work for mount azure file on windows
fix review comments
full implementation for attach azure file on windows node
working azure file mount
remove useless functions
add a workable implementation about mounting azure file on windows node
fix review comments and make the pod creating successful even azure file mount failed
fix according to review comments
add mount_windows_test
add implementation for IsLikelyNotMountPoint func
remove mount_windows_test.go temporaly
add back unit test for mount_windows.go
add normalizeWindowsPath func
fix normalizeWindowsPath func issue
implment azure disk on windows
update bazel BUILD
revert validation.go change as it's another PR
fix merge issue and compiling issue
fix windows compiling issue
fix according to review comments
fix according to review comments
fix cross-build failure
fix according to review comments
fix test build failure temporalily
fix darwin build failure
fix azure windows test failure
add empty implementation of MakeRShared on windows
fix gofmt errors
Kubelet makes sure that /var/lib/kubelet is rshared when it starts.
If not, it bind-mounts it with rshared propagation to containers
that mount volumes to /var/lib/kubelet can benefit from mount propagation.
fsGroup check will be enforcing that if a volume has already been
mounted by one pod and another pod wants to mount it but has a different
fsGroup value, this mount operation will not be allowed.
Automatic merge from submit-queue (batch tested with PRs 46458, 50934, 50766, 50970, 47698)
Prepare VolumeHost for running mount tools in containers
This is the first part of implementation of https://github.com/kubernetes/features/issues/278 - running mount utilities in containers.
It updates `VolumeHost` interface:
* `GetMounter()` now requires volume plugin name, as it is going to return different mounter to different volume plugings, because mount utilities for these plugins can be on different places.
* New `GetExec()` method that should volume plugins use to execute any utilities. This new `Exec` interface will execute them on proper place.
* `SafeFormatAndMount` is updated to the new `Exec` interface.
This is just a preparation, `GetExec` right now leads to simple `os.Exec` and mount utilities are executed on the same place as before. Also, the volume plugins will be updated in subsequent PRs (split into separate PRs, some plugins required lot of changes).
```release-note
NONE
```
@kubernetes/sig-storage-pr-reviews
@rootfs @gnufied
Automatic merge from submit-queue (batch tested with PRs 50693, 50831, 47506, 49119, 50871)
Use reflect.DeepEqual to replace slicesEqual
**What this PR does / why we need it**:
**Which issue this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close that issue when PR gets merged)*: fixes https://github.com/kubernetes/kubernetes/issues/50952
**Special notes for your reviewer**:
**Release note**:
```release-note
NONE
```
Automatic merge from submit-queue
Run mount in its own systemd scope.
Kubelet needs to run /bin/mount in its own cgroup.
- When kubelet runs as a systemd service, "systemctl restart kubelet" may kill all processes in the same cgroup and thus terminate fuse daemons that are needed for gluster and cephfs mounts.
- When kubelet runs in a docker container, restart of the container kills all fuse daemons started in the container.
Killing fuse daemons is bad, it basically unmounts volumes from running pods.
This patch runs mount via "systemd-run --scope /bin/mount ...", which makes sure that any fuse daemons are forked in its own systemd scope (= cgroup) and they will survive restart of kubelet's systemd service or docker container.
This helps with #34965
As a downside, each new fuse daemon will run in its own transient systemd service and systemctl output may be cluttered.
@kubernetes/sig-storage-pr-reviews
@kubernetes/sig-node-pr-reviews
```release-note
fuse daemons for GlusterFS and CephFS are now run in their own systemd scope when Kubernetes runs on a system with systemd.
```
Automatic merge from submit-queue
Switch from package syscall to golang.org/x/sys/unix
**What this PR does / why we need it**:
The syscall package is locked down and the comment in https://github.com/golang/go/blob/master/src/syscall/syscall.go#L21-L24 advises to switch code to use the corresponding package from golang.org/x/sys. This PR does so and replaces usage of package syscall with package golang.org/x/sys/unix where applicable. This will also allow to get updates and fixes
without having to use a new go version.
In order to get the latest functionality, golang.org/x/sys/ is re-vendored. This also allows to use Eventfd() from this package instead of calling the eventfd() C function.
**Special notes for your reviewer**:
This follows previous works in other Go projects, see e.g. moby/moby#33399, cilium/cilium#588
**Release note**:
```release-note
NONE
```
Kubelet needs to run /bin/mount in its own cgroup.
- When kubelet runs as a systemd service, "systemctl restart kubelet" may kill
all processes in the same cgroup and thus terminate fuse daemons that are
needed for gluster and cephfs mounts.
- When kubelet runs in a docker container, restart of the container kills all
fuse daemons started in the container.
Killing fuse daemons is bad, it basically unmounts volumes from running pods.
This patch runs mount via "systemd-run --scope /bin/mount ...", which makes
sure that any fuse daemons are forked in its own systemd scope (= cgroup) and
they will survive restart of kubelet's systemd service or docker container.
As a downside, each new fuse daemon will run in its own transient systemd
service and systemctl output may be cluttered.
The syscall package is locked down and the comment in [1] advises to
switch code to use the corresponding package from golang.org/x/sys. Do
so and replace usage of package syscall with package
golang.org/x/sys/unix where applicable.
[1] https://github.com/golang/go/blob/master/src/syscall/syscall.go#L21-L24
This will also allow to get updates and fixes for syscall wrappers
without having to use a new go version.
Errno, Signal and SysProcAttr aren't changed as they haven't been
implemented in /x/sys/. Stat_t from syscall is used if standard library
packages (e.g. os) require it. syscall.SIGTERM is used for
cross-platform files.
NsEnterMounter should not stop parsing findmnt output on the first space but
on the last one, just in case the mount point name itself contains a space.
Added IsNotMountPoint method to mount utils (pkg/util/mount/mount.go)
Added UnmountMountPoint method to volume utils (pkg/volume/util/util.go)
Call UnmountMountPoint method from local storage (pkg/volume/local/local.go)
IsLikelyNotMountPoint behavior was not modified, so the logic/behavior for UnmountPath is not modified
lsblk reports FSTYPE of devices with partition tables as empty string "",
which is indistinguishable from empty devices. We must look for dependent
devices (i.e. partitions) to see that the device is really empty and report
error otherwise.
I checked that LVM, LUKS and MD RAID have their own FSTYPE in lsblk output,
so it should be only a partition table that has empty FSTYPE.
The main point of this patch is to run lsblk without "-n", i.e. print all
dependent devices and check if they're there.
Automatic merge from submit-queue (batch tested with PRs 35094, 42095, 42059, 42143, 41944)
Use chroot for containerized mounts
This PR is to modify the containerized mounter script to use chroot
instead of rkt fly. This will avoid the problem of possible large number
of mounts caused by rkt containers if they are not cleaned up.
Automatic merge from submit-queue (batch tested with PRs 41714, 41510, 42052, 41918, 31515)
Show specific error when a volume is formatted by unexpected filesystem.
kubelet now detects that e.g. xfs volume is being mounted as ext3 because of
wrong volume.Spec.
Mount error is left in the error message to diagnose issues with mounting e.g.
'ext3' volume as 'ext4' - they are different filesystems, however kernel should
mount ext3 as ext4 without errors.
Example kubectl describe pod output:
```
FirstSeen LastSeen Count From SubobjectPath Type Reason Message
41s 3s 7 {kubelet ip-172-18-3-82.ec2.internal} Warning FailedMount MountVolume.MountDevice failed for volume "kubernetes.io/aws-ebs/aws://us-east-1d/vol-ba79c81d" (spec.Name: "pvc-ce175cbb-6b82-11e6-9fe4-0e885cca73d3") pod "3d19cb64-6b83-11e6-9fe4-0e885cca73d3" (UID: "3d19cb64-6b83-11e6-9fe4-0e885cca73d3") with: failed to mount the volume as "ext4", it's already formatted with "xfs". Mount error: mount failed: exit status 32
Mounting arguments: /dev/xvdba /var/lib/kubelet/plugins/kubernetes.io/aws-ebs/mounts/aws/us-east-1d/vol-ba79c81d ext4 [defaults]
Output: mount: wrong fs type, bad option, bad superblock on /dev/xvdba,
missing codepage or helper program, or other error
In some cases useful info is found in syslog - try
dmesg | tail or so.
```
This PR is to modify the containerized mounter script to use chroot
instead of rkt fly. This will avoid the problem of possible large number
of mounts caused by rkt containers if they are not cleaned up.
OSX 10.11.x has `/var` symlinked to `/private/var`, which was tripping
up logic in `mount.GetMountRefs`
This fixes unit tests for pkg/volume/fc and pkg/volume/iscsi
kubelet now detects that e.g. xfs volume is being mounted as ext3 because of
wrong volume.Spec.
Mount error is left in the error message to diagnose issues with mounting e.g.
'ext3' volume as 'ext4' - they are different filesystems, however kernel should
mount ext3 as ext4 without errors.
Automatic merge from submit-queue
Enable lazy initialization of ext3/ext4 filesystems
**What this PR does / why we need it**: It enables lazy inode table and journal initialization in ext3 and ext4.
**Which issue this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close that issue when PR gets merged)*: fixes#30752, fixes#30240
**Release note**:
```release-note
Enable lazy inode table and journal initialization for ext3 and ext4
```
**Special notes for your reviewer**:
This PR removes the extended options to mkfs.ext3/mkfs.ext4, so that the defaults (enabled) for lazy initialization are used.
These extended options come from a script that was historically located at */usr/share/google/safe_format_and_mount* and later ported to GO so this dependency to the script could be removed. After some search, I found the original script here: https://github.com/GoogleCloudPlatform/compute-image-packages/blob/legacy/google-startup-scripts/usr/share/google/safe_format_and_mount
Checking the history of this script, I found the commit [Disable lazy init of inode table and journal.](4d7346f7f5). This one introduces the extended flags with this description:
```
Now that discard with guaranteed zeroing is supported by PD,
initializing them is really fast and prevents perf from being affected
when the filesystem is first mounted.
```
The problem is, that this is not true for all cloud providers and all disk types, e.g. Azure and AWS. I only tested with magnetic disks on Azure and AWS, so maybe it's different for SSDs on these cloud providers. The result is that this performance optimization dramatically increases the time needed to format a disk in such cases.
When mkfs.ext4 is told to not lazily initialize the inode tables and the check for guaranteed zeroing on discard fails, it falls back to a very naive implementation that simply loops and writes zeroed buffers to the disk. Performance on this highly depends on free memory and also uses up all this free memory for write caching, reducing performance of everything else in the system.
As of https://github.com/kubernetes/kubernetes/issues/30752, there is also something inside kubelet that somehow degrades performance of all this. It's however not exactly known what it is but I'd assume it has something to do with cgroups throttling IO or memory.
I checked the kernel code for lazy inode table initialization. The nice thing is, that the kernel also does the guaranteed zeroing on discard check. If it is guaranteed, the kernel uses discard for the lazy initialization, which should finish in a just few seconds. If it is not guaranteed, it falls back to using *bio*s, which does not require the use of the write cache. The result is, that free memory is not required and not touched, thus performance is maxed and the system does not suffer.
As the original reason for disabling lazy init was a performance optimization and the kernel already does this optimization by default (and in a much better way), I'd suggest to completely remove these flags and rely on the kernel to do it in the best way.
This is a fix on top #38124. In this fix, we move the logic to filter
out shared mount references into operation_executor's UnmountDevice
function to avoid this part is being used by other types volumes such as
rdb, azure etc. This filter function should be only needed during
unmount device for GCI image.
this is a workaround for the unmount device issue caused by gci mounter. In GCI cluster, if gci mounter is used for mounting, the container started by mounter script will cause additional mounts created in the container. Since these mounts are irrelavant to the original mounts, they should be not considered when checking the mount references. By comparing the mount path prefix, those additional mounts can be filtered out.
Plan to work on better approach to solve this issue.
This change is to only enable containerized mounter for nfs and
glusterfs types. For other types such as tmpfs, ext2/3/4 or empty type,
we should still use mount from $PATH
This PR is to fix the issue in converting aws volume id from mount
paths. Currently there are three aws volume id formats supported. The
following lists example of those three formats and their corresponding
global mount paths:
1. aws:///vol-123456
(/var/lib/kubelet/plugins/kubernetes.io/aws-ebs/mounts/aws/vol-123456)
2. aws://us-east-1/vol-123456
(/var/lib/kubelet/plugins/kubernetes.io/mounts/aws/us-est-1/vol-123455)
3. vol-123456
(/var/lib/kubelet/plugins/kubernetes.io/mounts/aws/us-est-1/vol-123455)
For the first two cases, we need to check the mount path and convert
them back to the original format.
When kubelet restarts, all the information about the volumes will be
gone from actual/desired states. When update node status with mounted
volumes, the volume list might be empty although there are still volumes
are mounted and in turn causing master to detach those volumes since
they are not in the mounted volumes list. This fix is to make sure only
update mounted volumes list after reconciler starts sync states process.
This sync state process will scan the existing volume directories and
reconstruct actual states if they are missing.
This PR also fixes the problem during orphaned pods' directories. In
case of the pod directory is unmounted but has not yet deleted (e.g.,
interrupted with kubelet restarts), clean up routine will delete the
directory so that the pod directoriy could be cleaned up (it is safe to
delete directory since it is no longer mounted)
The third issue this PR fixes is that during reconstruct volume in
actual state, mounter could not be nil since it is required for creating
container.VolumeMap. If it is nil, it might cause nil pointer exception
in kubelet.
Details are in proposal PR #33203
In order to be able to use new mounter library, this PR adds the
mounterPath flag to kubelet which passes the flag to the mount
interface. If flag is empty, mount uses default mount path.
Currently kubelet volume management works on the concept of desired
and actual world of states. The volume manager periodically compares the
two worlds and perform volume mount/unmount and/or attach/detach
operations. When kubelet restarts, the cache of those two worlds are
gone. Although desired world can be recovered through apiserver, actual
world can not be recovered which may cause some volumes cannot be cleaned
up if their information is deleted by apiserver. This change adds the
reconstruction of the actual world by reading the pod directories from
disk. The reconstructed volume information is added to both desired
world and actual world if it cannot be found in either world. The rest
logic would be as same as before, desired world populator may clean up
the volume entry if it is no longer in apiserver, and then volume
manager should invoke unmount to clean it up.
In nsenter_mount.go/isLikelyNotMountPoint function, the returned output
from findmnt command misses the last letter. Modify the code to make sure
that output has the full target path. fix#26421#25056#22911
fix https://github.com/kubernetes/kubernetes/issues/24717
If kubelet root-dir is a symlink, the pod's cinder volume dir can't be
umounted even after pod is deleted.
This patch reads target path of symlink before comparing with entries in
proc mounts.
Kubelet was not able to mount volumes when running inside a container and
using nsenter mounter,
NsenterMounter.IsLikelyNotMountPoint() should return ErrNotExist when the
checked directory does not exists as the regular mounted does this and
some volume plugins depend on this behavior.