2020-05-11 • OPENSHIFT , IBM

Installing OpenShift Container Storage on IBM Cloud Bare Metal RedHat Openshift Kubernetes Service (ROKS)

This feature set, deploying OCS on Bare Metal, has not gone GA yet and is still a work in progress.^[1] It should be coming soon, but the main issue is that there is no documentation specific to this deployment scenario and the kubelet home directory on IBM ROKS is not the same as Red Hat OpenShift Container Platform.

Install OpenShift Container Storage Operator.

Do Section 1.1.1 in the official documentation.
Skip Section 1.1.2 because this is not a typical install.
Since these are RHEL based nodes do Section 1.1.3.

In our case everything was already there, we just needed to enable cephfs through SELinux.
```
# setsebool -P container_use_cephfs on
```

Prepare the Raw Storage on Each Node

These particular nodes had an SSD attached that was automatically picked up by multipath and also had a filesystem on it. We need to remove the multipath and clear the disk using wipefs. To do this, we add the wwid to the blacklist for the local disk per RHEL guidlines for local SCSI storage devices. Then we wipe the volume data.

References:

https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html-single/dm_multipath/index#ignore_localdisk_procedure

https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/8/html-single/configuring_device_mapper_multipath/index#proc_ignoring-local-disks-for-multipathing-setting-up-dm-multipath

Identify the nodes. We are working with the first three. ROKS doesn’t show us easily which ones we need. You may know this from the provision order, or you may be able to do describes on these nodes.

$ oc get nodes
NAME             STATUS   ROLES           AGE     VERSION
10.194.150.196   Ready    master,worker   7h44m   v1.16.2
10.194.150.201   Ready    master,worker   7h25m   v1.16.2
10.194.150.204   Ready    master,worker   7h44m   v1.16.2
10.194.150.205   Ready    master,worker   7d3h    v1.16.2
10.194.150.232   Ready    master,worker   7d3h    v1.16.2
10.194.150.244   Ready    master,worker   7d1h    v1.16.2

Drop into a debug pod to get a shell on the node. There are several different ways to see the drive device.

$ oc debug node/10.194.150.196
Starting pod/10194150196-debug ...
To use host binaries, run `chroot /host`
Pod IP: 10.194.150.196
If you don't see a command prompt, try pressing enter.
sh-4.2# chroot /host
sh-4.2# lsblk
NAME                                   MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINT
sda                                      8:0    0   1.8T  0 disk
|-sda1                                   8:1    0   256M  0 part  /boot
|-sda2                                   8:2    0     1G  0 part
`-sda3                                   8:3    0   1.8T  0 part  /
sdb                                      8:16   0 893.8G  0 disk
`-sdb1                                   8:17   0 893.8G  0 part  /var/data
sdc                                      8:32   0   1.8T  0 disk
`-3600605b00b5982502642ac600341c0bb    253:0    0   1.8T  0 mpath
  `-3600605b00b5982502642ac600341c0bb1 253:1    0   1.8T  0 part

sh-4.2# ls /dev/mapper/
3600605b00b5982502642ac600341c0bb  3600605b00b5982502642ac600341c0bb1  control

sh-4.2# ls -lah /dev/sdc
brw-rw----. 1 root disk 8, 32 May  4 12:24 /dev/sdc

sh-4.2# ls -lah /dev/mapper/3600605b00b5982502642ac600341c0bb1
lrwxrwxrwx. 1 root root 7 May  4 12:24 /dev/mapper/3600605b00b5982502642ac600341c0bb1 -> ../dm-1

sh-4.2# dmsetup ls
3600605b00b5982502642ac600341c0bb1	(253:1)
3600605b00b5982502642ac600341c0bb	(253:0)

sh-4.2# multipath -ll
3600605b00b5982502642ac600341c0bb dm-0 AVAGO   ,MR9361-8i
size=1.7T features='0' hwhandler='0' wp=rw
`-+- policy='service-time 0' prio=1 status=active
  `- 0:2:2:0 sdc 8:32 active ready running

Check the wwids and multipath.conf file.

sh-4.2# cat /etc/multipath/wwids
# Multipath wwids, Version : 1.0
# NOTE: This file is automatically maintained by multipath and multipathd.
# You should not need to edit this file in normal circumstances.
#
# Valid WWIDs:
/3600605b00b5982502642ac600341c0bb/

sh-4.2# cat /etc/multipath.conf
defaults {
user_friendly_names no
max_fds max
flush_on_last_del yes
queue_without_daemon no
dev_loss_tmo infinity
fast_io_fail_tmo 5
}
# All data under blacklist must be specific to your system.
blacklist {
wwid "SAdaptec*"
devnode "^hd[a-z]"
devnode "^(ram|raw|loop|fd|md|dm-|sr|scd|st)[0-9]*"
devnode "^cciss.*"
}
devices {
device {
vendor "NETAPP"
product "LUN"
path_grouping_policy group_by_prio
features "3 queue_if_no_path pg_init_retries 50"
prio "alua"
path_checker tur
failback immediate
path_selector "round-robin 0"
hardware_handler "1 alua"
rr_weight uniform
rr_min_io 128
uid_attribute ID_SERIAL
}
}

Note that you cannot use the /dev/sdc device name yet because of devicemapper and multipath.

sh-4.2# mkdir /tmp/testa

sh-4.2# mount /dev/sdc /tmp/testa
mount: /dev/sdc is already mounted or /tmp/testa busy

sh-4.2# mount /dev/mapper/3600605b00b5982502642ac600341c0bb1 /tmp/testa

sh-4.2# umount /tmp/testa

Backup multipath.conf and id it to add your device wwid to the blacklist.

sh-4.2# cp /etc/multipath.conf{.bk,}

sh-4.2# vi /etc/multipath.conf

sh-4.2# diff /etc/multipath.conf{,.bk}
11d10
< wwid 3600605b00b5982502642ac600341c0bb

sh-4.2# cat /etc/multipath.conf
defaults {
user_friendly_names no
max_fds max
flush_on_last_del yes
queue_without_daemon no
dev_loss_tmo infinity
fast_io_fail_tmo 5
}
# All data under blacklist must be specific to your system.
blacklist {
wwid 3600605b00b5982502642ac600341c0bb
wwid "SAdaptec*"
devnode "^hd[a-z]"
devnode "^(ram|raw|loop|fd|md|dm-|sr|scd|st)[0-9]*"
devnode "^cciss.*"
}
devices {
device {
vendor "NETAPP"
product "LUN"
path_grouping_policy group_by_prio
features "3 queue_if_no_path pg_init_retries 50"
prio "alua"
path_checker tur
failback immediate
path_selector "round-robin 0"
hardware_handler "1 alua"
rr_weight uniform
rr_min_io 128
uid_attribute ID_SERIAL
}
}

Nothing happens until you refresh the service.

sh-4.2# lsblk
NAME                                   MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINT
sda                                      8:0    0   1.8T  0 disk
|-sda1                                   8:1    0   256M  0 part  /boot
|-sda2                                   8:2    0     1G  0 part
`-sda3                                   8:3    0   1.8T  0 part  /
sdb                                      8:16   0 893.8G  0 disk
`-sdb1                                   8:17   0 893.8G  0 part  /var/data
sdc                                      8:32   0   1.8T  0 disk
|-sdc1                                   8:33   0   1.8T  0 part
`-3600605b00b5982502642ac600341c0bb    253:0    0   1.8T  0 mpath
  `-3600605b00b5982502642ac600341c0bb1 253:1    0   1.8T  0 part

sh-4.2# systemctl reload multipathd.service

sh-4.2# lsblk
NAME   MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
sda      8:0    0   1.8T  0 disk
|-sda1   8:1    0   256M  0 part /boot
|-sda2   8:2    0     1G  0 part
`-sda3   8:3    0   1.8T  0 part /
sdb      8:16   0 893.8G  0 disk
`-sdb1   8:17   0 893.8G  0 part /var/data
sdc      8:32   0   1.8T  0 disk
`-sdc1   8:33   0   1.8T  0 part

sh-4.2# mount /dev/sdc1 /tmp/testa

sh-4.2# umount /tmp/testa

Wipe the volume data (it came with a single XFS partition).

sh-4.2# wipefs /dev/sdc
offset               type
 ----------------------------------------------------------------
0x1fe                dos   [partition table]


sh-4.2# wipefs /dev/sdc --all --force
/dev/sdc: 2 bytes were erased at offset 0x000001fe (dos): 55 aa
/dev/sdc: calling ioclt to re-read partition table: Success

sh-4.2# wipefs /dev/sdc

sh-4.2# exit
exit

Scripting it, to repeat:

cp /etc/multipath.conf{,.bk}
sed -i 's/wwid \"SAdaptec\*\"/wwid \"SAdaptec\*\"\nwwid '$(multipath -ll | grep AVAGO | awk '{print $1}')'/g' /etc/multipath.conf
systemctl reload multipathd.service
wipefs /dev/sdc --all --force

Note

These RHEL systems are "ephemeral" if "reset" using the ibmcloud command. You will lose any alteration to the underlying OS. Further work and testing needs to be done if we need to be able to use the machines and their storage correctly on a "reset". It is unclear to me if there is a "Reboot" option that does not "reset" the OS. For the purposes of this POC it does not appear to be important as long as we do not "Reset" the machines.

Install the Local Storage Operator

Do Section 1.2.2 in the official documentation.

Additional LocalVolume Reference

Create the LocalVolume object:

echo 'apiVersion: local.storage.openshift.io/v1
kind: LocalVolume
metadata:
  name: ocs-backing
  namespace: local-storage
spec:
  nodeSelector:
    nodeSelectorTerms:
    - matchExpressions:
      - key: kubernetes.io/hostname
        operator: In
        values:
        - 10.194.150.196
        - 10.194.150.201
        - 10.194.150.204
  storageClassDevices:
  - devicePaths:
    - /dev/sdc
    storageClassName: local-ocs-backing
    volumeMode: Block' | oc create -f -

Note	Best practices suggested by the OCS documentation is to use the `/dev/disks/by-id` name of the disk. This was not originally done in this POC.

Results:

$ oc get pods
NAME                                                    READY   STATUS    RESTARTS   AGE
local-ssd-file-platinum-kafka-local-diskmaker-77dz5     1/1     Running   0          17h
local-ssd-file-platinum-kafka-local-diskmaker-jzm6g     1/1     Running   0          17h
local-ssd-file-platinum-kafka-local-diskmaker-qnmb6     1/1     Running   0          17h
local-ssd-file-platinum-kafka-local-provisioner-66n4b   1/1     Running   0          17h
local-ssd-file-platinum-kafka-local-provisioner-gnr2x   1/1     Running   0          17h
local-ssd-file-platinum-kafka-local-provisioner-kzbfw   1/1     Running   0          17h
local-storage-operator-5df7f84f45-kttf4                 1/1     Running   0          14h
ocs-backing-local-diskmaker-fkbqn                       1/1     Running   0          4s
ocs-backing-local-diskmaker-l82n6                       1/1     Running   0          4s
ocs-backing-local-diskmaker-zhl9d                       1/1     Running   0          4s
ocs-backing-local-provisioner-5sqq4                     1/1     Running   0          4s
ocs-backing-local-provisioner-n9mxg                     1/1     Running   0          4s
ocs-backing-local-provisioner-wz4qk                     1/1     Running   0          4s

Note that the "diskmaker" logs don’t take any action until there is a PVC created to match the PV.

$ oc logs -f ocs-backing-local-diskmaker-fkbqn
I0505 02:28:06.808416       1 diskmaker.go:23] Go Version: go1.12.12
I0505 02:28:06.808796       1 diskmaker.go:24] Go OS/Arch: linux/amd64
^C

The "provisioner" logs show the capture of the raw block device.

$ oc logs -f ocs-backing-local-provisioner-5sqq4
I0505 02:28:06.851623       1 common.go:320] StorageClass "local-ocs-backing" configured with MountDir "/mnt/local-storage/local-ocs-backing", HostDir "/mnt/local-storage/local-ocs-backing", VolumeMode "Block", FsType "", BlockCleanerCommand ["/scripts/quick_reset.sh"]
I0505 02:28:06.851784       1 main.go:63] Loaded configuration: {StorageClassConfig:map[local-ocs-backing:{HostDir:/mnt/local-storage/local-ocs-backing MountDir:/mnt/local-storage/local-ocs-backing BlockCleanerCommand:[/scripts/quick_reset.sh] VolumeMode:Block FsType:}] NodeLabelsForPV:[] UseAlphaAPI:false UseJobForCleaning:false MinResyncPeriod:{Duration:5m0s} UseNodeNameOnly:false LabelsForPV:map[storage.openshift.com/local-volume-owner-name:ocs-backing storage.openshift.com/local-volume-owner-namespace:local-storage]}
I0505 02:28:06.851817       1 main.go:64] Ready to run...
W0505 02:28:06.851829       1 main.go:73] MY_NAMESPACE environment variable not set, will be set to default.
W0505 02:28:06.851839       1 main.go:79] JOB_CONTAINER_IMAGE environment variable not set.
I0505 02:28:06.852418       1 common.go:382] Creating client using in-cluster config
I0505 02:28:06.884228       1 main.go:85] Starting controller
I0505 02:28:06.884272       1 main.go:100] Starting metrics server at :8080
I0505 02:28:06.884403       1 controller.go:45] Initializing volume cache
I0505 02:28:07.087009       1 controller.go:108] Controller started
E0505 02:28:07.087143       1 discovery.go:201] Error reading directory: open /mnt/local-storage/local-ocs-backing: no such file or directory
I0505 02:28:17.087958       1 discovery.go:304] Found new volume at host path "/mnt/local-storage/local-ocs-backing/sdc" with capacity 1919816826880, creating Local PV "local-pv-a0747d99", required volumeMode "Block"
I0505 02:28:17.102694       1 discovery.go:337] Created PV "local-pv-a0747d99" for volume at "/mnt/local-storage/local-ocs-backing/sdc"
I0505 02:28:17.102824       1 cache.go:55] Added pv "local-pv-a0747d99" to cache
I0505 02:28:17.116552       1 cache.go:64] Updated pv "local-pv-a0747d99" to cache
^C
$ oc logs -f ocs-backing-local-provisioner-n9mxg
I0505 02:28:06.628631       1 common.go:320] StorageClass "local-ocs-backing" configured with MountDir "/mnt/local-storage/local-ocs-backing", HostDir "/mnt/local-storage/local-ocs-backing", VolumeMode "Block", FsType "", BlockCleanerCommand ["/scripts/quick_reset.sh"]
I0505 02:28:06.628779       1 main.go:63] Loaded configuration: {StorageClassConfig:map[local-ocs-backing:{HostDir:/mnt/local-storage/local-ocs-backing MountDir:/mnt/local-storage/local-ocs-backing BlockCleanerCommand:[/scripts/quick_reset.sh] VolumeMode:Block FsType:}] NodeLabelsForPV:[] UseAlphaAPI:false UseJobForCleaning:false MinResyncPeriod:{Duration:5m0s} UseNodeNameOnly:false LabelsForPV:map[storage.openshift.com/local-volume-owner-name:ocs-backing storage.openshift.com/local-volume-owner-namespace:local-storage]}
I0505 02:28:06.628809       1 main.go:64] Ready to run...
W0505 02:28:06.628820       1 main.go:73] MY_NAMESPACE environment variable not set, will be set to default.
W0505 02:28:06.628831       1 main.go:79] JOB_CONTAINER_IMAGE environment variable not set.
I0505 02:28:06.629393       1 common.go:382] Creating client using in-cluster config
I0505 02:28:06.658010       1 main.go:85] Starting controller
I0505 02:28:06.658072       1 main.go:100] Starting metrics server at :8080
I0505 02:28:06.658185       1 controller.go:45] Initializing volume cache
I0505 02:28:06.860545       1 controller.go:108] Controller started
E0505 02:28:06.860666       1 discovery.go:201] Error reading directory: open /mnt/local-storage/local-ocs-backing: no such file or directory
I0505 02:28:16.861298       1 discovery.go:304] Found new volume at host path "/mnt/local-storage/local-ocs-backing/sdc" with capacity 1919816826880, creating Local PV "local-pv-4db9cb47", required volumeMode "Block"
I0505 02:28:16.877592       1 discovery.go:337] Created PV "local-pv-4db9cb47" for volume at "/mnt/local-storage/local-ocs-backing/sdc"
I0505 02:28:16.877745       1 cache.go:55] Added pv "local-pv-4db9cb47" to cache
I0505 02:28:16.887963       1 cache.go:64] Updated pv "local-pv-4db9cb47" to cache
^C
$ oc logs -f ocs-backing-local-provisioner-wz4qk
I0505 02:28:06.777835       1 common.go:320] StorageClass "local-ocs-backing" configured with MountDir "/mnt/local-storage/local-ocs-backing", HostDir "/mnt/local-storage/local-ocs-backing", VolumeMode "Block", FsType "", BlockCleanerCommand ["/scripts/quick_reset.sh"]
I0505 02:28:06.777974       1 main.go:63] Loaded configuration: {StorageClassConfig:map[local-ocs-backing:{HostDir:/mnt/local-storage/local-ocs-backing MountDir:/mnt/local-storage/local-ocs-backing BlockCleanerCommand:[/scripts/quick_reset.sh] VolumeMode:Block FsType:}] NodeLabelsForPV:[] UseAlphaAPI:false UseJobForCleaning:false MinResyncPeriod:{Duration:5m0s} UseNodeNameOnly:false LabelsForPV:map[storage.openshift.com/local-volume-owner-name:ocs-backing storage.openshift.com/local-volume-owner-namespace:local-storage]}
I0505 02:28:06.778007       1 main.go:64] Ready to run...
W0505 02:28:06.778018       1 main.go:73] MY_NAMESPACE environment variable not set, will be set to default.
W0505 02:28:06.778027       1 main.go:79] JOB_CONTAINER_IMAGE environment variable not set.
I0505 02:28:06.778609       1 common.go:382] Creating client using in-cluster config
I0505 02:28:06.803697       1 main.go:85] Starting controller
I0505 02:28:06.803728       1 main.go:100] Starting metrics server at :8080
I0505 02:28:06.803815       1 controller.go:45] Initializing volume cache
I0505 02:28:07.006479       1 controller.go:108] Controller started
E0505 02:28:07.006588       1 discovery.go:201] Error reading directory: open /mnt/local-storage/local-ocs-backing: no such file or directory
I0505 02:28:17.007375       1 discovery.go:304] Found new volume at host path "/mnt/local-storage/local-ocs-backing/sdc" with capacity 1919816826880, creating Local PV "local-pv-7f5385f2", required volumeMode "Block"
I0505 02:28:17.026660       1 discovery.go:337] Created PV "local-pv-7f5385f2" for volume at "/mnt/local-storage/local-ocs-backing/sdc"
I0505 02:28:17.026720       1 cache.go:55] Added pv "local-pv-7f5385f2" to cache
I0505 02:28:17.036906       1 cache.go:64] Updated pv "local-pv-7f5385f2" to cache
^C

The PersistentVolumes show up:

$ oc get pv | grep ocs-backing
local-pv-4db9cb47                          1787Gi     RWO            Delete           Available                                                                              local-ocs-backing                        22m
local-pv-7f5385f2                          1787Gi     RWO            Delete           Available                                                                              local-ocs-backing                        22m
local-pv-a0747d99                          1787Gi     RWO            Delete           Available                                                                              local-ocs-backing                        22m

Create an OCS StorageCluster

Edit the system:node ClusterRole. Alternatively, you can set enable-controller-attach-detach to true on the kubelet.^[2]

$ oc edit clusterrole system\:node

- apiGroups:
  - storage.k8s.io
  resources:
  - volumeattachments
  verbs:
  - get
  - create
  - update
  - delete
  - list

Do Section 1.2.6 in the official documentation to create a custom cluster.

$ echo 'apiVersion: ocs.openshift.io/v1
kind: StorageCluster
metadata:
  name: ocs-storagecluster
  namespace: openshift-storage
spec:
  monDataDirHostPath: /var/lib/rook
  storageDeviceSets:
  - config: {}
    count: 1
    dataPVCTemplate:
      spec:
        accessModes:
        - ReadWriteOnce
        resources:
          requests:
            storage: 1787Gi
        storageClassName: local-ocs-backing
        volumeMode: Block
    name: ocs-deviceset
    placement: {}
    replica: 3
    resources: {}
  version: 4.3.0' | oc create -f -

Notice that our storage size matches the PV size, which is not one of the prescripted sizes in the GUI. Also, the mons are being hosted on the root disk under /var/lib/rook. You can alternatively deploy another PV if you have something else on your bare metal cluster to back them. In IBM Cloud this could be any of many different options. We did this for simplicity as it was not critical to the POC.

Edit the DaemonSets and change all references to the kublet location from /var/lib/kubelet to /var/data/kubelet
```
 $ oc edit ds csi-cephfsplugin
 $ oc edit ds csi-rbdplugin
```
Automated this would look like:+

 $ oc get ds csi-rbdplugin -o yaml | sed 's/var\/lib\/kubelet/var\/data\/kubelet/g' | oc apply -f -
 $ oc get ds csi-cephfsplugin -o yaml | sed 's/var\/lib\/kubelet/var\/data\/kubelet/g' | oc apply -f -

Debugging

Sometimes the pods that prepare the OSDs (e.g. rook-ceph-osd-prepare-ocs-deviceset-0-0-zf9xk-jckrc) may recycle after failing and crashing a few times.

Kubelet log on each node is very helpful: /var/log/kubelet.log

This error is persistent but benign:

May 11 21:30:34 kube-bqjdgjof0jub1eqc7830-cluster0-storage-00000d76 kubelet.service: E0511 21:30:34.616492   15432 goroutinemap.go:150] Operation for "/var/data/kubelet/plugins/openshift-storage.rbd.csi.ceph.com/csi.sock" failed. No retries permitted until 2020-05-11 21:32:36.616461186 -0500 CDT m=+637666.488684071 (durationBeforeRetry 2m2s). Error: "RegisterPlugin error -- failed to get plugin info using RPC GetInfo at socket /var/data/kubelet/plugins/openshift-storage.rbd.csi.ceph.com/csi.sock, err: rpc error: code = Unimplemented desc = unknown service pluginregistration.Registration"

This error indicates you did not edit the clusterrole correctly, or did not do it at all:

May 11 10:56:11 kube-bqjdgjof0jub1eqc7830-cluster0-storage-00000d76 kubelet.service: E0511 10:56:11.498116   15432 nestedpendingoperations.go:20] Operation for "\"kubernetes.io/csi/openshift-storage.rbd.csi.ceph.com^0001-0011-openshift-storage-0000000000000002-abcba85c-9395-11ea-abcf-52f7713ae97\"" failed. No retries permitted until 2020-05-11 10:58:13.498080928 -0500 CDT m=+599603.370303765 (durationBeforeRetry 2m2s). Error: "AttahVolume.Attach failed for volume \"pvc-3797850b-b025-46fc-a177-242b86483049\" (UniqueName: \"kubernetes.io/csi/openshift-storage.rbd.csi.ceph.com^001-0011-openshift-storage-0000000000000002-abcba85c-9395-11ea-abcf-52f77713ae97\") from node \"10.194.150.204\" : kubernetes.io/csi: attacher.Attch failed: volumeattachments.storage.k8s.io is forbidden: User \"system:node:10.194.150.204\" cannot create resource \"volumeattachments\" in API roup \"storage.k8s.io\" at the cluster scope: can only get individual resources of this type"

Notes and References

1. https://github.com/openshift/ocs-operator/issues/454#issuecomment-604522767

2. https://github.com/openshift/ocs-operator/issues/453#issuecomment-619918779