OPENSHIFT , IBM

Installing OpenShift Container Storage on IBM Cloud Bare Metal RedHat Openshift Kubernetes Service (ROKS)

#storage , #container

Installing OpenShift Container Storage on IBM Cloud Bare Metal RedHat Openshift Kubernetes Service (ROKS)

This feature set, deploying OCS on Bare Metal, has not gone GA yet and is still a work in progress.[1] It should be coming soon, but the main issue is that there is no documentation specific to this deployment scenario and the kubelet home directory on IBM ROKS is not the same as Red Hat OpenShift Container Platform.

Install OpenShift Container Storage Operator.

  1. Do Section 1.1.1 in the official documentation.

  2. Skip Section 1.1.2 because this is not a typical install.

  3. Since these are RHEL based nodes do Section 1.1.3.

    In our case everything was already there, we just needed to enable cephfs through SELinux.

    # setsebool -P container_use_cephfs on

Prepare the Raw Storage on Each Node

These particular nodes had an SSD attached that was automatically picked up by multipath and also had a filesystem on it. We need to remove the multipath and clear the disk using wipefs. To do this, we add the wwid to the blacklist for the local disk per RHEL guidlines for local SCSI storage devices. Then we wipe the volume data.

References:

  • https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html-single/dm_multipath/index#ignore_localdisk_procedure

  • https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/8/html-single/configuring_device_mapper_multipath/index#proc_ignoring-local-disks-for-multipathing-setting-up-dm-multipath

    1. Identify the nodes. We are working with the first three. ROKS doesn’t show us easily which ones we need. You may know this from the provision order, or you may be able to do describes on these nodes.

      $ oc get nodes
      NAME             STATUS   ROLES           AGE     VERSION
      10.194.150.196   Ready    master,worker   7h44m   v1.16.2
      10.194.150.201   Ready    master,worker   7h25m   v1.16.2
      10.194.150.204   Ready    master,worker   7h44m   v1.16.2
      10.194.150.205   Ready    master,worker   7d3h    v1.16.2
      10.194.150.232   Ready    master,worker   7d3h    v1.16.2
      10.194.150.244   Ready    master,worker   7d1h    v1.16.2
    2. Drop into a debug pod to get a shell on the node. There are several different ways to see the drive device.

      $ oc debug node/10.194.150.196
      Starting pod/10194150196-debug ...
      To use host binaries, run `chroot /host`
      Pod IP: 10.194.150.196
      If you don't see a command prompt, try pressing enter.
      sh-4.2# chroot /host
      sh-4.2# lsblk
      NAME                                   MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINT
      sda                                      8:0    0   1.8T  0 disk
      |-sda1                                   8:1    0   256M  0 part  /boot
      |-sda2                                   8:2    0     1G  0 part
      `-sda3                                   8:3    0   1.8T  0 part  /
      sdb                                      8:16   0 893.8G  0 disk
      `-sdb1                                   8:17   0 893.8G  0 part  /var/data
      sdc                                      8:32   0   1.8T  0 disk
      `-3600605b00b5982502642ac600341c0bb    253:0    0   1.8T  0 mpath
        `-3600605b00b5982502642ac600341c0bb1 253:1    0   1.8T  0 part
      
      sh-4.2# ls /dev/mapper/
      3600605b00b5982502642ac600341c0bb  3600605b00b5982502642ac600341c0bb1  control
      
      sh-4.2# ls -lah /dev/sdc
      brw-rw----. 1 root disk 8, 32 May  4 12:24 /dev/sdc
      
      sh-4.2# ls -lah /dev/mapper/3600605b00b5982502642ac600341c0bb1
      lrwxrwxrwx. 1 root root 7 May  4 12:24 /dev/mapper/3600605b00b5982502642ac600341c0bb1 -> ../dm-1
      
      sh-4.2# dmsetup ls
      3600605b00b5982502642ac600341c0bb1	(253:1)
      3600605b00b5982502642ac600341c0bb	(253:0)
      
      sh-4.2# multipath -ll
      3600605b00b5982502642ac600341c0bb dm-0 AVAGO   ,MR9361-8i
      size=1.7T features='0' hwhandler='0' wp=rw
      `-+- policy='service-time 0' prio=1 status=active
        `- 0:2:2:0 sdc 8:32 active ready running
    3. Check the wwids and multipath.conf file.

      sh-4.2# cat /etc/multipath/wwids
      # Multipath wwids, Version : 1.0
      # NOTE: This file is automatically maintained by multipath and multipathd.
      # You should not need to edit this file in normal circumstances.
      #
      # Valid WWIDs:
      /3600605b00b5982502642ac600341c0bb/
      
      sh-4.2# cat /etc/multipath.conf
      defaults {
      user_friendly_names no
      max_fds max
      flush_on_last_del yes
      queue_without_daemon no
      dev_loss_tmo infinity
      fast_io_fail_tmo 5
      }
      # All data under blacklist must be specific to your system.
      blacklist {
      wwid "SAdaptec*"
      devnode "^hd[a-z]"
      devnode "^(ram|raw|loop|fd|md|dm-|sr|scd|st)[0-9]*"
      devnode "^cciss.*"
      }
      devices {
      device {
      vendor "NETAPP"
      product "LUN"
      path_grouping_policy group_by_prio
      features "3 queue_if_no_path pg_init_retries 50"
      prio "alua"
      path_checker tur
      failback immediate
      path_selector "round-robin 0"
      hardware_handler "1 alua"
      rr_weight uniform
      rr_min_io 128
      uid_attribute ID_SERIAL
      }
      }
    4. Note that you cannot use the /dev/sdc device name yet because of devicemapper and multipath.

      sh-4.2# mkdir /tmp/testa
      
      sh-4.2# mount /dev/sdc /tmp/testa
      mount: /dev/sdc is already mounted or /tmp/testa busy
      
      sh-4.2# mount /dev/mapper/3600605b00b5982502642ac600341c0bb1 /tmp/testa
      
      sh-4.2# umount /tmp/testa
    5. Backup multipath.conf and id it to add your device wwid to the blacklist.

      sh-4.2# cp /etc/multipath.conf{.bk,}
      
      sh-4.2# vi /etc/multipath.conf
      
      sh-4.2# diff /etc/multipath.conf{,.bk}
      11d10
      < wwid 3600605b00b5982502642ac600341c0bb
      
      sh-4.2# cat /etc/multipath.conf
      defaults {
      user_friendly_names no
      max_fds max
      flush_on_last_del yes
      queue_without_daemon no
      dev_loss_tmo infinity
      fast_io_fail_tmo 5
      }
      # All data under blacklist must be specific to your system.
      blacklist {
      wwid 3600605b00b5982502642ac600341c0bb
      wwid "SAdaptec*"
      devnode "^hd[a-z]"
      devnode "^(ram|raw|loop|fd|md|dm-|sr|scd|st)[0-9]*"
      devnode "^cciss.*"
      }
      devices {
      device {
      vendor "NETAPP"
      product "LUN"
      path_grouping_policy group_by_prio
      features "3 queue_if_no_path pg_init_retries 50"
      prio "alua"
      path_checker tur
      failback immediate
      path_selector "round-robin 0"
      hardware_handler "1 alua"
      rr_weight uniform
      rr_min_io 128
      uid_attribute ID_SERIAL
      }
      }

      Nothing happens until you refresh the service.

      sh-4.2# lsblk
      NAME                                   MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINT
      sda                                      8:0    0   1.8T  0 disk
      |-sda1                                   8:1    0   256M  0 part  /boot
      |-sda2                                   8:2    0     1G  0 part
      `-sda3                                   8:3    0   1.8T  0 part  /
      sdb                                      8:16   0 893.8G  0 disk
      `-sdb1                                   8:17   0 893.8G  0 part  /var/data
      sdc                                      8:32   0   1.8T  0 disk
      |-sdc1                                   8:33   0   1.8T  0 part
      `-3600605b00b5982502642ac600341c0bb    253:0    0   1.8T  0 mpath
        `-3600605b00b5982502642ac600341c0bb1 253:1    0   1.8T  0 part
      
      sh-4.2# systemctl reload multipathd.service
      
      sh-4.2# lsblk
      NAME   MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
      sda      8:0    0   1.8T  0 disk
      |-sda1   8:1    0   256M  0 part /boot
      |-sda2   8:2    0     1G  0 part
      `-sda3   8:3    0   1.8T  0 part /
      sdb      8:16   0 893.8G  0 disk
      `-sdb1   8:17   0 893.8G  0 part /var/data
      sdc      8:32   0   1.8T  0 disk
      `-sdc1   8:33   0   1.8T  0 part
      
      sh-4.2# mount /dev/sdc1 /tmp/testa
      
      sh-4.2# umount /tmp/testa
    6. Wipe the volume data (it came with a single XFS partition).

      sh-4.2# wipefs /dev/sdc
      offset               type
       ----------------------------------------------------------------
      0x1fe                dos   [partition table]
      
      
      sh-4.2# wipefs /dev/sdc --all --force
      /dev/sdc: 2 bytes were erased at offset 0x000001fe (dos): 55 aa
      /dev/sdc: calling ioclt to re-read partition table: Success
      
      sh-4.2# wipefs /dev/sdc
      
      sh-4.2# exit
      exit
    7. Scripting it, to repeat:

      cp /etc/multipath.conf{,.bk}
      sed -i 's/wwid \"SAdaptec\*\"/wwid \"SAdaptec\*\"\nwwid '$(multipath -ll | grep AVAGO | awk '{print $1}')'/g' /etc/multipath.conf
      systemctl reload multipathd.service
      wipefs /dev/sdc --all --force
Note
These RHEL systems are "ephemeral" if "reset" using the ibmcloud command. You will lose any alteration to the underlying OS. Further work and testing needs to be done if we need to be able to use the machines and their storage correctly on a "reset". It is unclear to me if there is a "Reboot" option that does not "reset" the OS. For the purposes of this POC it does not appear to be important as long as we do not "Reset" the machines.

Install the Local Storage Operator

  1. Do Section 1.2.2 in the official documentation.

  2. Create the LocalVolume object:

    echo 'apiVersion: local.storage.openshift.io/v1
    kind: LocalVolume
    metadata:
      name: ocs-backing
      namespace: local-storage
    spec:
      nodeSelector:
        nodeSelectorTerms:
        - matchExpressions:
          - key: kubernetes.io/hostname
            operator: In
            values:
            - 10.194.150.196
            - 10.194.150.201
            - 10.194.150.204
      storageClassDevices:
      - devicePaths:
        - /dev/sdc
        storageClassName: local-ocs-backing
        volumeMode: Block' | oc create -f -
Note
Best practices suggested by the OCS documentation is to use the /dev/disks/by-id name of the disk. This was not originally done in this POC.

Results:

$ oc get pods
NAME                                                    READY   STATUS    RESTARTS   AGE
local-ssd-file-platinum-kafka-local-diskmaker-77dz5     1/1     Running   0          17h
local-ssd-file-platinum-kafka-local-diskmaker-jzm6g     1/1     Running   0          17h
local-ssd-file-platinum-kafka-local-diskmaker-qnmb6     1/1     Running   0          17h
local-ssd-file-platinum-kafka-local-provisioner-66n4b   1/1     Running   0          17h
local-ssd-file-platinum-kafka-local-provisioner-gnr2x   1/1     Running   0          17h
local-ssd-file-platinum-kafka-local-provisioner-kzbfw   1/1     Running   0          17h
local-storage-operator-5df7f84f45-kttf4                 1/1     Running   0          14h
ocs-backing-local-diskmaker-fkbqn                       1/1     Running   0          4s
ocs-backing-local-diskmaker-l82n6                       1/1     Running   0          4s
ocs-backing-local-diskmaker-zhl9d                       1/1     Running   0          4s
ocs-backing-local-provisioner-5sqq4                     1/1     Running   0          4s
ocs-backing-local-provisioner-n9mxg                     1/1     Running   0          4s
ocs-backing-local-provisioner-wz4qk                     1/1     Running   0          4s

Note that the "diskmaker" logs don’t take any action until there is a PVC created to match the PV.

$ oc logs -f ocs-backing-local-diskmaker-fkbqn
I0505 02:28:06.808416       1 diskmaker.go:23] Go Version: go1.12.12
I0505 02:28:06.808796       1 diskmaker.go:24] Go OS/Arch: linux/amd64
^C

The "provisioner" logs show the capture of the raw block device.

$ oc logs -f ocs-backing-local-provisioner-5sqq4
I0505 02:28:06.851623       1 common.go:320] StorageClass "local-ocs-backing" configured with MountDir "/mnt/local-storage/local-ocs-backing", HostDir "/mnt/local-storage/local-ocs-backing", VolumeMode "Block", FsType "", BlockCleanerCommand ["/scripts/quick_reset.sh"]
I0505 02:28:06.851784       1 main.go:63] Loaded configuration: {StorageClassConfig:map[local-ocs-backing:{HostDir:/mnt/local-storage/local-ocs-backing MountDir:/mnt/local-storage/local-ocs-backing BlockCleanerCommand:[/scripts/quick_reset.sh] VolumeMode:Block FsType:}] NodeLabelsForPV:[] UseAlphaAPI:false UseJobForCleaning:false MinResyncPeriod:{Duration:5m0s} UseNodeNameOnly:false LabelsForPV:map[storage.openshift.com/local-volume-owner-name:ocs-backing storage.openshift.com/local-volume-owner-namespace:local-storage]}
I0505 02:28:06.851817       1 main.go:64] Ready to run...
W0505 02:28:06.851829       1 main.go:73] MY_NAMESPACE environment variable not set, will be set to default.
W0505 02:28:06.851839       1 main.go:79] JOB_CONTAINER_IMAGE environment variable not set.
I0505 02:28:06.852418       1 common.go:382] Creating client using in-cluster config
I0505 02:28:06.884228       1 main.go:85] Starting controller
I0505 02:28:06.884272       1 main.go:100] Starting metrics server at :8080
I0505 02:28:06.884403       1 controller.go:45] Initializing volume cache
I0505 02:28:07.087009       1 controller.go:108] Controller started
E0505 02:28:07.087143       1 discovery.go:201] Error reading directory: open /mnt/local-storage/local-ocs-backing: no such file or directory
I0505 02:28:17.087958       1 discovery.go:304] Found new volume at host path "/mnt/local-storage/local-ocs-backing/sdc" with capacity 1919816826880, creating Local PV "local-pv-a0747d99", required volumeMode "Block"
I0505 02:28:17.102694       1 discovery.go:337] Created PV "local-pv-a0747d99" for volume at "/mnt/local-storage/local-ocs-backing/sdc"
I0505 02:28:17.102824       1 cache.go:55] Added pv "local-pv-a0747d99" to cache
I0505 02:28:17.116552       1 cache.go:64] Updated pv "local-pv-a0747d99" to cache
^C
$ oc logs -f ocs-backing-local-provisioner-n9mxg
I0505 02:28:06.628631       1 common.go:320] StorageClass "local-ocs-backing" configured with MountDir "/mnt/local-storage/local-ocs-backing", HostDir "/mnt/local-storage/local-ocs-backing", VolumeMode "Block", FsType "", BlockCleanerCommand ["/scripts/quick_reset.sh"]
I0505 02:28:06.628779       1 main.go:63] Loaded configuration: {StorageClassConfig:map[local-ocs-backing:{HostDir:/mnt/local-storage/local-ocs-backing MountDir:/mnt/local-storage/local-ocs-backing BlockCleanerCommand:[/scripts/quick_reset.sh] VolumeMode:Block FsType:}] NodeLabelsForPV:[] UseAlphaAPI:false UseJobForCleaning:false MinResyncPeriod:{Duration:5m0s} UseNodeNameOnly:false LabelsForPV:map[storage.openshift.com/local-volume-owner-name:ocs-backing storage.openshift.com/local-volume-owner-namespace:local-storage]}
I0505 02:28:06.628809       1 main.go:64] Ready to run...
W0505 02:28:06.628820       1 main.go:73] MY_NAMESPACE environment variable not set, will be set to default.
W0505 02:28:06.628831       1 main.go:79] JOB_CONTAINER_IMAGE environment variable not set.
I0505 02:28:06.629393       1 common.go:382] Creating client using in-cluster config
I0505 02:28:06.658010       1 main.go:85] Starting controller
I0505 02:28:06.658072       1 main.go:100] Starting metrics server at :8080
I0505 02:28:06.658185       1 controller.go:45] Initializing volume cache
I0505 02:28:06.860545       1 controller.go:108] Controller started
E0505 02:28:06.860666       1 discovery.go:201] Error reading directory: open /mnt/local-storage/local-ocs-backing: no such file or directory
I0505 02:28:16.861298       1 discovery.go:304] Found new volume at host path "/mnt/local-storage/local-ocs-backing/sdc" with capacity 1919816826880, creating Local PV "local-pv-4db9cb47", required volumeMode "Block"
I0505 02:28:16.877592       1 discovery.go:337] Created PV "local-pv-4db9cb47" for volume at "/mnt/local-storage/local-ocs-backing/sdc"
I0505 02:28:16.877745       1 cache.go:55] Added pv "local-pv-4db9cb47" to cache
I0505 02:28:16.887963       1 cache.go:64] Updated pv "local-pv-4db9cb47" to cache
^C
$ oc logs -f ocs-backing-local-provisioner-wz4qk
I0505 02:28:06.777835       1 common.go:320] StorageClass "local-ocs-backing" configured with MountDir "/mnt/local-storage/local-ocs-backing", HostDir "/mnt/local-storage/local-ocs-backing", VolumeMode "Block", FsType "", BlockCleanerCommand ["/scripts/quick_reset.sh"]
I0505 02:28:06.777974       1 main.go:63] Loaded configuration: {StorageClassConfig:map[local-ocs-backing:{HostDir:/mnt/local-storage/local-ocs-backing MountDir:/mnt/local-storage/local-ocs-backing BlockCleanerCommand:[/scripts/quick_reset.sh] VolumeMode:Block FsType:}] NodeLabelsForPV:[] UseAlphaAPI:false UseJobForCleaning:false MinResyncPeriod:{Duration:5m0s} UseNodeNameOnly:false LabelsForPV:map[storage.openshift.com/local-volume-owner-name:ocs-backing storage.openshift.com/local-volume-owner-namespace:local-storage]}
I0505 02:28:06.778007       1 main.go:64] Ready to run...
W0505 02:28:06.778018       1 main.go:73] MY_NAMESPACE environment variable not set, will be set to default.
W0505 02:28:06.778027       1 main.go:79] JOB_CONTAINER_IMAGE environment variable not set.
I0505 02:28:06.778609       1 common.go:382] Creating client using in-cluster config
I0505 02:28:06.803697       1 main.go:85] Starting controller
I0505 02:28:06.803728       1 main.go:100] Starting metrics server at :8080
I0505 02:28:06.803815       1 controller.go:45] Initializing volume cache
I0505 02:28:07.006479       1 controller.go:108] Controller started
E0505 02:28:07.006588       1 discovery.go:201] Error reading directory: open /mnt/local-storage/local-ocs-backing: no such file or directory
I0505 02:28:17.007375       1 discovery.go:304] Found new volume at host path "/mnt/local-storage/local-ocs-backing/sdc" with capacity 1919816826880, creating Local PV "local-pv-7f5385f2", required volumeMode "Block"
I0505 02:28:17.026660       1 discovery.go:337] Created PV "local-pv-7f5385f2" for volume at "/mnt/local-storage/local-ocs-backing/sdc"
I0505 02:28:17.026720       1 cache.go:55] Added pv "local-pv-7f5385f2" to cache
I0505 02:28:17.036906       1 cache.go:64] Updated pv "local-pv-7f5385f2" to cache
^C

The PersistentVolumes show up:

$ oc get pv | grep ocs-backing
local-pv-4db9cb47                          1787Gi     RWO            Delete           Available                                                                              local-ocs-backing                        22m
local-pv-7f5385f2                          1787Gi     RWO            Delete           Available                                                                              local-ocs-backing                        22m
local-pv-a0747d99                          1787Gi     RWO            Delete           Available                                                                              local-ocs-backing                        22m

Create an OCS StorageCluster

  1. Edit the system:node ClusterRole. Alternatively, you can set enable-controller-attach-detach to true on the kubelet.[2]

    $ oc edit clusterrole system\:node
    
    - apiGroups:
      - storage.k8s.io
      resources:
      - volumeattachments
      verbs:
      - get
      - create
      - update
      - delete
      - list
  2. Do Section 1.2.6 in the official documentation to create a custom cluster.

    $ echo 'apiVersion: ocs.openshift.io/v1
    kind: StorageCluster
    metadata:
      name: ocs-storagecluster
      namespace: openshift-storage
    spec:
      monDataDirHostPath: /var/lib/rook
      storageDeviceSets:
      - config: {}
        count: 1
        dataPVCTemplate:
          spec:
            accessModes:
            - ReadWriteOnce
            resources:
              requests:
                storage: 1787Gi
            storageClassName: local-ocs-backing
            volumeMode: Block
        name: ocs-deviceset
        placement: {}
        replica: 3
        resources: {}
      version: 4.3.0' | oc create -f -

    Notice that our storage size matches the PV size, which is not one of the prescripted sizes in the GUI. Also, the mons are being hosted on the root disk under /var/lib/rook. You can alternatively deploy another PV if you have something else on your bare metal cluster to back them. In IBM Cloud this could be any of many different options. We did this for simplicity as it was not critical to the POC.

  3. Edit the DaemonSets and change all references to the kublet location from /var/lib/kubelet to /var/data/kubelet

     $ oc edit ds csi-cephfsplugin
     $ oc edit ds csi-rbdplugin

    Automated this would look like:+

 $ oc get ds csi-rbdplugin -o yaml | sed 's/var\/lib\/kubelet/var\/data\/kubelet/g' | oc apply -f -
 $ oc get ds csi-cephfsplugin -o yaml | sed 's/var\/lib\/kubelet/var\/data\/kubelet/g' | oc apply -f -

Debugging

Sometimes the pods that prepare the OSDs (e.g. rook-ceph-osd-prepare-ocs-deviceset-0-0-zf9xk-jckrc) may recycle after failing and crashing a few times.

Kubelet log on each node is very helpful: /var/log/kubelet.log

This error is persistent but benign:

May 11 21:30:34 kube-bqjdgjof0jub1eqc7830-cluster0-storage-00000d76 kubelet.service: E0511 21:30:34.616492   15432 goroutinemap.go:150] Operation for "/var/data/kubelet/plugins/openshift-storage.rbd.csi.ceph.com/csi.sock" failed. No retries permitted until 2020-05-11 21:32:36.616461186 -0500 CDT m=+637666.488684071 (durationBeforeRetry 2m2s). Error: "RegisterPlugin error -- failed to get plugin info using RPC GetInfo at socket /var/data/kubelet/plugins/openshift-storage.rbd.csi.ceph.com/csi.sock, err: rpc error: code = Unimplemented desc = unknown service pluginregistration.Registration"

This error indicates you did not edit the clusterrole correctly, or did not do it at all:

May 11 10:56:11 kube-bqjdgjof0jub1eqc7830-cluster0-storage-00000d76 kubelet.service: E0511 10:56:11.498116   15432 nestedpendingoperations.go:20] Operation for "\"kubernetes.io/csi/openshift-storage.rbd.csi.ceph.com^0001-0011-openshift-storage-0000000000000002-abcba85c-9395-11ea-abcf-52f7713ae97\"" failed. No retries permitted until 2020-05-11 10:58:13.498080928 -0500 CDT m=+599603.370303765 (durationBeforeRetry 2m2s). Error: "AttahVolume.Attach failed for volume \"pvc-3797850b-b025-46fc-a177-242b86483049\" (UniqueName: \"kubernetes.io/csi/openshift-storage.rbd.csi.ceph.com^001-0011-openshift-storage-0000000000000002-abcba85c-9395-11ea-abcf-52f77713ae97\") from node \"10.194.150.204\" : kubernetes.io/csi: attacher.Attch failed: volumeattachments.storage.k8s.io is forbidden: User \"system:node:10.194.150.204\" cannot create resource \"volumeattachments\" in API roup \"storage.k8s.io\" at the cluster scope: can only get individual resources of this type"