OPENSHIFT , DNS

Installing GlusterFS on OpenShift

#rhel , #glusterfs , #zone

Installing GlusterFS on OpenShift

OpenShift 3.11.170
OpenShift Container Storage 3.11.5

Working through some errors, after following documentation for a normal OpenShift deploy with GlusterFS (OpenShift Container Storage)

On my first run, I hit this for some odd reason:

TASK [cockpit : Install cockpit-ws] **************************************************************************************************************
FAILED - RETRYING: Install cockpit-ws (3 retries left).
FAILED - RETRYING: Install cockpit-ws (2 retries left).
FAILED - RETRYING: Install cockpit-ws (1 retries left).
fatal: [master0.ocp.example.com]: FAILED! => {"attempts": 3, "changed": false, "msg": "https://cdn.redhat.com/content/dist/rhel/server/7/7Server/x86_64/os/Packages/g/gsettings-desktop-schemas-3.28.0-2.el7.x86_64.rpm: [Errno 14] HTTPS Error 403 - Forbidden\nTrying other mirror.\nTo address this issue please refer to the below knowledge base article\n\nhttps://access.redhat.com/solutions/69319\n\nIf above article doesn't help to resolve this issue please open a ticket with Red Hat Support.\n\nhttps://cdn.redhat.com/content/dist/rhel/server/7/7Server/x86_64/extras/os/Packages/c/cockpit-docker-195-1.el7.x86_64.rpm: [Errno 14] HTTPS Error 403 - Forbidden\nTrying other mirror.\nhttps://cdn.redhat.com/content/dist/rhel/server/7/7Server/x86_64/os/Packages/c/cockpit-bridge-195.1-1.el7.x86_64.rpm: [Errno 14] HTTPS Error 403 - Forbidden\nTrying other mirror.\nhttps://cdn.redhat.com/content/dist/rhel/server/7/7Server/x86_64/os/Packages/l/libmodman-2.0.1-8.el7.x86_64.rpm: [Errno 14] HTTPS Error 403 - Forbidden\nTrying other mirror.\nhttps://cdn.redhat.com/content/dist/rhel/server/7/7Server/x86_64/os/Packages/t/trousers-0.3.14-2.el7.x86_64.rpm: [Errno 14] HTTPS Error 403 - Forbidden\nTrying other mirror.\n\n\nError downloading packages:\n  cockpit-docker-195-1.el7.x86_64: [Errno 256] No more mirrors to try.\n  cockpit-bridge-195.1-1.el7.x86_64: [Errno 256] No more mirrors to try.\n  trousers-0.3.14-2.el7.x86_64: [Errno 256] No more mirrors to try.\n  gsettings-desktop-schemas-3.28.0-2.el7.x86_64: [Errno 256] No more mirrors to try.\n  libmodman-2.0.1-8.el7.x86_64: [Errno 256] No more mirrors to try.\n\n", "rc": 1, "results": ["Loaded plugins: product-id, search-disabled-repos, subscription-manager\nResolving Dependencies\n--> Running transaction check\n---> Package cockpit-bridge.x86_64 0:195.1-1.el7 will be installed\n--> Processing Dependency: glib-networking for package: cockpit-bridge-195.1-1.el7.x86_64\n--> Processing Dependency: libjson-glib-1.0.so.0()(64bit) for package: cockpit-bridge-195.1-1.el7.x86_64\n---> Package cockpit-docker.x86_64 0:195-1.el7 will be installed\n---> Package cockpit-kubernetes.x86_64 0:195-2.rhaos.el7 will be installed\n---> Package cockpit-system.noarch 0:195.1-1.el7 will be installed\n---> Package cockpit-ws.x86_64 0:195.1-1.el7 will be installed\n--> Running transaction check\n---> Package glib-networking.x86_64 0:2.56.1-1.el7 will be installed\n--> Processing Dependency: gsettings-desktop-schemas for package: glib-networking-2.56.1-1.el7.x86_64\n--> Processing Dependency: libgnutls.so.28(GNUTLS_1_4)(64bit) for package: glib-networking-2.56.1-1.el7.x86_64\n--> Processing Dependency: libgnutls.so.28(GNUTLS_2_12)(64bit) for package: glib-networking-2.56.1-1.el7.x86_64\n--> Processing Dependency: libgnutls.so.28(GNUTLS_3_0_0)(64bit) for package: glib-networking-2.56.1-1.el7.x86_64\n--> Processing Dependency: libgnutls.so.28(GNUTLS_3_1_0)(64bit) for package: glib-networking-2.56.1-1.el7.x86_64\n--> Processing Dependency: libgnutls.so.28()(64bit) for package: glib-networking-2.56.1-1.el7.x86_64\n--> Processing Dependency: libproxy.so.1()(64bit) for package: glib-networking-2.56.1-1.el7.x86_64\n---> Package json-glib.x86_64 0:1.4.2-2.el7 will be installed\n--> Running transaction check\n---> Package gnutls.x86_64 0:3.3.29-9.el7_6 will be installed\n--> Processing Dependency: trousers >= 0.3.11.2 for package: gnutls-3.3.29-9.el7_6.x86_64\n--> Processing Dependency: libhogweed.so.2()(64bit) for package: gnutls-3.3.29-9.el7_6.x86_64\n--> Processing Dependency: libnettle.so.4()(64bit) for package: gnutls-3.3.29-9.el7_6.x86_64\n---> Package gsettings-desktop-schemas.x86_64 0:3.28.0-2.el7 will be installed\n---> Package libproxy.x86_64 0:0.4.11-11.el7 will be installed\n--> Processing Dependency: libmodman.so.1()(64bit) for package: libproxy-0.4.11-11.el7.x86_64\n--> Running transaction check\n---> Package libmodman.x86_64 0:2.0.1-8.el7 will be installed\n---> Package nettle.x86_64 0:2.7.1-8.el7 will be installed\n---> Package trousers.x86_64 0:0.3.14-2.el7 will be installed\n--> Finished Dependency Resolution\n\nDependencies Resolved\n\n================================================================================\n Package               Arch   Version         Repository                   Size\n================================================================================\nInstalling:\n cockpit-bridge        x86_64 195.1-1.el7     rhel-7-server-rpms          552 k\n cockpit-docker        x86_64 195-1.el7       rhel-7-server-extras-rpms   371 k\n cockpit-kubernetes    x86_64 195-2.rhaos.el7 rhel-7-server-ose-3.11-rpms 3.7 M\n cockpit-system        noarch 195.1-1.el7     rhel-7-server-rpms          1.6 M\n cockpit-ws            x86_64 195.1-1.el7     rhel-7-server-rpms          805 k\nInstalling for dependencies:\n glib-networking       x86_64 2.56.1-1.el7    rhel-7-server-rpms          145 k\n gnutls                x86_64 3.3.29-9.el7_6  rhel-7-server-rpms          681 k\n gsettings-desktop-schemas\n                       x86_64 3.28.0-2.el7    rhel-7-server-rpms          605 k\n json-glib             x86_64 1.4.2-2.el7     rhel-7-server-rpms          134 k\n libmodman             x86_64 2.0.1-8.el7     rhel-7-server-rpms           28 k\n libproxy              x86_64 0.4.11-11.el7   rhel-7-server-rpms           65 k\n nettle                x86_64 2.7.1-8.el7     rhel-7-server-rpms          327 k\n trousers              x86_64 0.3.14-2.el7    rhel-7-server-rpms          289 k\n\nTransaction Summary\n================================================================================\nInstall  5 Packages (+8 Dependent packages)\n\nTotal size: 9.2 M\nTotal download size: 1.8 M\nInstalled size: 23 M\nDownloading packages:\nDelta RPMs disabled because /usr/bin/applydeltarpm not installed.\n"]}
	to retry, use: --limit @/usr/share/ansible/openshift-ansible/playbooks/deploy_cluster-mid.retry

This is a fairly easy one to resolve. Something awkward must have happened during the intial node subscription process. These nodes were brand new and should not have had any problem. Luckily the error above pointed me here. I simply needed to refresh the subscription.

[root@master0 ~]# subscription-manager refresh
1 local certificate has been deleted.
All local data refreshed

Then I went ahead and finished the install of the packages in that step, to validate it worked:

[root@master0 ~]# yum install cockpit-docker cockpit-kubernetes cockpit-bridge cockpit-ws cockpit-system

Installed:
  cockpit-bridge.x86_64 0:195.1-1.el7           cockpit-docker.x86_64 0:195-1.el7           cockpit-kubernetes.x86_64 0:195-2.rhaos.el7
  cockpit-system.noarch 0:195.1-1.el7           cockpit-ws.x86_64 0:195.1-1.el7

Dependency Installed:
  glib-networking.x86_64 0:2.56.1-1.el7          gnutls.x86_64 0:3.3.29-9.el7_6          gsettings-desktop-schemas.x86_64 0:3.28.0-2.el7
  json-glib.x86_64 0:1.4.2-2.el7                 libmodman.x86_64 0:2.0.1-8.el7          libproxy.x86_64 0:0.4.11-11.el7
  nettle.x86_64 0:2.7.1-8.el7                    trousers.x86_64 0:0.3.14-2.el7

Complete!

There was no way I wanted to restart the playbook from the beginning, so I found this task was the last one of the control plane. A little experience tells me that I still need to keep the first playbook included to get all my variables set correctly.

[root@bastion ~]# cat /usr/share/ansible/openshift-ansible/playbooks/deploy_cluster.yml
---
- import_playbook: init/main.yml
  vars:
    l_prereq_check_hosts: "oo_nodes_to_config"

- import_playbook: openshift-checks/private/install.yml

- import_playbook: openshift-node/private/bootstrap.yml

- import_playbook: common/private/control_plane.yml

- import_playbook: openshift-node/private/join.yml

- import_playbook: common/private/components.yml
[root@bastion ~]# tail -1 /usr/share/ansible/openshift-ansible/playbooks/common/private/control_plane.yml
- import_playbook: ../../openshift-master/private/additional_config.yml
[root@bastion ~]# cat /usr/share/ansible/openshift-ansible/playbooks/deploy_cluster-mid.yml
---
- import_playbook: init/main.yml
  vars:
    l_prereq_check_hosts: "oo_nodes_to_config"

- import_playbook: ../../openshift-master/private/additional_config.yml

- import_playbook: common/private/components.yml

So I ran my custom deploy_cluster-mid.yml playbook above. On this run I actually did hit a Gluster related error:

TASK [openshift_storage_glusterfs : Load heketi topology] ****************************************************************************************
fatal: [master0.ocp.example.com]: FAILED! => {"changed": true, "cmd": ["oc", "--config=/tmp/openshift-glusterfs-ansible-yto9HG/admin.kubeconfig", "rsh", "--namespace=glusterfs", "deploy-heketi-storage-1-4n2hk", "heketi-cli", "-s", "http://localhost:8080", "--user", "admin", "--secret", "adminkey", "topology", "load", "--json=/tmp/openshift-glusterfs-ansible-yto9HG/topology.json", "2>&1"], "delta": "0:00:09.280402", "end": "2020-03-14 14:53:48.828321", "failed_when_result": true, "rc": 0, "start": "2020-03-14 14:53:39.547919", "stderr": "", "stderr_lines": [], "stdout": "Creating cluster ... ID: 9ba496323150424c4a94d922b8019e9d\n\tAllowing file volumes on cluster.\n\tAllowing block volumes on cluster.\n\tCreating node infra0.ocp.example.com ... Unable to create node: validation failed: zone: cannot be blank.\n\tCreating node infra1.ocp.example.com ... ID: 6bbcaf294d1eaf1dbb9fdb44bcb13fb1\n\t\tAdding device /dev/vdd ... OK\n\tCreating node infra2.ocp.example.com ... ID: fc9ed9233248e0be1078d4f1c4037f31\n\t\tAdding device /dev/vdd ... OK\n\tCreating node infra3.ocp.example.com ... ID: 364ca51878e0d974475e495a563e4cb0\n\t\tAdding device /dev/vdd ... OK", "stdout_lines": ["Creating cluster ... ID: 9ba496323150424c4a94d922b8019e9d", "\tAllowing file volumes on cluster.", "\tAllowing block volumes on cluster.", "\tCreating node infra0.ocp.example.com ... Unable to create node: validation failed: zone: cannot be blank.", "\tCreating node infra1.ocp.example.com ... ID: 6bbcaf294d1eaf1dbb9fdb44bcb13fb1", "\t\tAdding device /dev/vdd ... OK", "\tCreating node infra2.ocp.example.com ... ID: fc9ed9233248e0be1078d4f1c4037f31", "\t\tAdding device /dev/vdd ... OK", "\tCreating node infra3.ocp.example.com ... ID: 364ca51878e0d974475e495a563e4cb0", "\t\tAdding device /dev/vdd ... OK"]}
	to retry, use: --limit @/usr/share/ansible/openshift-ansible/playbooks/deploy_cluster-mid.retry

I started my research on this part:

Creating node infra0.ocp.example.com ... Unable to create node: validation failed: zone: cannot be blank.

Nothing good shows up in a quick internet search. I decided to take a look at the pod hosted on this node.

[root@master0 ~]# oc get pods
NAME                            READY     STATUS    RESTARTS   AGE
deploy-heketi-storage-1-4n2hk   1/1       Running   0          3h
glusterfs-storage-22c82         1/1       Running   0          3h
glusterfs-storage-frqnr         1/1       Running   0          3h
glusterfs-storage-g9p2h         1/1       Running   0          3h
glusterfs-storage-v4lj7         1/1       Running   0          52m

Looking at the logs for this pod, I can see this error close to the top of the logs:

[root@master0 ~]# oc logs -f deploy-heketi-storage-1-4n2hk
Setting up heketi database
No database file found
failed to dump db: Could not construct dump from DB: Could not construct dump from DB: Unable to access list
error: Unable to export db. DB contents may not be valid
HEKETI_PRE_REQUEST_VOLUME_OPTIONS is
modified HEKETI_PRE_REQUEST_VOLUME_OPTIONS is server.tcp-user-timeout 42
Heketi 9.0.0-9.el7rhgs
[heketi] INFO 2020/03/14 18:53:17 Loaded kubernetes executor
[heketi] INFO 2020/03/14 18:53:17 Pre Request Volume Options: server.tcp-user-timeout 42
[heketi] INFO 2020/03/14 18:53:17 Volumes per cluster limit is set to default value of 1000
[heketi] INFO 2020/03/14 18:53:17 Block: Auto Create Block Hosting Volume set to true
[heketi] INFO 2020/03/14 18:53:17 Block: New Block Hosting Volume size 5 GB
[heketi] INFO 2020/03/14 18:53:17 GlusterFS Application Loaded
[heketi] INFO 2020/03/14 18:53:17 Started Node Health Cache Monitor
[heketi] INFO 2020/03/14 18:53:17 Started background pending operations cleaner
Listening on port 8080
[heketi] INFO 2020/03/14 18:53:27 Starting Node Health Status refresh
[heketi] INFO 2020/03/14 18:53:27 Cleaned 0 nodes from health cache
[negroni] 2020-03-14T18:53:32Z | 200 | 	 614.248µs | localhost:8080 | GET /clusters
[negroni] 2020-03-14T18:53:40Z | 200 | 	 212.532µs | localhost:8080 | GET /clusters
[negroni] 2020-03-14T18:53:40Z | 201 | 	 9.019659ms | localhost:8080 | POST /clusters
[negroni] 2020-03-14T18:53:40Z | 400 | 	 868.709µs | localhost:8080 | POST /nodes
[heketi] ERROR 2020/03/14 18:53:40 heketi/apps/glusterfs/app_node.go:35:glusterfs.(*App).NodeAdd: validation failed: zone: cannot be blank.
[cmdexec] INFO 2020/03/14 18:53:40 Check Glusterd service status in node infra1.ocp.example.com
[kubeexec] DEBUG 2020/03/14 18:53:40 heketi/pkg/remoteexec/log/commandlog.go:34:log.(*CommandLogger).Before: Will run command [systemctl status glusterd] on [pod:glusterfs-storage-g9p2h c:glusterfs ns:glusterfs (from host:infra1.ocp.example.com selector:glusterfs-node)]
[kubeexec] DEBUG 2020/03/14 18:53:40 heketi/pkg/remoteexec/kube/exec.go:72:kube.ExecCommands: Current kube connection count: 0
[kubeexec] DEBUG 2020/03/14 18:53:40 heketi/pkg/remoteexec/log/commandlog.go:41:log.(*CommandLogger).Success: Ran command [systemctl status glusterd] on [pod:glusterfs-storage-g9p2h c:glusterfs ns:glusterfs (from host:infra1.ocp.example.com selector:glusterfs-node)]: Stdout filtered, Stderr filtered

There were some other errors in those logs that led me astray but were actually benign (I think). So I took a look at the gluster status on this and other pods. Then when I saw that no other nodes were joined to this one, I tried probing another node to join it, but this did not work.

[root@master0 ~]# oc rsh glusterfs-storage-v4lj7
sh-4.2# gluster peer status
Number of Peers: 0
sh-4.2# gluster peer probe infra1.ocp.example.com
peer probe: failed: infra1.ocp.example.com is either already part of another cluster or having volumes configured

However I could see other pods/nodes were grouped together already. This was likely from heketi getting some of the volumes set up correctly but not infra0.

[root@master0 ~]# oc rsh glusterfs-storage-frqnr
sh-4.2# gluster peer probe infra0.ocp.example.com
peer probe: success.
sh-4.2# gluster peer status
Number of Peers: 3

Hostname: infra1.ocp.example.com
Uuid: 9a3bfaf8-2f46-460b-9d13-f33787c60a51
State: Peer in Cluster (Connected)

Hostname: 192.168.1.112
Uuid: 4338f436-49cb-4694-8340-9573842c6bb0
State: Peer in Cluster (Connected)

Hostname: infra0.ocp.example.com
Uuid: e91bbb22-f11b-4298-8550-b98e9d6ecbde
State: Peer in Cluster (Connected)
sh-4.2# exit
exit

There were these other errors in the gluster logs that led me astray but did not seem to be part of this particular problem. I’m not sure if this is an issue for something else.

[2020-03-14 18:50:59.198609] E [rpc-transport.c:302:rpc_transport_load] 0-rpc-transport: /usr/lib64/glusterfs/6.0/rpc-transport/rdma.so: cannot open shared object file: No such file or directory
[2020-03-14 18:50:59.198637] W [rpc-transport.c:306:rpc_transport_load] 0-rpc-transport: volume 'rdma.management': transport-type 'rdma' is not valid or not found on this machine
[2020-03-14 18:50:59.198657] W [rpcsvc.c:1991:rpcsvc_create_listener] 0-rpc-service: cannot create listener, initing the transport failed
[2020-03-14 18:50:59.198673] E [MSGID: 106244] [glusterd.c:1789:init] 0-management: creation of 1 listeners failed, continuing with succeeded transport

So I tried deleting this gluster pod to see if it would come back without an errors. No luck there.

[root@master0 ~]# oc get pods -o wide
NAME                            READY     STATUS    RESTARTS   AGE       IP               NODE                          NOMINATED NODE
deploy-heketi-storage-1-4n2hk   1/1       Running   0          2h        10.128.0.2       master1.ocp.example.com   <none>
glusterfs-storage-22c82         1/1       Running   0          3h        192.168.1.112   infra2.ocp.example.com    <none>
glusterfs-storage-frqnr         1/1       Running   0          3h        192.168.1.113   infra3.ocp.example.com    <none>
glusterfs-storage-g9p2h         1/1       Running   0          3h        192.168.1.111   infra1.ocp.example.com    <none>
glusterfs-storage-v4lj7         1/1       Running   0          43m       192.168.1.110   infra0.ocp.example.com    <none>
[root@master0 ~]# oc exec glusterfs-storage-22c82 -- gluster peer status
Number of Peers: 3

Hostname: infra1.ocp.example.com
Uuid: 9a3bfaf8-2f46-460b-9d13-f33787c60a51
State: Peer in Cluster (Connected)

Hostname: 192.168.1.113
Uuid: 0316a317-9d88-4be4-8d2e-e3e5f853abad
State: Peer in Cluster (Connected)
Other names:
infra3.ocp.example.com

Hostname: infra0.ocp.example.com
Uuid: e91bbb22-f11b-4298-8550-b98e9d6ecbde
State: Peer in Cluster (Connected)
[root@master0 ~]# oc exec glusterfs-storage-frqnr -- gluster peer status
Number of Peers: 3

Hostname: infra1.ocp.example.com
Uuid: 9a3bfaf8-2f46-460b-9d13-f33787c60a51
State: Peer in Cluster (Connected)

Hostname: 192.168.1.112
Uuid: 4338f436-49cb-4694-8340-9573842c6bb0
State: Peer in Cluster (Connected)

Hostname: infra0.ocp.example.com
Uuid: e91bbb22-f11b-4298-8550-b98e9d6ecbde
State: Peer in Cluster (Connected)
[root@master0 ~]# oc exec glusterfs-storage-g9p2h -- gluster peer status
Number of Peers: 3

Hostname: 192.168.1.112
Uuid: 4338f436-49cb-4694-8340-9573842c6bb0
State: Peer in Cluster (Connected)

Hostname: 192.168.1.113
Uuid: 0316a317-9d88-4be4-8d2e-e3e5f853abad
State: Peer in Cluster (Connected)
Other names:
infra3.ocp.example.com

Hostname: infra0.ocp.example.com
Uuid: e91bbb22-f11b-4298-8550-b98e9d6ecbde
State: Peer in Cluster (Connected)
[root@master0 ~]# oc exec glusterfs-storage-v4lj7 -- gluster peer status
Number of Peers: 3

Hostname: 192.168.1.113
Uuid: 0316a317-9d88-4be4-8d2e-e3e5f853abad
State: Peer in Cluster (Connected)
Other names:
192.168.1.113

Hostname: infra1.ocp.example.com
Uuid: 9a3bfaf8-2f46-460b-9d13-f33787c60a51
State: Peer in Cluster (Connected)

Hostname: 192.168.1.112
Uuid: 4338f436-49cb-4694-8340-9573842c6bb0
State: Peer in Cluster (Connected)
[root@master0 ~]# exit
logout
Connection to master0.ocp.example.com closed.

Everything looked healthy, so I reran the playbook, but it stalled at exactly the same error. This time I decided to run the actual same command and see if the full output gave me any clues.

[root@master0 ~]# oc rsh deploy-heketi-storage-1-4n2hk
sh-4.2$ heketi-cli -s http://localhost:8080 --user admin --secret adminkey topology load --json=/tmp/openshift-glusterfs-ansible-v54fr1/topology.json
Creating cluster ... ID: 3df301367683ce8fa3aa9d16bc1ae295
	Allowing file volumes on cluster.
	Allowing block volumes on cluster.
	Creating node infra0.ocp.example.com ... Unable to create node: validation failed: zone: cannot be blank.
	Found node infra1.ocp.example.com on cluster 9ba496323150424c4a94d922b8019e9d
		Found device /dev/vdd
	Found node infra2.ocp.example.com on cluster 9ba496323150424c4a94d922b8019e9d
		Found device /dev/vdd
	Found node infra3.ocp.example.com on cluster 9ba496323150424c4a94d922b8019e9d
		Found device /dev/vdd

So I decided to take a look at that topology.json being loaded. It does show zone: 0 for this node.

sh-4.2$ cat /tmp/openshift-glusterfs-ansible-v54fr1/topology.json
{
  "clusters": [{
      "nodes": [{
          "node": {
            "hostnames": {
              "manage": ["infra0.ocp.example.com"],
              "storage": ["192.168.1.110"]
            },
            "zone": 0
          },
          "devices": ["/dev/vdd"]
        },{
          "node": {
            "hostnames": {
              "manage": ["infra1.ocp.example.com"],
              "storage": ["192.168.1.111"]
            },
            "zone": 1
          },
          "devices": ["/dev/vdd"]
        },{
          "node": {
            "hostnames": {
              "manage": ["infra2.ocp.example.com"],
              "storage": ["192.168.1.112"]
            },
            "zone": 2
          },
          "devices": ["/dev/vdd"]
        },{
          "node": {
            "hostnames": {
              "manage": ["infra3.ocp.example.com"],
              "storage": ["192.168.1.113"]
            },
            "zone": 3
          },
          "devices": ["/dev/vdd"]
        }]
    }]
}

I saw that I could add this node manually with the cli and tried that and got the same error:

sh-4.2$ heketi-cli node add --zone=0 --management-host-name=infra0.ocp.example.com --storage-host-name=192.168.1.110
Error: Missing zone
sh-4.2$ heketi-cli node add --zone=4 --management-host-name=infra0.ocp.example.com --storage-host-name=192.168.1.110
Error: Missing cluster id

Poking around at the heketi code and examples of inventory files basically shows that zone has to be greater than or equal to 1.[1][2]

Sure enough, taking a look at my inventory I can see I decided to start counting at zero. Seemed logical to me at the time.

[glusterfs]
infra0.ocp.example.com glusterfs_zone=0 glusterfs_devices='["/dev/vdd"]'
infra1.ocp.example.com glusterfs_zone=1 glusterfs_devices='["/dev/vdd"]'
infra2.ocp.example.com glusterfs_zone=2 glusterfs_devices='["/dev/vdd"]'
infra3.ocp.example.com glusterfs_zone=3 glusterfs_devices='["/dev/vdd"]'

So I changed that to…​

[glusterfs]
infra0.ocp.example.com glusterfs_zone=1 glusterfs_devices='["/dev/vdd"]'
infra1.ocp.example.com glusterfs_zone=2 glusterfs_devices='["/dev/vdd"]'
infra2.ocp.example.com glusterfs_zone=3 glusterfs_devices='["/dev/vdd"]'
infra3.ocp.example.com glusterfs_zone=4 glusterfs_devices='["/dev/vdd"]'

To be safe, I run the uninstaller for glusterfs, cause I’m pretty sure getting the zones mixed up will require a fresh start.

[root@bastion ~]# ansible-playbook -e @passwords.yaml -e @firewall-lb_vars.yaml /usr/share/ansible/openshift-ansible/playbooks/openshift-glusterfs/uninstall.yml

Now for the fun part. There is no way I am going to run that whole stinking playbook again from the start. Let’s take a look at that error, noting where it gets run. I’m going to start there, since I’ve already past the node join phases.

[root@bastion ~]# cat /usr/share/ansible/openshift-ansible/playbooks/deploy_cluster.yml
---
- import_playbook: init/main.yml
  vars:
    l_prereq_check_hosts: "oo_nodes_to_config"

- import_playbook: openshift-checks/private/install.yml

- import_playbook: openshift-node/private/bootstrap.yml

- import_playbook: common/private/control_plane.yml

- import_playbook: openshift-node/private/join.yml

- import_playbook: common/private/components.yml
[root@bastion ~]# cat /usr/share/ansible/openshift-ansible/playbooks/deploy_cluster-mid.yml
---
- import_playbook: init/main.yml
  vars:
    l_prereq_check_hosts: "oo_nodes_to_config"

- import_playbook: common/private/components.yml
[root@bastion ~]# ansible-playbook -e @passwords.yaml -e @firewall-lb_vars.yaml /usr/share/ansible/openshift-ansible/playbooks/deploy_cluster-mid.yml

How lovely, I hit another error on the same task!

TASK [openshift_storage_glusterfs : Load heketi topology] ****************************************************************************************
fatal: [master0.ocp.example.com]: FAILED! => {"changed": true, "cmd": ["oc", "--config=/tmp/openshift-glusterfs-ansible-zNmlaJ/admin.kubeconfig", "rsh", "--namespace=glusterfs", "deploy-heketi-storage-1-c4s4s", "heketi-cli", "-s", "http://localhost:8080", "--user", "admin", "--secret", "adminkey", "topology", "load", "--json=/tmp/openshift-glusterfs-ansible-zNmlaJ/topology.json", "2>&1"], "delta": "0:00:10.473353", "end": "2020-03-14 18:31:14.281746", "failed_when_result": true, "rc": 0, "start": "2020-03-14 18:31:03.808393", "stderr": "", "stderr_lines": [], "stdout": "Creating cluster ... ID: 24b8307be372ef117e6b49fe888253ad\n\tAllowing file volumes on cluster.\n\tAllowing block volumes on cluster.\n\tCreating node infra0.ocp.example.com ... ID: 439b6098591eebe22fc6a0d85cb22d48\n\t\tAdding device /dev/vdd ... OK\n\tCreating node infra1.ocp.example.com ... ID: 5561846579586e375303b3354c204e78\n\t\tAdding device /dev/vdd ... Unable to add device: Initializing device /dev/vdd failed (aleady contains Physical Volume 31f24p-It7N-fSXN-KpAB-NtkH-YMru-GfN7Ob): Running command on the host: /usr/sbin/lvm\n  /run/lock/lvm: setfscreatecon failed: No such file or directory\n  SELinux context reset: setfscreatecon failed: No such file or directory\n  /run/lock/lvm/P_orphans: setfscreatecon failed: No such file or directory\n  SELinux context reset: setfscreatecon failed: No such file or directory\n  /run/lock/lvm/V_vg_c0a8b39ed56697eb9d0dd8fecde25b38: setfscreatecon failed: No such file or directory\n  SELinux context reset: setfscreatecon failed: No such file or directory\n  /run/lock/lvm/V_docker_vg: setfscreatecon failed: No such file or directory\n  SELinux context reset: setfscreatecon failed: No such file or directory\n  /run/lock/lvm/P_orphans: setfscreatecon failed: No such file or directory\n  SELinux context reset: setfscreatecon failed: No such file or directory\n  /run/lock/lvm/P_orphans: setfscreatecon failed: No such file or directory\n  SELinux context reset: setfscreatecon failed: No such file or directory\n  /run/lock/lvm/P_orphans: setfscreatecon failed: No such file or directory\n  SELinux context reset: setfscreatecon failed: No such file or directory\n  Can't initialize physical volume \"/dev/vdd\" of volume group \"vg_c0a8b39ed56697eb9d0dd8fecde25b38\" without -ff\n  /dev/vdd: physical volume not initialized.\n\tCreating node infra2.ocp.example.com ... ID: 7dc4babf6b7348400fe3fd2ff258115c\n\t\tAdding device /dev/vdd ... Unable to add device: Initializing device /dev/vdd failed (aleady contains Physical Volume H9E57w-uRLX-c3zl-OK73-boF3-ZQzG-x3n7JZ): Running command on the host: /usr/sbin/lvm\n  /run/lock/lvm: setfscreatecon failed: No such file or directory\n  SELinux context reset: setfscreatecon failed: No such file or directory\n  /run/lock/lvm/P_orphans: setfscreatecon failed: No such file or directory\n  SELinux context reset: setfscreatecon failed: No such file or directory\n  /run/lock/lvm/V_docker_vg: setfscreatecon failed: No such file or directory\n  SELinux context reset: setfscreatecon failed: No such file or directory\n  /run/lock/lvm/V_vg_2a9a40cec578321b4dcb5a2347d2711b: setfscreatecon failed: No such file or directory\n  SELinux context reset: setfscreatecon failed: No such file or directory\n  /run/lock/lvm/P_orphans: setfscreatecon failed: No such file or directory\n  SELinux context reset: setfscreatecon failed: No such file or directory\n  /run/lock/lvm/P_orphans: setfscreatecon failed: No such file or directory\n  SELinux context reset: setfscreatecon failed: No such file or directory\n  /run/lock/lvm/P_orphans: setfscreatecon failed: No such file or directory\n  SELinux context reset: setfscreatecon failed: No such file or directory\n  Can't initialize physical volume \"/dev/vdd\" of volume group \"vg_2a9a40cec578321b4dcb5a2347d2711b\" without -ff\n  /dev/vdd: physical volume not initialized.\n\tCreating node infra3.ocp.example.com ... ID: e12357d2988172f437ed38fbfe1e3014\n\t\tAdding device /dev/vdd ... Unable to add device: Initializing device /dev/vdd failed (aleady contains Physical Volume EOF1Oj-sfbd-0Fbt-mYAP-Z3du-eCk6-vS0u0E): Running command on the host: /usr/sbin/lvm\n  /run/lock/lvm: setfscreatecon failed: No such file or directory\n  SELinux context reset: setfscreatecon failed: No such file or directory\n  /run/lock/lvm/P_orphans: setfscreatecon failed: No such file or directory\n  SELinux context reset: setfscreatecon failed: No such file or directory\n  /run/lock/lvm/V_vg_b5c3215ceaddfd5b41fe5cc0e8762043: setfscreatecon failed: No such file or directory\n  SELinux context reset: setfscreatecon failed: No such file or directory\n  /run/lock/lvm/V_docker_vg: setfscreatecon failed: No such file or directory\n  SELinux context reset: setfscreatecon failed: No such file or directory\n  /run/lock/lvm/P_orphans: setfscreatecon failed: No such file or directory\n  SELinux context reset: setfscreatecon failed: No such file or directory\n  /run/lock/lvm/P_orphans: setfscreatecon failed: No such file or directory\n  SELinux context reset: setfscreatecon failed: No such file or directory\n  /run/lock/lvm/P_orphans: setfscreatecon failed: No such file or directory\n  SELinux context reset: setfscreatecon failed: No such file or directory\n  Can't initialize physical volume \"/dev/vdd\" of volume group \"vg_b5c3215ceaddfd5b41fe5cc0e8762043\" without -ff\n  /dev/vdd: physical volume not initialized.", "stdout_lines": ["Creating cluster ... ID: 24b8307be372ef117e6b49fe888253ad", "\tAllowing file volumes on cluster.", "\tAllowing block volumes on cluster.", "\tCreating node infra0.ocp.example.com ... ID: 439b6098591eebe22fc6a0d85cb22d48", "\t\tAdding device /dev/vdd ... OK", "\tCreating node infra1.ocp.example.com ... ID: 5561846579586e375303b3354c204e78", "\t\tAdding device /dev/vdd ... Unable to add device: Initializing device /dev/vdd failed (aleady contains Physical Volume 31f24p-It7N-fSXN-KpAB-NtkH-YMru-GfN7Ob): Running command on the host: /usr/sbin/lvm", "  /run/lock/lvm: setfscreatecon failed: No such file or directory", "  SELinux context reset: setfscreatecon failed: No such file or directory", "  /run/lock/lvm/P_orphans: setfscreatecon failed: No such file or directory", "  SELinux context reset: setfscreatecon failed: No such file or directory", "  /run/lock/lvm/V_vg_c0a8b39ed56697eb9d0dd8fecde25b38: setfscreatecon failed: No such file or directory", "  SELinux context reset: setfscreatecon failed: No such file or directory", "  /run/lock/lvm/V_docker_vg: setfscreatecon failed: No such file or directory", "  SELinux context reset: setfscreatecon failed: No such file or directory", "  /run/lock/lvm/P_orphans: setfscreatecon failed: No such file or directory", "  SELinux context reset: setfscreatecon failed: No such file or directory", "  /run/lock/lvm/P_orphans: setfscreatecon failed: No such file or directory", "  SELinux context reset: setfscreatecon failed: No such file or directory", "  /run/lock/lvm/P_orphans: setfscreatecon failed: No such file or directory", "  SELinux context reset: setfscreatecon failed: No such file or directory", "  Can't initialize physical volume \"/dev/vdd\" of volume group \"vg_c0a8b39ed56697eb9d0dd8fecde25b38\" without -ff", "  /dev/vdd: physical volume not initialized.", "\tCreating node infra2.ocp.example.com ... ID: 7dc4babf6b7348400fe3fd2ff258115c", "\t\tAdding device /dev/vdd ... Unable to add device: Initializing device /dev/vdd failed (aleady contains Physical Volume H9E57w-uRLX-c3zl-OK73-boF3-ZQzG-x3n7JZ): Running command on the host: /usr/sbin/lvm", "  /run/lock/lvm: setfscreatecon failed: No such file or directory", "  SELinux context reset: setfscreatecon failed: No such file or directory", "  /run/lock/lvm/P_orphans: setfscreatecon failed: No such file or directory", "  SELinux context reset: setfscreatecon failed: No such file or directory", "  /run/lock/lvm/V_docker_vg: setfscreatecon failed: No such file or directory", "  SELinux context reset: setfscreatecon failed: No such file or directory", "  /run/lock/lvm/V_vg_2a9a40cec578321b4dcb5a2347d2711b: setfscreatecon failed: No such file or directory", "  SELinux context reset: setfscreatecon failed: No such file or directory", "  /run/lock/lvm/P_orphans: setfscreatecon failed: No such file or directory", "  SELinux context reset: setfscreatecon failed: No such file or directory", "  /run/lock/lvm/P_orphans: setfscreatecon failed: No such file or directory", "  SELinux context reset: setfscreatecon failed: No such file or directory", "  /run/lock/lvm/P_orphans: setfscreatecon failed: No such file or directory", "  SELinux context reset: setfscreatecon failed: No such file or directory", "  Can't initialize physical volume \"/dev/vdd\" of volume group \"vg_2a9a40cec578321b4dcb5a2347d2711b\" without -ff", "  /dev/vdd: physical volume not initialized.", "\tCreating node infra3.ocp.example.com ... ID: e12357d2988172f437ed38fbfe1e3014", "\t\tAdding device /dev/vdd ... Unable to add device: Initializing device /dev/vdd failed (aleady contains Physical Volume EOF1Oj-sfbd-0Fbt-mYAP-Z3du-eCk6-vS0u0E): Running command on the host: /usr/sbin/lvm", "  /run/lock/lvm: setfscreatecon failed: No such file or directory", "  SELinux context reset: setfscreatecon failed: No such file or directory", "  /run/lock/lvm/P_orphans: setfscreatecon failed: No such file or directory", "  SELinux context reset: setfscreatecon failed: No such file or directory", "  /run/lock/lvm/V_vg_b5c3215ceaddfd5b41fe5cc0e8762043: setfscreatecon failed: No such file or directory", "  SELinux context reset: setfscreatecon failed: No such file or directory", "  /run/lock/lvm/V_docker_vg: setfscreatecon failed: No such file or directory", "  SELinux context reset: setfscreatecon failed: No such file or directory", "  /run/lock/lvm/P_orphans: setfscreatecon failed: No such file or directory", "  SELinux context reset: setfscreatecon failed: No such file or directory", "  /run/lock/lvm/P_orphans: setfscreatecon failed: No such file or directory", "  SELinux context reset: setfscreatecon failed: No such file or directory", "  /run/lock/lvm/P_orphans: setfscreatecon failed: No such file or directory", "  SELinux context reset: setfscreatecon failed: No such file or directory", "  Can't initialize physical volume \"/dev/vdd\" of volume group \"vg_b5c3215ceaddfd5b41fe5cc0e8762043\" without -ff", "  /dev/vdd: physical volume not initialized."]}

That’s really peculiar. I think I would have been caught up on the SELinux errors if I had not actually seen them before. But taking a better look highlights this one:

Unable to add device: Initializing device /dev/vdd failed (aleady contains Physical Volume 31f24p-It7N-fSXN-KpAB-NtkH-YMru-GfN7Ob)

Searching for that typo aleady shows this on all volumes. So apparently the gluster uninstall did not remove the VG or PV. Let’s clean up that mess. Let’s also file a bug for that cause we really shouldn’t have to do this.

[root@bastion ~]# ssh infra0.ocp.example.com
[root@infra0 ~]# lvs
  LV             VG        Attr       LSize   Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert
  docker-root-lv docker_vg -wi-ao---- <60.00g
[root@infra0 ~]# vgs
  VG                                  #PV #LV #SN Attr   VSize   VFree
  docker_vg                             1   1   0 wz--n- <60.00g     0
  vg_b053c53756aaf00baf664d11b3bd8e93   1   0   0 wz--n-  59.87g 59.87g
[root@infra0 ~]# vgremove vg_b053c53756aaf00baf664d11b3bd8e93
  Volume group "vg_b053c53756aaf00baf664d11b3bd8e93" successfully removed
[root@infra0 ~]# pvs
  PV         VG        Fmt  Attr PSize   PFree
  /dev/vdb1  docker_vg lvm2 a--  <60.00g     0
  /dev/vdd             lvm2 ---   60.00g 60.00g
[root@infra0 ~]# pvremove /dev/vdd
  Labels on physical volume "/dev/vdd" successfully wiped.
[root@infra0 ~]# exit
logout
Connection to infra0.ocp.example.com closed.
[root@bastion ~]# ssh infra1.ocp.example.com
[root@infra1 ~]# vgremove vg_c0a8b39ed56697eb9d0dd8fecde25b38
  Volume group "vg_c0a8b39ed56697eb9d0dd8fecde25b38" successfully removed
[root@infra1 ~]# pvremove /dev/vdd
  Labels on physical volume "/dev/vdd" successfully wiped.
[root@infra1 ~]# exit
logout
Connection to infra1.ocp.example.com closed.
[root@bastion ~]# ssh infra2.ocp.example.com
[root@infra2 ~]# vgremove vg_2a9a40cec578321b4dcb5a2347d2711b
  Volume group "vg_2a9a40cec578321b4dcb5a2347d2711b" successfully removed
[root@infra2 ~]# pvremove /dev/vdd
  Labels on physical volume "/dev/vdd" successfully wiped.
[root@infra2 ~]# exit
logout
Connection to infra2.ocp.example.com closed.
[root@bastion ~]# ssh infra3.ocp.example.com
[root@infra3 ~]# vgremove vg_b5c3215ceaddfd5b41fe5cc0e8762043
  Volume group "vg_b5c3215ceaddfd5b41fe5cc0e8762043" successfully removed
[root@infra3 ~]# pvremove /dev/vdd
  Labels on physical volume "/dev/vdd" successfully wiped.
[root@infra3 ~]# exit
logout
Connection to infra3.ocp.example.com closed.

I suppose I could have written an ad-hoc ansible command, but getting it to pick the right VG would have taken some thought, and doing this manually on four nodes is faster.

One more run of the installer, and thankfully it’s done this time, successfully.

I went on to installing my application suite on top of OpenShift, but ran into further problems with gluster volumes. Here is one of the errors and behavior I hit.

This app was trying to deploy and check the progress of its database and main application pods.

fatal: [127.0.0.1]: FAILED! => {
    "attempts": 50,
    "changed": false,
    "cmd": "sleep 5; oc get pods --namespace openshift-launcher | grep \"deploy\"",
    "delta": "0:00:05.375602",
    "end": "2020-03-15 16:38:43.465237",
    "failed_when_result": true,
    "invocation": {
        "module_args": {
            "_raw_params": "sleep 5; oc get pods --namespace openshift-launcher | grep \"deploy\"",
            "_uses_shell": true,
            "argv": null,
            "chdir": null,
            "creates": null,
            "executable": null,
            "removes": null,
            "stdin": null,
            "warn": true
        }
    },
    "rc": 0,
    "start": "2020-03-15 16:38:38.089635",
    "stderr": "",
    "stderr_lines": [],
    "stdout": "launcher-sso-1-deploy              0/1       Error     0          14m\nlauncher-sso-postgresql-1-deploy   0/1       Error     0          14m",
    "stdout_lines": [
        "launcher-sso-1-deploy              0/1       Error     0          14m",
        "launcher-sso-postgresql-1-deploy   0/1       Error     0          14m"
    ]
}

Since it failed with an Error, let’s check the events.

bash-5.0$ oc get events
LAST SEEN   FIRST SEEN   COUNT     NAME                                                KIND                    SUBOBJECT                                  TYPE      REASON                  SOURCE                                  MESSAGE
21m         21m          1         launcher-sso-postgresql-1-deploy.15fc86e143dc7f08   Pod                                                                Normal    Scheduled               default-scheduler                       Successfully assigned openshift-launcher/launcher-sso-postgresql-1-deploy to compute1.ocp.home.dataxf.com
21m         21m          1         launcher-sso-postgresql.15fc86e13b735183            DeploymentConfig                                                   Normal    DeploymentCreated       deploymentconfig-controller             Created new replication controller "launcher-sso-postgresql-1" for version 1
21m         21m          1         launcher-sso.15fc86e173c8ab8b                       DeploymentConfig                                                   Normal    DeploymentCreated       deploymentconfig-controller             Created new replication controller "launcher-sso-1" for version 1
21m         21m          1         launcher-sso-1-deploy.15fc86e178cdd201              Pod                                                                Normal    Scheduled               default-scheduler                       Successfully assigned openshift-launcher/launcher-sso-1-deploy to compute0.ocp.home.dataxf.com
21m         21m          1         launcher-sso-postgresql-1-deploy.15fc86e1f347f3e4   Pod                     spec.containers{deployment}                Normal    Pulled                  kubelet, compute1.ocp.home.dataxf.com   Container image "registry.redhat.io/openshift3/ose-deployer:v3.11.170" already present on machine
21m         21m          1         launcher-sso-postgresql-1-deploy.15fc86e2655afe15   Pod                     spec.containers{deployment}                Normal    Started                 kubelet, compute1.ocp.home.dataxf.com   Started container
21m         21m          1         launcher-sso-postgresql-1-deploy.15fc86e25448fdd8   Pod                     spec.containers{deployment}                Normal    Created                 kubelet, compute1.ocp.home.dataxf.com   Created container
21m         21m          1         launcher-sso-postgresql-1.15fc86e2cf356be7          ReplicationController                                              Normal    SuccessfulCreate        replication-controller                  Created pod: launcher-sso-postgresql-1-m9gnf
21m         21m          1         launcher-sso-1-deploy.15fc86e3075ae97a              Pod                     spec.containers{deployment}                Normal    Pulling                 kubelet, compute0.ocp.home.dataxf.com   pulling image "registry.redhat.io/openshift3/ose-deployer:v3.11.170"
21m         21m          1         launcher-sso-1-deploy.15fc86e378276e33              Pod                     spec.containers{deployment}                Normal    Pulled                  kubelet, compute0.ocp.home.dataxf.com   Successfully pulled image "registry.redhat.io/openshift3/ose-deployer:v3.11.170"
21m         21m          1         launcher-sso-1-deploy.15fc86e384e6acef              Pod                     spec.containers{deployment}                Normal    Created                 kubelet, compute0.ocp.home.dataxf.com   Created container
21m         21m          1         launcher-sso-1-deploy.15fc86e3955e4723              Pod                     spec.containers{deployment}                Normal    Started                 kubelet, compute0.ocp.home.dataxf.com   Started container
21m         21m          1         launcher-sso-1.15fc86e3b35d3311                     ReplicationController                                              Normal    SuccessfulCreate        replication-controller                  Created pod: launcher-sso-1-j8r8g
21m         21m          1         launcher-sso-1-j8r8g.15fc86e3b4a87e1f               Pod                                                                Normal    Scheduled               default-scheduler                       Successfully assigned openshift-launcher/launcher-sso-1-j8r8g to compute1.ocp.home.dataxf.com
21m         21m          1         launcher-sso-postgresql-1-m9gnf.15fc86e5a53086c3    Pod                                                                Normal    Scheduled               default-scheduler                       Successfully assigned openshift-launcher/launcher-sso-postgresql-1-m9gnf to compute1.ocp.home.dataxf.com
21m         21m          1         launcher-sso-postgresql-claim.15fc86e5900f700d      PersistentVolumeClaim                                              Normal    ProvisioningSucceeded   persistentvolume-controller             Successfully provisioned volume pvc-72b78e87-66d9-11ea-8c12-566f4d920014 using kubernetes.io/glusterfs
21m         21m          21        launcher-sso-postgresql-1-m9gnf.15fc86e2cf29794c    Pod                                                                Warning   FailedScheduling        default-scheduler                       pod has unbound PersistentVolumeClaims (repeated 2 times)
18m         18m          1         launcher-sso-1-j8r8g.15fc87113872b609               Pod                                                                Normal    SandboxChanged          kubelet, compute1.ocp.home.dataxf.com   Pod sandbox changed, it will be killed and re-created.
18m         21m          2         launcher-sso-1-j8r8g.15fc86e472027237               Pod                     spec.containers{launcher-sso}              Normal    Pulling                 kubelet, compute1.ocp.home.dataxf.com   pulling image "registry.redhat.io/redhat-sso-7/sso73-openshift@sha256:89f0f6a1e6e71cda169774da5512656b79cc73f85d87bababec031676750bd54"
16m         16m          1         launcher-sso-postgresql-1-m9gnf.15fc873192f2e7d6    Pod                     spec.containers{launcher-sso-postgresql}   Normal    Started                 kubelet, compute1.ocp.home.dataxf.com   Started container
15m         15m          3         launcher-sso-postgresql-1-m9gnf.15fc8739cfde93e6    Pod                     spec.containers{launcher-sso-postgresql}   Warning   Unhealthy               kubelet, compute1.ocp.home.dataxf.com   Liveness probe failed: dial tcp 10.129.2.7:5432: connect: connection refused
14m         15m          12        launcher-sso-postgresql-1-m9gnf.15fc8732a1145d61    Pod                     spec.containers{launcher-sso-postgresql}   Warning   Unhealthy               kubelet, compute1.ocp.home.dataxf.com   Readiness probe failed: sh: cannot set terminal process group (-1): Inappropriate ioctl for device
sh: no job control in this shell
psql: could not connect to server: Connection refused
          Is the server running on host "127.0.0.1" and accepting
          TCP/IP connections on port 5432?

13m       13m       1         launcher-sso-postgresql-1-m9gnf.15fc875a7a74911e   Pod                     spec.containers{launcher-sso-postgresql}   Normal    Killing                       kubelet, compute1.ocp.home.dataxf.com   Killing container with id docker://launcher-sso-postgresql:Container failed liveness probe.. Container will be killed and recreated.
12m       18m       2         launcher-sso-1-j8r8g.15fc8710b0fb8fd7              Pod                     spec.containers{launcher-sso}              Warning   Failed                        kubelet, compute1.ocp.home.dataxf.com   Failed to pull image "registry.redhat.io/redhat-sso-7/sso73-openshift@sha256:89f0f6a1e6e71cda169774da5512656b79cc73f85d87bababec031676750bd54": rpc error: code = Canceled desc = context canceled
12m       18m       2         launcher-sso-1-j8r8g.15fc8710b0fd2624              Pod                     spec.containers{launcher-sso}              Warning   Failed                        kubelet, compute1.ocp.home.dataxf.com   Error: ErrImagePull
11m       11m       1         launcher-sso.15fc8773f1136718                      DeploymentConfig                                                   Normal    ReplicationControllerScaled   deploymentconfig-controller             Scaled replication controller "launcher-sso-1" from 1 to 0
11m       11m       1         launcher-sso-1.15fc877543a5bc68                    ReplicationController                                              Normal    SuccessfulDelete              replication-controller                  Deleted pod: launcher-sso-1-j8r8g
10m       10m       1         launcher-sso-postgresql-1.15fc877ac1758de8         ReplicationController                                              Normal    SuccessfulDelete              replication-controller                  Deleted pod: launcher-sso-postgresql-1-m9gnf
10m       21m       2         launcher-sso-postgresql-1-m9gnf.15fc86e694c5b616   Pod                     spec.containers{launcher-sso-postgresql}   Normal    Pulling                       kubelet, compute1.ocp.home.dataxf.com   pulling image "docker-registry.default.svc:5000/openshift/postgresql@sha256:0cf19c73fb1ed0784a5092edd42ce662d7328dac529f56b6bfc3d85b24552ed4"
10m       10m       1         launcher-sso-postgresql.15fc877abd754a97           DeploymentConfig                                                   Normal    ReplicationControllerScaled   deploymentconfig-controller             Scaled replication controller "launcher-sso-postgresql-1" from 1 to 0
10m       10m       1         launcher-sso-postgresql-1-m9gnf.15fc877b0adf956c   Pod                     spec.containers{launcher-sso-postgresql}   Warning   Failed                        kubelet, compute1.ocp.home.dataxf.com   Error: failed to start container "launcher-sso-postgresql": Error response from daemon: oci runtime error: container_linux.go:235: starting container process caused "process_linux.go:245: running exec setns process for init caused \"exit status 15\""
10m       16m       2         launcher-sso-postgresql-1-m9gnf.15fc872f6875de7a   Pod                     spec.containers{launcher-sso-postgresql}   Normal    Pulled                        kubelet, compute1.ocp.home.dataxf.com   Successfully pulled image "docker-registry.default.svc:5000/openshift/postgresql@sha256:0cf19c73fb1ed0784a5092edd42ce662d7328dac529f56b6bfc3d85b24552ed4"
10m       16m       2         launcher-sso-postgresql-1-m9gnf.15fc87317ce10773   Pod                     spec.containers{launcher-sso-postgresql}   Normal    Created                       kubelet, compute1.ocp.home.dataxf.com   Created container
8m        8m        1         launcher-sso-1-j8r8g.15fc879771b486f9              Pod                                                                Warning   FailedMount                   kubelet, compute1.ocp.home.dataxf.com   Unable to mount volumes for pod "launcher-sso-1-j8r8g_openshift-launcher(791ec264-66d9-11ea-8c12-566f4d920014)": timeout expired waiting for volumes to attach or mount for pod "openshift-launcher"/"launcher-sso-1-j8r8g". list of unmounted volumes=[sso-x509-https-volume sso-x509-jgroups-volume default-token-pcwhz]. list of unattached volumes=[sso-x509-https-volume sso-x509-jgroups-volume default-token-pcwhz]

There are a few different failures in those events, and sometimes events are hard to dissect because some errors are expected while the pods deploy and get running. For example, some errors come up as a result of the app pod waiting on the presence of the database pod.

Let’s look at some other things.

bash-5.0$ oc get pvc
NAME                            STATUS    VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS        AGE
launcher-sso-postgresql-claim   Bound     pvc-72b78e87-66d9-11ea-8c12-566f4d920014   1Gi        RWO            glusterfs-storage   21m


bash-5.0$ oc get all
NAME                                   READY     STATUS    RESTARTS   AGE
pod/launcher-sso-1-deploy              0/1       Error     0          23m
pod/launcher-sso-postgresql-1-deploy   0/1       Error     0          23m

NAME                                              DESIRED   CURRENT   READY     AGE
replicationcontroller/launcher-sso-1              0         0         0         23m
replicationcontroller/launcher-sso-postgresql-1   0         0         0         23m

NAME                                                             TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)    AGE
service/glusterfs-dynamic-72b78e87-66d9-11ea-8c12-566f4d920014   ClusterIP   172.30.97.7      <none>        1/TCP      23m
service/launcher-sso                                             ClusterIP   172.30.157.193   <none>        8443/TCP   23m
service/launcher-sso-ping                                        ClusterIP   None             <none>        8888/TCP   23m
service/launcher-sso-postgresql                                  ClusterIP   172.30.24.67     <none>        5432/TCP   23m

NAME                                                         REVISION   DESIRED   CURRENT   TRIGGERED BY
deploymentconfig.apps.openshift.io/launcher-sso              1          1         0         config,image(redhat-sso73-openshift:1.0)
deploymentconfig.apps.openshift.io/launcher-sso-postgresql   1          1         0         config,image(postgresql:9.5)

NAME                                    HOST/PORT                                                  PATH      SERVICES       PORT      TERMINATION   WILDCARD
route.route.openshift.io/launcher-sso   launcher-sso-openshift-launcher.apps.ocp.home.dataxf.com             launcher-sso   <all>     reencrypt     None


bash-5.0$ oc get pvc --all-namespaces
NAMESPACE                         NAME                                                                       STATUS    VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS        AGE
default                           registry-claim                                                             Bound     registry-volume                            20Gi       RWX            glusterfs-storage   17h
openshift-launcher                launcher-sso-postgresql-claim                                              Bound     pvc-72b78e87-66d9-11ea-8c12-566f4d920014   1Gi        RWO            glusterfs-storage   24m
openshift-middleware-monitoring   prometheus-application-monitoring-db-prometheus-application-monitoring-0   Bound     pvc-7fff0ee0-66d9-11ea-8c12-566f4d920014   10Gi       RWO            glusterfs-storage   24m

So what comes to mind is tailing the logs of the deploy pod, or even the pod that it failed to deploy. Unfortunately I waited too long and the pod it attempted to deploy is now gone, so I am left with nothing to check there.

bash-5.0$ oc logs -f pod/launcher-sso-postgresql-1-deploy
--> Scaling launcher-sso-postgresql-1 to 1
error: update acceptor rejected launcher-sso-postgresql-1: pods for rc 'openshift-launcher/launcher-sso-postgresql-1' took longer than 600 seconds to become available
bash-5.0$ oc debug launcher-sso-postgresql-1-m9gnf
Error from server (NotFound): pods "launcher-sso-postgresql-1-m9gnf" not found
bash-5.0$ oc logs -f launcher-sso-postgresql-1-m9gnf
Error from server (NotFound): pods "launcher-sso-postgresql-1-m9gnf" not found

Let’s rollout a new deployment to get another pod.

bash-5.0$ oc rollout latest dc/launcher-sso-postgresql
deploymentconfig.apps.openshift.io/launcher-sso-postgresql rolled out
bash-5.0$ oc get pods
NAME                               READY     STATUS              RESTARTS   AGE
launcher-sso-1-deploy              0/1       Error               0          27m
launcher-sso-postgresql-1-deploy   0/1       Error               0          27m
launcher-sso-postgresql-2-deploy   1/1       Running             0          13s
launcher-sso-postgresql-2-s45vx    0/1       ContainerCreating   0          9s


bash-5.0$ oc get events
LAST SEEN   FIRST SEEN   COUNT     NAME                                                KIND                    SUBOBJECT                                  TYPE      REASON                  SOURCE                                  MESSAGE
35s       35s       1         launcher-sso-postgresql-2-deploy.15fc885c85151b87   Pod                                                                Normal    Scheduled                     default-scheduler                       Successfully assigned openshift-launcher/launcher-sso-postgresql-2-deploy to compute0.ocp.home.dataxf.com
35s       35s       1         launcher-sso-postgresql.15fc885c80a512ef            DeploymentConfig                                                   Normal    DeploymentCreated             deploymentconfig-controller             Created new replication controller "launcher-sso-postgresql-2" for version 2
32s       32s       1         launcher-sso-postgresql-2-deploy.15fc885d284d4724   Pod                     spec.containers{deployment}                Normal    Created                       kubelet, compute0.ocp.home.dataxf.com   Created container
32s       32s       1         launcher-sso-postgresql-2-deploy.15fc885d170cff0b   Pod                     spec.containers{deployment}                Normal    Pulled                        kubelet, compute0.ocp.home.dataxf.com   Container image "registry.redhat.io/openshift3/ose-deployer:v3.11.170" already present on machine
32s       32s       1         launcher-sso-postgresql-2-deploy.15fc885d36524b4b   Pod                     spec.containers{deployment}                Normal    Started                       kubelet, compute0.ocp.home.dataxf.com   Started container
31s       31s       1         launcher-sso-postgresql-2.15fc885d545e47c9          ReplicationController                                              Normal    SuccessfulCreate              replication-controller                  Created pod: launcher-sso-postgresql-2-s45vx
31s       31s       1         launcher-sso-postgresql-2-s45vx.15fc885d5576bab8    Pod                                                                Normal    Scheduled                     default-scheduler                       Successfully assigned openshift-launcher/launcher-sso-postgresql-2-s45vx to compute0.ocp.home.dataxf.com
28s       28s       1         launcher-sso-postgresql-2-s45vx.15fc885dfa3d78d5    Pod                     spec.containers{launcher-sso-postgresql}   Normal    Pulling                       kubelet, compute0.ocp.home.dataxf.com   pulling image "docker-registry.default.svc:5000/openshift/postgresql@sha256:0cf19c73fb1ed0784a5092edd42ce662d7328dac529f56b6bfc3d85b24552ed4"
9s        9s        1         launcher-sso-postgresql-2-s45vx.15fc88627eabf475    Pod                     spec.containers{launcher-sso-postgresql}   Normal    Pulled                        kubelet, compute0.ocp.home.dataxf.com   Successfully pulled image "docker-registry.default.svc:5000/openshift/postgresql@sha256:0cf19c73fb1ed0784a5092edd42ce662d7328dac529f56b6bfc3d85b24552ed4"

After waiting some time, I can see the pod actually did start and then fail. Note that I am on very slow disks, 7200rpm SATA, with SDS (software defined storage) on top of another layer of SDS, so this isn’t expected to be fast.

bash-5.0$ oc get pods
NAME                               READY     STATUS             RESTARTS   AGE
launcher-sso-1-deploy              0/1       Error              0          31m
launcher-sso-postgresql-1-deploy   0/1       Error              0          31m
launcher-sso-postgresql-2-deploy   1/1       Running            0          4m
launcher-sso-postgresql-2-s45vx    0/1       CrashLoopBackOff   2          4m
bash-5.0$ oc logs -f launcher-sso-postgresql-2-s45vx
pg_ctl: another server might be running; trying to start server anyway
waiting for server to start....LOG:  redirecting log output to logging collector process
HINT:  Future log output will appear in directory "pg_log".
.................. done
server started
/var/run/postgresql:5432 - accepting connections
=> sourcing /usr/share/container-scripts/postgresql/start/set_passwords.sh ...
ERROR:  role "useruqY" does not exist
bash-5.0$ oc logs -f launcher-sso-postgresql-2-s45vx
pg_ctl: another server might be running; trying to start server anyway
waiting for server to start....LOG:  redirecting log output to logging collector process
HINT:  Future log output will appear in directory "pg_log".
.................. done
server started
/var/run/postgresql:5432 - accepting connections
=> sourcing /usr/share/container-scripts/postgresql/start/set_passwords.sh ...
ERROR:  role "useruqY" does not exist

Ok so it is peculiar that the user is missing. This generally tells me that the init scrip that sets up the database failed on the first deploy attempt. This also means that something else is my actual problem and I need to start the whole project deploy over rather than just rollout this one pod. Next time I will have to monitor the deployment to watch for its failure and capture the pod logs. Since there is usually ten minutes or more of time for this, it isn’t a problem, but I can’t walk away from it.

bash-5.0$ oc delete all --all
replicationcontroller "launcher-sso-1" deleted
replicationcontroller "launcher-sso-postgresql-1" deleted
replicationcontroller "launcher-sso-postgresql-2" deleted
service "glusterfs-dynamic-72b78e87-66d9-11ea-8c12-566f4d920014" deleted
service "launcher-sso" deleted
service "launcher-sso-ping" deleted
service "launcher-sso-postgresql" deleted
deploymentconfig.apps.openshift.io "launcher-sso" deleted
deploymentconfig.apps.openshift.io "launcher-sso-postgresql" deleted
route.route.openshift.io "launcher-sso" deleted

On the next deploy:

child process exited with exit code 1
initdb: removing contents of data directory "/var/lib/pgsql/data/userdata"
creating information schema ... bash-5.0$ oc get pods
NAME                               READY     STATUS         RESTARTS   AGE
launcher-sso-1-deploy              1/1       Running        0          6m
launcher-sso-1-z8wtr               0/1       ErrImagePull   0          6m
launcher-sso-postgresql-1-deploy   1/1       Running        0          6m
launcher-sso-postgresql-1-wpkfh    0/1       Error          0          6m
bash-5.0$ oc logs -f launcher-sso-postgresql-1-wpkfh
The files belonging to this database system will be owned by user "postgres".
This user must also own the server process.

The database cluster will be initialized with locale "en_US.utf8".
The default database encoding has accordingly been set to "UTF8".
The default text search configuration will be set to "english".

Data page checksums are disabled.

fixing permissions on existing directory /var/lib/pgsql/data/userdata ... ok
creating subdirectories ... ok
selecting default max_connections ... 100
selecting default shared_buffers ... 128MB
selecting dynamic shared memory implementation ... posix
creating configuration files ... ok
creating template1 database in /var/lib/pgsql/data/userdata/base/1 ... ok
initializing pg_authid ... ok
initializing dependencies ... ok
creating system views ... ok
loading system objects' descriptions ... ok
creating collations ... ok
creating conversions ... ok
creating dictionaries ... ok
setting privileges on built-in objects ... ok
FATAL:  unexpected data beyond EOF in block 32 of relation base/1/2673
HINT:  This has been seen to occur with buggy kernels; consider updating your system.
STATEMENT:  /*
	 * SQL Information Schema
........(truncated)..........

Now this is revealing: FATAL: unexpected data beyond EOF in block 32 of relation base/1/2673. I’ve actually seen this before, and from experience know that GlusterFS doesn’t work well on databases without some tuning. I know that we can also switch to Gluster-Block as well. But switching to gluster-block means searching for my storageclass references to the deploy project, which I am less familiar with. I could also switch my default storageclass to gluster-block, but I don’t need it for other applications. This is really only a test cluster, so I’m going to go with the path of least resistance here and update my existing glusterfs storageclass with the performance options needed.[3][4]

bash-5.0$ oc get sc
NAME                          PROVISIONER                          AGE
glusterfs-storage (default)   kubernetes.io/glusterfs              19h
glusterfs-storage-block       gluster.org/glusterblock-glusterfs   19h
bash-5.0$ oc get sc glusterfs-storage-block -o yaml
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  creationTimestamp: 2020-03-14T23:21:56Z
  name: glusterfs-storage-block
  resourceVersion: "135134"
  selfLink: /apis/storage.k8s.io/v1/storageclasses/glusterfs-storage-block
  uid: 9887fbd0-664a-11ea-abc3-566f4d920016
parameters:
  chapauthenabled: "true"
  hacount: "3"
  restsecretname: heketi-storage-admin-secret-block
  restsecretnamespace: glusterfs
  resturl: http://heketi-storage.glusterfs.svc:8080
  restuser: admin
provisioner: gluster.org/glusterblock-glusterfs
reclaimPolicy: Delete
volumeBindingMode: Immediate
bash-5.0$ oc get sc glusterfs-storage -o yaml
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  annotations:
    storageclass.kubernetes.io/is-default-class: "true"
  creationTimestamp: 2020-03-14T23:21:17Z
  name: glusterfs-storage
  resourceVersion: "134981"
  selfLink: /apis/storage.k8s.io/v1/storageclasses/glusterfs-storage
  uid: 8187c9d4-664a-11ea-abc3-566f4d920016
parameters:
  resturl: http://heketi-storage.glusterfs.svc:8080
  restuser: admin
  secretName: heketi-storage-admin-secret
  secretNamespace: glusterfs
provisioner: kubernetes.io/glusterfs
reclaimPolicy: Delete
volumeBindingMode: Immediate

Unfortunately editing in place doesn’t appear to work. It may have been my formating. Or maybe this object is immutable.

bash-5.0$ oc edit sc glusterfs-storage
error: storageclasses.storage.k8s.io "glusterfs-storage" is invalid
A copy of your changes has been stored to "/tmp/oc-edit-6fb8q.yaml"
error: Edit cancelled, no valid changes were saved.

bash-5.0$ oc get sc glusterfs-storage -o yaml > glusterfs-sc.yaml
bash-5.0$ vi glusterfs-sc.yaml
bash-5.0$ oc delete sc glusterfs-storage
storageclass.storage.k8s.io "glusterfs-storage" deleted
bash-5.0$ oc create -f glusterfs-sc.yaml
storageclass.storage.k8s.io/glusterfs-storage created

bash-5.0$ oc get sc glusterfs-storage -o yaml
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  annotations:
    storageclass.kubernetes.io/is-default-class: "true"
  creationTimestamp: 2020-03-15T19:32:09Z
  name: glusterfs-storage
  resourceVersion: "385822"
  selfLink: /apis/storage.k8s.io/v1/storageclasses/glusterfs-storage
  uid: a9b765a3-66f3-11ea-8c12-566f4d920014
parameters:
  resturl: http://heketi-storage.glusterfs.svc:8080
  restuser: admin
  secretName: heketi-storage-admin-secret
  secretNamespace: glusterfs
  volumeoptions: performance.stat-prefetch off performance.write-behind off performance.open-behind
    off performance.quick-read off performance.strict-o-direct on performance.read-ahead
    off performance.io-cache off performance.readdir-ahead off
provisioner: kubernetes.io/glusterfs
reclaimPolicy: Delete
volumeBindingMode: Immediate

Now on our next full deploy we get a clean database start up. And we can even see it initialize the user roles correctly.

bash-5.0$ oc logs -f pod/launcher-sso-postgresql-1-dg6pk
The files belonging to this database system will be owned by user "postgres".
This user must also own the server process.

The database cluster will be initialized with locale "en_US.utf8".
The default database encoding has accordingly been set to "UTF8".
The default text search configuration will be set to "english".

Data page checksums are disabled.

fixing permissions on existing directory /var/lib/pgsql/data/userdata ... ok
creating subdirectories ... ok
selecting default max_connections ... 100
selecting default shared_buffers ... 128MB
selecting dynamic shared memory implementation ... posix
creating configuration files ... ok
creating template1 database in /var/lib/pgsql/data/userdata/base/1 ... ok
initializing pg_authid ... ok
initializing dependencies ... ok
creating system views ... ok
loading system objects' descriptions ... ok
creating collations ... ok
creating conversions ... ok
creating dictionaries ... ok
setting privileges on built-in objects ... ok
creating information schema ... ok
loading PL/pgSQL server-side language ... ok
vacuuming database template1 ... ok
copying template1 to template0 ... ok
copying template1 to postgres ... ok
syncing data to disk ... ok

Success. You can now start the database server using:

    pg_ctl -D /var/lib/pgsql/data/userdata -l logfile start


WARNING: enabling "trust" authentication for local connections
You can change this by editing pg_hba.conf or using the option -A, or
--auth-local and --auth-host, the next time you run initdb.
waiting for server to start....LOG:  redirecting log output to logging collector process
HINT:  Future log output will appear in directory "pg_log".
 done
server started
/var/run/postgresql:5432 - accepting connections
=> sourcing /usr/share/container-scripts/postgresql/start/set_passwords.sh ...
ALTER ROLE
waiting for server to shut down.... done
server stopped
Starting server...
LOG:  redirecting log output to logging collector process
HINT:  Future log output will appear in directory "pg_log".

We can now move on to finishing our application deploy.