Troubleshooting
This document covers how to troubleshoot the deployment of a Kubernetes cluster, it will not cover debugging of workloads inside Kubernetes.
Understanding Cluster Status
Using juju status
can give you some insight as to what’s happening in a cluster:
In this example we can glean some information. The Workload
column will show the status of a given service. The Message
section will show you the health of a given service in the cluster. During deployment and maintenance these workload statuses will update to reflect what a given node is doing. For example the workload may say maintenance
while message will describe this maintenance as Installing docker
.
During normal operation the Workload should read active
, the Agent column (which reflects what the Juju agent is doing) should read idle
, and the messages will either say Ready
or another descriptive term. juju status --color
will also return all green results when a cluster’s deployment is healthy.
Status can become unwieldy for large clusters, it is then recommended to check status on individual services, for example to check the status on the workers only:
or just on the etcd cluster:
Errors will have an obvious message, and will return a red result when used with juju status --color
. Nodes that come up in this manner should be investigated.
SSHing to units
You can ssh to individual units easily with the following convention, juju ssh <servicename>/<unit#>
:
Will automatically ssh you to the 3rd worker unit.
This will automatically ssh you to the easyrsa unit.
Collecting debug information
To collect comprehensive debug output from your Charmed Kubernetes cluster, install and run juju-crashdump on a computer that has the Juju client installed, with the current controller and model pointing at your Charmed Kubernetes deployment.
Running the juju-crashdump
script will generate a tarball of debug information that includes systemd unit status and logs, Juju logs, charm unit data, and Kubernetes cluster information. It is recommended that you include this tarball when filing a bug.
Common Problems
Charms deployed to LXD containers fail after upgrade/reboot
For deployments using Juju’s localhost
cloud, which deploys charms to LXD/LXC containers, or other
cases where applications are deployed to LXD, there is a known issue
(https://bugs.launchpad.net/juju/+bug/1904619)
with the profiles applied by Juju. The LXD profile used by Juju is named after the charm, including
the revision number. Upgrading the charm causes Juju to create a new profile for LXD which does not
necessarily contain the same settings which were originally supplied. If services based on LXD
containers fail to resume after an upgrade, this is a potential cause of that failure.
To check what the profiles should contain, the YAML output from juju status
or juju machines
can be used:
... will detail the profiles in the output, e.g.:
model:
name: default
machines:
"0":
...
instance-id: juju-4ac678-1
...
lxd-profiles:
juju-default-kubernetes-worker-718:
config:
linux.kernel_modules: ip_tables,ip6_tables,netlink_diag,nf_nat,overlay
raw.lxc: |
lxc.apparmor.profile=unconfined
lxc.mount.auto=proc:rw sys:rw
lxc.cgroup.devices.allow=a
lxc.cap.drop=
security.nesting: "true"
security.privileged: "true"
description: ""
devices:
aadisable:
path: /dev/kmsg
source: /dev/kmsg
type: unix-char
...
To check this matches with the actually applied profile, you can run lxc
(this needs to be run on the machine where the containers are running).
This should give the appropriate corresponding output:
If this differs from what is expected, the profile can be manually edited. E.g., for the above profile:
Load Balancer interfering with Helm
This section assumes you have a working deployment of Kubernetes via Juju using a Load Balancer for the API, and that you are using Helm to deploy charts.
To deploy Helm you will have run:
Then when using helm you may see one of the following errors:
- Helm doesn’t get the version from the Tiller server
- Helm cannot install your chart
This is caused by the API load balancer not forwarding ports in the context of the helm client-server relationship. To deploy using helm, you will need to follow these steps:
Expose the Kubernetes Master service
Identify the public IP address of one of your masters
In this context the public IP address is 54.210.100.102.
If you want to access this data programmatically you can use the JSON output:
Update the kubeconfig file
Identify the kubeconfig file or section used for this cluster, and edit the server configuration.
By default, it will look like
https://54.213.123.123:443
. Replace it with the Kubernetes Master endpointhttps://54.210.100.102:6443
and save.Note that the default port used by Charmed Kubernetes for the Kubernetes Master API is 6443 while the port exposed by the load balancer is 443.
Start helm again!
helm install <chart> --debug Created tunnel using local port: '36749' SERVER: "localhost:36749" CHART PATH: /home/ubuntu/.helm/<chart> NAME: <chart> ... ...
Logging and monitoring
By default there is no log aggregation of the Kubernetes nodes, each node logs locally. Please read over the logging page for more information.
Troubleshooting Keystone/LDAP issues
The following section offers some notes to help determine issues with using Keystone for authentication/authorisation.
Testing the steps is important to determine the cause of the problem.
Can you communicate with Keystone and get an authorization token?
First is to verify that Keystone communication works from both your client and
the kubernetes-worker machines. The easiest thing to do here is to copy the
kube-keystone.sh script to the machines of interest from kubernetes-control-plane with
juju scp kubernetes-control-plane/0:kube-keystone.sh .
, edit the script to include
your credentials, source kube-keystone.sh
and then run get_keystone_token
.
This will produce a token from the Keystone server. If that isn’t working,
check firewall settings on your Keystone server. Note that the
kube-keystone.sh script could be overwritten, so it is a best practice to make
a copy somewhere and use that.
Are the pods for Keystone authentication up and running properly?
The Keystone pods live in the kube-system namespace and read a configmap from Kubernetes for the policy. Check to make sure they are running:
Check the logs of the pods for errors:
Is the configmap with the policy correct?
Check the configmap contents. The pod logs above would complain if the YAML isn’t valid, but make sure it matches what you expect.
Check the service and endpoints
Verify the service exists and has endpoints
Attempt to authenticate directly to the service
Use a token to auth with the Keystone service directly:
Note that you need to change the IP address above to the address of your
k8s-keystone-auth-service
. This will talk to the webhook and verify that the token is
valid and return information about the user.
API server
Finally, communication between the API server and the Keystone service is verified. The easiest thing to do here is to look at the log for the API server for interesting information such as timeouts or errors with the webhook.
See the guide to contributing or discuss these docs in our public Mattermost channel.