Saturday, December 29, 2018

Linux Troubleshooting


Diagnosing Memory Issues
------------------------

What is memory leak?

In computer science, a memory leak is a type of resource leak that occurs when a
computer program incorrectly manages memory allocations in such a way that
memory which is no longer needed is not released. A memory leak may also happen
when an object is stored in memory but cannot be accessed by the running code.

Finding top consumers
ps --sort -rss -eo rss,pid,command | head

User Processes
--------------

Process vs Threads:

Process is an executing program and don't share memory spaces while threads are
contained inside a process which shares memory space.

Threads are also called LWP in linux. They also consume kernel.pid_max.

/proc/sys/kernel/pid_max
- number of PIDs that can be created
- assigns sequentially, when limit is reached, counter wraps back to the beginning
- if no available PID can be used, no more processes can be created


Determining current pid_max value
# using sar
see "plist-sz" of "sar -q" command

# using ps (L is to see LWP or light-weight processes in multithreading systems)
ps -eL | wc -l
ps -eT | wc -l
Determining number of user processes
# via ps
ps h -Led -o user | sort | uniq -c | sort -n
ps h -Lu root | wc -l


CPU Loads
---------

Checking under "sar -q"
- high "runq-sz" means there's a lot of processes waiting in line

Thursday, December 27, 2018

Multipath


Basics
------

- configuration is stored in /etc/multipath.conf

Things to note in modifying multipath.conf
------------------------------------------

1. Make sure letters in WWIDs are in lower case. If not, multipath will blacklist them.
2. Make sure WWIDs don't have more than 1 size. If there is, delete the other device.
   It will reject the the WWID if it sees conflict like this:
   reject: ibmdata1 (3600507680c810200e000000000000027) undef IBM,2145

Parts of the "multipath -ll" output
-----------------------------------

data3 (360000970000198701142533030413536) dm-4 EMC,SYMMETRIX --> data3 is the multipath device
size=512G features='1 queue_if_no_path' hwhandler='0' wp=rw
`-+- policy='round-robin 0' prio=1 status=active
  |- 2:0:1:4 sdj 8:144  active ready running --> path #1
  |- 2:0:0:4 sde 8:64   active ready running --> path #2
  |- 1:0:0:4 sdo 8:224  active ready running --> path #3
  `- 1:0:1:4 sdt 65:48  active ready running --> path #4

Multipath in Virtual Machines
-----------------------------

In virtual machines, even if direct disk addressing (in VmWare terms raw device
mapping) is used, the underling virtualisation hypervisor handles multipathing
and masks it from the client virtual machine. Thus virtual machine itself
doesn't need to do anything about it.

Tutorials
---------

Determining LUN ID
Inspect "multipath -ll" output. Last number (in X:X:X:X) is the LUN ID.
That LUN ID is the one you will see on the storage array.
 
U01 (3624a9370b15fcb83b6a947a00001d5e7) dm-2 PURE    ,FlashArray     
size=150G features='0' hwhandler='0' wp=rw
`-+- policy='queue-length 0' prio=1 status=active
  |- 2:0:0:2 sdk 8:160 active ready running
  |- 2:0:1:2 sdo 8:224 active ready running
  |- 1:0:0:2 sdc 8:32  active ready running

Troubleshooting Docker Issues



Corrupted DB
Issue:

The following message appear when starting docker:
updating the store state of sandbox failed: failed to update store for object type *libnetwork.sbState: json: cannot unmarshal string into Go struct field sbState.ExtDNS of type libnetwork.extDNSEntry

Solution:
 
systemctl stop docker
mv /var/lib/docker/network/files/local-kv.db /root/corrupted-local-kv.db
systemctl start docker

Source:
Can't start docker
Resolution:
Try "journalctl -u docker.service" and see what's happening

Mar 16 13:46:53 gdc-co-ragent01 dockerd[1403]: can't create unix socket /var/run/docker.sock: is a directory
Mar 16 13:46:53 gdc-co-ragent01 systemd[1]: docker.service: main process exited, code=exited, status=1/FAILURE
Mar 16 13:46:53 gdc-co-ragent01 systemd[1]: Failed to start Docker Application Container Engine. 

Resolution:
rm -fr /var/run/docker.sock
systemctl start docker
Phantom containers causes
docker unable to run containers
with same name
8a81cf7cb9f3fd03e0139743d3616eb0e/kill returned error: Cannot kill container 78d3e2cc93abc053238e0edd5765f428a81cf7cb9f3fd03e0139743d3616eb0e: Container 78d3e2cc93abc053238e0edd5765f428a81cf7cb9f3fd03e0139743d3616eb0e is not running"

Resolution:
1. check those phantom containers: doker ps -a
2. remove them: docker rm -f
containers cannot ping outside IPs
Things you might want to check:
1. Make sure /proc/sys/net/ipv4/ip_forward is set to 1
No more space left on thin pool
Resolution:
Extend the thinpool:
  lvextend -L +10G /dev/mapper/base-thinpool
Error on docker-compose
Running docker-compose as a container with the following error:

Couldn't find `docker` binary. You might need to install Docker

Resolution:
Create .env on the directory where you are running docker-compose and put the following:
COMPOSE_INTERACTIVE_NO_CLI=1
Cannot login to docker registry due to missing port
Issue:

The following message appear when you login to docker registry.

Error response from daemon: Get https://gitlab.xyz.com:4567/v2/: Get https://gitlab.xyz.com/jwt/auth?account=root&client_id=docker&offline_token=true&service=container_registry: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers) (Client.Timeout exceeded while awaiting headers)

Resolution:
Make sure docker port and 443/tcp is allowed from source to destination

https://medium.com/@dan.lindow/docker-login-error-awaiting-headers-3fe01e2a1e2f
Resource Temporary unavailable due
to TaskMax setting
- try adding "TasksMax=infinity" in docker.service
- restart docker
- monitor for occurence of issue

https://success.docker.com/article/how-to-reserve-resource-temporarily-unavailable-errors-due-to-tasksmax-setting
https://github.com/chef-cookbooks/docker/issues/871

Wednesday, December 26, 2018

Linux Kernel


Runlevels
---------

0 -- shutdown/halt the system
1 -- single-user mode; usually aliased as s or S
2 -- multiuser mode w/ networking
3 -- multiuser mode w/o networking
4 -- unused
5 -- multiuser mode w/ networking and X Window System
6 -- reboot the system

Kernel Headers
--------------

  - development libraries
  - not installed by default
  - needed to compile other kernel version
  - package name: kernel-headers

Modules
-------

Checks if a module is builtin in the kernel
grep /lib/modules/$(uname -r)/modules.builtin

Crash Dumps
-----------

kexec uses a second kernel to capture the 1st kernels' memory during crashes.

/var/crash - default location of dump files (vmcore)

Tutorials
---------

Booting a RHEL VM from rescue mode
1. power off VM
2. edit settings and find to boot it to BIOS on next startup
3. power on VM
4. when you see this prompt --> "boot: ", type "linux rescue"
5. once the rescue environment finishes booting, choose a language to use
6. choose a keyboard layout to use
7. wait for network interfaces to be located, and activate them, so that
   requested data can be transferred to another host (sometimes this doesn't
   work, or try to restart NIC on Vsphere)
8. the rescue environment will try to find the current Red Hat Enterprise Linux
   installation on the system., select "continue"


RKE (Rancher Kubernetes Engine)


Introduction
------------

- short for Rancher Kubernetes Engine
- lightweight installer of K8 on bare-metal and virtual machines
- solves common issue on K8 installation -- complexity

Commands
--------

rke -d up --ignore-docker-version --config rancher-cluster.yml
rke etcd snapshot-save --name rancher_snapshot.b --config rancher-cluster.yml --ignore-docker-version

Tutorials
---------

Spinning up k8 cluster
Centos 7.5
Docker 17.03
Kubernetes 1.11

1. Download RKE binary
2 ... TBD
Removing a node
- comment out the node information on cluster.yml
- execute "rke up"

Troubleshooting
---------------

ssh: rejected: administratively prohibited
- update openssh to 7.4,and docker version v1.12.6
- set "AllowTcpForwarding yes" "PermitTunnel yes" to /etc/ssh/sshd_config, and
  then restart sshd service
- the host which run rke can ssh to all nodes without password
- run: "groupadd docker" to create docker group,while docker group is not exist.
- run: "useradd -g docker yourusername" to create yourusername user and set it's
  group to docker
- set the docker.service's MountFlags=shared (vi /xxx/xxx/docker.service)
- run:"su yourusername" to change current user,and then restart the docker
  service. so in the user yourusername session the docker.sock will be created
in the path /var/run/docker.sock
- in cluster.yml set the ssh user to yourusername(in setup hosts)

https://github.com/rancher/rke/issues/93
FATA[0088] [workerPlane] Failed to bring up Worker Plane: Can't remove Docker container [rke-log-linker] for host [sample.host]: Error response from daemon: Driver overlay failed to remove root filesystem ABCD: remove /var/lib/docker/overlay/EFGH/merged: device or resource busy
- just retry ./rke up