Something Broke, How Do I Fix It?
Lesson’s learned from a Linux performance specialist
12/20/2017 12:30:55 AM |
By Tanya Buchanan
Debugging Linux issues can be frustrating. Many times, it becomes a lesson in how well you can tailor your internet searches to find a resolution. I have been working on Linux performance for over a decade and hit many of these common stumbling blocks. There are some general steps that can make tackling these problems more manageable that I would like to share with you.
To level set, while these recommendations may be suited for many different platforms, my specific experience is with Linux on IBM Z. The Linux OS is brought up on an LPAR and is usually referred to as an “image.” Any reference to “image” refers to this specific instance of the Linux OS.
Here are three rules to consider when debugging:
1. Determine if a rescue image is required.
There are some debug tasks that can be completed on the image and others that will require a rescue image or a special boot mode. For example, if the there is a problem with the root filesystem and that requires a filesystem check, that task can’t be completed while the filesystem is mounted. It needs to be done with the image in “maintenance” or “emergency” mode without filesystem mounted or from another image that has access to the storage.
2. Know if any recent changes were made to the image.
Many times Linux images used for work purposes are shared by a team. It’s important to accurately track any important updates. Typically, when a problem arises it’s logical to check the last change made, but to do that this information needs to be communicated. For example, if you’re unable to logon to image it could be as simple as someone has changed the IP address.
3. Know who has access to the image.
For the same reason as stated above regarding shared images, it’s important to track who can access and image and execute which tasks.
Now let’s review some specific issues.
You open your favorite tool to connect your Linux image and you get the dreaded “Network Error: …” message. There are several reasons why this may happen. My first approach is to eliminate the easy ones first, things like: Is the IP address correct? Am I using the right port to connect?
If those aren’t the problem then we move to secondary debug steps. With Linux on IBM Z we have the option to login to an image using the “Operating System Messages Console” from the HMC. Open this console and login from there. Using the OS console has some limitation because tools like visual editor can’t be used for editing. Instead a stream editor like stream editor is needed.
First scan the OS messages or check logs in /var/log/* directory to see if any network errors jump out. It’s also important to determine if it’s an availability or configuration issue. You can check which devices are available using a command like “lsqeth” which will list qeth-based network devices. If the expected device is not listed then that means it was not successfully brought online. There are steps that can be taken to bring the device on manually provided the LPAR has access to the device. If the device can’t be brought online, further assistance from network support team will be required. If the device is available, then next step is to issue the “ifconfig” command (or similar) to display the network interfaces configured. Check the output displayed. If, for example, you see the network interface but IP is blank then there is likely a problem with the configuration files. Verify the contents of the files are correct and all the required files exist. If the device is configured correctly but you are still unable to connect it is likely a routing issue. There are Linux commands available to check routing (e.g., route -n) and temporary or permanent routes can be added.
Note each Linux distribution maintains network configuration files in different locations. Refer to the distribution documentation.
File Transfer Issues
Moving or copying files from Linux image to another is a regular and important task for most Linux users. What happens if you try to transfer a file and you get an error? There are specific steps to take depending on the protocol. There are several file transfer options available (e.g., FTP and SCP).
This might seem obvious, but make sure the software is installed. These software packages may or may not be available depending on the type of installation performed. Some common reasons for file transfer issues is being blocked by the firewall, incompatible software on images or user authorization issues (e.g., FTP using root is disabled by default or trying to access a directory the user does not have access to).
The inability to successfully boot an image is a showstopper. While it’s impossible to list every single issue and resolution that causes this, I can share a few that I’ve come across.
If a Linux is unable to boot successfully, it usually comes up in “Emergency Mode.” This allows access to some features, but typically network connectivity and some storage is unavailable.
This is another time that being able to access the OS Messages console is useful.
Start by scanning the logs. Many times, missing storage and filesystem errors cause an unsuccessful boot. In the case of a filesystem error, a filesystem check can be done right from the OS Messages console for non-root filesystems while in “Emergency” mode. However, if the root filesystem is corrupted, a rescue image will be required to run the filesystem check. The root filesystem is mounted at the “/” mount point. In addition, if the filesystem includes an LVM, additional steps will be required to find all the partitions and devices connected to the LVM to make them available on the rescue image in order debug further.
A Linux image will also boot in emergency mode if there are incorrect options in files accessed at boot time, such as zipl.conf, grub.conf, dasd.conf. If the problem was a typo or missing information then these files can be updated using SED and then the image rebooted. However, if a change was made to the kernel, (e.g., the image was upgrade from Suse Linux Enterprise Server V12 SP2 to V12 SP3), and the correct steps were not followed to update kernel files, a simple edit and reboot won’t suffice. Ensure the root and boot filesystems are available, repeat the missing steps then reboot. These steps vary by distribution.
Probably the most common reason I have boot issues is as a result of corrupted filesystems when an image is unknowingly booted on two different LPARs at the same time. For shared images, communication is the key to preventing this from happening. A script that runs at boot time can also be written to prevent this from happening, but by default the OS doesn’t verify if a filesystem is already mounted before booting. If this does happen, then the steps described above regarding running the filesystem check can be followed.
Note, running a filesystem check while useful doesn’t always leave things in the exact state they were before the error. I had an issue where repairing a filesystem deleted the /etc/shadow file, which holds encrypted passwords. This prevented me from logging in, which in turn required additional debug. In this case, to resolve I mounted to a rescue image and copied the shadow file for the rescue image to my image.
Back up Before Moving Forward
Remember these five items when debugging:
1. Who has access to the image?
2. What was last changed on the image?
3. Where on the system were changes made?
4. When were the changes made?
5. Why were the changes made?
In addition, before updating an important file, make a backup. It’s also key to know your important system files and where they live.
Probably the most important advice I can give is : Document, document, document. If it happens once, it will likely happen again. Documenting solutions helps to decrease debug time.
Tanya Buchanan is a Software Engineer, specializing in performance with 16 years experience on IBM Z. She holds a Bachelor of Computer Science from Herbert H. Lehman College (CUNY) and a master’s degree in Information Technology from Rensselaer Polytechnic Institute. She is currently a member of the IBM Blockchain Platform Performance Team in Poughkeepsie, NY. Her hometown is Kingston, Jamaica, and her interests include travel, Zumba and obsessively browsing Amazon.com.