Petter Reinholdtsen

Entries tagged "sysadmin".

Some notes on fault tolerant storage systems
1st November 2017

If you care about how fault tolerant your storage is, you might find these articles and papers interesting. They have formed how I think of when designing a storage system.

Several of these research papers are based on data collected from hundred thousands or millions of disk, and their findings are eye opening. The short story is simply do not implicitly trust RAID or redundant storage systems. Details matter. And unfortunately there are few options on Linux addressing all the identified issues. Both ZFS and Btrfs are doing a fairly good job, but have legal and practical issues on their own. I wonder how cluster file systems like Ceph do in this regard. After all, there is an old saying, you know you have a distributed system when the crash of a computer you have never heard of stops you from getting any work done. The same holds true if fault tolerance do not work.

Just remember, in the end, it do not matter how redundant, or how fault tolerant your storage is, if you do not continuously monitor its status to detect and replace failed disks.

As usual, if you use Bitcoin and want to show your support of my activities, please send Bitcoin donations to my address 15oWEoG9dUPovwmUL9KWAnYRtNJEkP1u1b.

Tags: english, raid, sysadmin.
Detecting NFS hangs on Linux without hanging yourself...
9th March 2017

Over the years, administrating thousand of NFS mounting linux computers at the time, I often needed a way to detect if the machine was experiencing NFS hang. If you try to use df or look at a file or directory affected by the hang, the process (and possibly the shell) will hang too. So you want to be able to detect this without risking the detection process getting stuck too. It has not been obvious how to do this. When the hang has lasted a while, it is possible to find messages like these in dmesg:

nfs: server nfsserver not responding, still trying
nfs: server nfsserver OK

It is hard to know if the hang is still going on, and it is hard to be sure looking in dmesg is going to work. If there are lots of other messages in dmesg the lines might have rotated out of site before they are noticed.

While reading through the nfs client implementation in linux kernel code, I came across some statistics that seem to give a way to detect it. The om_timeouts sunrpc value in the kernel will increase every time the above log entry is inserted into dmesg. And after digging a bit further, I discovered that this value show up in /proc/self/mountstats on Linux.

The mountstats content seem to be shared between files using the same file system context, so it is enough to check one of the mountstats files to get the state of the mount point for the machine. I assume this will not show lazy umounted NFS points, nor NFS mount points in a different process context (ie with a different filesystem view), but that does not worry me.

The content for a NFS mount point look similar to this:

[...]
device /dev/mapper/Debian-var mounted on /var with fstype ext3
device nfsserver:/mnt/nfsserver/home0 mounted on /mnt/nfsserver/home0 with fstype nfs statvers=1.1
        opts:   rw,vers=3,rsize=65536,wsize=65536,namlen=255,acregmin=3,acregmax=60,acdirmin=30,acdirmax=60,soft,nolock,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=129.240.3.145,mountvers=3,mountport=4048,mountproto=udp,local_lock=all
        age:    7863311
        caps:   caps=0x3fe7,wtmult=4096,dtsize=8192,bsize=0,namlen=255
        sec:    flavor=1,pseudoflavor=1
        events: 61063112 732346265 1028140 35486205 16220064 8162542 761447191 71714012 37189 3891185 45561809 110486139 4850138 420353 15449177 296502 52736725 13523379 0 52182 9016896 1231 0 0 0 0 0 
        bytes:  166253035039 219519120027 0 0 40783504807 185466229638 11677877 45561809 
        RPC iostats version: 1.0  p/v: 100003/3 (nfs)
        xprt:   tcp 925 1 6810 0 0 111505412 111480497 109 2672418560317 0 248 53869103 22481820
        per-op statistics
                NULL: 0 0 0 0 0 0 0 0
             GETATTR: 61063106 61063108 0 9621383060 6839064400 453650 77291321 78926132
             SETATTR: 463469 463470 0 92005440 66739536 63787 603235 687943
              LOOKUP: 17021657 17021657 0 3354097764 4013442928 57216 35125459 35566511
              ACCESS: 14281703 14290009 5 2318400592 1713803640 1709282 4865144 7130140
            READLINK: 125 125 0 20472 18620 0 1112 1118
                READ: 4214236 4214237 0 715608524 41328653212 89884 22622768 22806693
               WRITE: 8479010 8494376 22 187695798568 1356087148 178264904 51506907 231671771
              CREATE: 171708 171708 0 38084748 46702272 873 1041833 1050398
               MKDIR: 3680 3680 0 773980 993920 26 23990 24245
             SYMLINK: 903 903 0 233428 245488 6 5865 5917
               MKNOD: 80 80 0 20148 21760 0 299 304
              REMOVE: 429921 429921 0 79796004 61908192 3313 2710416 2741636
               RMDIR: 3367 3367 0 645112 484848 22 5782 6002
              RENAME: 466201 466201 0 130026184 121212260 7075 5935207 5961288
                LINK: 289155 289155 0 72775556 67083960 2199 2565060 2585579
             READDIR: 2933237 2933237 0 516506204 13973833412 10385 3190199 3297917
         READDIRPLUS: 1652839 1652839 0 298640972 6895997744 84735 14307895 14448937
              FSSTAT: 6144 6144 0 1010516 1032192 51 9654 10022
              FSINFO: 2 2 0 232 328 0 1 1
            PATHCONF: 1 1 0 116 140 0 0 0
              COMMIT: 0 0 0 0 0 0 0 0

device binfmt_misc mounted on /proc/sys/fs/binfmt_misc with fstype binfmt_misc
[...]

The key number to look at is the third number in the per-op list. It is the number of NFS timeouts experiences per file system operation. Here 22 write timeouts and 5 access timeouts. If these numbers are increasing, I believe the machine is experiencing NFS hang. Unfortunately the timeout value do not start to increase right away. The NFS operations need to time out first, and this can take a while. The exact timeout value depend on the setup. For example the defaults for TCP and UDP mount points are quite different, and the timeout value is affected by the soft, hard, timeo and retrans NFS mount options.

The only way I have been able to get working on Debian and RedHat Enterprise Linux for getting the timeout count is to peek in /proc/. But according to Solaris 10 System Administration Guide: Network Services, the 'nfsstat -c' command can be used to get these timeout values. But this do not work on Linux, as far as I can tell. I asked Debian about this, but have not seen any replies yet.

Is there a better way to figure out if a Linux NFS client is experiencing NFS hangs? Is there a way to detect which processes are affected? Is there a way to get the NFS mount going quickly once the network problem causing the NFS hang has been cleared? I would very much welcome some clues, as we regularly run into NFS hangs.

Tags: debian, english, sysadmin.
Debian Jessie, PXE and automatic firmware installation
17th October 2014

When PXE installing laptops with Debian, I often run into the problem that the WiFi card require some firmware to work properly. And it has been a pain to fix this using preseeding in Debian. Normally something more is needed. But thanks to my isenkram package and its recent tasksel extension, it has now become easy to do this using simple preseeding.

The isenkram-cli package provide tasksel tasks which will install firmware for the hardware found in the machine (actually, requested by the kernel modules for the hardware). (It can also install user space programs supporting the hardware detected, but that is not the focus of this story.)

To get this working in the default installation, two preeseding values are needed. First, the isenkram-cli package must be installed into the target chroot (aka the hard drive) before tasksel is executed in the pkgsel step of the debian-installer system. This is done by preseeding the base-installer/includes debconf value to include the isenkram-cli package. The package name is next passed to debootstrap for installation. With the isenkram-cli package in place, tasksel will automatically use the isenkram tasks to detect hardware specific packages for the machine being installed and install them, because isenkram-cli contain tasksel tasks.

Second, one need to enable the non-free APT repository, because most firmware unfortunately is non-free. This is done by preseeding the apt-mirror-setup step. This is unfortunate, but for a lot of hardware it is the only option in Debian.

The end result is two lines needed in your preseeding file to get firmware installed automatically by the installer:

base-installer base-installer/includes string isenkram-cli
apt-mirror-setup apt-setup/non-free boolean true

The current version of isenkram-cli in testing/jessie will install both firmware and user space packages when using this method. It also do not work well, so use version 0.15 or later. Installing both firmware and user space packages might give you a bit more than you want, so I decided to split the tasksel task in two, one for firmware and one for user space programs. The firmware task is enabled by default, while the one for user space programs is not. This split is implemented in the package currently in unstable.

If you decide to give this a go, please let me know (via email) how this recipe work for you. :)

So, I bet you are wondering, how can this work. First and foremost, it work because tasksel is modular, and driven by whatever files it find in /usr/lib/tasksel/ and /usr/share/tasksel/. So the isenkram-cli package place two files for tasksel to find. First there is the task description file (/usr/share/tasksel/descs/isenkram.desc):

Task: isenkram-packages
Section: hardware
Description: Hardware specific packages (autodetected by isenkram)
 Based on the detected hardware various hardware specific packages are
 proposed.
Test-new-install: show show
Relevance: 8
Packages: for-current-hardware

Task: isenkram-firmware
Section: hardware
Description: Hardware specific firmware packages (autodetected by isenkram)
 Based on the detected hardware various hardware specific firmware
 packages are proposed.
Test-new-install: mark show
Relevance: 8
Packages: for-current-hardware-firmware

The key parts are Test-new-install which indicate how the task should be handled and the Packages line referencing to a script in /usr/lib/tasksel/packages/. The scripts use other scripts to get a list of packages to install. The for-current-hardware-firmware script look like this to list relevant firmware for the machine:

#!/bin/sh
#
PATH=/usr/sbin:$PATH
export PATH
isenkram-autoinstall-firmware -l

With those two pieces in place, the firmware is installed by tasksel during the normal d-i run. :)

If you want to test what tasksel will install when isenkram-cli is installed, run DEBIAN_PRIORITY=critical tasksel --test --new-install to get the list of packages that tasksel would install.

Debian Edu will be pilots in testing this feature, as isenkram is used there now to install firmware, replacing the earlier scripts.

Tags: debian, english, isenkram, sysadmin.
Scripting the Cerebrum/bofhd user administration system using XML-RPC
6th December 2012

Where I work at the University of Oslo, we use the Cerebrum user administration system to maintain users, groups, DNS, DHCP, etc. I've known since the system was written that the server is providing an XML-RPC API, but I have never spent time to try to figure out how to use it, as we always use the bofh command line client at work. Until today. I want to script the updating of DNS and DHCP to make it easier to set up virtual machines. Here are a few notes on how to use it with Python.

I started by looking at the source of the Java bofh client, to figure out how it connected to the API server. I also googled for python examples on how to use XML-RPC, and found a simple example in the XML-RPC howto.

This simple example code show how to connect, get the list of commands (as a JSON dump), and how to get the information about the user currently logged in:

#!/usr/bin/env python
import getpass
import xmlrpclib
server_url = 'https://cerebrum-uio.uio.no:8000';
username = getpass.getuser()
password = getpass.getpass()
server = xmlrpclib.Server(server_url);
#print server.get_commands(sessionid)
sessionid = server.login(username, password)
print server.run_command(sessionid, "user_info", username)
result = server.logout(sessionid)
print result

Armed with this knowledge I can now move forward and script the DNS and DHCP updates I wanted to do.

Tags: english, sysadmin.

RSS Feed

Created by Chronicle v4.6