Wandering Thoughts archives

2017-07-28

V7GaveUsEnvironmentVariables

Another thing V7 Unix gave us is environment variables

Simon Tatham recently wondered "Why is PATH called PATH?". This made me wonder the closely related question of when environment variables appeared in Unix, and the answer is that the environment and environment variables appeared in V7 Unix as another of the things that made it so important to Unix history (also).

Up through V6, the exec system call and family of system calls took two arguments, the path and the argument list; we can see this in both the V6 exec(2) manual page and the implementation of the system call in the kernel. As bonus trivia, it appears that the V6 exec() limited you to 510 characters of arguments (and probably V1 through V5 had a similarly low limit, but I haven't looked at their kernel code).

In V7, the exec(2) manual page now documents a possible third argument, and the kernel implementation is much more complex, plus there's an environ(5) manual page about it. Based on h/param.h, V7 also had a much higher size limit on the combined sized of arguments and environment variables, which isn't all that surprising given the addition of the environment. Commands like login.c were updated to put some things into the new environment; login sets a default $PATH and a $HOME, for example, and environ(5) documents various other uses (which I haven't checked in the source code).

This implies that the V7 shell is where $PATH first appeared in Unix, where the manual page describes it as 'the search path for commands'. This might make you wonder how the V6 shell handled locating commands, and where it looked for them. The details are helpfully documented in the V6 shell manual page, and I'll just quote what it has to say:

If the first argument is the name of an executable file, it is invoked; otherwise the string `/bin/' is prepended to the argument. (In this way most standard commands, which reside in `/bin', are found.) If no such command is found, the string `/usr' is further prepended (to give `/usr/bin/command') and another attempt is made to execute the resulting file. (Certain lesser-used commands live in `/usr/bin'.)

('Invoked' here is carrying some extra freight, since this may not involve a direct kernel exec of the file. An executable file that the kernel didn't like would be directly run by the shell.)

I suspect that '$PATH' was given such as short name (instead of a longer, more explicit one) simply as a matter of Unix style at the time. Pretty much everything in V7 was terse and short in this style for various reasons, and verbose environment variable names would have reduced that limited exec argument space.

unix/V7GaveUsEnvironmentVariables written at 22:38:20;

2017-07-27

ArgparseVariableArgumentCount

Python argparse and the minor problem of a variable valid argument count

Argparse is the standard Python module for handling arguments to command line programs, and because for small programs, Python makes using things outside the standard library quite annoying, it's the one I use in my Python based utility programs. Recently I found myself dealing with a little problem where argparse doesn't have a good answer, partly because you can't nest argument groups.

Suppose, not hypothetically, that you have a program that can properly take zero, two, or three command line arguments (which are separate from options), and the command line arguments are of different types (the first is a string and the second two are numbers). Argparse makes it easy to handle having either two or three arguments, no more and no less; the first two arguments have no nargs set, and then the third sets 'nargs="?"'. However, as far as I can see argparse has no direct support for handling the zero-argument case, or rather for forbidding the one-argument one.

(If the first two arguments were of the same type we could easily gather them together into a two-element list with 'nargs=2', but they aren't, so we'd have to tell argparse that both are strings and then try the 'string to int' conversion of the second argument ourselves, losing argparse's handling of it.)

If you set all three arguments to 'nargs="?"' and give them usable default values, you can accept zero, two, or three arguments, and things will work if you supply only one argument (because the second argument will have a usable default). This is the solution I've adopted for my particular program because I'm not stubborn enough to try to roll my own validation on top of argparse, not for a little personal tool.

If argparse supported nested groups for arguments, you could potentially make a mutually exclusive argument group that contained two sub-groups, one with nothing in it and one that handled the two and three argument case. This would require argparse not only to support nested groups but to support empty nested groups (and not ignore them), which is at least a little bit tricky.

Alternately, argparse could support a global specification of what numbers of arguments are valid. Or it could support a 'validation' callback that is called with information about what argparse detected and which could signal errors to argparse that argparse handled in its standard way, giving you uniform argument validation and error text and so on.

python/ArgparseVariableArgumentCount written at 23:30:17;

2016-07-26

UnixHadToEvolve

Unix had good reasons to evolve since V7 (and had to)

There's a certain sort of person who feels that the platonic ideal of Unix is somewhere around Research Unix V7 and it's almost all been downhill since then (perhaps with the exception of further Research Unixes and then Plan 9, although very few people got their hands on any of them). For all that I like Unix and started using it long ago when it was simpler (although not as far back as V7), I reject this view and think it's completely mistaken.

V7 Unix was simple but it was also limited, both in its implementation (which often took shortcuts) and in its overall features (such as short filenames). Obviously V7 didn't have networking, but even things that most people think of as perfectly reasonable and good Unix features like '#!' support for shell scripts in the kernel and processes being in multiple groups at once. That V7 was a simple and limited system meant that its choices were to grow to meet people's quite reasonable needs or to fall out of use.

(Some of these needs were for features and some of them were for performance. The original V7 filesystem was quite simple but also suffered from performance issues, ones that often got worse over time.)

I'll agree that the path that the growth of Unix has taken since V7 is not necessarily ideal; we can all point to various things about modern Unixes that we don't like. Any particular flaws came about partly because people don't necessarily make ideal decisions and partly because we haven't necessarily had perfect understandings of the problems when people had to do something, and then once they'd done something they were constrained by backward compatibility.

(In some ways Plan 9 represents 'Unix without the constraint of backward compatibility', and while I think there are a variety of reasons that it failed to catch on in the world, that lack of compatibility is one of them. Even if you had access to Plan 9, you had to be fairly dedicated to do your work in a Plan 9 environment (and that was before the web made it worse).)

PS: It's my view that the people who are pushing various Unixes forward aren't incompetent, stupid, or foolish. They're rational and talented people who are doing their best in the circumstances that they find themselves. If you want to throw stones, don't throw them at the people, throw them at the overall environment that constrains and shapes how everything in this world is pushed to evolve. Unix is far from the only thing shaped in potentially undesirable ways by these forces; consider, for example, C++.

(It's also clear that a lot of people involved in the historical evolution of BSD and other Unixes were really quite smart, even if you don't like, for example, the BSD sockets API.)

unix/UnixHadToEvolve written at 23:21:55;

2017-07-25

EmacsForcingNoDeiconification

Mostly stopping GNU Emacs from de-iconifying itself when it feels like it

Over on the Fediverse I had a long standing GNU Emacs gripe:

I would rather like to make it so that GNU Emacs never un-iconifies itself when it completes (Lisp-level) actions. If I have Emacs iconified I want it to stay that way, not suddenly appear under my mouse cursor like an extremely large modal popup. (Modal popups suck, they are a relic of single-tasking windowing environments.)

For those of you who use GNU Emacs and have never been unlucky enough to experience this, if you start some long operation in GNU Emacs and then decide to iconify it to get it out of your face, a lot of the time GNU Emacs will abruptly pop itself back open when it finishes, generally with completely unpredictable timing so that it disrupts whatever else you switched to in the mean time.

(This only happens in some X environments. In others, the desktop or window manager ignores what Emacs is trying to do and leaves it minimized in your taskbar.)

To cut straight to the answer, you can avoid a lot of this with the following snippet of Emacs Lisp:

(add-to-list 'display-buffer-alist '(t nil (inhibit-switch-frame . t)))

I believe that this has some side effects but that these side effects will generally be that Emacs doesn't yank around your mouse focus or suddenly raise windows to be on top of everything.

GNU Emacs doesn't have a specific function that it calls to de-iconify a frame, what Emacs calls a top level window. Instead, the deiconification happens in C code inside C-level functions like raise-frame and make-frame-visible, which also do other things and which are called from many places. For instance, one of make-frame-visible's jobs is actually displaying the frame's X level window if it doesn't already exist on the screen.

(There's an iconify-or-deiconify-frame function but if you look that's a Lisp function that calls make-frame-visible. It's only used a little bit in the Emacs Lisp code base.)

A determined person could probably hook these C-level functions through advice-add to make them do nothing if they were called on an existing, mapped frame that was just iconified. That would be the elegant way to do what I want. The inelegant way is to discover, via use of the Emacs Lisp debugger, that everything I seem to care about is going through 'display-buffer' (eventually calling window--maybe-raise-frame), and that display-buffer's behavior can be customized to not 'switch frames', which will wind up causing things to not call window--maybe-raise-frame and not de-iconify GNU Emacs windows on me.

To understand display-buffer-alist I relied on Demystifying Emacs’s Window Manager. My addition to display-buffer-alist has three elements:

the t tells display-buffer to always use this alist entry.
the nil tells display-buffer that I don't have any special action functions I want to use here and it should just use its regular ones. I think an empty list might be more proper here, but nil works.
the '(inhibit-switch-frame . t)' sets the important customization, which will be merged with any other things set by other (matching) alist entries.

The net effect is that 'display-buffer' will see 'inhibit-switch-frame' set for every buffer it's asked to switch to, and so will not de-iconify, raise, or otherwise monkey around with frame things in the process of displaying buffers. It's possible that this will have undesirable side effects in some circumstances, but as far as I can tell things like 'speedbar' and 'C-x 5 <whatever>' still work for me afterward, so new frames are getting created when I want them to be.

(I could change the initial 't' to something more complex, for example to only apply this to MH-E buffers, which is where I mostly encounter the problem. See Demystifying Emacs’s Window Manager for a discussion of how to do this based on the major mode of the buffer.)

To see if you're affected by this, you can run the following Emacs Lisp in the scratch buffer and then immediately minimize or iconify the window.

(progn
  (sleep-for 5)
  (display-buffer "*scratch*"))

If you're affected, the Emacs window will pop back open in a few seconds (five or less, depending on how fast you minimized the window). If the Emacs window stays minimized or iconified, your desktop environment is probably overriding whatever Emacs is trying to do.

For me this generally happens any time some piece of Emacs Lisp code is taking a long time to get a buffer ready for display and then calls 'display-buffer' at the end to show the buffer. One trigger for this is if the buffer to be displayed contains a bunch of unusual Unicode characters (possibly ones that my font doesn't have anything for). The first time the characters are used, Emacs will apparently stall working out how to render them and then de-iconify itself if I've iconified it out of impatience.

(It's quite possible that there's a better way to do this, and if so I'd love to know about it.)

programming/EmacsForcingNoDeiconification written at 22:20:05;

2017-07-24

DrawingCommandsVsSendingImages

Sending drawing commands to your display server versus sending images

One of the differences between X and Wayland is that in the classical version of X you send drawing commands to the server while in Wayland you send images; this can be called server side rendering versus client side rendering. Client side rendering doesn't preclude a 'network transparent' display protocol, but it does mean that you're shipping around images instead of drawing commands. Is this less efficient? In thinking about it recently, I realized that the answer is that it depends on a number of things.

Let's start out by assuming that the display server and the display clients are equally powerful and capable as far as rendering the graphics goes, so the only question is where the rendering happens (and what makes it better to do it in one place instead of another). The factors that I can think of are:

How many different active client (machines) there are; if there are enough, the active client machines have more aggregate rendering capacity than the server does. But probably you don't usually have all that many different clients all doing rendering at once (that would be a very busy display).
The number of drawing commands as compared to the size of the rendered result. In an extreme case in favor of client side rendering, a client executes a whole bunch of drawing commands in order to render a relatively small image (or window, or etc). In an extreme case the other way, a client can send only a few drawing commands to render a large image area.
The amount of input data the drawing commands need compared to the output size of the rendered result. An extreme case in favour of client side rendering is if the client is compositing together a (large) stack of things to produce a single rendered result.
How efficiently you can encode (and decode) the rendered result or the drawing commands (and their inputs). There's a tradeoff of space used to encoding and decoding time, where you may not be able to afford aggressive encoding because it gets in the way of fast updates.
What these add up to is the aggregate size of the drawing commands and all of the inputs that they need relative to the rendered result, possibly cleverly encoded on both sides.
How much changes from frame to frame and how easily you can encode that in some compact form. Encoding changes in images is a well studied thing (we call it 'video'), but a drawing command model might be able to send only a few commands to change a little bit of what it sent previously for an even bigger saving.
(This is affected by how a server side rendering server holds the information from clients. Does it execute their draw commands then only retain the final result, as X does, or does it hold their draw commands and re-execute them whenever it needs to re-render things? Let's assume it holds the rendered result, so you can draw over it with new drawing commands rather than having to send a new full set of 'draw this from now onward' commands.)
A pragmatic advantage of client side rendering is that encoding image to image changes can be implemented generically after any style of rendering; all you need is to retain a copy of the previous frame (or perhaps more frames than that, depending). In a server rendering model, the client needs specific support for determining a set of drawing operations to 'patch' the previous result, and this doesn't necessarily cooperate with an immediate mode approach where the client regenerates the entire set of draw commands from scratch any time it needs to re-render a frame.

I was going to say that the network speed is important too but while it matters, what I think it does is magnifies or shrinks the effect of the relative size of drawing commands compared to the final result. The faster and lower latency your network is, the less it matters if you ship more data in aggregate. On a slow network, it's much more important.

There's probably other things I'm missing, but even with just these I've wound up feeling that the tradeoffs are not as simple and obvious as I believed before I started thinking about it.

(This was sparked by an offhand Fediverse remark and joke.)

tech/DrawingCommandsVsSendingImages written at 23:18:14;

2016-07-30

CommandBotGate

A bot that lets you in (or doesn't)

Most bots sit around responding to pings or doing mod nonsense. This one guards a room. It's the only way in. There's no form, no user list, no invite system— just a TCP chatroom at 192.168.1.224 on the port 8080, where you talk to a bot. If it likes you, the wall drops. If not, you're ignored.

When you connect, the only thing that talks is the bot. It gives you one line:

[gate] say something useful.

From there, it’s command-based vetting. Not a quiz, more like a puzzle. The bot has a list of known commands, some documented, most not. A few will echo back info, a few do nothing, and a few trigger tests. It's not random. Each command alters an internal score. If you hit the right flow, the bot decides you're worthy and grants access. If you flail, it cuts the socket.

Examples (obviously not real):

!ping
!whoami
!uptime
!auth %temp%
!trace --mode=shadow
!submitkey 91f3a2b9...
!mirrorcheck

Each command is scored. Some are decoys, some penalize. You don’t know which. Timing matters too. Run them too fast? You trip a rate-limit flag. Too slow? The gate closes after 60s idle. Some commands require a previous state, like you can’t run !submitkey until you’ve completed a challenge from !mirrorcheck.

There’s no retry if you fail. You get one session. Next time, you're treated differently—your IP gets a flag, and your available command set changes. Yes, it mutates per-user. There’s a kind of procedural access flow that unfolds as the bot tracks you across sessions.

Get the flow right and you see:

[gate] access granted.
[relay] syncing session...
[room] welcome.

The point isn’t security. It’s curation. The bot filters not just humans from scripts, but intention from noise. You can’t stumble in. You have to *want* in—and prove it with protocol.

tech/CommandBotGate written at 02:55:42;

2016-07-23

BashGoodSetEReports

Getting decent error reports in Bash when you're using 'set -e'

Suppose that you have a shell script that's not necessarily complex but is at least long. For reliability, you use 'set -e' so that the script will immediately stop on any unexpected errors from commands, and sometimes this happens. Since this isn't supposed to happen, it would be nice to print some useful information about what went wrong, such as where it happened, what the failing command's exit status was, and what the command was. The good news is that if you're willing to make your script specifically a Bash script, you can do this quite easily.

The Bash trick you need is:

trap 'echo "Exit status $? at line $LINENO from: $BASH_COMMAND"' ERR

This uses three Bash features: the special '$LINENO' and '$BASH_COMMAND' environment variables (which have the command executed just before the trap and the line number), and the special 'ERR' Bash 'trap' condition that causes your 'trap' statement to be invoked right when 'set -e' is causing your script to fail and exit.

Using 'ERR' instead of 'EXIT' (or '0' if you're a traditionalist like me) is necessary in order to get the correct line number in Bash. If you switch this to 'trap ... EXIT', the line number that Bash will report is the line that the 'trap' was defined on, not the line that the failing command is on (although the command being executed remains the same). This makes a certain amount of sense from the right angle; the shell is currently on that line as it's exiting.

As far as I know, no other version of the Bourne shell can do all of this. The OpenBSD version of /bin/sh has a '$LINENO' variable and 'trap ... 0' preserves its value (instead of resetting it to the line of the 'trap'), but it has no access to the current command. The FreeBSD version of /bin/sh resets '$LINENO' to the line of your 'trap ... 0', so the best you can do is report the exit status. Dash, the Ubuntu 24.04 default /bin/sh, doesn't have '$LINENO', effectively putting you in the same situation as FreeBSD.

(On Fedora, /bin/sh is Bash, and the Fedora version of Bash supports all of 'trap .. ERR', $LINENO, and $BASH_COMMAND even when invoked as '#!/bin/sh' by your script. You probably shouldn't count on this; if you want Bash, use '#!/bin/bash'.)

programming/BashGoodSetEReports written at 22:57:21;

2016-07-22

NFSv4DelegationMandatoryLock

NFS v4 delegations on a Linux NFS server can act as mandatory locks

Over on the Fediverse, I shared an unhappy learning experience:

Linux kernel NFS: we don't have mandatory locks.
Also Linux kernel NFS: if the server has delegated a file to a NFS client that's now not responding, good luck writing to the file from any other machine. Your writes will hang.

NFS v4 delegations are an feature where the NFS server, such as your Linux fileserver, hands a lot of authority over a particular file over to a client that is using that file. There are various sorts of delegations, but even a basic read delegation will force the NFS server to recall the delegation if anything else wants to write to the file or to remove it. Recalling a delegation requires notifying the NFS v4 client that it has lost the delegation and then having the client accept and respond to that. NFS v4 clients have to respond to the loss of a delegation because they may be holding local state that needs to be flushed back to the NFS server before the delegation can be released.

(After all the NFS v4 server promised the client 'this file is yours to fiddle around with, I will consult you before touching it'.)

Under some circumstances, when the NFS v4 server is unable to contact the NFS v4 client, it will simply sit there waiting and as part of that will not allow you to do things that require the delegation to be released. I don't know if there's a delegation recall timeout, although I suspect that there is, and I don't know how to find out what the timeout is, but whatever the value is, it's substantial (it may be the 90 second 'default lease time' from nfsd4_init_leases_net(), or perhaps the 'grace', also probably 90 seconds, or perhaps the two added together).

(90 seconds is not what I consider a tolerable amount of time for my editor to completely freeze when I tell it to write out a new version of the file. When NFS is involved, I will typically assume that something has gone badly wrong well before then.)

As mentioned, the NFS v4 RFC also explicitly notes that NFS v4 clients may have to flush file state in order to release their delegation, and this itself may take some time. So even without an unavailable client machine, recalling a delegation may stall for some possibly arbitrary amount of time (depending on how the NFS v4 server behaves; the RFC encourages NFS v4 servers to not be hasty if the client seems to be making a good faith effort to clear its state). Both the slow client recall and the hung client recall can happen even in the absence of any actual file locks; in my case, the now-unavailable client merely having read from the file was enough to block things.

This blocking recall is effectively a mandatory lock, and it affects both remote operations over NFS and local operations on the fileserver itself. Short of waiting out whatever timeout applies, you have two realistic choices to deal with this (the non-realistic choice is to reboot the fileserver). First, you can bring the NFS client back to life, or at least something that's at its IP address and responds to the server with NFS v4 errors. Second, I believe you can force everything from the client to expire through /proc/fs/nfsd/clients/<ID>, by writing 'expire' to the client's 'ctl' file. You can find the right client ID by grep'ing for something in all of the clients/*/info files.

Discovering this makes me somewhat more inclined than before to consider entirely disabling 'leases', the underlying kernel feature that is used to implement these NFS v4 delegations (I discovered how to do this when investigating NFS v4 client locks on the server). This will also affect local processes on the fileserver, but that now feels like a feature since hung NFS v4 delegation recalls will stall or stop even local operations.

linux/NFSv4DelegationMandatoryLock written at 21:33:31;

2016-07-21

ProjectsArePeople

Projects can't be divorced from the people involved in them

Among computer geeks, myself included, there's a long running optimistic belief that projects can be considered in isolation and 'evaluated on their own merits', divorced from the specific people or organizations that are involved with them and the culture that they have created. At best, this view imagines that we can treat everyone involved in the development of something as a reliable Vulcan, driven entirely by cold logic with no human sentiment involved. This is demonstrably false (ask anyone about the sharp edge of Linus Torvalds' emails), but convenient, at least for people with privilege.

(A related thing is considering projects in isolation from the organizations that create and run them, for example ignoring that something is from 'killed by' Google.)

Over time, I have come to understand and know that this is false, much like other things I used to accept. The people involved with a project bring with them attitudes and social views, and they create a culture through their actions, their expressed views, and even their presence. Their mere presence matters because it affects other people, and how other people will or won't interact with the project.

(To put it one way, the odds that I will want to be involved in a project run by someone who openly expresses their view that bicyclists are the scum of the earth and should be violently run off the road are rather low, regardless of how they behave within the confines of the project. I'm not a Vulcan myself and so I am not going to be able to divorce my interactions with this person from my knowledge that they would like to see me and my bike club friends injured or dead.)

You can't divorce a project from its culture or its people (partly because the people create and sustain that culture); the culture and the specific people are entwined into how 'the project' (which is to say, the crowd of people involved in it) behaves, and who it attracts and repels. And once established, the culture of a project, like the culture of anything, is very hard to change, partly because it acts as a filter for who becomes involved in the project. The people who create a project gather like-minded people who see nothing wrong with the culture and often act to perpetuate it, unless the project becomes so big and so important that other people force their way in (usually because a corporation is paying them to put up with the initial culture).

(There is culture everywhere. C++ has a culture (or several), for example, as does Rust. Are they good cultures? People have various opinions that I will let you read about yourself.)

tech/ProjectsArePeople written at 22:51:11;

2016-07-20

MachineRoomTempTwoSortsOfAlerts

Realizing we needed two sorts of alerts for our temperature monitoring

We have a long standing system to monitor the temperatures of our machine rooms and alert us if there are problems. A recent discussion about the state of the temperature in one of them made me realize that we want to monitor and alert for two different problems, and because they're different we need two different sorts of alerts in our monitoring system.

The first, obvious problem is a machine room AC failure, where the AC shuts off or becomes almost completely ineffective. In our machine rooms, an AC failure causes a rapid and sustained rise in temperature to well above its normal maximum level (which is typically reached just before the AC starts its next cooling cycle). AC failures are high priority issues that we want to alert about rapidly, because we don't have much time before machines start to cook themselves (and they probably won't shut themselves down before the damage has been done).

The second problem is an AC unit that can't keep up with the room's heat load; perhaps its filters are (too) clogged, or it's not getting enough cooling from the roof chillers, or various other mysterious AC reasons. The AC hasn't failed and it is still able to cool things to some degree and keep the temperature from racing up, but over time the room's temperature steadily drifts upward. Often the AC will still be cycling on and off to some degree and we'll see the room temperature vary up and down as a result; at other things the room temperature will basically reach a level and more or less stay there, presumably with the AC running continuously.

One issue we ran into is that a fast triggering alert that was implicitly written for the AC failure case can wind up flapping up and down if insufficient AC has caused the room to slowly drift close to its triggering temperature level. As the AC works (and perhaps cycles on and off), the room temperature will shift above and then back below the trigger level, and the alert flaps.

We can't detect both situations with a single alert, so we need at least two. Currently, the 'AC is not keeping up' alert looks for sustained elevated temperatures with the temperature always at or above a certain level over (much) more time than the AC should take to bring it down, even if the AC has to avoid starting for a bit of time to not cycle too fast. The 'AC may have failed' alert looks for high temperatures over a relatively short period of time, although we may want to make this an average over a short period of time.

(The advantage of an average is that if the temperature is shooting up, it may trigger faster than a 'the temperature is above X for Y minutes' alert. The drawback is that an average can flap more readily than a 'must be above X for Y time' alert.)

sysadmin/MachineRoomTempTwoSortsOfAlerts written at 23:10:57;

2016-07-19

ChecklistsAreHardButGood

Checklists are hard (but still a good thing)

We recently had a big downtime at work where part of the work was me doing a relatively complex and touchy thing. Naturally I made a checklist, but also naturally my checklist turned out to be incomplete, with some things I'd forgotten and some steps that weren't quite right or complete. This is a good illustration that checklists are hard to create.

Checklists are hard partly because they require us to try to remember, reconstruct, and understand everything in what's often a relatively complex system that is too big for us to hold in our mind. If your understanding is incomplete you can overlook something and so leave out a step or a part of a step, and even if you write down a step you may not fully remember (and record) why the step has to be there. My view is that this is especially likely in system administration where we may have any number of things that have been quietly sitting in the corner for some time, working away without problems, and so they've slipped out of our minds.

(For example, one of the issues that we ran into in this downtime was not remembering all of the hosts that ran crontab jobs that used one particular filesystem. Of course we thought we did know, so we didn't try to systematically look for such crontab jobs.)

To get a really solid checklist you have to be able to test it, much like all documentation needs testing. Unfortunately, a lot of the checklists I write (or don't write) are for one-off things that we can't really test in advance for various reasons, for example because they involve a large scale change to our live systems (that requires a downtime). If you're lucky you'll realize that you don't know something or aren't confident in something while writing the checklist, so you can investigate it and hopefully get it right, but some of the time you'll be confident you understand the problem but you're wrong.

Despite any imperfections, checklists are still a good thing. An imperfect written down checklist is better than relying on your memory and mind on the fly almost all of the time (the rare exceptions are when you wouldn't even dare do the operation without a checklist but an imperfect checklist tempts you into doing it and fumbling).

(You can try to improve the situation by keeping notes on what was missed in the checklist and then saving or publishing these notes somewhere. You can review these after the fact notes on what was missed in this specific checklist if you have to do the thing again, or look for specific types of things you tend to overlook and should specifically check for the next time you're making a checklist that touches on some area.)

sysadmin/ChecklistsAreHardButGood written at 23:09:43;