Thursday, March 15, 2007

Lessons from Ops

I am a programmer who enjoys working "only slightly removed from pure thought-stuff." But when I left college I chose a job that was decidedly oriented toward beating this thought-stuff into the ground of reality. I wanted to learn the details of how to program large-scale, heterogeneous systems that were possible, if not easy, to maintain. Not "maintain" in the programmer sense of the word (ease of adding new features, finding bugs, etc.), but maintain as in enabling users to understand, control, build upon, and rely on the system.

Oh, That's Typical
I'm going to stay away from the common recommendations such as DRY because I'd like give specific pointers that apply to the situations I've seen. The hope is when I write programs in the future, the tips herein will pop back into my mind and I'll produce systems that are easy to debug, understandable by people uninvolved in the development, and predictable.

Doesn't Play with Others
A data problem was reported to our team the other day. Typically, finding the source problem requires a large amount of domain knowledge: where are the files that configure the various transformations that this data goes through? Who typically publishes this data type? Where are the logs that may have information about this document? The path to resolution is like a little known covered wagon trail through the Sierras: you need an expert guide and most of the time you get stuck at Donner pass. What should be provided is a yellow brick road: it may require a little work to get there, but you know where you need to take your next step.

Introspective vs. Transparent
At it's root, debugging is about figuring out what happened and why. To provide an easily debuggable system, simply provide the what and why at every step that business rules are applied. Just as importantly, provide it as a service in the same way that interested parties normally interact with your system. Adding `printf`s to your code is great for programmers who usually interact with a service on the command line or `tail` files, but it is nearly useless for clients.

The Boudin bakery in San Francisco is transparent, literally. You can see them mix the dough, cut bread, and bake it. But that doesn't let me understand any of the decisions they've made; I couldn't go home a bake delicious sour dough in which to put clam chowder. In order to do that, I'd need to interact with the cooks in the normal way: ask questions, have them explain what's going on.

Except for the most basic services, data transparency is not that helpful. Think about the questions you'll get from users. When did this data change? Who changed it? Where did it come from and how did it get transformed? How would the system react in different situations? These are the questions that need to be answerable in a straightforward manner.

If you're worried about performance, or exposing too much information to customers, keep in mind that introspection is a system property. It doesn't have to live in your main service, especially when lots of historical data is desired. It would be nice, since people don't have to learn a system with more components, but it's not a requirement. Unique identifiers (database keys, document IDs, URLs) should be usable across the system so a story can be put together programmatically. A coworker of mine suggested that the public password for a system we are developing should be the URL of a wiki page they should read before using the service. I love this idea because it eliminates an often fruitless search for information and directly ties the wiki service into the overall design. Think how useful it would be to have wiki links for each error your service produces along with the ID of the data that was involved in the problem.

You From Around Here?
Even programs with great documentation can't cover all of the potential questions from a new user. Often users can't express their issue in the jargon that would enable them to find and understand the solution. Providing a way to "show and tell" what's going on gives users a chance to understand from experience. Just as languages with an interactive shell and introspection let users get an intuitive feel, systems should provide users with meta-information that they can manipulate.

To Be Continued
This turned out longer than expected, so I'll continue in a later post. Leave comments so I can improve my writing and ideas.

No comments: