Our Critical Production Environments

The giant mainframes of the ‘80s have been replaced by compact servers and distributed computing strategies. How d'es that change operations?

The first computer that I ever met was the ANFS/Q7 built by IBM for the U.S. Air Force sometime in the 1950s. I was introduced to the ANFS/Q7 (Army Navy Federal Systems) in 1959. A firm called System Development Corporation (SDC) was hiring and training people to be “computer programmers” for a system called SAGE (Strategic Air Ground Environment)—the first automated air defense system. Not knowing what a computer was or what it meant to be a “programmer”, this was an exciting adventure. And a real mind-boggling part of this adventure was seeing the “machine room.” The ANFS/Q7 occupied a building that took up about an acre in space. And the building was mostly machine room. The ANFS/Q7 was a vacuum tube, magnetic core memory computer. It had consoles with hundreds of blinking lights that you learned to read, in binary, on the fly. The room was staffed by several computer operators on each shift and also by IBM service folks on a 24x7 schedule. I loved the 1983 movie War Games, because the machine room reminded me a bit of the one where I was introduced to my ANFS/Q7.

I was working on parts of the SAGE operating system software in the 1960s, and I can tell horror stories about dropping huge card decks, tangling tapes, and the like. In those days, you tested your program by sending a deck of cards off to the machine room and waiting for it to return with a core dump. At least, that’s what you did if you were developing air defense applications. If you were developing parts of the operating system, you scheduled time “on the computer,” usually in the middle of the night, and you arrived with your deck of cards to watch the blinking lights with the operator. I loved this stuff. There was some real magic for me in the machine room. And regardless of my tangled tapes and out-of-sequence cards, the computer operators ran the machine room with policies and procedures and the authority to enforce them.

And for me, there is still magic today in the machine room. In my organization, the machine room is currently called “the platform”—I think because of the raised floor. The platform no longer houses one big honker of a machine—but for us it is home to hundreds of servers that run the applications supporting our community to do their work of teaching, learning, research, and administration. Most of our machine rooms are in out-of-the-way places, often in the lower levels of buildings. People who work in them in the winter might never see the light of day. And we tend to forget about them, except when an important system crashes. Then their jobs are similar to the jobs of emergency medical technicians. They are expected to know what emergency actions to take to protect our critical data resources and what experts to round up. Most do this very well. Emergency events tend to happen on weekends or nights when experts are not on-site. But my experience is that these experts can usually be found—because they never turn off their cell phones or their computers at home. They can often log in from home and begin their diagnosis. Between our system operators, our system programmers, and our application developers, we handle extraordinary events very well.

The ordinary, in my mind, is a different problem. One of the challenges of running a great production environment—and our machine room is the heart of that environment—is having policies, standards, and procedures in place for routine, day-to-day operations. In our current environments, with hundreds of servers, managing an orderly production environment is a challenge. User pressure to reduce development and implementation costs, makes that challenge greater. Yet, a stable production environment is essential for our institutions. We test and turnover procedures for moving systems and applications to production. We need to insist on documentation for all systems moved to production. And we need to have standards for that documentation. With that work the next challenge is adoption of those policies, standards and procedures. At my university, we have done good work in development of standards and procedures. We have not been as successful at implementing the the standards and procedures. We need to make a real commitment to staff training in these new ways of doing our work. We need to hold our managers accountable for implementing the standards and procedures. Our staff need to understand that following these standards and procedures is part of their job. We need routinely review and modify our standards and procedures. Then we need to retrain.

With all the great stuff in our machine rooms, it’s easy to think of them as a playpens. They are not. We should build test labs as playpens and test environments for our production systems. We need to give those who run our production environments the authority they need to ensure that our environments are safe and stable.

comments powered by Disqus