Our Critical Production Environments
- By Annie Stunden
- 09/01/04
The giant mainframes of the ‘80s have been replaced by compact servers
and distributed computing strategies. How d'es that change operations?
The first computer that I ever met was the ANFS/Q7 built by IBM for the U.S.
Air Force sometime in the 1950s. I was introduced to the ANFS/Q7 (Army Navy
Federal Systems) in 1959. A firm called System Development Corporation (SDC)
was hiring and training people to be “computer programmers” for
a system called SAGE (Strategic Air Ground Environment)—the first automated
air defense system. Not knowing what a computer was or what it meant to be a
“programmer”, this was an exciting adventure. And a real mind-boggling
part of this adventure was seeing the “machine room.” The ANFS/Q7
occupied a building that took up about an acre in space. And the building was
mostly machine room. The ANFS/Q7 was a vacuum tube, magnetic core memory computer.
It had consoles with hundreds of blinking lights that you learned to read, in
binary, on the fly. The room was staffed by several computer operators on each
shift and also by IBM service folks on a 24x7 schedule. I loved the 1983 movie
War Games, because the machine room reminded me a bit of the one where
I was introduced to my ANFS/Q7.
I was working on parts of the SAGE operating system software in the 1960s,
and I can tell horror stories about dropping huge card decks, tangling tapes,
and the like. In those days, you tested your program by sending a deck of cards
off to the machine room and waiting for it to return with a core dump. At least,
that’s what you did if you were developing air defense applications. If
you were developing parts of the operating system, you scheduled time “on
the computer,” usually in the middle of the night, and you arrived with
your deck of cards to watch the blinking lights with the operator. I loved this
stuff. There was some real magic for me in the machine room. And regardless
of my tangled tapes and out-of-sequence cards, the computer operators ran the
machine room with policies and procedures and the authority to enforce them.
And for me, there is still magic today in the machine room. In my organization,
the machine room is currently called “the platform”—I think
because of the raised floor. The platform no longer houses one big honker of
a machine—but for us it is home to hundreds of servers that run the applications
supporting our community to do their work of teaching, learning, research, and
administration. Most of our machine rooms are in out-of-the-way places, often
in the lower levels of buildings. People who work in them in the winter might
never see the light of day. And we tend to forget about them, except when an
important system crashes. Then their jobs are similar to the jobs of emergency
medical technicians. They are expected to know what emergency actions to take
to protect our critical data resources and what experts to round up. Most do
this very well. Emergency events tend to happen on weekends or nights when experts
are not on-site. But my experience is that these experts can usually be found—because
they never turn off their cell phones or their computers at home. They can often
log in from home and begin their diagnosis. Between our system operators, our
system programmers, and our application developers, we handle extraordinary
events very well.
The ordinary, in my mind, is a different problem. One of the challenges of
running a great production environment—and our machine room is the heart
of that environment—is having policies, standards, and procedures in place
for routine, day-to-day operations. In our current environments, with hundreds
of servers, managing an orderly production environment is a challenge. User
pressure to reduce development and implementation costs, makes that challenge
greater. Yet, a stable production environment is essential for our institutions.
We test and turnover procedures for moving systems and applications to production.
We need to insist on documentation for all systems moved to production. And
we need to have standards for that documentation. With that work the next challenge
is adoption of those policies, standards and procedures. At my university, we
have done good work in development of standards and procedures. We have not
been as successful at implementing the the standards and procedures. We need
to make a real commitment to staff training in these new ways of doing our work.
We need to hold our managers accountable for implementing the standards and
procedures. Our staff need to understand that following these standards and
procedures is part of their job. We need routinely review and modify our standards
and procedures. Then we need to retrain.
With all the great stuff in our machine rooms, it’s easy to think of
them as a playpens. They are not. We should build test labs as playpens and
test environments for our production systems. We need to give those who run
our production environments the authority they need to ensure that our environments
are safe and stable.