A ROUTINE Friday afternoon batch job turned into disaster when
a computer meltdown brought a manufacturing system to its knees. The computer
room was humming, and all systems were go for one of Australia's largest
manufacturers.
Then Jeff Steel, project manager of Infact Consultants, reset the system
clock to January 7, 2000, and waited to see what would happen. The routine
batch job, which involved 800 custom-built Cobol and PL-1 programs in a
manufacturing mainframe environment, was expected to take six hours to
run. Close by, a terminal in the control room was set up to track the programs
as they went through the batch run. although he anticipated some problems,
Steel was not prepared for anything coming out of left field.
His team of 12 programmers had worked methodically for nine months,
manually sifting through millions of lines of code, rectifying the double
digit issue to take account of the year 2000. Great care had been taken
to keep the crew motivated and focused on the their tasks to ensure time
was spent productively and any reworking was kept to a minimum. At worst,
he expected to make some specific changes that could be easily spotted.
Operations had hardly begun before the first programs started to run
slowly. By the time the sixth program started, the system began to falter.
Then, one after another, programs fell over. By the time the 10th
program failed, Steel decided to let the job run to the end, because in
all likelihood, it would be all over in half an
hour anyway.
Within minutes, 750 programs had fallen over. One of the few programs
to continue running was invoicing, but it was producing invoices for the
43rd day of the 14th month. As the job finally ground to a halt, a silence
hung over the room as everyone stared vacantly into the terminal. Steel
stood frozen to the floor in shock, as did his team, which had been contracted
to fulfil a $3 million contract.
Twelve people stared at the terminal where a complete suite of programs
had died instantly. Fortunately the meltdown had taken place in a test
environment.
The search was now on to diagnose the problem. One of the team tracked
down the problem to an obscure mainframe program. The culprit was a non-Y2K
compliant link editor on a PL1 program that last ran in 1987.
A link editor takes different modules of a program and puts them together
in the right place at the right time. With the problem identified and a
Y2K compliant link editor installed, the 30 programs were rerun and the
problem was solved.
Steel says the use of the test environment saved the company from bankruptcy.
"The consequences in a live environment would have been devastating," he
says. As well as bringing the business to a standstill, it would have
rendered it unable to operate for six months - and possibly taken suppliers
and customers down with it. Situations like this are typical of what's
happening and testify to the truth of rumours about large companies not
yet meeting Y2K compliancy requirements, Steel says.
The post-mortem meeting found that the collective time required to
diagnose such an obscure problem in a live environment would been about
a month, and a
fix would have taken six months. "The problem was so unusual, you wouldn't
have known if it was hardware, software or system utilities," Steel says.
"The horrible thing about it was that it was such an obscure component
that nobody even thought that it could fail."
Even with hindsight, the problem could never have been spotted before
testing because it was too obscure. "In nine months of remediation, no-one
had ever got near this problem," he says.
Steel says the meltdown was so catastrophic that even a contingency
plan wouldn't have saved the day. The only way to find the Y2K bugs in
a system is to manually trawl the program code line by line to find the
date fields, some of which are very obscure, he says. One area for dates
was embedded deep in a job control language, where a sort of 30 characters
revealed six characters making up a date. Even though the testing is complete,
Steel cannot say definitely
that the system is now 100 per cent Y2K compliant.
As part of the strategy to protect himself and the company from any
legal recourse, he operated with an auditor looking over his shoulder at
every stage concurring that the way he was progressing was the best available
method.
"All I can say to the client is that I can't guarantee that there will
not be any problems after the year 2000," he says. Steel says most organisations
don't understand Y2K. "Until something like this happens, they don't understand
what Y2K can do to them," he says.