Voyer, John with Andrea Cahill, Kathryn Laustsen and Benjamin Philbrick  "Information Technology: System Dynamics Modeling of an IT Major Incident Resolution Process", 2015 July 19 - 2015 July 23

Online content

Fullscreen
Information Technology:
System Dynamics Modeling
of an IT Major Incident Resolution Process

John Voyer, Ph.D.
Andrea Cahill, MBA
Kathryn Laustsen, MBA
Benjamin Philbrick, MBA

University of Southern Maine

Abstract

We did a system dynamics analysis of a significant IT problem—poor handling of Critical
Incidents—at a medium-sized health care organization. We used the usual system dynamics process of
describing the process verbally, identifying reference modes, and creating a dynamic hypothesis. From
that, we developed a quantitative model, which was a variant on the familiar Project Model. Using the
model, we then tested various policy options for staff allocation and their coordination. The conclusion
was that, even though the organization could improve work quality by using fewer staff members, poor
coordination—regardless of number of staff—wiped out gains in quality, productivity, and speed. We
end by recommending ways to improve coordination.

Problem Description

We did a system dynamics analysis of a significant IT problem at a medium-sized health care
organization, using the usual system dynamics process of describing the process verbally, identifying
reference modes, creating a dynamic hypothesis, developing a quantitative model, and testing various
policy options using the model.

Marshall’s Promontory Healthcare (MPHC) employs approximately 800 people from Caribou,
Maine to Syracuse, New York. The business consists of two primary models: clinics, which provide
patient care, and health insurance plans. There are nine clinics in the MPHC family and the health plan
business operates two plans.

The Information Technology Department at MPHC employs approximately 40 people. The
department receives many calls, ranging from password resets to issues that are more complex affecting
a larger portion of the organization. The department calls the latter Major Incidents. Two types of
Major Incidents affect the IT Department:

e High
co. Affects, or makes unavailable, a single non-core service/business process
o. The disruption affects 5-19 patients or members
o Anincident prevents 2-99 employees from working
° Critical
o Disruption is System wide or Companywide or line of business wide
o Disruption affects 20 or more patients or members
o Anincident prevents more than 100 employees from working
o. There isa risk to public safety or reputation

Example: For over nine months during 2014, there was a memory leak issue within Internet

Explorer. Though this does not sound like a huge problem, MPHC’s Electronic Health Record is cloud
based. Having IE crash quite often for practice users affects workflow and patient service. In another
Major Incident example, when the credit card scanning machines do not work in the Virtual Desktop
Infrastructure (VDI) environment, checkout personnel are unable to receive payment. By not
documenting the details, users, adaptations, root cause, and resolution, MPHC’s IT department cannot
resolve similar issues quickly if they reoccur.

The problem MPHC has been having is that the quality of the department’s work starts out high,
but then drops, and the level of coordination among staff assigned to the major incidents is, after a brief
spike, quite low.

Major Incident Process Flow

MPHC has a published process flow for dealing with Major Incidents, which we show in Figure 1.
For purposes of the present paper, there are three crucial points in this process:

1. After identifying the existence of a Major Incident, the IT staff is supposed to coordinate
its efforts and develop a recovery plan.

2. From its recovery plan, the staff is supposed to create and implement a temporary
workaround plan.

3. After implementing the workaround, the staff continues working to resolve the Major
Incident, ultimately stabilizing it.

Service Disruption | Service Disruption
Detected [Suspected

Follow
y, \ Requalar
<< Majr incident» n>] inedent

< Management
Ne roce:
y
Y
¥
Escalate tT
Service
Operations
Manager
internal
extemal
een Communication

TT Leadership

SS

Coordinate
Recovery Efforts

Create Recovery
Plan

I

Instigate Recovery
plan

I

Implement
Workaround
solution

I

Continuous
Stabilize Solution

(intemal j external)

(( Conduct Post
(Mortem

igure 1. MPHC’s Process Flow for Major Incident Resolution


Year Number of Major Incident Tickets
2012 22
2013 17

2014 (est.) 14

Table 1. Major Incident IT Tickets at Marshall’s Promontory Health Care

Year Average Number of Calls/Ticket Average Length of Ticket (days)
2012 12.18 12.49
2013 17.59 17.5

2014 (est.) 14.44 68

Table 2. Characteristics of Major Incident IT Tickets at MPHC. (Data for 2014 are estimates from 9 months of
actual data.)

Ticket Data

MPHC’s IT Department uses a piece of software called TrackIT to track all incoming tickets to the IT
Service Desk. The department polled those data for Major Incident Tickets, which we show in Table 1.
Our model (discussed later) used an average number of Major Incidents from 2012/2013, as they were
complete-year data. This number is 19.5 tickets per year, which is equal to .053425 Major Incidents per
day (days is the default unit of measure in the model), or roughly one every three weeks.

Data were not available about number of employees working on an incident, whether the
department implemented a workaround, and about the efficiency/coordination of the employees
working on the incident. However, MPHC’s IT department provided the information about the Number
of Calls by Major Incident and the Duration of each Incident, i.e. the time between the initial report and
the implementation of the resolution. We averaged those data, which we show in Table 2.

Reference Modes

Based on interviews and the personal experience of on member of the modeling team who is also a staff
member of the IT Department at MPHC, we can infer the following reference modes for this problem.

Quality of IT Work (Figure 2)

100%
Quality of response to Major Incidents starts high, but as time passes,

the quality goes down, because the number of employees working on
the issue decreases as other projects and priorities come along, and
as there is lack of resolution to the original Major Incident.

Quality

Coordination of IT Work (Figure 3)

Time This stays and remains low, as the IT employees recreate the issue
Figure 2. Quality of IT Work and then go off in different directions to work in silos to find a
resolution. Because of this atomized approach, they make many
changes at once, which increase the difficulty of pinpointing the
actual resolution (or resolving subsequent problems, if other things
are no longer working). There is a spike in the mode to represent staff
checking in with one another, which is brief and often unlikely.

Coordination

Time

Figure 3. Coordination of IT Work

Major Incidents Resolved (Figure 4)
This represents the resolution of Major Incidents over time, and
5 shows how the length of resolution time varies for each Major
2 Incident.
B
E
z
Time
Figure 4, Major Incidents Resolved
- , ~~ Business Satisfaction
Major Incident with Resolutions .. Business Desired
Tickets Quality of Resolutions

_—m Business Problems
=" with Resolutions

4s)

Business Testing and
ion of Resolutions |

es!
Target for Major Problem with Major (ap

Incident Tickets = Incident Tickets
— Major Ticket
Resolution

| Quality of ‘wana
ae Hr Resource(s) Resolucns ae Problems with
Assigned as ae IT Desired Quality Resolutions
bse ipaa oe of Resolution
coordination of
Revisions of +
“v Resolutions Business Desired
= Quality of
— Workarounds
Business Satisfaction, Ps ~
with Workaround» <¢@—__—_— hes ) -
IT Desired Quality of fusiness Testing and Business Problems
( ts) ~~ iaislon at with Workarounds
if as a a nd
Revision of
Workarounds Ny _t Nik
Revisions of Workarounds
he i

Figure 5. Causal Loop Diagram and Dynamic Hypothesis

Dynamic Hypothesis

We present our dynamic hypothesis in Figure 5. We placed Quality of Workarounds and Quality
of Resolutions in boldface centrally in the causal loop diagram, since both affect the time it takes the IT
department to present a workaround or resolution to the business (which is the name the IT
department gives to its internal clients at MPHC). In addition, the Coordination of the Response from
the IT department (ostensibly a key early element of its process flow) has a direct correlation to Quality
of the Workaround and Quality of the Resolution, so that is also in boldface in the center of the diagram.
Lastly, without timely business involvement in both a workaround and a resolution, it is difficult for IT to
know if it has proposed an adequate workaround, or when it has resolved the Major Incident.

Loops
There are six loops in the Causal Loop Diagram, all balancing.

Loop B1 - Major Ticket Resolution, describes the process that the IT department follows, by
policy, to resolve a ticket. There is one exogenous policy input to this loop, Target for Major Incident
Tickets, which is the target for completion of any Major Incident ticket.

Loop B2 - IT Testing and Revision of Resolutions, describes the process within the IT
department relating to the quality of and the need to revise possible resolutions, ultimately to meet
quality standards the department sets.

Loop B3 — Business Testing and Revision of Resolutions, describes how the IT department relies
on the business to test each possible resolution and to sign off in agreement that the resolution is fully
functional. This loop has an exogenous input of Business Desired Quality of Resolutions, as often the
business’s perception of the quality of the resolution differs from that of the IT Department.

Loop B4 - IT Testing and Revision of Workarounds, describes the workaround process that
sometimes occurs if the IT Department cannot resolve the root cause to the Major Incident fast enough.
In this case, if there is a way for the business to continue to serve patients and members, even if the
workaround is lengthy, then the department recommends a workaround as a stopgap from turning
patients and members away at the door. This loop has an exogenous input of IT Desired Quality of
Workarounds, which creates the familiar goal-gap situation between desired and actual quality of the
workaround.

Loop B5 — Business Testing and Revision of Workarounds, describes the same workaround
process just described, only this time from the perspective of the business. In this case, once the IT
Department has proposed a workaround, the business must test and sign off on whether the
workaround is sufficient for it to continue to do business. This loop has an exogenous input of Business
Desired Quality of Workarounds, as the Quality of the Workaround, the Business Desired Quality of the
Workaround and the IT Quality of the Workaround may all differ, due to expectations and/or
perceptions.

The final balancing loop, B6 — Workaround as Resolution, describes a situation where the
workaround built for the Business becomes the resolution. In this loop, the business accepts that the
workaround will suffice as the resolution, possibly because IT has been unable to determine a
resolution, or, possibly, because the workaround is an improvement to the business flow.

Note the critically central importance in our dynamic hypothesis of Coordination of the
Response from the IT department. This variable feeds all six loops, and we hypothesize that the issues
the IT Department at MPHC experiences result from poor Coordination of the Response, mostly (we
submit) caused by too many staffers thrown at the Major Incident in a “siloed,” atomized way. This will
feature prominently in our model, to which we now turn.

ktfect on Work Quality af
ficeney of Coordination of
Employees Working on incident
table

Effect on Wark Quality of
Efcene of Cowdination of

Employees Working on Incident

Effect on Wark Quality of
Relative Number of Employees
‘Woring on Incident

Effect on Work Quality of
Relative Number of Employees
Working on incident table

Resolution
Percentage

Desired Time to
Correct Erors

Inedent Tekete
Desired Work
‘Quality

elatve Number of
Employees Working 00

Number of Employees
Working on incident

Desired Number of

Efficiency of Coordination
lof Employees Working on

Effect on Time to Correct Errors
‘of Coordination of Employees
Working on Incident table

Effect on Time to Correct
Errors of Relative Number of
Employees table

=e
| esecton tne corec
fos Relate None of
Sane

Effect on ito Comect Enos
of Coordination of Eeloyees
‘Work

ng on cident

\
\

\
Relative ficiency of
‘coortination

Desired Efficiency of
coordination of Employees
‘Working on Incident

Employees Working on
Figure 6. Full Stock-and-Flow Model

The Model Explained

We show our full stock-and-flow model in Figure 6. We believe that this situation is a variation
of the familiar “Project Model.” In that model, there is a stock of “Work to be Done,” whose contents
move into a stock of “Completed Work.” However, a percentage of items ostensibly moved to the stock
of “Completed Work” are actually “Undiscovered Rework.” These items appear finished, but actually
were done so poorly that, once their flaws are discovered, will have to return to the stock of “Work to
be Done.” The project ends when these iterations end and all the work truly is finished.

A major difference in the situation described at MPHC, however, is that the “Work to be Done”
never actually ends. It is not a finite stock of work, as in the Project Model. It is a stock continuously
replenished as new Major Incidents crop up. However, there are some similarities. As in the Project
Model, not all the Major Incidents are resolved on their first pass through the process; the IT staff will
need to redo some of them as the flaws in their workaround or resolution emerge. In another similarity
to the Project Model, what drive the problem under study here are two things:

1. The quality of the work done in the first place.
2. The speed at which the IT staff discovers its errors.

We believe that at MPHC, poor coordination of staff drives both of these key variables.
For Work Quality, we chose to model this problem using two table functions:

1. The input for the first table function is the ratio of employees assigned to work on a problem
and the desired or appropriate number assigned to that work (on the theory that the siloed,
atomized response increases as management assigns a larger number of staff members to the
Major Incident). This isa downward sloping function, where quality declines as management
deploys more employees to the Critical Incident.

2. The other table function has the input of ratio of actual efficiency of coordination and desired
efficiency of coordination. Coordination is an explicit element of the organization’s resolution
process, yet we believe that the organization does not practice good coordination. This is an
upward sloping function, as quality improves as coordination improves.

For Time to Correct Errors, we also chose to model this problem using two table functions:

1. The first table function has the input of the ratio of actual efficiency of coordination and
desired efficiency of coordination. As mentioned earlier, coordination is an explicit element
of the organization’s resolution process, yet we believe that the organization does not
practice good coordination. This is a downward sloping function, where the time required
to correct errors declines as management is more efficient in coordinating the employees
assigned to resolving a Critical Incident.

2. The input for the second table function is the ratio of employees assigned to work ona
problem and the desired or appropriate number assigned to that work (on the theory that
the siloed, atomized response increases as management assigns a larger number of staff
members to the Major Incident). This is an upward sloping function, where the time
required to correct errors increases as management deploys more employees to the Critical

Incident.
Simulation Run ploy Coordinati Analysis of Simulation Runs
Base 6 100%
Additional employee 7 100% Table 3 shows the parameters of
Reduced efficiency 6 80% the simulation runs we did to test various
Suboptimal on both 7 80% scenarios.
Table 3. Parameter Values in Four Simulation Runs Figure 7 shows the effects of the

parameter settings on Work Quality. The Base scenario (curve 1) allows Work Quality to be perfect,
which is expected.

Adding an additional employee (curve 2) reduces quality, again as expected. Reducing
coordination (curve 3) is a bit more deleterious to Work Quality, and it is hardly surprising that adding
an employee and reducing coordination reduces work quality to its lowest level (curve 4).

Figure 8 shows the effects of these scenarios on Undiscovered Rework. The pattern is the same
as it is for Work Quality. Undiscovered Rework increases a little more for each of the scenarios beyond
the optimal Base—poor for the Additional Employee (curve 2), worse for Reduced Coordination (curve
3), and worst for both policies together (curve 4).

Work Quality Undiscovered Rework
1 of >
ae os
Bos zo
625 its + ~
s °
° 73 146 7219 22 365 ° 3 Tee m9 292 365
“ime (Dav) Tne (Day
ace — fase
Figure 7. Results of Scenarios on Work Quality Figure 8. Results of Scenarios on Undiscovered Rework

Rework Discovery Rate

tekevoay

a6 219
Time (Day)

‘Additonal enployerand reduced coordination

Figure 9, Results of Scenarios on Rework Discovery Rate

Major Incident Resolution Rate

0 7 16 219
Time (Day)

ease
Aediserslermoe ee

Figure 10. Results of Scenarios on Major Incident
Resolution Rate

Unresolved Major Incident Tickets

a6 2a
Time (Dav)

‘Additonal arployssrd reducad coordination

Figure 11. Results of Scenarios on 1 Major Incident Tickets

Work Quality

° 73 146
Time (Day)
One employee

Seven employees -—+—+—+
Five employees

Figure 12. Subtracting Staff Improves Work
Quality, But There Is a Limit

Figure 9 shows the effects of the scenarios on
Rework Discovery Rate. Each of the scenarios beyond
the Base scenario increases the Rework Discovery
Rate, but mostly because the increased level of the
Undiscovered Rework stock by definition raises the
rate of its discovery.

Figure 10 shows the effects of the scenarios
on the Major Incident Resolution Rate. The pattern is
the same, with the Additional Employee (curve 2), the
Reduced Coordination (curve 3) and the two policies
combined (curve 4) showing progressively higher
rates of Major Incident Resolution. This is probably
the result of the increased number of unresolved
tickets in those three scenarios.

We show this in Figure 11, which shows the
effects of the scenarios on the number of unresolved
tickets. Again, the Additional Employee and the
Suboptimal on Both scenarios increase the number of
Unresolved Major Incident Tickets, which leads to
their higher resolutions rates.

Figure 12 shows that reducing the number of
staff assigned to a Critical Incident improves work
quality, there is a limit—work quality will never be
better than 100%. However, even with fewer
employees, a reduction of coordination will hurt Work
Quality (Figure 13). This strongly implies that it is the
quality of coordination, not the number of
employees, that determines the ultimate
performance of this system.

Work Quality with Reduced Coordination

Dinnl

° 72 16 29 702
Time (Day)

Five employees -

Five employees andreducedcosrdination 2—z-

‘Ore employee

Ore employes and reduced coordination

Figure 13, Reducing Coordination Reduces
Work Quality More than Subtracting Staff

10 Improves It.

Policy Analysis and Suggestions

As we showed in Figure 1, MPHC’s current process flow for how to handle a Major Incident has a
process section for “Coordinate Recovery Efforts.” However, there is detail lacking as to how the
coordination occurs. Based on the output from our model, the department should assign only the
appropriate resources to work the issue. If those resources need assistance or have questions for others
within the IT Department (or the business), they should reach out to them during the coordination, but
management should consider those people supplementary, and not part of the Resolution/Workaround
Team. Furthermore, instead of these primary resources calling into the phone bridge that is opened (a
conference line), those people should set up in a conference room in the IT building so that they can
communicate and troubleshoot together, instead of in silos. This will reduce the time to correct the
error at hand.

This working group should document the steps taken to troubleshoot, implement a workaround
and implement the final resolution. This not only would speed the rollback process (reversing steps
taken so far), if necessary, but would help in the future for similar or repeated Major Incidents, leading
to faster resolution times.

11

Metadata

Resource Type:
Document
Description:
We did a system dynamics analysis of a significant IT problem—poor handling of Critical Incidents—at a medium-sized health care organization. We used the usual system dynamics process of describing the process verbally, identifying reference modes, and creating a dynamic hypothesis. From that, we developed a quantitative model, which was a variant on the familiar Project Model. Using the model, we then tested various policy options for staff allocation and their coordination. The conclusion was that, even though the organization could improve work quality by using fewer staff members, poor coordination—regardless of number of staff—wiped out gains in quality, productivity, and speed. We end by recommending ways to improve coordination.
Rights:
Date Uploaded:
March 14, 2026

Using these materials

Access:
The archives are open to the public and anyone is welcome to visit and view the collections.
Collection restrictions:
Access to this collection is unrestricted unless otherwide denoted.
Collection terms of access:
https://creativecommons.org/licenses/by/4.0/

Access options

Ask an Archivist

Ask a question or schedule an individualized meeting to discuss archival materials and potential research needs.

Schedule a Visit

Archival materials can be viewed in-person in our reading room. We recommend making an appointment to ensure materials are available when you arrive.