Achieving zero % network downtime [a practical case study]

by Nathan Mielke

SHARE

Home » Articles and insights » Achieving zero % network downtime [a practical case study]

In the 2017-2018 school year, the Technology Services team I was part of at Hartford Union High School (HUHS) delivered zero % network downtime! It was a great achievement and was always the outcome we were striving for. In our case, it gave students and staff uninterrupted access to digital tools to aid learning and teaching.

While it was truly a ‘championship’ caliber achievement, high performing teams focus on the process, not the championships itself.  So, what is the process? What’s the hustle that helps make it all happen?

A small part of it is straight up dumb luck!  No major piece of infrastructure went down. No fiber from the internet service provider (ISP) was cut in the area. The ISP didn’t have any infrastructure failures leading to significant network outages of their own. No one decided to DDoS attack (Distributed Denial of Service) the school – with no cloud scrubbing service we were 100% susceptible to that. Good fortune needs to be put into the front of the equation.

What else is part of the secret organizational sauce that addresses the causes of network downtime and leads to success?

CASE STUDY

This Wisconsin manufacturer needed to modernize its IT infrastructure to support rapid business growth.

Discover what they did
  • People
  • Partners
  • Processes
  • Financial Support

People

Achieving zero % Network Downtime - People

Every technology team needs good people who are pulling in the same direction. If one person is off on their own, the whole team suffers. I’ve seen this first hand and struggled to move people on who aren’t pulling in the same direction as the team. Later we’ll talk more about processes, but if you have one person who isn’t consistent with departmental procedures, it knocks support structures and systems monitoring for a loop.

If you’re not following help desk ticket guidelines, stakeholder assistance is going to be uneven. Clients will learn to circumvent processes to get an outcome in a timeline that they prefer, as opposed to the timeline set for the entire organization.

Another key factor necessary for a high functioning technology department is having people with diverse and flexible skill sets. Everyone needs to be able to do a little bit of what another person does in the department.

In most schools, there are no technology “specialists”, only technology “generalists”:

  • Helpdesk Support personnel need to be able to diagnose a system problem
  • Desktop Technicians need to be able to take apart a laptop to add RAM
  • Instructional Technology teachers need to be able to hunt for answers online when something isn’t working quite right…

Cross training within the department is an essential element to keep the ball rolling.

Partners

Downtime - Partners

To harken back to the days of Who Wants to be a Millionaire, every technology team needs to have a “Phone a friend option”.  When you ring that friend, they need to know something about your systems. This can’t be a cold call – “OMG, our network is down…can you help?” If the answer on the other end of the line is “Who is this?” you’re in a lot of trouble.

Trusted partners, vendors, consultants, managed services, whatever you want to call it, you should stay in touch with them on at least a quarterly basis so that they remain familiar with your environment on an on-going basis. You don’t need to purchase blocks of time to fulfill this part of the formula, you simply need a trusted partner (the name I prefer) who you have a professional relationship with, and who knows what you have going on in your school and on your systems (network infrastructure, server environment, firewall/security solutions, etc.).

Here’s an example where a vendor relationship was essential: The IT team comes in on what most would expect to be a quiet Friday in June, but it didn’t start out quiet. Connections across the building were spotty – the wireless displays around the building were down, but students in summer school in a lab were able to log in and get to work. The wireless controllers were up and the connection to the ISP was fine.

What gives?

It turned out something was up with the Storage Area Network – none of the virtual servers were reachable. With one phone call, a trusted partner was on the case helping the short-handed IT team (remember it’s June, so people are using vacation as often as possible) work through the issue. By 9 AM the domain controller was back online handing out IP addresses to every machine which was asking for one and all was well again.

If there was no one available to assist at such short notice, the IT team may have struggled with the issue for hours. Sometimes you need to be able to phone a friend.

That was the last time the district experienced network downtime, which was June of 2017. The failure was, unfortunately, a self-inflicted wound, and could have been avoided if we had proper email alerts and notifications in place for low disk space on the SAN (something that we corrected immediately).  They would have contributed to a coordinated operation of monitoring network downtime and outages.

In your push toward zero % network downtime, find ways to mitigate self-inflicted wounds, and you’ll be money ahead. Monitoring was taken more seriously post-SAN overload. Now the SAN has been on a terrific diet, even better than a Paleo Diet, taking up about ⅓ of the storage used than at the time of the crisis mentioned above.

Processes

Achieving Zero % Network Downtime - Processes

Any high-functioning team has processes in place to make sure work is done right.  To use a football analogy… When a high functioning offense comes to the line of scrimmage, the players know how to check the defense for the scheme and coverage the opponent is in. The quarterback is going through a process that has become an instinct – linebackers are off, safeties are playing deep…time to run the ball! Audibles in football are no different than any problem-solving process outside of a stadium.

Questions you should be asking

  • What challenge is the problem presenting you with?
  • What options do you have?
  • What’s your audible?
  • How are you tackling the problem?
  • Are your people running around freaking out, or are they calm, looking back at past issues that are similar in the helpdesk database?
  • Are they asking peers questions?
  • Are they asking Google questions and looking for a similar situation somewhere out there on the interwebs?
  • How is your team doing their root cause analysis?

Those are a lot of questions to ask, but they’re essential to building the proper processes to attack problems like they are a hostile, defensive lineman looking for a big sack.

“Mistakes are the necessary steps in the learning process; once they have served their purpose, they should be forgotten, not repeated.”

VINCE LOMBARDI

Documentation and communication

Do some internal checks to harden your defense. Review your homemade own knowledge management. Be sure to use helpdesk tickets as an opportunity to document information, not just as an annoyance to be closed ASAP.

How robust your knowledge management is, says a lot about your internal communication as a team and how well the group functions as a learning organization. If every helpdesk ticket is a nail to be hammered and closed, then there likely isn’t much learning going on.

If you have built a culture of dialogue, discovery, and documentation because you enjoy the challenge, you’re on the right path. 0% network downtime, or the process that builds toward it, happens by improving internal processes. The easier it is to find the information, the better off everyone is.

Financial support

Achieving Zero % Network Downtime - Finances

Zero % network downtime doesn’t happen with chicken wire, bubble gum and 10-year old network switches holding a network together. Simply put, there needs to be recognition and a commitment from “the powers that be” that budget and money is made available to adequately support the health of the business systems that everything in an organization runs on.

When I arrived at HUHS, it wasn’t pretty. The internal switching was dominated by 10 year old network switches that were severely outdated. On top of that, those switches were 10/100Mb and not even capable of 1Gb or 10Gb speeds.

Old AND slow.

Back then, there were a lot of blinking fault lights, and no one really knew the last time the switches even had a firmware update. Bad for productivity and worse for security. Speed was no treat either. The edge switches that provided connectivity to the desktops were bottlenecking the whole operation at 100Mb.

“The internet is slow!” is usually how the complaints start when problems persist. The internet or the servers are almost always to blame (at least according to the users – whether this is actually true or not).  I don’t know why the Ethernet cabling doesn’t get blamed as much. I always say, in an attempt to break any tension, “the internet is always working, but sometimes our connection to it isn’t.”

It’s been my experience that historical helpdesk data can assist in telling the story you need to convey to get the money necessary to make the internet work. It turns the anecdote into a hard number of support tickets that refer to the issue at hand. This information makes the qualitative, quantitative!

Zero % network downtime supports long-term innovation

Results

In conclusion, zero downtime is a goal that may never be met. Every circumstance is different. Your network can have no network outages and still be a terrible, slow network. There’s no honor in zero % network downtime if people can’t do their work and if people can’t innovate. That is the ultimate goal, isn’t it?  That people can do their job, that they can push on the boundaries of what’s possible without concern for the limits of technology currently implemented in their school.

Additionally, they feel supported. Supported people who know they can push, they can do, and they can achieve until their heart’s content. What matters is that you, as a contributor, is striving toward that. These 4 areas – People, Partners, Processes and Financial support are the pieces that help you toil toward the goal.

Nathan Mielke

Nathan Mielke

Nathan is an educational technology leader working in Milwaukee and Southeast Wisconsin. He’s passionate about building reliable, efficient systems to support student learning and school operations and writes about continuous improvement in IT services and educational technology on his personal blog ndmielke.org. Nathan has worked with Source One Technology in multiple school districts.

Tired of wasting time and money on frustrating IT issues and vendors?
We're hiring!  Take a look at our engineering roles in Wisconsin.
View jobs