1. What is the actual problem?
This should be the first question an IT professional should ask when it comes to troubleshooting various IT related issues – even if only to verify the information that has already been provided. Typically this will mean having a conversation with the individual or group of individuals that reported the problem in the first place. It’s certainly not unheard of for the reported problem to get muddied or distorted when going through multiple people or channels before you first hear of it.
People often rephrase things when dictating what someone else previously said, so it’s quite possible for the original complaint to turn into something completely different as it passes through different people:
“The Amazon website tends to lock up my web browser whenever I add items into my Cart.”Mary, Sales Department.
“Helpdesk? Mary’s internet isn’t working when she’s online shopping.”
“Please help Mary so she can browse shopping sites. I think the internet filter is probably blocking that category.”John, creating Helpdesk ticket
We’ve all encountered these types of scenarios in the past and they can be really frustrating, even more so when the issues are much more important than whether a single employee is capable of adding items to their Amazon shopping cart.
The point here being, don’t take what’s being told to you for granted. Spend the time necessary to verify that what is being reported to you is actually what’s occurring and the original reason the issue was raised in the first place. Furthermore, taking the time to speak with the source, in this case, Mary, allows you to ask important follow-up questions that can further aid in diagnosing the problem as its being reported.
2. Who is experiencing the problem?
Without knowledge of who is experiencing the problem, your ability to focus your troubleshooting efforts into a precise area will be diminished and you might wind up going off in a direction that’s not even necessary or even remotely related to the source of the problem. One of the questions that should be asked is, who exactly is experiencing the problem?
Is it (for example):
- A single user
- A group/department of users
- The entire remote branch office location
- The entire main office location –and- remote branch offices
Every organization is different as it relates to the “Who”, but there are stark differences in the following scenario and what could be the underlying issue relating to the company’s IP Phones when the IT professional called in to solve the problem has a clearer understanding of “Who” is actually affected:
- Jerry’s IP phone isn’t working
- This is likely an issue with Jerry’s phone specifically
A group/dept. of users
- The entire 2nd floor is having problems with IP phones
- This might be an issue specific to a network switch/VLAN on the 2nd floor
- All users in the remote/branch office are having problems with IP phones
- This might be an issue specific to the VPN connection between offices
Main and remote offices
- All users in the main and remote offices are having problems with IP phones
- This might be an issue specific to the core switch or IP Phone System itself
The point here is, when the IT professional starts to understand “Who” is really affected, they can eliminate having to navigate down unnecessary paths while troubleshooting and can instead work towards narrowing down their troubleshooting efforts to a more specific and concise area. In the case of the single user above, why waste time troubleshooting the VPN tunnel when only Jerry is affected by the issue? This is why knowing the “Who” is extremely important.
Here’s another example of something an IT Professional or Wireless Engineer hears from time to time. “Help! Wireless is completely down in the entire building. Everyone is reporting problems”. In these situations, do yourself a favor and pay special attention to words or phrases such as “entire”, “everyone”, and “completely down” when problems are reported. These “all-inclusive” phraseologies tend to exaggerate what’s really happening and have the potential to lead you astray.
It’s not uncommon that while investigating the problem, the IT Professional or Wireless Engineers quickly learns that the “entire” building, or “everyone”, or that the wireless network being “completely down” (which, for example, in a school, might affect 3,000+ users) turns out to be a single wireless Access Point being down in one small office that is affecting 5 actual users (not, 3,000+ users as “everyone” seems to imply).
Bear in mind, problems can sometimes be overblown and overstated, especially when a user, or group of users, is regularly frustrated with or intimated by technology (any IT professional has likely experienced those high-maintenance users that cry wolf over just about anything!).
3. When did the problem start?
Knowing when the problem actually started (with attention to finite details such as the exact day and exact time) can often provide a better understanding of the problem and help trigger more definitive ideas and potential solutions relating to the underlying root cause that a given IT professional is expected to solve. Imagine being brought into a new customer to resolve critical problems with their Internet Services and being told,
“The internet pipe is a problem. People are randomly seeing spotty performance and oddball issues whenever web surfing and we don’t know why.”
Now, a less-experienced IT professional might just start diving headfirst into firewall logs, bandwidth monitoring, opening up a trouble-ticket directly with the ISP and trying to figure out what is going on, but someone with more experience will first pause to ask additional questions, wanting more specifics as to “When” the problem started happening.
- Has this ALWAYS been a problem?
- WHEN were these random internet browsing issues first reported?
Knowing the “When” gives the IT professional more precise information they can use when trying to uncover what’s actually occurring. If the answer to the above questions comes back as, “The first report of the issue came 10 days ago.” That MAY provide some additional insight as to what to look into next.
Certainly looking back into firewall logs and bandwidth utilization metrics over the last 2 week period makes sense knowing the issue presented itself within the last 10 days, but it hardly warrants spending much time at all looking back at logs and bandwidth utilization metrics from 3+ months ago. That being said, once again, try to VERIFY the information being told to you. Perhaps the person giving you the answer vaguely remembers that it was 10 days ago, but in truth, it’s only been 3 days!
In this particular situation where the internet is being reported as sporadic, it’s altogether possible that roughly 11 days ago, another on-site computer technician decided to enable the UTM (Unified Threat Management) functionality within their firewall to allow for additional Antivirus inspection, IDS (Intrusion Detection Services), Geo-IP Filtering, and a plethora of other goodies typically included in UTM feature-sets.
Unfortunately, as a direct result, the firewall’s processors/CPUs have become overloaded and cannot move traffic through it quickly enough to keep up with the additional processing demands required when the firewall’s UTM feature-set was enabled.
4. Is the problem intermittent or constant?
Another key element to an effective problem solving process is finding out if the reported issue is occurring constantly or whether it’s only occurring intermittently? Problems that are constant, or fixed, are generally (though not always) easier to troubleshoot. Whereas problems that are intermittent and seemingly random, are generally more difficult to troubleshoot.
How many times have we as IT professionals been called in to troubleshoot a problem, only to find that upon our arrival, the issue suddenly doesn’t seem to exist anymore yet no one did anything specific to actually resolve the problem!? Those situations can be really frustrating, not only for the IT professional but for the end-user as well because the likelihood of the issue reappearing is rather high (and most likely reappears just a few short moments after the IT professional has left!)
The best thing to do in these scenarios is document WHEN the issue occurred and how LONG it lasted before it miraculously “fixed itself”, so the next time that same problem is reported, you might be able to piece together some crude and basic assumptions or theories based on WHEN it happened previously and how LONG it lasted each time.
Wireless chaos only at lunchtime?!
Even with intermittent problems, given enough time, an IT professional may be able to gather enough information to piece together and understand that the “wireless issue” the Accounting Department has reported is typically only occurring around the Noon hour and lasts for up to 30 minutes or so. As it turns out, chance would have it that the nearest wireless AP for the Accounting Department is in the nearby break/lunch room. What else is in that break/lunch room aside from the wireless AP (and horrible coffee smells)? Ding! You guessed it! A microwave for people to heat up their lunch meals, which, when the microwave is running and heating up leftovers from last night’s spaghetti dinner, causes the entire 2.4Ghz wireless spectrum the AP operates on to go absolutely haywire and ruin any possibility for stable wireless performance.
5. What changed recently?
This is one question that is unfortunately not asked often enough, is just plain overlooked, or in other cases is just completely disregarded (shame on you if you fall into that category!). Technology is a very touchy and hypersensitive beast, and more often than not, it doesn’t take too kindly to introducing changes. Even the changes that are supposed to solve and prevent other known problems, often result in the introduction of new and unexpected problems.
It’s not unheard of that sometimes even routine maintenance on equipment can cause problems.
Take for example, updating firmware on a network switch. This should be a relatively trouble-free routine operation, but suddenly users are reporting that they’re occasionally having problems logging into their desktops. It’s happening to more than one user, in fact, it’s being reported sporadically throughout the building early in the morning hours when most employees arrive for the start of their shift.
“What Changed” recently? Over the weekend you decided to update the firmware on your edge switches and now the port security that was set up on the switches using AAA authentication with Radius, isn’t behaving as expected. Unfortunately, it looks like the new firmware update might have introduced a random bug! What’s the solution? Back rev your switches, or look for ever newer firmware code that might resolve the problem.
Or take a situation where your VMWare host servers have been running flawlessly for an entire year, yet suddenly you start seeing Purple Screens of Death (PSOD) on one of them every few days (or even several times a day!), which forces VMWare HA (High Availability) to trigger and restart your downed VMs on another available VMWare host in your cluster.
You haven’t changed anything with the VMWare software itself, still running on the same trusted vSphere 6.0 Update 1 release that has been rock solid and problem-free in your environment. So “What Changed” recently? Wait a minute, come to think of it, the host server that is regularly crashing recently had an additional 64GB of memory added to it one week ago! Might be worth removing that extra 64GB of memory and seeing if the problem goes away. Certainly wouldn’t be the first time new or additional hardware was the result of the underlying issue.
6. Can the problem be recreated?
Another helpful step for effective problem solving is trying to recreate the actual problem. As discussed before, reported problems can either be of a constant or intermittent nature. Taking the time to re-create the problem can be beneficial and especially helpful in cases where you might need to break out tools such as Wireshark to capture packets and network traffic for future analysis and evaluation. IT professionals have to make use of such tools in more complex technical support issues especially when the flow of network traffic is in question or when there’s a need to examine whether the traffic is making it from the source to destination devices.
If possible, take advantage of any sandbox or test environments that are available. Having these environments gives you the flexibility to recreate the issue and effectively “break” things on purpose, without putting your production network or systems at risk and without interrupting services that end-users are relying on during standard business hours.
Recreating the problem is also advantageous in situations where the IT professional may need to involve 3rd party technical support from a vendor as well. Often, these vendors will have the means to establish remote sessions to take control of your desktop (or the machine in which you’ve successfully recreated the problem on), which gives the vendor the ability to actually see the issue while it’s occurring to further help diagnose what is happening.
7. Are benchmarks and logs available?
Having some kind of benchmarking tool available to track and record network and server performance is beyond measure in terms of its overall value when helping an IT professional track down challenging technical issues. One of the key areas worth checking when problems are being reported is looking at the actual METRICS over a historical period of time. Metrics can prove to be invaluable when trying to figure out: Whether the problem reported actually exists or is a false positive
Maybe you’ve been in a situation where someone reports, “The file server is really slow today!”
Without historical benchmarks available, taking a look at the current server performance may not yield any fruitful results because the CPU, disk, network, and memory counters all SEEM to be operating at a reasonable level, but based on and compared to what exactly?
With historical benchmarks available, there is a foundation to actually compare today’s performance on the server as it relates to the CPU, Disk, Network, and Memory (and any other metric/counter you want) VERSUS what the server has been utilizing for the past days, weeks, or months prior.
What historical benchmarks might help you discover is, that according to the historical data, perhaps there is absolutely NO difference in the server performance today versus previous days, weeks, or months? The complaint of “The file server is really slow today” turns out to be a false positive in that case, proven by the metrics an historical benchmarks. Finding the real cause and resolution to the user’s complaint is going to require you to start looking into other areas aside from the server itself. Perhaps it’s a client-side issue or networking issue.
Having benchmarks available is crucial in taking out illogical guess-work and assumptions, and replacing them with hard evidence and facts to back up your problem solving process. There are countless software options available that will give you the data you need for metrics, though we often recommend using PRTG from Paessler, which is a wonderful utility for acquiring benchmarks on your network and servers.
Logs are another important thing to consider during the troubleshooting process. Going back into log history can give a stumped IT Professional some additional clues as to what is going on, especially in cases where the question of “When did the problem start?” remains unanswered.
Having network devices (switches, routers, firewalls, wireless, etc.) sending their log information to a dedicated syslog server (for example, Kiwi Syslog Server from SolarWinds) gives someone the opportunity to search for entries related to particular devices (by IP address) for specific warning messages or error messages.
Syslog messages and the historical information gathered here can sometimes help point the IT Professional in the right direction, not to mention, the logs themselves can be extremely valuable to the vendor of the product as well when they are involved in troubleshooting what is happening.
8. I’m officially stuck – now what?
Alright, so you find yourself in one of those rather unpleasant circumstances where you’ve asked all the right questions, dug into your resourceful bag of tricks, and find that you’ve exhausted all your technical knowledge and ability to track down the source of the problem. What do you do now?
The first step is DON’T PANIC. Effective problem solving is, more often than not, substantially reduced when the IT professional is stressed out and under pressure (although in some rare cases, people tend to flourish under these “trial by fire” scenarios). Keeping panic at bay will help a person to remain calm, focused, and continue to allow them to logically walk through the problem solving process.
This is however, easier said than done, when there are countless emails and phone calls coming in demanding an update as to when the source of the problem will be fixed (and let’s not forget, potentially angry bosses that might be clueless as to why the problem is taking more than 10 minutes to resolve!).
An IT professional would be wise to remind themselves that they’ve been in similar situations in the past and have always been able to figure out the problem given enough time to properly diagnose what is going on. So, don’t beat yourself up too heavily if the solution seems out of reach and you need to call in the cavalry.
The second step is just that, call in the cavalry! Let’s face it, there will always be instances where even the most seasoned IT professional needs assistance from peers, vendors or other resources. None of us are capable of knowing absolutely everything. When you find yourself struggling, don’t be afraid to reach out for help! What does that mean?
- Calling tech support for a particular hardware/software vendor
- Open a case with, for example, Cisco TAC support
- Open a case with, for example, Microsoft PSS support
- Reach out to more experienced IT professionals
- Involve a co-worker, professional colleague, or peer
- Partner with a local and trusted IT vendor
- Search online resources for help
- Google can be your friend (be careful of “quick-fix” solutions you find)
- Look into vendor specific forums (most large-vendors have them)
The problem solving process in summary
Be sure to give yourself the absolute best chance to combat those dreaded technical support issues. The next time someone contacts you and yells in a panic, “Email is broken!” understand that you can more quickly deduct what is actually going on and help minimize the amount of time necessary to resolve the problem by simply asking the right questions:
- What is the Actual Problem?
- Who is Experiencing the Problem?
- When did the Problem Start?
- Is the Problem Intermittent or Constant?
- What Recently Changed?
- Can the Problem be Recreated?
- Are Benchmarks and Logs Available?
- I’m Officially Stuck – Now What?
Keep in mind, however, that not only do you need answers to those questions, but you need answers that are accurate.
As stated earlier, this means the IT professional may need to take the necessary time to validate the answers being provided to them. Inaccurate answers and misinformed facts will send you down the wrong troubleshooting path and unnecessarily prolong the amount of time necessary to resolve complex technical support issues. So get your facts straight!
Having the answers to these questions will allow you to immediately narrow down the scope of the problem and the potential areas at fault, conduct tests, formulate conclusions, and resolve problems even faster than you may have anticipated.
2 thoughts on “An effective problem solving process for IT professionals”
Found your article very interesting. I can definitely identify with all of the points you made, especially troubleshooting. Either you can or cant troubleshoot and think logically through an issue or problem. You are right in mentioning that its something you really cannot teach. One other thing that helps with a logically stepping through the process is documentation. There should always be a repository where network diagrams, server builds, OS versions etc., are kept. I understand that a lot of times these documents cannot be relied upon due to being out of date and it seems most people scoff at the idea of keeping good documentation. But I believe it to be important to help with any troubleshooting. You also mentioned the question, Did anything change? or What changed? A big issue when attempting to troubleshoot. Every place I have worked at, always used a change management process that documented every single change, no matter how small. Of course these places had to by law (SOX audits) because they were publicly traded companies. Just wanted to say, good article!
That is a great article with some excellent questions. Working with students and teachers, I’d throw in a few extra suggestions.
1. What is a reasonable timeline for solving the problem? Often times a lack of communication to this question leads to frustration and long term mistrust regarding the reliability of technology. Asking what needs to be done from the end user’s perspective, and knowing their timeline for completion is helpful. Giving them a reasonable amount of time in which they can expect the issue to be resolved sets everybody up for success around reasonable expectations.
2. Suggest potential work-arounds when necessary — Standing in front of a group of adults and attempting to present when the technology is not working is overwhelming and frustrating. The same tech failure when you are working with a group of students and you start to lose their attention — it’s a nightmare! Knowing what tools your district provides for staff and their general purpose may allow you to offer some potential work-around ideas until the problem is resolved. There is not a fix for everything, but when you can suggest a reasonable alternative in the moment, you offer more than just tech support — you offer customer service.
Comments are closed.