By James Fraze
There could be any of hundreds of things wrong with most network connectivity issues. This guide will attempt to help you narrow it down as fast as possible, using the least amount of effort.
Here are the categories of testing/troubleshooting:
Common things to consider on EVERY network troubleshooting issue
Most Common Testing
These are common issues and easy to rule out.
NO Connectivity Testing
If there is no connectivity, here is a checklist of things that might be wrong
If there is connectivity sometimes, it is likely one of these issues
These are touch points that you should be aware of for EVERY troubleshooting session. It is rare that your user will exactly what you need. In their eagerness to solve the problem they will quite often mislead you, on accident of course.
Think of good questions, and control the troubleshooting session to get what you need.
Is there an outage that your NOC/SOC knows about? If you are cooridinating changes this is an easy one to find a quick answer on.
- Use the same application to test
- Test from the same server and client
- Test during the same time of day/traffic load
It is common to generate “fake” traffic using telnet on the same port as the test, but if you are testing through an application level firewall it might get blocked simply because telnet on xxxx is not the same as the actual application on port xxxx
Get Error Message
- Apache Error Codes
- System Logs
- Dialogue Box Popups
- Time of Message
Often the message can be searched in a knowledge base, Google, or will trigger a tribal memory of how to fix it. Don’t let the user simply tell you “it doesn’t work”
The presence of a message will also tell us several things:
- Which Server responsed with the message
- Routing Issues can be skipped (messages can’t exist without a route)
Asking the user intelligent questions will get them to admit why it isn’t working so you don’t have to troubleshoot it. “Well normally I connect from work, but I brought my laptop home and …..”
Don’t Trust The User
If at all possible, do a screen share, or screen capture of the user starting the application, initiating the connection, etc. They might of requested access from their PC, but are trying to use their PC to remote desktop to another server and make the connection there. Viewing them as they make the connection will quickly eliminate hours of searching through logs for traffic that never existed.
Verify Test Methods
The user provided you with logs or error messages. That’s outstanding! Verify exactly how they got them. They will often send logs that are well meaning, but from the wrong computers or devices. Verify with them each detail and don’t trust them!
Verify User Communication
Explicitly tell the user what to click, what to do, etc. Over eager users will very often chase a rabbit down the hole that is irrelevant. Take charge of the conversation, and tell them exactly what you want them to do.
Most Common Testing
If you can test most of the common issues with only a few minutes of work, you might close more cases quicker, or with less effort. These are the common things to look over before you dig into a deep troubleshooting session.
A simple ping test might solve this for you. If you can ping the server from the client and the client from the server, there are no route issues. If you can’t, then icmp might be blocked/filtered, or there might be no route.
Ask for the following information from the customer from both the client and server:
- netstat -rn
The traceroute will show you if the to and from path are the same. If they aren’t, you could have an asynchronous routing issue, and most devices will block this type of traffic by default.
The netstat command will show you a routing table that indicates which interface the traffic should leave from. This should give you a clue to review network diagrams and determine if that is the correct zone, or path for it to take.
Remember that each device only needs to know where to send it’s next hop, it doesn’t know anything about the network beyond that. Each other device is responsible for knowing where it’s next hop is.
Missed Firewall Rule/Change
- Verify Audit Logs if the Change Actually Happened
- Verify the CORRECT change was implemented
- Review logs to see if that device is blocking traffic
It is common for a user to request the wrong information, or to implement the wrong information. Clear up the communication before a change, and when troubleshooting – verify the information. Users very often have no clue of what they need to get their applications working across a network and many engineers will blindly put in information without stopping to think if that’s really what the user needs (and will fail to ask them for verification).
Did the implementer of the change, or the requester fat-finger something? Did they create the rule wrong? Wrong zones? Wrong IP addresses? After you verify the actual change they wanted, check the rule sets to see if that was implemented as requested or if they requested the wrong change.
NO Connectivity Testing
If you are satifisifed it’s not a configuration or user error, or some quick error to diagnose, but are still unsure why there is NO connectivity. You can start detailed troubleshooting.
Again, try to think of the most common and fastest things to test before you go directly to a core dump and analyzing connection tables. That stuff takes time and you might be able to solve it without deep analysis.
- Are cables good?
- Are lights on, etc
- Are they connected on the proper network
- Does any other network traffic work for them?
- netstat -s will show network statistics such as link speed, and errors
- Is the service running on the server?
- Can anyone else connect to the server from anywhere else?
- netstat -tunlp | grep XXXX to see if the port XXXX is listening (linux)
- nesttat -an | findstr XXXX (windows)
IP vs Host Names
- Test with IP instead of hostname
- Verify by pinging hostname from their server/client that matching IP is returned
- Hard coded hostnames in applications/code could refer to the wrong IP
- Invalid host entries might be listed in /etc/hosts or windows hosts file
Inspecting logs will quickly tell you if a firewall or router ACL is blocking/dropping traffic.
If you have done a fake traffic test using telnet you might get the following two responses (example using telnet to test http or tcp/80):
Connection refused means that an active ACL/Firewall is blocking/dropping that connection. But remember, telnet is NOT http! So if you are using application level firewalls you can’t test like this. If you know you aren’t using application level firewalls that verify application headers, this is a quick test that MIGHT show you good info. (see below)
telnet 80 someplace.com
… Connection Timed Out
Connection timed out means any of several things. It’s not a reliable test because the firewall/ACL could be configured to not respond, or there could be no route back, etc.
Firewalls/ACLs are usually pass or fail. It should show a drop in the logs if that traffic is logged, or from the command line monitors that each device has. However, it might not show it if the connection tables are full, like in a denial of service attack, port scan, or misbehaving app that uses up ports but never lets them time out. The common test will be to look at logs, or run from the command line to verify.
Many appliances and firewalls are built on some form of linux. As such, they have system logs in /var/log and grepping through them might show information that isn’t normally logged in the traditional logging software. Test the easy first though!
See spoofing for a clever way to check for all drops, even if they aren’t logged, on a CheckPoint Firewall. Each firewall will have a different way to report this though.
Most firewalls allow for zone setup. You can specify what zones can communicate to each other. When traffic comes from an unexpected zone, it will drop this traffic. The confusing part of this is that often this traffic is not logged, or it is allowed from client to server but not from server to client for example. Each firewall has a way to review this, but for example in CheckPoint Firewalls you can use the zdebug command to see ALL drops, including spoofing:
expert# fw ctl zdebug drop | grep x.x.x.x
A common mistake on all firewalls is for the operator to lable the zones incorrectly in a configuration rule. Verify that the zones are correct, or create a rule that doesn’t verify zones to see if the zones are the issue.
Depending on the application, it might have built in ACLs that prevent certain traffic. SSH is an example of an application that has the ability, at the application level to deny/restrict certain traffic based on IP. Verify the application is properly setup.
The netstat -an/netstat -tunlp command will let you know what port the application is listening on. The client might think their SQL was setup on port 3306, but maybe it’s been running on a different port all along? Verify it.
If the type of error you are experiencing is intermittent, it’s likely not a firewall, routing or ACL, though it’s possible.
Many devices will drop packets when their CPU load is too high. Often this is caused by trying to log too much at once and the CPU simply cannot keep up. These devices will often log something about this error, but even if they don’t reviewing extremely high CPU loads will reveal this type of intermittent traffic drop.
Linux uses the top command and the ps commands to give cpu usage information.
Bandwidth is very easy to diagnose because everything will be slow, and sometimes work, or sometimes time out. What is happening is the TCP packets are trying repeatedly to connect, and when they fail they are being retransmitted. Thus, sometimes they will give a connection and sometimes not.
In the case of a download, like a webpage, you might get some of the header, and some of the html/xml but then a dropped connection. Suspect bandwidth issues in cases like this.
Routing convergence or similar routing issues might make the traffic work sometimes, but not others.
In the case of a clustered device clients/servers might have routes to the specific firewall/device but they should have a route to the virtual IP. When the clustered device fails over, the traffic fails.
Each device has a limit to the number of connections it can handle, when it goes over that limit it will simply drop additional connections. The applications will also have limits and it is worth investigating both of these for intermittent failures.
Common causes of connection limits reached are port scanners, penetration testing, and apps that have extremely long timeouts.
Future articles will cover specifics on individual technologies such as packet captures, debugging, traces and more. For now though, apply these principles to quickly find where a problem might exist, so you can you invest your time wisely in troubleshooting for your clients.