The First 3 Steps To Take When Your Network Goes Down

Whether it is the middle of the day, or the middle of the night nobody who is in charge of a network wants to get “that call”. There is a major problem and the network is down. It usually starts with one or two complaints “hey, I can’t open my email” or “something is wrong with my web browser” but those few complaints suddenly turn into many and you suddenly you know there is a real problem. What you may not know, is what to do next.

In this blog post, I will examine some basic troubleshooting steps that every network manager should take when investigating an issue. Whether you have a staff of 2 or 200, these common sense steps still apply. Of course, depending on what you discover as you perform your investigation, you may need to take some additional steps to fully determine the root cause of the problem and how to fix it.

Step 1. Determine the extent of the problem.

You will need to try and pinpoint as quickly as possible the scope of the issue. Is it related to a single physical location like just one office, or is it network wide including WAN’s and remote users. This can provide valuable insight into where to go next. If the problem is contained within a single location, then you can be pretty sure that the cause of the issue is also within that location (or at the very least that location plus any uplink connections to other locations).

It may not seem intuitive but if the issue is network wide with multiple affected locations, then sometimes this can really narrow down the problem. It probably resides in the “core” of your network because this is usually the only place that can have an issue which affects such a large portion of your network. That may not make it easier to fix, but it generally does help with identification.

If you’re lucky you might even be able to narrow this issue down even further into a clear segment like “only wireless users” or “everything on VLAN 100” etc. In this case, you need to jump straight into deep dive troubleshooting on just those areas.

Step 2. Try to determine if it is server/application related or network related.

This starts with the common “ping test”. The big question you need to answer is, do my users have connectivity to the servers they are trying to access, but (for some reason) cannot access the applications (this means the problem is in the servers / apps) or do they not have any connectivity at all (which means a network issue).

This simple step can go a long way towards troubleshooting the issue. If there is no network connectivity, then the issue will reside in the infrastructure. Most commonly in L2/L3 devices and firewalls. I’ve seen many cases where the application of a single firewall rule is the cause if an entire network outage.

If there is connectivity, then you need to investigate the servers and applications themselves. Common network management platforms should be able to inform you of server availability including tests for service port availability, the status of services and processes etc. A widespread issue that happens all at once is usually indicative of a problem stemming from a patch or other update / install that was performed on multiple systems simultaneously.

Step 3. Use your network management system to pinpoint, rollback, and/or restart.

Good management systems today should be able to identify when the problem first occurred and potentially even the root cause of the issue (especially for network issues). You also should have backup / restore capabilities for all systems. That way, in a complete failure scenario, you can always fall back to a known good configuration or state. Lastly, you should be able to then restart your services or devices and restore service.

In some cases there may have been a hardware failure that needs to be addressed first before a device can come back online. Having spare parts or emergency maintenance contracts will certainly help in that case. If the issue is more complex like overloading of a circuit or system, then steps may need to be put in place to restrict usage until additional capacity can be added. With most datacenters running on virtualized platforms today, in many cases additional capacity for compute, and storage can be added in less than 60 minutes.

Network issues happen to every organization. Those that know how to effectively respond and take a step by step approach to troubleshooting will be able to restore service quickly.

I hope these three steps to take when your Network goes down was usefull, dont forget to subscribe for our weekly blogs.

The First 3 Steps To Take When Your Network Goes Down

Thanks to NMSaaS for the article.

The Top 3 Reasons Why Network Discovery is Critical to IT Success

Network discovery is the process of identifying devices attached to a network. It establishes the current state and health of your IT infrastructure.

It’s essential for every business due to the fact that without the visibility into your entire environment you can’t successfully accomplish even the basics of network management tasks.

When looking into why Network Discovery is critical to IT success there are three key factors to take into consideration.

1. Discovering the Current State & Health of the Infrastructure.

Understanding the current state and health of the network infrastructure is a fundamental requirement in any infrastructure management environment. What you cannot see you cannot manage, or even understand, so it is vital for infrastructure stability to have a tool that can constantly discover the state and health of the components in operation.

2. Manage & Control the Infrastructure Environment

  • Once you know what you have its very easy to compile an accurate inventory of the following:
  • The environment’s components provide the ability to track hardware.
  • To manage end-of-life and end‑of‑support.
  • The hardware threshold management (i.e. Swap-Out device before failure)
  • To effectively manage the estates operating systems and patch management.

3. Automate Deployment

Corporation’s today place a lot of emphasis on automation therefore, it is very important that when choosing a Network Discovery tool to operate your infrastructure environment, it can integrate seamlessly with your CRM system. Having a consistent view of the infrastructure inventory and services will allow repeatable and consistent deployment of hardware and configuration in order to automate service fulfillment and deployment.

If you’re not using network discovery tool don’t worry were offering the service for absolutely free, just click below and you will be one step closer to improving your network management system.

The Top 3 Reasons Why Network Discovery is Critical to IT Success

Thanks to NMSaaS for the article. 

Why Just Backing Up Your Router Config is the Wrong Thing To Do

One of the most fundamental best practices of any IT organization is to have a backup strategy and system in place for critical systems and devices. This is clearly needed for any disaster recovery situation and most IT departments have definitive plans and even practiced methodologies set in place for such an occurrence.

However what many IT pros don’t always consider is how useful it is to have backups for reasons other than DR and the fact that for most network devices (and especially routers), it is not just the running configuration that should be saved. In fact, there are potentially hundreds of smaller pieces of information that when properly backed up can be used for help with ongoing operational issues.

First, let’s take a look at the traditional device backup landscape, and then let’s explore how this structure should be enhanced to provide additional services and benefits.

Unlike server hard drives, network devices like routers do not usually fall within the umbrella backup systems used for mass data storage. In most cases a specialized system must be put in place for these devices. Each network vendor has special commands that must be used in order to access the device and request / download the configurations.

When looking at these systems it is important to find out where the resulting configurations will be stored. If the system is simply storing the data into an on-site appliance, then it also critical to determine if that appliance itself is being backup into an offsite / recoverable system otherwise the backup are not useful in a DR situation where the backup appliance may also be offline.

It is also important to understand how many backups your system can hold i.e. can you only store the last 10 backups, or maybe only everything in the last 30 days etc. are these configurable options that you can adjust based on your retention requirements? This can be a critical component for audit reporting, as well as when rollback is needed to a previous state (that may not just have been the last state).

Lastly, does the system offer a change report showing what differences exist between selected configurations? Can you see who made the changes and when?

In addition to the “must haves” explored above, I also think there are some advanced features that really can dramatically improve the operational value of a device / router backup system. Let’s look at these below:

  • Routers and other devices are more than just their config files. Very often they can provide output which describes additional aspects of their operation. To use the common (cisco centric) terminology, you can also get and store the output of a “show” command. This may contain critical information about the devices hardware, software, services, neighbors and more that could not be seen from just the configuration. It can be hugely beneficial to store this output as well as it can be used to help understand how the device is being used, what other devices are connected to it and more.
  • Any device in a network, especially a core component such as a router should conform to company specific policies for things like access, security etc. Both the main configuration file, as well as the output from the special “show” commands can be used to check the device against any compliance policy your organization has in place.
  • All backups need to run both on a schedule (we generally see 1x per day as the most common schedule) as well as on an ad-hoc basis when a change is made. This second option is vital to maintaining an up to date backup system. Most changes to devices happen at some point during the normal work day. It is critical that your backup system can be notified (usually via log message) that a change was made and then immediately launch a backup of the device – and potentially a policy compliance check as well.

Long gone are the days where simply logging into a router, getting the running configuration, and storing that in a text file is considered a “backup plan”. Critical network devices need to have the same attention paid them as servers and other IT systems. Now is a good time to revisit your router backup systems and strategies and determine if you are implementing a modern backup approach, as you can see its not just about backing up your router config.

b2ap3_thumbnail_6313af46-139c-423c-b3d5-01bfcaaf724b_20150730-133914_1.pngThanks to NMSaaS for the article.

NMSaaS Webinar – Stop paying for Network Inventory Software & let NMSaaS do it for FREE.

Please join NMSaaS CTO John Olson for a demonstration of our free Network Discovery, Asset & Inventory Solution.

Wed, Jul 29, 2015 1:00 PM – 1:30 PM CDT

Do any of these problems sound familiar?

  • My network is complex and I don’t really even know exactly what we have and where it all is.
  • I can’t track down interconnected problems
  • I don’t know when something new comes on the network
  • I don’t know when I need upgrades
  • I suspect we are paying too much for maintenance

NMSaaS is here to help.

Sign up for the webinar NOW > > >

In this webinar you will learn that you can receive the following:

  • Highly detailed complimentary Network Discovery, Inventory and Topology Service
  • Quarterly Reports with visibility in 100+ data points including:
    • Device Connectivity Information
    • Installed Software
    • VM’s
    • Services / Processes
    • TCP/IP Ports in use
    • More…
  • Deliverables – PDF Report & Excel Inventory List

Thanks to NMSaaS for the article.

3 Steps to Configure Your Network For Optimal Discovery

All good network monitoring / management begins the same way – with an accurate inventory of the devices you wish to monitor. These systems must be on boarded into the monitoring platform so that it can do its job of collecting KPI’s, backing up configurations and so on. This onboarding process is almost always initiated through a discovery process.

This discovery is carried out by the monitoring system and is targeted at the devices on the network. The method of targeting may vary, from a simple list of IP addresses or host names, to a full subnet discovery sweep, or even by using an exported csv file from another system. However, the primary means of discovery is usually the same for all Network devices, SNMP.

Additional means of onboarding can (and certainly do) exist, but I have yet to see any full-featured management system that does not use SNMP as one of its primary foundations.

SNMP has been around for a long time, and is well understood and (mostly) well implemented in all major networking vendors’ products. Unfortunately, I can tell you from years of experience that many networks are not optimally configured to make use of SNMP and other important configuration options which when setup correctly will optimize the network for a more efficient and ultimately more successful discovery and onboarding process.

Having said that, below are 3 simple steps that should be taken, in order to help maximize your network for optimal discovery.

1) Enable SNMP

Yes it seems obvious to say that if SNMP isn’t enabled then it will not work. But, as mentioned before it still astonishes me how many organizations I work with that still do not have SNMP enabled on all of the devices they should have. These days almost any device that can connect to a network usually has some SNMP support built in. Most networks have SNMP enabled on the “core” devices like Routers / Switches / Servers, but many IT pros many not realize that SNMP is available on non- core systems as well.

Devices like VoIP phones and video conferencing systems, IP connected security cameras, Point of Sale terminals and even mobile devices (via apps) can support SNMP. By enabling SNMP on as many possible systems in the network, the ability to extend the reach of discovery and monitoring has grown incredibly and now gives visibility into the network end-points like never before.

2) Setup SNMP correctly

Just enabling SNMP isn’t enough – the next step is to make sure it is configured correctly. That means removing / changing the default Read Only (RO) community string (which is commonly set by default to “public”) to a more secure string. It is also best practice to use as few community strings as you can. In many large organizations, there can be some “turf wars” over who gets to set these strings on systems. The Server team may have one standard string and the network team has another.

Even though most systems will allow for multiple strings, it is generally best to try to keep these as consistent as possible. This helps prevent confusion when setting up new systems and also helps eliminate unnecessary discovery overhead on the management systems (which may have to try multiple community strings for each device on an initial discovery run). As always, security is important, so you should configure the IP address of the known management server as an allowed SNMP system and block any other systems from being allowed to run an SNMP query against your systems.

3) Enable Layer 2 discovery protocols

In your network, you want much deeper insight into not only what you have, but how it is all connected. One of the best way to get this information is to enable layer 2 (link layer) discovery abilities. Depending on the vendor(s) you have in your network, this may accomplished with a proprietary protocol like the Cisco Discovery Protocol (CDP) or it may be implemented in a generic standard like the Link Layer Discovery Protocol (LLDP). In either case, by enabling these protocols, you gain valuable L2 connectivity information like connected MAC addresses, VLAN’s, and more.

By following a few simple steps, you can dramatically improve the results of your management system’s onboarding / discovery process and therefore gain deeper and more actionable information about your network.

b2ap3_thumbnail_6313af46-139c-423c-b3d5-01bfcaaf724b.png

Thanks to NMSaaS for the article.

Infosim® Global Webinar Day July 30th, 2015 – The Treasure Hunt is On!

How to visualize the state of your network and service infrastructure to uncover the hidden treasures in your data

Infosim® Global Webinar Day July 30th, 2015 - The treasure hunt is on! Join Harald Höhn, Sea Captain and Senior Developer on a perilous treasure hunt on “How to visualize the state of your network and service infrastructure to uncover the hidden treasure in your data”.

This Webinar will provide insight into:

  • How to speed up your workflows with auto-generated Weather Maps
  • How to outline complex business processes with Weather Maps
  • How to uncover the hidden treasures in your data [Live Demo]

Infosim® Global Webinar Day July 30th, 2015 - The treasure hunt is on! But wait, there is more! We are giving away three treasure maps (Amazon Gift Card, value $50) on this Global Webinar Day. In order to join the draw, simply answer the hidden treasure question that will be part of the questionnaire at the end of the Webinar. Good Luck!

Register today watch a recording

b2ap3_thumbnail_Fotolia_33050826_XS_20150804-182656_1.jpg

A recording of this Webinar will be available to all who register!
(Take a look at our previous Webinars here.)

Thanks to Infosim for the article.

Top 10 Key Metrics for NetFlow Monitoring

NetFlow is a feature that was introduced on Cisco routers that provides the ability to collect IP network traffic as it enters or exits an interface. By analyzing the data provided by NetFlow, a network administrator can determine things such as the source and destination of traffic, class of service, and the causes of congestion.

There are numerous key metrics when it comes to Netflow Monitoring:

1-Netflow Top Talkers

The flows that are generating the heaviest system traffic are known as the “top talkers.” The NetFlow Top Talkers feature allows flows to be sorted so that they can be viewed, to identify key users of the network.

2-Application Mapping

Application Mapping lets you configure the applications identified by NetFlow. You can add new applications, modify existing ones, or delete them. It’s also usually possible to associate an IP address with an application to help better track applications that are tied to specific servers.

3-Alert profiles

Alert profiles makes network monitoring using NetFlow easier. It allows for the Netflow system to be watching the traffic and alarming on threshold breaches or other traffic behaviors.

4-IP Grouping

You can create IP groups based on IP addresses and/or a combination of port and protocol. IP grouping is useful in tracking departmental bandwidth utilization, calculating bandwidth costs and ensuring appropriate usage of network bandwidth.

5-Netflow Based Security features

NetFlow provides IP flow information in the network. In the field of network security, IP flow information provided by NetFlow is used to analyze anomaly traffic. NetFlow based anomaly traffic analysis is an appropriate supplement to current signature-based NIDS.

6- Top Interfaces

Included in the Netflow Export information is the interface that the traffic passes through. This can be very useful when trying to diagnose network congestion, especially on lower bandwidth WAN interfaces as well as helping to plan capacity upgrades / downgrades for the future.

7- QoS traffic Monitoring

Most networks today enable some level of traffic prioritization. Multimedia traffic like VoIP and Video which are more susceptible to problems when there are network delays typically are tagged as higher priority than other traffic like web and email. Netflow can track which traffic is tagged with these priority levels. This enables network engineers to make sure that the traffic is being tagged appropriately.

8- AS Analysis

Most Netflow tools are able to also show the AS (Autonomous System) number and well known AS assignments for the IP traffic. This can be very useful in peer analysis as well as watching flows across the “border” of a network. For ISP’s and other large organizations this information can be helpful when performing traffic and network engineering analysis especially when the network is being redesigned or expanded.

9- Protocol analysis

One of the most basic metrics that Netflow can provide is a breakdown of TCP/IP protocols in use on the network like TCP, UDP, ICMP etc. This information is typically combined with port and IP address information to provide a complete view of the applications on the network.

10- Extensions with IPFIX

Although technically not NetFlow, IPFIX is fast becoming the preferred method of “flow-based” analysis. This is mainly due to the flexible structure of IPFIX which allows for variable length fields and proprietary vendor information. This is critical when trying to understand deeper level traffic metrics like HTTP host, URLs, messages and more.

Thanks to NMSaaS for the article. 

How To Monitor VoIP Performance In The Real World

It’s one of the most dreaded calls to get for an IT staff member – the one where a user complains about the quality of their VoIP call or video conference. The terms used to describe the problem are reminiscent of a person who brings their car in for service because of a strange sound “ I hear a crackle”, or “it sounds like the other person is in a tunnel” or “I could only hear every other word – and then the call dropped”. None of these are good, and unfortunately, they are all very hard to diagnose.

As an IT professional, we are used to solving problems. We are comfortable in a binary world, something works or it doesn’t and when it doesn’t, we fix the issue so that it does. When a server or application is unavailable, we can usually diagnose and fix the issue and then it works again. But, with VoIP and Video, the situation is not so cut and dried. It’s rare that the phone doesn’t work at all – it usually “works” i.e the phone can make and receive calls, but often the problems are more nuanced; the user is unhappy with the “experience” of the connection. It’s the difference between having a bad meal and the restaurant being closed.

In the world of VoIP, this situation has even been mathematically described (leave it to engineers to come up with a formula). It is called a Mean Opinion Score (MOS) and is used to provide a data point which represents how a user “feels” about the quality of a call. The rating system looks like this:

How To Monitor VoIP Performance In The Real World

Today, the MOS score is accepted as the main standard by which the quality of VoIP calls are measured. There are conditional factors that go into what makes an “OK” MOS score which take into account (among other things) the CODEC which is used in the call. As a rule of thumb, any MOS score below ~3.7 is considered a problem worth investigating, and anything consistently below 2.0 is real issue. *(many organizations use a different # other than 3.7, but it is usually pretty close to this). The main factors which go into generating this score come from 3 KPI’s

  1. Loss
  2. Jitter
  3. Latency / Delay

So, in order to try and bring some rigor to monitoring VoIP quality on a network (and get to the issues before the users get to you) network staff need to monitor the MOS score for VoIP calls. In the real world there are at least three (separate) ways of doing this:

1) The “ACTUAL” MOS score from live calls based on reports from the VoIP endpoints

Some VoIP phones will actually perform measurements of the critical KPI’s (Loss, Jitter, and Latency) and send reports of the call quality to a Call Manager or other server. Most commonly this information is transmitted using the Real Time Control Protocol (RTCP) and may also include RTCP XR (for eXtended Report) data which can provide additional information like Signal to Noise Ratio and Echo level. Support for RTCP / RTCP XR is highly dependent on the phone system being used and in particular the handset being used. If your VoIP system does support RTCP / RTCP XR you will still need a method of capturing and reporting / alarming on the data provided.

2) The “PREDICTED” MOS score based on network quality metrics from a synthetic test call.

Instead of waiting for the phones to tell you there is a problem, many network managers implement a testing system which makes periodic synthetic calls on the network and then gathers the KPI’s from those calls. Generally, this type of testing takes place completely outside of the VoIP phone system and uses vendor software to replicate an endpoint. The software is installed at critical ends of a test path and then the endpoints “call” each other (by sending an RTP stream from one endpoint to another). These systems should be able to exactly mimic the live VoIP system in terms of CODEC used and QoS tagging etc so that the test frames are passed through the network in exactly the same way that a “real” VoIP call would be. These systems can then “predict” the Quality of experience that the network is providing at that moment.

3) The “ACTUAL” MOS score bases on a passive analysis of the live packets on the network.

This is where a passive “probe” product is put into the network and “sniffs” the VoIP calls. It can then inspect that traffic and create a MOS score or other metrics which is useful to determine the current quality of service being experienced by users. This method removes any need for support from the VoIP system and also does not require the network to handle additional test data, but does have some drawbacks as this method can be expensive and may have trouble accurately reading any encrypted VoIP traffic.

Which is best? Well, they both all have their place, and in a perfect world an IT staff would have access to live data and test data in order to troubleshoot an issue. In an even more perfect world, they would be able to correlate that data in real time to other potentially performance impacting information like router / switch performance data and network bandwidth usage (especially on WAN circuits).

In the end, VoIP performance monitoring comes down to having access to all of the critical KPI’s that could be used to diagnose issues and (hopefully) stop users from making that dreaded service call.

How To Monitor VoIP Performance In The Real World

Thanks to NMSaaS for the article.

Infosim® Global Webinar Day June 25th, 2015 – Convince Your Boss that You Need a New iPhone: Introducing the New StableNet® Mobile App

Join Dr. David Hock, Senior Consultant R&D, and Eduardo González, Developer & Consultant, for a Webinar on “The new StableNet® Mobile App”.

This Webinar will provide insight into:

  • How will the StableNet® Mobile App make your life easier?
  • How are Apple™ Swift and the StableNet® REST API maximizing the user experience?
  • What functionality does the StableNet® Mobile App bring to your fingertips? [Live Demo]
  • When can you get the Mobile App for your StableNet® infrastructure?

A recording of this Webinar will be available to all who register!

b2ap3_thumbnail_Fotolia_33050826_XS.jpg

(Take a look at our previous Webinars here.)

Introducing the First Self-Regulating Root Cause Analysis: Dynamic Rule Generation with StableNet® 7

Infosim®, a leading manufacturer of automated Service Fulfillment and Service Assurance solutions for Telcos, ISPs, MSPs and Corporations, today announced a proprietary new technology called Dynamic Rule Generation (DRG) with StableNet® 7.

The challenge: The legacy Fault Management approach includes a built-in dilemma: Scalability vs. Aggregation. On the one hand, it is unfeasible to pre-create all possible rules while on the other hand, not having enough rules will leave NOC personnel with insufficient data to troubleshoot complex scenarios.

The solution: DRG expands and contracts rules that automatically troubleshoot networks by anticipating all possible scenarios from master rule sets. DRG is like cruise control for a network rule set. When DRG is turned on, it can robotically expand and contract rule sets to keep troubleshooting data at optimum levels constantly without human intervention. It will also allow for automatic ticket generation and report alarms raised by dynamically generated rules. DRG leads to fast notification, a swift service Impact Analysis, and results in the first self-regulating Root Cause Analysis in today’s Network Management Software market.

Start automating Fault Management and stop manually creating rules! Take your hands off the keyboard and allow the DRG cruise control to take over!

Supporting Quotes:

Dr. Stefan Köhler, CEO for Infosim® comments:

“We at Infosim® believe you should receive the best value from your network, and exchange of information should be as easy as possible. The way we want to achieve these goals, is to simplify the usage and automate the processes you use to manage your network. Rules creation and deletion has been an Achilles’ heel of legacy network management systems. With DRG (Dynamic Rule Generation), we are again delivering another new technology to our customers to achieve our goal of the dark NOC.”

Marius Heuler, CTO for Infosim® comments:

“By further enhancing the already powerful Root Cause Analysis of StableNet®, we are providing functionality to our users that will both take care of ongoing changes in their networks while automatically keeping the rules up to date.”

ABOUT INFOSIM®

Infosim® is a leading manufacturer of automated Service Fulfillment and Service Assurance solutions for Telcos, ISPs, Managed Service Providers and Corporations. Since 2003, Infosim® has been developing and providing StableNet® to Telco and Enterprise customers. Infosim® is privately held with offices in Germany (Würzburg – Headquarters), USA (Austin) and Singapore.

Infosim® develops and markets StableNet®, the leading unified software solution for Fault, Performance and Configuration Management. StableNet® is available in two versions: Telco (for Telecom Operators and ISPs) and Enterprise (for IT and Managed Service Providers). StableNet® is a single platform unified solution designed to address today’s many operational and technical challenges of managing distributed and mission-critical IT infrastructures.

Many leading organizations and Network Service Providers have selected StableNet® due to its enriched features and reduction in OPEX & CAPEX. Many of our customers are well-known global brands spanning all market sectors. References available on request.

At Infosim®, we take pride in the engineering excellence of our high quality and high performance products. All products are available for a trial period and professional services for proof of concept (POC) can be provided on request.

ABOUT STABLENET®

StableNet® is available in two versions: Telco (for Telecom Operators and ISPs) and Enterprise (for IT and Managed Service Providers).

StableNet® Telco is a comprehensive unified management solution; offerings include: Quad-play, Mobile, High-speed Internet, VoIP (IPT, IPCC), IPTV across Carrier Ethernet, Metro Ethernet, MPLS, L2/L3 VPNs, Multi Customer VRFs, Cloud and FTTx environments. IPv4 and IPv6 are fully supported.

StableNet® Enterprise is an advanced, unified and scalable network management solution for true End-to-End management of medium to large scale mission-critical IT supported networks with enriched dashboards and detailed service-views focused on both Network & Application services.

Thanks to Infosim for the article.