ScreenOS Troubleshooting

From Network Security Wiki
Jump to navigation Jump to search

Troubleshooting

Initial Steps:

   Paste.png     This section is under construction.

High CPU Usage

Packets passed to, through, or processed by the firewall could use the CPU. The firewall will start to experience problems if the CPU begins to reach 85%. The symptoms include:

   High CPU utilization
   Poor system or throughput performance
   OSPF adjacencies or BGP peering is failing
   Device management is slower than normal
   Ping to the management interface times out
   Firewall is not passing traffic
   Packet drops
   The 'in overrun' counter (get counter stat) could increment.

Check the CPU Utilization

The CPU utilization is calculated based on two entities: Flow and Task. When CPU utilization is high, it means it is busy processing network traffic, but it does not mean it cannot keep up and will start dropping packets. CPU utilization is only a measure of network load through the firewall, not the throughput of the box itself. On all firewall appliance devices (NetScreen-5, 25, 50, 204, 208, and SSG Series), there is 1 CPU used for processing. On ASIC based hardware firewalls (NS-5000, ISG devices) there are two CPU’s; one dedicated for flow and the other dedicated for task.

        SSG320-> get perf cpu detail
        Average System Utilization:  2%
        Last 60 seconds:
        59:  2    58:  2    57:  2    56:  2    55:  2    54:  2    
        53:  2    52:  2    51:  2    50:  2    49:  2    48:  2    
        47:  2    46:  2    45:  2    44:  2    43:  2    42:  2    
        41:  2    40:  2    39:  2    38:  2    37:  2    36:  2    
        35:  2    34:  2    33:  2    32:  2    31:  2    30:  2    
        29:  2    28:  2    27:  2    26:  2    25:  2    24:  2    
        23:  2    22:  2    21:  2    20:  2    19:  2    18:  2    
        17:  2    16:  2    15:  2    14:  2    13:  2    12:  2    
        11:  2    10:  2     9:  2     8:  2     7:  2     6:  2    
         5:  2     4:  2     3:  2     2:  2     1:  2     0:  2    

        Last 60 minutes:
        59:  2    58:  2    57:  2    56:  2    55:  2    54:  2    
        53:  2    52:  2    51:  2    50:  2    49:  2    48:  2    
        47:  2    46:  2    45:  2    44:  2    43:  2    42:  2    
        41:  2    40:  2    39:  2    38:  2    37:  2    36:  2    
        35:  2    34:  2    33:  2    32:  2    31:  2    30:  2    
        29:  2    28:  2    27:  2    26:  2    25:  2    24:  2    
        23:  2    22:  2    21:  2    20:  2    19:  2    18:  2    
        17:  2    16:  2    15:  2    14:  2    13:  2    12:  2    
        11:  2    10:  2     9:  2     8:  2     7:  2     6:  2    
         5:  2     4:  2     3:  2     2:  2     1:  2     0:  2    

        Last 24 hours:
        23:  2    22:  2    21:  2    20:  2    19:  2    18:  2    
        17:  2    16:  2    15:  2    14:  2    13:  2    12:  2    
        11:  2    10:  2     9:  2     8:  2     7:  2     6:  2    
         5:  2     4:  2     3:  2     2:  2     1:  2     0:  2   

Average system utilization is the average CPU utilization for the last 24 hrs. Example:

  • If the system up time is 48 hrs and 18 minutes, then the average system utilization is the average CPU utilization in the last 24 hours, excluding that 18 minutes.
  • If system up time is less than 24 hrs but greater than 1 hr, it will be average utilization up to last hour. Example, if system is up 10 hr 40 minutes, the average system utilization is the cpu utilization in 10 hrs (excluding 40 minutes).
  • If system up time is less than 1hr, (for example, 34 minutes 26 seconds), then average utilization is the cpu utilization in last 34 minutes (excluding 26 seconds).
  • If system up time is less than 1 minute, example 48 seconds, then average utilization is computed over that 48 seconds.

Determine cause - Flow or Task

The command get perf cpu all detail lists the utilization history of the CPU by Flow and Task. The first number within the parenthesis refers to the Flow CPU, and the second number represents the Task CPU.

        SSG320-> get perf cpu all detail
        Average System Utilization: 55% (61  5)
        Last 60 seconds:
        59: 86(96  2)*** 58: 85(95  0)**  57: 86(96  2)*** 56: 85(95  0)**
        55: 85(95  2)**  54: 86(96  0)*** 53: 86(96  2)*** 52: 86(96  0)***
        51: 86(96  2)*** 50: 85(95  1)**  49: 86(96  2)*** 48: 86(96  0)***
        47: 86(96  3)*** 46: 86(96  0)*** 45: 86(96  2)*** 44: 86(96  0)***
        43: 86(96  2)*** 42: 86(96  0)*** 41: 86(96  2)*** 40: 86(96  0)***
        39: 86(96  2)*** 38: 86(96  0)*** 37: 86(96  2)*** 36: 86(96  0)***
        35: 86(96  2)*** 34: 86(96  1)*** 33: 85(95  4)**  32: 85(95  0)**
        31: 86(96  2)*** 30: 86(96  0)*** 29: 86(96  2)*** 28: 86(96  1)***
        27: 86(96  3)*** 26: 86(96  0)*** 25: 86(96  2)*** 24: 86(96  0)***
        23: 86(96  2)*** 22: 86(96  0)*** 21: 86(96  2)*** 20: 86(96  0)***
        19: 86(96  2)*** 18: 86(96  1)*** 17: 86(96  2)*** 16: 86(96 36)***
        15: 86(96  2)*** 14: 86(96  0)*** 13: 85(95  2)**  12: 86(96  0)***
        11: 86(96  3)*** 10: 86(96  0)***  9: 86(96  2)***  8: 86(96  0)***
         7: 86(96  3)***  6: 86(96  0)***  5: 85(95  3)**   4: 86(96  1)***
         3: 86(96  2)***  2: 86(96  1)***  1: 86(96  2)***  0: 85(95  0)**

        Last 60 minutes:
        59: 85(95  1)**  58: 85(95 24)**  57: 84(94  1)**  56: 84(94  1)**
        55: 84(94  1)**  54: 84(94  1)**  53: 83(93  1)**  52: 83(93  1)**
        51: 82(92  1)**  50: 82(92  1)**  49: 83(93  2)**  48: 82(92  1)**
        47: 82(92  1)**  46: 81(91  1)**  45: 81(91  1)**  44: 80(90  2)**
        43: 81(91 14)**  42: 79(89 72)**  41: 57(22 66)*   40: 53(19 63)*
        39: 53( 1 63)*   38: 53(18 63)*   37: 61(57 65)*   36: 56(34 64)*
        35: 59(58 66)*   34: 32(35 11)    33: 26(33  1)    32: 70(80  0)*
        31: 66(76  0)*   30: 50(60  0)*   29: 48(58  1)    28: 26(36  4)
        27: 24(32  2)    26: 45(54  1)    25: 55(65  1)*   24: 21(30  1)
        23: 63(73  0)*   22: 33(40  0)    21: 11(13  1)    20: 53(63  0)*
        19: 78(88  1)**  18:  9(13  2)    17: 46(56  2)    16: 19(29  1)
        15: 38(48  0)    14: 35(45  1)    13: 63(73  0)*   12: 79(89  0)**
        11: 78(88  1)**  10: 36(45  0)     9: 22(27  1)     8: 31(41  0)
         7: 71(81  1)**   6:  5( 6  2)     5:  4( 5  0)     4: 31(39  1)
         3:  5( 5  1)     2: 56(66  0)*    1: 42(52  0)     0: 25(34  2)

        Last 24 hours:
        23: 44(48 10)    22: 66(74  1)*   21: N/A          20: N/A     
        19: N/A          18: N/A          17: N/A          16: N/A     
        15: N/A          14: N/A          13: N/A          12: N/A     
        11: N/A          10: N/A           9: N/A           8: N/A     
         7: N/A           6: N/A           5: N/A           4: N/A     
         3: N/A           2: N/A           1: N/A           0: N/A
  • A single asterisk * indicates the CPU is nearing a warning threshold. It is marked when utilization is ≥ 50% & ≤ 70%.
  • Double asterisks ** indicates to the administrator that CPU is nearing a high level; the administrator should investigate the cause of why CPU is nearing this level. It is marked when utilization ≥ 70% & ≤ 85%.
  • Triple asterisks *** indicates the CPU utilization is high; the administrator should investigate the cause of why CPU is high. It is marked when utilization is ≥ 85%.

Investigate cause of High CPU

  • If High CPU is due to Task, determine which Task is using the most resources on the CPU.
  • If High CPU is due to Flow, run Packet Profiling to determine cause of High "FLOW" CPU

Finding task using most CPU

Use the CPU alarm snapshot feature. To use this, issue the following commands:

set alarm snapshot CPU on
set alarm snapshot CPU trigger (repeat this 2-3 times in 10 second intervals)
unset alarm snapshot CPU on
get alarm snapshot CPU all

The output will indicate the amount of resources each task is using during the specified time interval.

    ================= alarm snapshot ====================
        alarm_time: 5/26/2013 10:27:34
        cpu_utilization: 39
    === snapshot of per service session creation counters ===

    protocol  	-2 	-1 	0  	1 	2
    (   0)NONE 	0 	0 	0 	1 	0
    (   1)FTP 	0 	0 	0 	0 	0
    (   2)RSH 	0 	0 	0   	0 	0
    (   4)REXEC  	0 	0 	0 	0 	0
    (some protocols removed to shorten output)
    (  28)DHCP 	0 	0 	0 	0 	0
    (  29)TFTP 	0 	0 	0 	0 	0
    (  30)IDP 	0 	0 	0 	0 	0
    (  31)BWMON 	0 	0 	0 	0 	0
    (  32)IRC 	0 	0 	0 	0 	0
    (  33)YMSG 	0 	0 	0 	0 	0
    (  34)IDENT 	0 	0 	0 	0 	0
    (  35)NNTP 	0 	0 	0 	0 	0
    (  36)AIM 	0 	0 	0 	0 	0
    (  37)RUSERS 	0 	0 	0 	0 	0
    (  38)LPR 	0 	0 	0 	0 	0
    (  39)GOPHER  	0 	0 	0 	0 	0

    =====================================================
    ========= snapshot of task run time =======================
                                          time slot
    Task Name 	-2 	-1 	0 	1 	2
    (  74)av worker  	803 	760 	803 	760 	1151
    (  66)telnet 	6 	6 	6 	6 	1
    (  62)telnet-cmd:0 	5 	6 	5 	6 	0
    (   2)1s timer   	4 	4 	4 	4 	4
    (  17)hwif count poll  	0 	0 	0 	0 	0
    (  20)ping high  	0 	0 	0 	0 	0
    (  22)tftp  	0 	0 	0 	0 	0
    (  24)pk poll mgt  	0 	0 	0 	0 	0
    (  25)asp_tcp_timer   	0 	0 	0 	0 	1
    (  26)cmd  	0 	0 	0 	0 	0
    (  27)pki  	0 	0 	0 	0 	0
    (  28)pki-db  	0 	0 	0 	0 	0
    (  29)ssl 	0 	0 	0 	0 	0
    (  30)infranet  	0 	0 	0 	0 	0
    (  31)dhcp probing  	0 	0 	0 	0 	0
    (  32)dnsa 	0 	0 	0 	0 	0


    =====================================================
    ================ snapshot of count  =====================
                                                   time slot
    counter 	-2 	-1 	0 	1 	2
    (   0)poli_deny 	0 	0 	0 	0 	0

    =====================================================

Here AV Worker takes up the most resources in this output as well.

Troubleshoot the related logs

Based on the resource identified, troubleshoot the related logs, counters, and resources. From the example above, the administrator would troubleshoot AV related logs, counters, and resources.

 If the resource is  "NTP" consult: KB8843 - High (NTP) task CPU after upgrading to ScreenOS 5.4. 
 If the resource is "session scan", investigate ARP, route, and policy additions/changes.
 If the resource is "session scan" and the ScreenOS version is 6.x, then run the following commands, which contains a new debug that gives an insight into why the 'session scan' task is busy.  Also, check How to debug "session scan" task causing High CPU.
 If dlog task is high, refer to KB24402 - [ScreenOS] High Task CPU usage on the firewall due to dlog task.
 If sendmail task is high, refer to KB25397 - [ScreenOS] The firewall's Task CPU usage is high due to the 'sendmail' task.
 If sme_appsig task is high, refer to KB23113 - High CPU (task) load on ScreenOS ISG-2000 caused by 'sme_appsig'.

a. Set the 'task' debug:

SSG320-> set task "session scan" debug

b. Get the output of the "session scan" task. The Subtask with the highest RunTime and RunCnt fields should be the culprit.

SSG320-> get task "session scan"
id 16, name session scan, seq 16, state IDLE
priority IDLE, previous priority NORM
stack size 12224, run time     0.579
trace: 00093b4c 003fb440 00081bb4
max scheduled interval: 290 ms

Debugged task id list:  16
Task session scan                    debug time:     0 Hour     0 Minute  7 Seconds

Subtask Name                        RunTime     RunCnt   Schedule   LockLatency
Scan session                          0.550          2          6         0.000
Route event                           0.002          1          2         0.000
ARP   event                           0.026          1          0         0.000
NDP                                   0.000          0          0         0.000
PMTU                                  0.000          0          0         0.000
-------------------------------------------------------------------------------
All subtasks  used CPU time           0.578          lock latency         0.000
Capture output of #get task <ID>

where ID is the suspect task number. In above example, the task ID for av worker is 74. Run this command multiple times to see if the trace is the same or changing differently each time. This trace data is useful for Technical Support.

Investigate cause of High "FLOW" CPU?

High CPU in Flow indicates the Firewall is busy processing packets; this includes the processing of functions such as:

Session creation/ tear down
Traffic management features (i.e. logging, shaping, etc)
Firewall Protection features (i.e. Screen options)
ALG processing
Attacks

Run Packet Profiling on firewall with High Flow CPU to help identify the cause.

  • Session Table – Check session table information to see the total number of sustained sessions and whether there are any session allocation failures.

get session info

NS5200->  get session info
slot 1: sw alloc 0/max 1000000, alloc failed 24749314, di alloc failed 0
slot 2: hw0 alloc 0/max 1048576
slot 2: hw1 alloc 0/max 1048576
  • Attacks - Check if the network is under any kind of attack or if there are a high number of packets getting processed by the screen options
get counter screen zone
get alarm event
get log event

Note: There is the possibility that an attack can be occurring, but is not being reported in the output of the above commands. This is because the firewall will only report attacks for the screen options configured on the firewall. To confirm an attack is not occurring, connect a packet capture tool to the firewall’s network segments and review the data.

  • Interface Counters - Check for errors, high policy deny values, high frag values or any other counters that are incrementing unusually.

It's best to clear the counters and take a new snapshot of the counters. To clear the counters, enter clear counter all. Then, enter the following set of commands several times; leaving a 5 - 10 second interval between sets.

get clock
get counter stat
  • High volume of fragmentation can cause high CPU in flow. For firewall devices with a single CPU (i.e. NetScreen-200 models and below), fragmentation has a dramatic effect. Run CLI command get session frag several times to check for packet fragmentation.
  • ALG - Identify any applications using the ALG function (i.e. FTP, H323, etc)
  • Obtain the non-truncated get session output from the firewall and run it through the Firewall Session Analyzer Tool to determine which applications are most commonly running through the firewall. The ‘Rank based on source IP with protocol and destination port information’ will help identify the ‘top’ applications. Check if those ‘top’ applications will trigger the ALG.
  • Debug/Snoop - Check if either debug or snoop is enabled. Debug and Snoop Can Cause High CPU Utilization
  • Traffic -

If enabled, check if policy count is on. If enabled, disable traffic counting on the policies.

ICMP - ICMP Type 3 Code 3 packets and high CPU activity on ScreenOS
VPN - AES encryption - HIGH CPU due to AES Encryption
  • Packet rate - If SSG series, NS500, NS200, NS25/50 or NS5 series, calculate the packets-per-second going through the firewall. The easiest way to determine the packet-per-second rate is to get a 1-5 minute snapshot of the network by capturing a packet trace. Many of the packet capturing tools have an option to display the packet-per-second rate.

If you do not have or cannot set up a packet trace of the network, the next best effort is to calculate the total number of packets coming into the interfaces of the firewall.

This can be done by obtaining the output from get counter stat consecutively over a set time period. To do this, first issue a get clock, so that you have a time stamp to reference from. Then, issue a get counter stat | include packet. Total the number of ‘in packets’ and divide by 2. This provides the total hardware and flow counters for each interface. Repeat the process of issuing another get clock and another get counter stat | include packet. Do this in quick succession, so that you can get an accurate time measurement.

  • Policy Ordering - Ensure the most frequently used policies are positioned near the top of the policy list.

The following are possible ways to help determine the frequently used policies:

  1. Use NSM.
  2. If counting is enabled on the policies, analyze the data.
  3. Obtain the non-truncated get session output from the firewall, and run it through the Firewall Session Analyzer Tool to help determine the frequently used policies.

High Memory Usage

Step 1. Run ‘get sys’ and check the total memory of the device.

Step 2. Run the command “get mem” to check the allocated memory and memory left from heap.

ns5200-> get mem
Memory: allocated 1453351968, left 310348720, frag 69, fail 0

Here allocated memory is 1453MB and free memory is 310MB. If you add both, you can see that the total usable memory is 1763MB which is less than the total memory of the device. This is because the rest is used when the device boots up (to save config, firmware, license etc.). The output shown in get mem actually shows kernel memory or global heap which is used to allocate memory for different flow and task operations dynamically.

Fail count represent the no. of times the device failed to allocate memory from global heap for a particular operation. Check if this count is increasing continuously.

Step 3. Run the command “get license” to check if there is any utm license installed on the device. If there is check whether the license is active and feature is in use. License keys take high amount of memory and if not in use, delete the keys.

If the device is ssg5/20(or any other low end device) and utm feature license are installed on the device, then the memory is expected to be high and this is normal.

Step 4. Run the command “get session info” to find out the no. of sessions on the device. Check what should be the expected number of sessions, are the no. of sessions on the device normal, since high memory is noticed, is any upgrade/downgrade of firmware performed.

Check if the no. of allocated sessions are hitting the max limit and alloc failed count is increasing. If yes, check if there is any custom service created on the device with timeout as “never” or a very high number.

Another case is that the no. of allocated session is very low ( e.g. 10k out of max 2 lacs possible) and that is the expected no. of session. In that case, limit the max no. of sessions on the device by running an envar command “set envar max-session=50000” and reboot the device. After booting up, if you do a “get session info” the max no. would be shown as 50000. Running this envar command basically reduces the memory allocated during boot process for those sessions.(If the usable memory was 1763MB, On running this envar command, the usable memory will increase to approx 1900MB or more).

Run this envar command only after doing thorough analysis and checking all following points.

Step 5. Run the command “get mem pool” to check which feature/task/pool is using highest memory. Here’s an example:

Global memory pools:

NAME                         SYS_MEM   ALLOCMEM NALLOC  NFREE OVERSZ     QUOTA
==============================================================================
DM                             48756          0      0    721      0        -1
Routing                        16436        376     30   1061      0        -1
SSHv2 String Pool                  0          0      0      0      0        -1
idp                         94710208   88100512 1501527   2096      0 316067840
JPS Notify                         0          0      0      0      0        -1
JPS Context                    16420         56      2    517      0        -1
defrag pool                        0          0      0      0      0   4500000
net                            24572          0      0    714      0        -1
Auth Id Table                      0          0      0      0      0        -1
CAVIUM                       9433088    9184000  30733    330     10        -1
NET-PAK                            0          0      0      0      0 134217728
PKI-IKE                       556224     324352   3116   2355  10790        -1
sys                           583756     382752   5204   1839      0        -1

Here the allocmem counter represent the amount of memory from global heap allocated to that pool. Nalloc represent no of cell allocated to this pool. Check if both the counters are increasing continuously for any particular pool(Collect this output multiple times in span of few minutes/seconds during the issue). The pool that needs to be checked are idp, pki-ike, net-pak, sys(it corresponds to task running on device).

Here sys, idp, pki-ike pool uses 382KB, 88MB, 324KB memory from global heap respectively. Now, in case idp pool counters are increasing continuously, it means that there is lot of traffic using di utm feature. Take output of ‘get session’ and analyse in Firewall Session Analyzer to confirm the same. If ‘sys’ counter is very high then there is some task running on the device which is using very high amount of memory(use the command to check ‘which’ task).

To find if there is any error which memory allocated to a particular pool, run this command “get mem pool name sys error”.

It may happen that although, DI UTM license are not installed on the device , still ‘get mem pool’ will show that certain memory is allocated to idp pool. If you check ‘get license’ you will find that DI is enabled.

   Paste.png     This section is under construction.

Step 6. Run the command ‘get mem used’ to check which block of memory is allocated to which task. Here’s a sample output:

====== PUBLIC HEAP =====
+  03e00010:03e00030:03e002e0, 000f;    unknown,    688 Trace: 003ae958 003a4e1c 003a562c, Time: 0 ,(0/0)
+  155e7930:155e7950:155ec150, 000f; telnet-cmd:19,  18432 Trace: 00a38cb0 00a3ba38 008c31ac, Time: 424406978 ,(0/0)

Here:

 “+” represents that this memory block is in use(- represents free).
 “03e00010:03e00030:03e002e0, 000f” represents pointer information (block start address, next block address etc remember C)
 “unknown” means this block is occupied by a unknown task.
 “688” represents block size.
 “Trace: 003ae958 003a4e1c 003a562c” represents trace and time when this memory block was allocated.

In this output, check which task has occupied max. no of memory blocks. This output will be huge so just estimate the no. of blocks used by a particular task. Example, if this output shows max. blocks are allocated to “unknown” this means that these are pre-allocated memory blocks which will be used for session table information. In this case if you reduce the max no. of sessions by the envar command, the unnecessary unused blocks of memory will be freed.

If a different task is occurring most no. of times in this output, run “get os task” to find the task id for it. Run the command “get mem <task id> to find the memory allocated to that particular task from global heap”. If this no. is continuously increasing find out the amount of traffic using this particular task through “get session” . Also, run the command “get mem <task id> error” to see the error count. If the error count is increasing or the memory used by this task is increasing continuously without much traffic using this task, there is memory leak issue for this task/chunk.

Other commands like “get mem chunk”, “get mem debug” , “get mem kernel” etc are used to map pointer information to check whether that block of memory is freed or not(based on trace and time values) and to find which chunk of memory is having leak issues.

High Session Usage

   Paste.png     This section is under construction.

Latency/Throughput Issues

   Ambox notice.png     This Section needs to be consolidated

Throughput issues

Latency will depend upon connection speed, session speed, file size and concurrent sessions being used. The firewall is doing stateful inspection of the packets and hence there would be some latency introduced by the firewall.

1. Check the interface configuration. Make sure that all the interfaces are in full duplex mode. Be careful, you make lose connectivity.

2. Check the speed set on the interfaces. In Ideal case it should be 100mbps.

3. Check if there is high CPU, memory or session related issues.

4. Check if the latency/slowness is periodic in nature i.e does it always happen at around a certain time (usually when the traffic is high).

5. Check if latency is seen for traffic going through a particular interface or for a particular traffic going through the firewall.

6. Check if the latency is seen for a particular traffic. Usually the latency issues are seen with traffic like FTP, web traffic.

7. Try accessing web sites using their IP addresses instead of their domain names. This will help to know if the latency is caused due to DNS server.

8. Check the default gateway of the PC being used. Make sure the IP address of the trusted interface of the firewall is assigned as gateway. If not, try using the ip address of the trusted interface of the firewall. (this is not applicable when there is an internal router and is used as gateway for the PC).

9. Try to connect to the firewall trusted side directly and check throughput. This will help isolate whether the issue is on the internal network devices or on the firewall.

10. Try to connect to the internet directly and see if the internet connection is proper.

11. Check the config to see if any Traffic Shaping is configured on the firewall. If yes, confirm whether the settings is as per requirement.

12. Check the polices on the firewall. Put the most used policy on the top of the list.

13. Check if the firewall has web-filtering or AV or DI. Try disabling the same and check if the speed improves.

14. Check the routing on the firewall if there is any routing loop on the firewall. Try and simplify the routes as much as possible.

15. Check if there are any fragmentation taking place on the firewall. If yes, the set the mss value accordingly and check if that improves the speed.

16. For issues like latency in VPN traffic, check configuration flow parameters and traffic shaping config.

NOTEs:-

  • Do not use ICMP for testing latency.
  • Throughput testing should be done using specialized traffic-generating tool, like Smartbits or Ixia.

Troubleshooting High Latency/Performance Issues

The first thing to check is fragmentation. There are two situations wherein either the firewall is fragmenting the traffic or the firewall is receiving fragmented traffic.

Case I - Firewall is receiving the fragmented traffic.

The following commands will help analyze the fragmentation level “get session frag” – Shows the total fragments received.

For latency in VPN traffic, check “get sa stat” – Shows fragments in VPN traffic “get sa [active | inactive] stat”

Shows the SA statistics for the device. Also displays active or inactive SA statistics.

Displays these statistics for all incoming or outgoing SA pairs:

Fragment: The total number of fragmented incoming and outgoing packets.
Auth-fail: The total number of packets for which authentication has failed.
Other: The total number of miscellaneous internal error conditions other than those listed in the auth-fail category.
Total Bytes: The amount of active incoming and outgoing traffic.

Run these commands multiple times to check how fast the counters are increasing because these counters show historic values which might not be relevant at that moment of troubleshooting.

Since the firewall is receiving the fragmented traffic, need to check the MTU of the upstream/downstream devices.

It is normally not recommended to reduce the MTU so try the following command

set flow path-mtu

Hard-code the physical interface to appropriate speed/duplex settings as per the peer devices to reduce fragmentation.

Try changing the cables (straight/cross cable)


Case II – Firewall is fragmenting the traffic.

Reduce the MSS (Maximum Segment Size) using the following commands

set flow tcp-mss     (this command is for VPN TCP traffic)
set flow all-tcp-mss (this command is for Clear TCP Traffic)

These commands will only affect TCP traffic and not UDP/ICMP traffic.

Check the “get counter statistics” output for the following errors

crc err    Number of packets with a cyclic redundancy check error
align err  Number of packets with alignment error in the bit stream
no buffer  Number of packets dropped due to unavailable buffers
misc err   Number of packets with at least one error
coll err   Number of collision packets

All these error are mostly caused due to duplex mismatch so hard-coding the physical interfaces & changing cables should help.

General Troubleshooting: (To be done when the entire network environment has been verified as not being the cause of latency)

1. Check applications and disable the unused ALG. This can help improve the performance. But be aware of all the applications before doing this.

2. Run sniffer capture within the trusted network to verify whether there are a lot of retransmission drops. If yes, try to isolate the trusted network and connect a PC directly to the trust interface and observe the behavior.

3. Check whether option for reassembly-for-alg on trust & untrust zone is enabled.

get zone [trust | untrust] | include alg

The default setting is No but if the customer has enabled this option and is using bursty application like FTP, this can induce latency. This feature is good for security.

4. The following command checks the TCP SYN bit before creating a session

set flow tcp-syn-check

This is good feature for security and can be disabled if the customer is ready to comprise on security for faster packet processing. But in High CPU cases with latency, keep this option enabled preferably.

5. If the firewall is being used in a large L2 network, make sure that all the routes on the firewall have a gateway configured so that the arp broadcast for the next-hop can be avoided.

6. Use session analyzer available at http://tools.juniper.net/fsa/ to check who is generating the maximum traffic in the network. The report includes Rank based on destination IP address, Rank based on source port, Rank based on source IP address, Rank based on protocol, Rank based on VSD (Virtual System Device), Rank based on source IP with protocol and destination port information.

7. From a command prompt in Windows, you can check fragmentation and other errors by using the netstat –s

8. Use the PPS calculator, to calculate the packets per second hitting the firewall which can be a major factor for reducing the throughput.


Latency/performance issues through the firewall

1. Get a network diagram to understand his setup and packet flow through the firewall.

2. Check the bandwidth provided by ISP and what is observed from test. Generally, there would be a 30% difference with the firewall. Let say you get 10mbps when directly connected to ISP link, then you can expect 7-8mbps when firewall is also included. Though this is not documented anywhere, this info is as per observations.

3. Check how speed testing is done. Generally, the speed test tools available on internet uses icmp and are not recommended as icmp traffic have lowest priority. Test speed using tcp connection(download any file from internet or you can go to a FTP Site to download any file).

4. Check since when latency is observed. Were there any recent changes in network set-up or on the firewall. Try to get as much information as possible like whether for any particular application latency is observed.

5. Check the CPU utilization on the device and no. of sessions on the device. If cpu/sessions are high on the device, it may cause certain latency.

6. Check interface duplex settings of incoming and outgoing interface. Check whether any of them is set to half-duplex. Try hard-coding it to full duplex(Be careful before hard-coding as the connected device may not support full-duplex 100mbps/1000mps in that case the access to firewall may be lost).

7. Check interface counters and see if any error counts are increasing continuously. Here we are supposed to look specifically for hardware counter “in misc err”, “in overrun”, “out underrun” or flow counters like “tcp out of seq”. If out of seq counter is increasing continuously, disable sequence check on the firewall.

8. Run the command “get session frag” to check “total fragment received” and “no. of fragment in queue” count is increasing. If it is increasing then there is fragmentation happening on the firewall. Play with different mss values by commands “set flow tcp-mss” and “set flow all-tcp-mss” so that on including the overhead added by different headers the total size is less than mtu. In case where latency is observed for traffic over vpn, check if PFS is enabled in Phase 2. If yes, use no-pfs as PFS utilizes lot of bandwidth.

9. UTM - The common cause for most latency cases. Check if web-filtering/AV is enabled on policy which allows traffic. Check by disabling that feature or you can create a new policy only for the PC from which test is conducted.

10. Check if the application is using any alg. If yes, disable re-assembly for alg on zone level and test speed again.

11. Calculate total Packets Per Second(pps) coming on the device. Check if total pps(sum of pps on all interface) is greater than the maximum pps that can be handled by the device(Refer datasheet).

12. If none of the steps resolves the issue, take debug flow basic, snoop detail with proper filter to capture incoming and outgoing packets, along with external sniffer captures on connected switches(both upstream and downstream) simultaneously. The snoop detail we take on firewall gives the time in seconds(as per design of screen os and cannot be changed) while wireshark output taken on interface gives time in milliseconds which is very useful in latency cases. We cannot determine what should be the normal time for packet to pass through the firewall, this data is presented to engineering and only they can comment. Do check in debug and snoop detail output whether you see any errors like “retransmission” or “packet drop”. Again, debug and snoop should capture tcp/udp data not icmp.

PATHPING

  • Syntax:
pathping <ip_address>   (from the command prompt of windows not firewall).
  • Pathping is a route tracing tool that combines feature of ping and trace-route along with additional information about the packet loss and round trip time of every hop till the destination ip address.
  • Pathping sends packets to each router on the way to the destination over a period of time and then computes the result based on packets returned from each hop and shows the degree of packet loss at every hop which helps us in pin pointing the hop which is causing the network problem.
  • After the computation is complete the output displayed would contain the following fields for each node: Hop Number, Round Trip Time (RTT), Percent of packets lost and sent for source to here, Address of the node at that hope and the percent of packets lost and sent from this node to the other node.
  • Sample output is displayed below:
Source to Here                                                   This Node/Link

Hop   RTT     Lost/Sent = Pct   Lost/Sent = Pct   Address (Node)  
0                                                                 111.111.111.111/     
                                             0/100 =  0%        |
1       30ms   0/100 = 0%        0/100 =  0%       222.222.222.222/     
                                             0/100 =  0%        |
2       30ms   0/100 = 0%        0/100 =  0%       111.222.111.222/    
                                             33/100 = 33%      |
3       30ms   0/100 = 0%        0/100 =  0%       222.111.222.111/      
                                             0/100 =  0%         |
0       30ms   0/100 = 0%        0/100 =  0%       123.123.123.123 .
  • The “Source to here” : is the first set of statistics after the hop number is equivalent to if you have pinged the node directly.
  • The “This Node/Link” is the set of statistic before the pipe and is the column you want to pay attention to as this would show the packet loss between the link.
  • In the above example the link between 111.222.111.222 and 222.111.222.111 is dropping 33% of the traffic and hence the router at hop 3 needs to be addressed for the issue.
  • Check for TTL and packet loss for the firewall’s incoming and outgoing interface to know if firewall is causing the issue.

Misc

  • Random Ping/Packet Failure reason:

Interface based NAT being used. DIP Allocation Failure error found in debug flow basic. All available pport's (get pport) are used. Commonly occurs in devices with less available Pseudo Ports like SSG5 or SSG20 as they have 2000 pports with basic license.

Solution:

Upgrade the device (i.e. buy a higher Model device with more pports available)
Upgrade the license (i.e. upgrade from basic to advanced license)
Add a DIP pool for a new Public IP (This will add 65535 more pports)
  • In case packets are coming out of sequence due to multiple paths/Loops/Load balancers, these packets may be dropped.

This could result in partial page loading. The command to ignore Packet sequence checking is:-

set flow no-tcp-seq-check

Use this command for testing only as this adds a security loophole.


blog comments powered by Disqus