How to debug connections with tcpdump. This write up assumes that you have two openswan systems connected. If you have another system at one end, then likely it provides no useful debugging. (This is particularly true to windows and cisco boxes) This write up also documents use of tcpdump. ethereal or tethereal can also be used, as they use libpcap, and therefore the same expressions. In general, tcpdump needs to run as root. Some distros have customized versions of tcpdump, including options that mean the opposite from that which is in the tcpdump.org distribution. If in doubt, install tcpdump and libpcap from source. In all cases, we assume that "eth0" is the internal interface, and "eth1" is the external interface. If PPPoE is involved, then one should dump on the "ppp0" interface rather than the "eth1" interface: this will permit the selection expressions to work. Tcpdump can decode PPPoE packets, but can not select based upon the encapsulated UDP port numbers in that situation. If you have never built an openswan to openswan tunnel, you should do so using raw RSA keys to confirm that you can configure both ends properly. The network should look like: client1----sg1----internet----sg2----client2 The first thing to do is to observe the initiator's startup. If there is a NAT involved, make "sg1" be the system behind the NAT, which must be the initiator. The "replace" option removes any previous state, and reloads the definition of the conn from /etc/ipsec.conf. This is how it should look: sg1-[~] root 26 #ipsec auto --replace sg1--sg2-net sg1-[~] root 27 #ipsec auto --up sg1--sg2-net 104 "sg1--sg2-net" #796: STATE_MAIN_I1: initiate 003 "sg1--sg2-net" #796: received Vendor ID payload [Dead Peer Detection] 106 "sg1--sg2-net" #796: STATE_MAIN_I2: sent MI2, expecting MR2 108 "sg1--sg2-net" #796: STATE_MAIN_I3: sent MI3, expecting MR3 004 "sg1--sg2-net" #796: STATE_MAIN_I4: ISAKMP SA established 117 "sg1--sg2-net" #797: STATE_QUICK_I1: initiate 004 "sg1--sg2-net" #797: STATE_QUICK_I2: sent QI2, IPsec SA established {ESP=>0xaa6fa19a <0xa2f3b68d xfrm=3DES_0-HMAC_SHA1 NATD=none DPD=enabled} There are three possible failures here: a) no messages are ever received (only STATE_MAIN_I1) b) message MI2 is repeated (MR1 and MR2 are received) c) QUICK mode initiates, but never completes d) all of the above occurs, but traffic does not flow. Situation (a) is usually due to a firewall, some other blockage, or that the responder is not running. Situation (b) is often due to failure to authenticated, but can also be due to a firewall on port-4500, when NAT-T is involved. Situation (c) is often due to mis-matched policy. Sitaution (d) can be due to a number of things, including: firewall'ed ESP packets, firewall'ed port-4500 packets, Path MTU issues, local or remote system mis-configuration: such as a missing "ip" command, errors from the _updown scripts, mis-compiled kernels modules, etc. Situation A - no communication on port 500 Use tcpdump on "sg1" and on "sg2" at the same time: # tcpdump -i eth1 -n -p udp port 500 or udp port 4500 Explanation of options: -i eth1 <- only look on this interface. (some distros include support for dumping on all interfaces. Don't use that, it is confusing) -n do not do reverse DNS resolution -p do not use promiscuous mode. (we only want to see packets for this host) udp port 500 look for src/dst port == 500 or udp port 4500 look for src/dst port == 4500 Then, do the replace and up the conn again. If everything is operating correctly, then each packet that leaves sg1 should arrive on sg2. If there are many tunnels, then it may become confusing, and it is worth addint new expressions to the filter: sg1# tcpdump -i eth1 -n -p ip host 1.2.3.4 and '(' udp port 500 or udp port 4500 ')' sg2# tcpdump -i eth1 -n -p ip host 5.6.7.8 and '(' udp port 500 or udp port 4500 ')' (where 1.2.3.4 is the IP on SG2 and 5.6.7.8 is the IP of SG1. Note on each system filter for the IP of the opposite system) You may be tempted to just look at all traffic from the opposite security gateway. This can be done, but be careful if you are SSH'ed into one security gateway from another. You will dump port-22 traffic, which will create more port-22 traffic, leading to an endless loop. Except the port-22 traffic: sg1# tcpdump -i eth1 -n -p ip host 1.2.3.4 and not port 22 One one is looking for is packets that start at sg1, and do not get to sg2. It may be helpful to include the -v option to tcpdump, which will decode the initial main mode proposal (the first two exchanges are not encrypted). There are two situations a1) the packets all arrive on sg2. a2) the packets do not arrive on sg2. A1 - if all the packets arrive on sg2, but the pluto running on it does not acknowledge them (this can be seen by turning on debugging on pluto), then the problem is likely firewalling on sg2. A2 - if the packets do not arrive on sg2, then the odds are that there is a firewall involved, or you are not aware of some Network Address Traversal. Situation B - failure at third exchange. First, examine the logs on SG2. If it is complaining about being unable to find the appropriate keys, etc. then the problem is not a communication failure. It will be logging receive of MI3, make a complaint about failure to authenticate (in the case of PSK, it will appear as a failure to decrypt properly). Assuming that there are no log entries on "sg2", then the third packet may not be received. There are a number of possible reasons for this. Use tcpdump on both ends again, but include -v: sg1# tcpdump -i eth1 -v -n -p udp port 500 or udp port 4500 sg2# tcpdump -i eth1 -v -n -p udp port 500 or udp port 4500 Look for the third exchange, it will be marked as [E]. first exchange: 11:14:25.516187 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto 17, length: 320) 205.150.200.247.500 > 205.150.200.252.500: isakmp 1.0 msgid : phase 1 I ident: [|sa] 11:14:25.537388 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto 17, length: 128) 205.150.200.252.500 > 205.150.200.247.500: isakmp 1.0 msgid : phase 1 R ident: [|sa] second exchange 11:14:25.547023 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto 17, length: 272) 205.150.200.247.500 > 205.150.200.252.500: isakmp 1.0 msgid : phase 1 I ident: [|ke] 11:14:25.772504 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto 17, length: 272) 205.150.200.252.500 > 205.150.200.247.500: isakmp 1.0 msgid : phase 1 R ident: [|ke] third exchange: 11:14:25.781501 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto 17, length: 232) 205.150.200.247.500 > 205.150.200.252.500: isakmp 1.0 msgid : phase 1 I ident[E]: [encrypted id] 11:14:25.865700 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto 17, length: 360) 205.150.200.252.500 > 205.150.200.247.500: isakmp 1.0 msgid : phase 1 R ident[E]: [encrypted id] The above is what is should look like. One should see the same thing at each end. Variations that one might see: a) one might see traffic on port 4500. b) the IP addresses may also have changed if a NAT is involved. c) the packet may have been fragmented. d) any combination of the above. Fragmentation problem ===================== If certificates are involved, and they are being sent inline, that may lead to I3/R3 packets that are larger than 1500 bytes, which requires that the packet be fragmented. This will be indicated by having a non-zero "id" field, and the flags will include '[+]'. The above filter will not show the fragments. If you are seeing fragmentation, then adjust the filter to show all packets going to the other end. Careful, this may result in a lot of traffic: sg1# tcpdump -i eth1 -v -n -p ip host 1.2.3.4 and not port 22 sg2# tcpdump -i eth1 -v -n -p ip host 5.6.7.8 and not port 22 note if you see fragments leaving one system and not arriving at the other system. Note that Linux sends the fragments *BEFORE* the initial fragment. It is also possible that the local system is filtering the fragments itself, in which case, no packet will emerge at all. This can be due to local firewalling, but can also be due to UDP on the 2.6 kernel having the "Dont Fragment" bit set. The pluto log, with "emitting" debug, will log how big a UDP packet is being sent. This will confirm that a large packet is being sent. The most common situation is ISPs, crappy routers or over-zealous firewall admins who have filtered out fragments. Often they will claim that they have not done that. Test the situation with: sg1# ping -s 5000 1.2.3.4 you may also want to try with "hping2": sg1# hping2 -2 -x -y --destport 500 1.2.3.4 A way to determine that this is in fact the problem is to omit the certificate payload by putting "leftsendcert=never". Copy the certificate to sg2, and point the conn at it. While you may not want to operate like this permanently, it helps to diagnose the problem. Port-4500 is closed =================== If NAT was detected (this will be logged by pluto when you use --up), and you see the I3 packet leaving sg1 to port 4500, but not arriving at sg2, then the likely reason is that port-4500 has been blocked. If you see the packet arrive, but pluto on sg2 never sees it, then the problem is likely firewalling *on* sg2. Otherwise, it is your ISP, or firewall admin who done this to you. Be careful with rules: the port-4500 traffic, unlike the port-500 traffic does not originate on port 4500. That is, a rule like: iptables -s 0.0.0.0/0 -d 0.0.0.0/0 -p udp --sport 500 --dport 500 works for plain IKE traffic, but may not work for port-4500 traffic, which may originate on any port. Instead, you may need rules like: iptables -s 0.0.0.0/0 -d 0.0.0.0/0 -p udp --sport 4500 -j ACCEPT iptables -s 0.0.0.0/0 -d 0.0.0.0/0 -p udp --dport 4500 -j ACCEPT which is a very much more liberal rule set. Note that it would permit, for instance, traffic from port 4500 *to* port 138, so you may want to make sure that these rules are after you have protected your GatesOS systems. Situation C =========== This situation can be diagnosed from reading the logs on sg2. Tcpdump does not help at all here. Situation D =========== The tunnel appears to complete, but no traffic flows. First, invoke: sg1# tcpdump -i ipsec0 -n -p sg2# tcpdump -i ipsec0 -n -p and send some traffic. Do you see the traffic you are sending? (If sg1 is busy, then you will want to add some additional filters) If you do, but the traffic does not get through to its final destination, then you have a firewall problem on sg2, or perhaps a routing problem. If you see no traffic on sg1, then you may have a firewall problem there. Second, read the logs at both ends. Was the SA setup properly? Look for errors from "ip". If you see traffic on sg1, but none on sg2, then you need to investigate if the session layer ESP traffic is getting through. If you are running NETKEY, the above steps will be meaningless. Third, is there NAT involved? It could be that a network sharing NAT-device (often called incorrectly a "broadband router") will be "helpful". The two ends may not detect that there is a NAT involved, and may not switch from port-500+proto-50(ESP) -> port-4500 w/UDP-encap ESP. In versions 2.3.0+, pluto will log what it is going to do: 004 "sg1--sg2-net" #797: STATE_QUICK_I2: sent QI2, IPsec SA established {ESP=>0xaa6fa19a <0xa2f3b68d xfrm=3DES_0-HMAC_SHA1 NATD=none DPD=enabled} If it says "NATD=none", and you think that there is NAT, then you may have to use "forceencaps" in the conn definition. Or, take the NAT-device out of the loop, and smash it with a hammer. If this is not the case, then on each end do: sg1# tcpdump -i eth1 -v -n -p ip host 1.2.3.4 and ip proto 50 sg2# tcpdump -i eth1 -v -n -p ip host 5.6.7.8 and ip proto 50 you should see ESP packets with SPI 0xaa6fa19a leaving sg1 (this the => in the log entry), and ESP packets with SPI 0xa2f3b68d arriving on sg1. On sg2, you see the opposite. Observe if you see traffic only in one direction. Go back and confirm whether or not, on sg2's eth0 (the internal interface) you see traffic sent from sg1. If you also see responses, then the problem is in the reverse flow. Follow these instructions again, but with "sg1" and "sg2" swapped. $Id: debugging-tcpdump.txt,v 1.1 2005/04/24 15:47:26 mcr Exp $