[#6463] TCP hangs when lot of traffic from 2 or more devices to one host on the same switch
Submitted By: Zoltan Ger
Open Date
2011-01-25 05:00:22
Priority:
Low Assignee:
Nobody
Status:
Open Fixed In Release:
N/A
Found In Release:
2009R1.1-RC4 Release:
2.6.28.10-ADI2009R1.1-g3079765
Category:
Networking Board:
Custom
Processor:
BF537 Silicon Revision:
Rev. 3
Is this bug repeatable?:
Yes Resolution:
Out of Date
Uboot version or rev.:
Toolchain version or rev.:
4.1.2 (ADI svn)
App binary format:
N/A
Summary: TCP hangs when lot of traffic from 2 or more devices to one host on the same switch
Details:
How to reconstruct the problem:
1.) Unpack the attached sources, "linux" folder is for the host apps, "embedded" is for the embedded apps
2.) Setup a 100Mbit switch with two devices using the BF537 and a linux host PC. Do not use a 1GBit Switch, beacause the problem is not reproducibale if the port to the PC has a speed of >128Mbit/s.
3.) Run "embedded/testAppEmb" on the two BF537 devices and "linux/testapp ip_of_first_device_ip" in a terminal on the host PC and "linux/testapp ip_of_second_device_ip" in another terminal on the host PC.
After step 3.), the problem occurs.
What we found out:
1.) The problem does not happen, if the device sends data continuously without request from the host. To reproduce this, run "embedded/testAppEmbCont" on the two BF537 devices and "linux/testappcont ip_of_first_device_ip" in a terminal on the host PC and "linux/testappcont ip_of_second_device_ip" in another terminal on the host PC.
2.) The problem occurs at the beginning if the socket on the host is non-blocking and the host proceeds after the error, but it recovers after approx. 1 minute. To reproduce this, run "embedded/testAppEmb" on the two BF537 devices and "linux/testapptimeoutcont ip_of_first_device_ip" in a terminal on the host PC and "linux/testapptimeoutcont ip_of_second_device_ip" in another terminal on the host PC.
3.) We disabled the nagle algorithm, decreased the paket size (MTU), tried to decrease/increase SO_SNDBUF and SO_RCVBUF. None of these solved the problem.
Wireshark log:
Unpack the attached wireshark log. BF537-Device 1 with IP .144 and BF537-Device 2 with IP .146 nd Host PC with IP .143. Last message from BF537-Device 2 is number 34919, after that the host sends message 34920, but Device 2 dows not answer any more.
Our assumption:
Because the problem only exist if the host sends packages to the BF-devices to request data, it could be a TX/RX DMA out-of-sync issue. The problem is not reproducible it the BF-devices send data without request messages from the host PC. In that case, RX-DMA is not used.
Thanks in advance.
Follow-ups
--- Zoltan Ger 2011-01-25 08:20:09
In addition, we checked the RX-and TX-DMAs. Please see dmadump.txt. RX_COMP is
set, which means that the buffer is not processed by the DMA.
--- Sonic Zhang 2011-01-26 01:11:51
Could you run ipconfig on your bf537 boards and paste here?
--- Sonic Zhang 2011-01-26 01:12:02
Could you run ifconfig on your bf537 boards and paste here?
--- Sonic Zhang 2011-01-26 01:15:01
When you say TCP hangs, do you mean the kernel hangs or just the application
stop running? Can you type command on the serial console after the "TCP
hangs"?
--- Zoltan Ger 2011-01-26 03:34:00
"ifconfig.txt" is attached. There are no errors or dropped packets. As
you can see in "dmadump.txt", the RX-DMA does not finish copying the
packets to memory, because RX_COMP bit is set if the error occurs. TCP hangs
means, that the TCP application waits for a response (see the wireshark dump in
"tcp_hang.zip"). I can type commands via serial, can login with
telnet, start applications and so on.
--- Zoltan Ger 2011-01-26 03:37:18
We tried to reproduce this issue with uClinux release 2010R1-RC5 and it turns
out it works OK. We see short stalls but the TCP stack is able to recover.
--- Sonic Zhang 2011-01-26 04:14:20
Then, please upgrade to 2010R1. We usually don't support release except for the
latest one.
--- Zoltan Ger 2011-01-28 09:58:03
We'll upgrade, of course. But we'd like to know the reason. We searched with
keywords like tcp, emac, dma etc. to find the bug report, but without any
success. Can you send us a link to (a similar) bug report?
Files
Changes
Commits
Dependencies
Duplicates
Associations
Tags
File Name File Type File Size Posted By
dmadump.txt text/plain 3731 Zoltan Ger
ifconfig.txt text/plain 811 Zoltan Ger
TCP-TestApps.zip application/x-zip-compressed 564787 Zoltan Ger
tcp_hang.zip application/x-zip-compressed 1172114 Zoltan Ger