FAQ: [#6463] TCP hangs when lot of traffic from 2 or more devices to one host on the same switch(2011)

Document created by Aaronwu Employee on Sep 11, 2013
Version 1Show Document
  • View in full screen mode

[#6463] TCP hangs when lot of traffic from 2 or more devices to one host on the same switch

Submitted By: Zoltan Ger

Open Date

2011-01-25 05:00:22    

Priority:

Low     Assignee:

Nobody

Status:

Open     Fixed In Release:

N/A

Found In Release:

2009R1.1-RC4     Release:

2.6.28.10-ADI2009R1.1-g3079765

Category:

Networking     Board:

Custom

Processor:

BF537     Silicon Revision:

Rev. 3

Is this bug repeatable?:

Yes     Resolution:

Out of Date

Uboot version or rev.:

    Toolchain version or rev.:

4.1.2 (ADI svn)

App binary format:

N/A     

Summary: TCP hangs when lot of traffic from 2 or more devices to one host on the same switch

Details:

 

How to reconstruct the problem:

1.) Unpack the attached sources, "linux" folder is for the host apps, "embedded" is for the embedded apps

2.) Setup a 100Mbit switch with two devices using the BF537 and a linux host PC. Do not use a 1GBit Switch, beacause the problem is not reproducibale if the port to the PC has a speed of >128Mbit/s.

3.) Run "embedded/testAppEmb" on the two BF537 devices and "linux/testapp ip_of_first_device_ip" in a terminal on the host PC and "linux/testapp ip_of_second_device_ip" in another terminal on the host PC.

After step 3.), the problem occurs.

 

What we found out:

1.) The problem does not happen, if the device sends data continuously without request from the host. To reproduce this, run "embedded/testAppEmbCont" on the two BF537 devices and "linux/testappcont ip_of_first_device_ip" in a terminal on the host PC and "linux/testappcont ip_of_second_device_ip" in another terminal on the host PC.

2.) The problem occurs at the beginning if the socket on the host is non-blocking and the host proceeds after the error, but it recovers after approx. 1 minute. To reproduce this, run "embedded/testAppEmb" on the two BF537 devices and "linux/testapptimeoutcont ip_of_first_device_ip" in a terminal on the host PC and "linux/testapptimeoutcont ip_of_second_device_ip" in another terminal on the host PC.

3.) We disabled the nagle algorithm, decreased the paket size (MTU), tried to decrease/increase SO_SNDBUF and SO_RCVBUF. None of these solved the problem.

 

Wireshark log:

Unpack the attached wireshark log. BF537-Device 1 with IP .144 and BF537-Device 2 with IP .146 nd Host PC with IP .143. Last message from BF537-Device 2 is number 34919, after that the host sends message 34920, but Device 2 dows not answer any more.

 

Our assumption:

Because the problem only exist if the host sends packages to the BF-devices to request data, it could be a TX/RX DMA out-of-sync issue. The problem is not reproducible it the BF-devices send data without request messages from the host PC. In that case, RX-DMA is not used.

 

Thanks in advance.

 

Follow-ups

 

--- Zoltan Ger                                               2011-01-25 08:20:09

In addition, we checked the RX-and TX-DMAs. Please see dmadump.txt. RX_COMP is

set, which means that the buffer is not processed by the DMA.

 

--- Sonic Zhang                                              2011-01-26 01:11:51

Could you run ipconfig on your bf537 boards and paste here?

 

--- Sonic Zhang                                              2011-01-26 01:12:02

Could you run ifconfig on your bf537 boards and paste here?

 

--- Sonic Zhang                                              2011-01-26 01:15:01

When you say TCP hangs, do you mean the kernel hangs or just the application

stop running? Can you type command on the serial console after the "TCP

hangs"?

 

--- Zoltan Ger                                               2011-01-26 03:34:00

"ifconfig.txt" is attached. There are no errors or dropped packets. As

you can see in "dmadump.txt", the RX-DMA does not finish copying the

packets to memory, because RX_COMP bit is set if the error occurs. TCP hangs

means, that the TCP application waits for a response (see the wireshark dump in

"tcp_hang.zip"). I can type commands via serial, can login with

telnet, start applications and so on.

 

--- Zoltan Ger                                               2011-01-26 03:37:18

We tried to reproduce this issue with uClinux release 2010R1-RC5 and it turns

out it works OK. We see short stalls but the TCP stack is able to recover.

 

--- Sonic Zhang                                              2011-01-26 04:14:20

Then, please upgrade to 2010R1. We usually don't support release except for the

latest one.

 

--- Zoltan Ger                                               2011-01-28 09:58:03

We'll upgrade, of course. But we'd like to know the reason. We searched with

keywords like tcp, emac, dma etc. to find the bug report, but without any

success. Can you send us a link to (a similar) bug report?

 

 

 

    Files

    Changes

    Commits

    Dependencies

    Duplicates

    Associations

    Tags

 

File Name     File Type     File Size     Posted By

dmadump.txt    text/plain    3731    Zoltan Ger

ifconfig.txt    text/plain    811    Zoltan Ger

TCP-TestApps.zip    application/x-zip-compressed    564787    Zoltan Ger

tcp_hang.zip    application/x-zip-compressed    1172114    Zoltan Ger

Attachments

Outcomes