2009-10-05 09:35:53     tcp drops packets do to checksum errors

Document created by Aaronwu Employee on Aug 27, 2013
Version 1Show Document
  • View in full screen mode

2009-10-05 09:35:53     tcp drops packets do to checksum errors

Stefan Wanja (GERMANY)

Message: 80835   

 

Hello,

 

we are experiencing packet drops due to a failing checksum check in tcp_checksum_complete (in tcp_rcv_established). The checksum is not calculated by the hardware as stated in https://blackfin.uclinux.org/gf/project/uclinux-dist/forum/?_forum_action=ForumMessageBrowse&thread_id=36476&action=ForumBrowse&forum_id=39.

 

The checksum check failes seldom, but often enough to make us wonder and to cause problems. The network device recevies every packet without errors, so the data or checksum problem must occur within the kernel somehow.

 

Does anybody know this problem? With wireshark we are seeing Duplicate Acks from the blackfin each time the checksum calculation fails. It fails only for one packet at once.

 

We are using uclinux-dist-2009R1-RC6 and bf527.

 

Kind regards,

 

Stefan

QuoteReplyEditDelete

 

 

2009-10-05 11:11:59     Re: tcp drops packets do to checksum errors

Robin Getz (UNITED STATES)

Message: 80850   

 

Stefan:

 

Is it failing on one packet type more than a different one? (http? mail? ftp?) - what is the easiest way to replicate the problem you are seeing?

 

-Robin

QuoteReplyEditDelete

 

 

2009-10-06 12:02:07     Re: tcp drops packets do to checksum errors

Stefan Wanja (GERMANY)

Message: 80900   

 

Hey Robin,

 

we are trying to make it reproducable, but at the moment, it seems we found a problem in bfin_mac.

 

The checksum calculated in the tcp receive is wrong because the data in the skb->data is in fact corrupted.

 

In the skb->data of the failing packet we found some sequences of old packets(!)  which replaced parts of the packet (for example the area from tcp flags to checksum), we have also seen these bytes  a couple of times: EA 05 00 50 DA 47 1B D1 3A F8 F3 8D (even after resets), and also random bytes where we were expecting zeros.

 

We made a seperate copy of the skb->data area in the bfin_mac_rx routine, which is called by the DMA interrupt handler to see if the stack is causing the corruptions in the packet, but the packets are already corrupted down there.

 

So either the network device copies trash, or the DMA copy somehow makes errors or there is a caching problem?

 

There are some notes about caching in the code already...

 

 

 

Concerning reproducability: We have a load of about 70% on the bf527 and receive mostly Ethernet MTU sized packets at a datarate of approx 6,15MBit/s to the bf, while the bf sends about 12,3MBit/s. A failure occurs in average about every 20 seconds.

 

Kind Regards,

 

Stefan

QuoteReplyEditDelete

 

 

2009-10-06 12:32:30     Re: tcp drops packets do to checksum errors

Michael Hennerich (GERMANY)

Message: 80903    A quick hint before I'm out of office for today - Blackfin doesn't have

a dedicated cache invalidate instruction!

It's always flush + invalidate.

So incase the cache is hot - you might flush cached data into the newly

DMAed data.

I took a quick look - bfin_mac_rx() flush + invalidates the packet

twice...

 

This is where I would start looking...

 

-Michael

QuoteReplyEditDelete

 

 

2009-10-06 13:35:38     Re: tcp drops packets do to checksum errors

Stefan Wanja (GERMANY)

Message: 80908   

 

Hey Michael,

 

thanks for your hint. We didn't locate the problem yet, but switching to WRITE THROUGH caching solved the checksum failure problem. But we would like to use WRITE BACK if it brings more performance in general.

 

So we are still looking for a "real" solution....

 

Stefan

QuoteReplyEditDelete

 

 

2009-10-06 13:48:43     Re: tcp drops packets do to checksum errors

Robin Getz (UNITED STATES)

Message: 80910   

 

Hmmm....

 

There also might be an issue (not sure) if dev_alloc_skb(PKT_BUF_SZ + NET_IP_ALIGN); doesn't return something alighned (start and end) to a cache line.

 

When I add something like:

 

Index: linux-2.6.x/drivers/net/bfin_mac.c

===================================================================

--- linux-2.6.x/drivers/net/bfin_mac.c  (revision 7535)

+++ linux-2.6.x/drivers/net/bfin_mac.c  (working copy)

@@ -1016,6 +1016,11 @@

        /* Invidate the data cache of skb->data range when it is write back

         * cache. It will prevent overwritting the new data from DMA

         */

+       if ((unsigned long)new_skb->head & 0x1F)

+               printk(KERN_NOTICE DRV_NAME ":new_skb->head not aligned\n");

+       if (((unsigned long)new_skb->end & 0x1F) != 0x1F)

+               printk(KERN_NOTICE DRV_NAME ":new_skb->end not aligned\n");

+

 

 

I get alot of "bfin_mac:new_skb->end not aligned" - so when you see the errors - are they in the middle of the packet (aligned to a 32-byte cache line, or at the end?)

 

-Robin

QuoteReplyEditDelete

 

 

2009-10-09 07:20:36     Re: tcp drops packets do to checksum errors

Stefan Wanja (GERMANY)

Message: 81014   

 

Hey Robin,

 

the errors are in the middle towars the beginning. At the end we have not seen wrong data so far. If it is 32-byte aligned I don't know how to figure out.

 

Here is a failing packet, the tail of zeros is omitted. The coloured bytes are wrong, they should be zeros.

 

The first block is the ethernet and ip header, then the tcp header + data.

 

 

3a f8 f3 8d 8c fc 00 05  da 47 1b d1 08 00 45 00

04 c0 d4 10 40 00 80 06  f9 22 0a 00 0a 1a 0a 00

0a eb

 

08 a1 30 39 a1 5d 25 68  d7 94 3e 1a 50 18 80 00

ec e0 00 00 00 00 00 00  00 00 00 00 00 00 00 00

00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00

00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00

ea 05 00 50 da 47 1b d1  3a f8 f3 8d 00 00 00 00

00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00

 

The green block is the PC's mac addr, the red block is part of the blackfin's mac addr. The blue is unidentified. We captured about 10 packets with exactly the same offset (!)and content(!), even after reboots, so there must be some logic behind this...

QuoteReplyEditDelete

 

 

2009-10-09 16:09:22     Re: tcp drops packets do to checksum errors

Mike Frysinger (UNITED STATES)

Message: 81019   

 

are you using the L1 option ?  look in the menuconfig at the bfin_mac driver and the sub options it has.

QuoteReplyEditDelete

 

 

2009-10-12 05:19:46     Re: tcp drops packets do to checksum errors

Stefan Wanja (GERMANY)

Message: 81093   

 

Heya,

 

yes, we're using L1 Cache, 10 transmit buffer packets and 20 receive (thats the default). We had also increased the values to their maxima before and had the same problems.

 

Kind Regards,

 

Stefan

QuoteReplyEditDelete

 

 

2009-10-13 03:34:01     Re: tcp drops packets do to checksum errors

Sonic Zhang (CHINA)

Message: 81134   

 

I create a bug for this issue in the tracker   blackfin.uclinux.org/gf/project/uclinux-dist/tracker/?action=TrackerItemEdit&tracker_item_id=5600

QuoteReplyEditDelete

 

 

2009-10-14 07:16:57     Re: tcp drops packets do to checksum errors

Sonic Zhang (CHINA)

Message: 81201   

 

Could you try the attached patch to see any difference in WB cache mode?

 

In this patch, RX skbs are invalidated only once before it is linked to RX DMA ring.

 

bfin_mac_invalidate_skb_before_link_to_dma.patch

QuoteReplyEditDelete

 

 

2009-10-15 08:27:53     Re: tcp drops packets do to checksum errors

Stefan Wanja (GERMANY)

Message: 81311   

 

Hey Sonic,

 

thanks for your patch. I cannot really say, that it makes a difference, since the errors occur at random times.

 

They still happen. There seems to be a relation to uboot's network activity. When uboot uses network before the linux boots, the probability of such an error is raised quite a lot.

 

But still, if WT cache is used, no errors appeared so far (with and without your patch).

 

Speaking of uboot I have to admit that we are still using U-Boot 1.1.6 (ADI-2008R1). I will update now and see what happens then, sorry for that ...

 

Can there be an issue with uboot that causes problems when linux uses WB cache???

 

Kind regards,

 

Stefan

QuoteReplyEditDelete

 

 

2009-10-20 06:56:49     Re: tcp drops packets do to checksum errors

Stefan Wanja (GERMANY)

Message: 81476   

 

Hey Sonic,

 

thanks for your patch. I cannot really say, that it makes a difference, since the errors occur at random times.

 

They still happen. There seems to be a relation to uboot's network activity. When uboot uses network before the linux boots, the probability of such an error is raised quite a lot.

 

But still, if WT cache is used, no errors appeared so far (with and without your patch).

 

[Speaking of uboot I have to admit that we are still using U-Boot 1.1.6 (ADI-2008R1). I will update now and see what happens then, sorry for that ...] => Checked that and it still happens with u-boot-2008.10 (ADI-2009R1-rc3).

 

Can there be an issue with uboot that causes problems when linux uses WB cache???

 

Kind regards,

 

Stefan

 

---

QuoteReplyEditDelete

 

 

2009-10-20 08:23:40     Re: tcp drops packets do to checksum errors

Stefan Wanja (GERMANY)

Message: 81478   

 

Hello Sonic,

 

why did you say its fixed already? Your change didn't solve the problem yet....

 

Kind regards,

 

Stefan

QuoteReplyEditDelete

 

 

2009-10-20 12:58:30     Re: tcp drops packets do to checksum errors

Mike Frysinger (UNITED STATES)

Message: 81491   

 

once the kernel boots, whatever version of u-boot you were using shouldnt matter

QuoteReplyEditDelete

 

 

2009-10-20 23:13:22     Re: tcp drops packets do to checksum errors

Sonic Zhang (CHINA)

Message: 81499   

 

This bug is reopened after you say the patch doesn't help a lot.

QuoteReplyEditDelete

 

 

2009-10-22 06:47:24     Re: tcp drops packets do to checksum errors

Sonic Zhang (CHINA)

Message: 81591   

 

With the attached patch, I ftp a 20M-byte file to the board without checksum error for about 20 times. Of course, no one can say this walk around solve the issue.

 

Please have a try with your application.

 

bfin_mac_invalidate_skb_before_link_to_dma_2.patch

QuoteReplyEditDelete

 

 

2009-11-03 11:53:25     Re: tcp drops packets do to checksum errors

Stefan Wanja (GERMANY)

Message: 82039   

 

Hey Sonic,

 

unfortunately this patch also doesn't help...

 

I had to define the CONFIG_BFIN_EXTMEM_WRITEBACK macro because I have the 2.6.28 kernel btw.

 

We have constant DMA work on both SPORTS, could this be a problem? Also the kernel is configured as preemtible...

 

Kind regards,

 

Stefan

QuoteReplyEditDelete

 

 

2010-05-06 10:07:19     Re: tcp drops packets do to checksum errors

Jon Kowal (GERMANY)

Message: 89182   

 

Hello,

 

I picked up the issue because the network ist still not working correctly. We can reproduce the problem easily with EZ-Kits as well as our own boards. The problem causes our high-bandwith network applications (multi-channel realtime audio) to fail because packets are being corrupted after reception and there is no time for retransmission. Switching to write-through mode is no solution because we are experiencing freezes in that mode (see other forum thread) that we are not able to solve, either.

 

Because the write-back problem is so easy reproducible we decided to focus on that, again. I have spent the last three days trying to find some solution or workaround and this is how far I got:

 

    I use iperf on blackfin and PC to produce the problem. It is important to run iperf with the -d option for bidirectional test. Usually it takes just a second for the first checksum errors to occur in tcp_rcv_established(). I start iperf using the following syntax:

    Blackfin:     iperf -s

    PC:     iperf -c 192.168.1.5 -P 1 -i 1 -d -F /dev/zero -t 15

    In tcp_rcv_established() (net/ipv4/tcp_input.c) I added the following three lines after the csum_error label to make the errors visible. Sometimes I also dump the entire packet there (see below for example).

 

    csum_error:;

    static unsigned long u =0;

    u++;

    printk(KERN_ALERT "CSUM error: %u\t %p\n", u, skb->head );

 

    TCP checksum check fails because the received data contains data that does not belong there.

    Sometimes the invalid data can be identified as header data from a packet that had been sent some time ago. I was also able to match the memory address: the sent packet had been located at the exact address where the error is now occuring in the incoming data.

    I have inserted the following memset right before the call to blackfin_dcache_invalidate_range() in bfin_mac_rx() to see if the cache is being invalidated correctly. It turns out that the corrupted packets will now contain the data I introduced using memset.

    memset( new_skb->head, 1, new_skb->end-new_skb->head );

    The following listing shows an excerpt from a corrupted packet which should be all 0 after the tcp header at the top of the data but contains 32 bytes (one cache line?) of 1's which I introduced before invalidating the cache.

 

     Packet Length=1480, Data StartAddr=00289034

    84 33 13 89 e7 42 c2 23  9a b4 5d ff 80 10 00 2e

    0a f0 00 00 01 01 08 0a  00 80 b8 2e ff fe f0 19

    00 00 00 00 00 00 00 00  00 00 00 00 01 01 01 01

    01 01 01 01 01 01 01 01  01 01 01 01 01 01 01 01

    01 01 01 01 01 01 01 01  01 01 01 01 00 00 00 00

    00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00

    ...

 

    So apparently either the invalidate did not work or somewhere up the stack some read/write operation has read the old data into the cache before the dma updated the sdram.

    Attempts to invalidate or disable the entire cache in case of an error using blackfin_invalidate_entire_dcache() always crash the system.

 

I would like to check that the invalidate operation (FLUSHINV) is working correctly by using DTEST_COMMAND to check the invalid flag of the affected memory. Also setting the invalid flag directly may be a quicker way to invalidate the flash instead of using FLUSHINV which always (unnecessarily in our case) flushes the cache first.

 

Unfortunately I have no experience in writing assembler code and even after reading the programming reference a couple of times I feel unsure on how to actually proceed.

 

Can anyone help me write a function that, given a specific memory location, returns the valid flag for the associated cache line?

 

Do you have any other ideas on the cause of the problem? If you have problems to reproduce the issue I am sure I can assist.

 

Thanks! Jon

TranslateQuoteReplyEditDelete

 

 

2010-05-06 10:21:49     Re: tcp drops packets do to checksum errors

Jon Kowal (GERMANY)

Message: 89186   

 

Hi Robin,

 

I have checked the addresses of skb and error data. the skb->head is always 32-bit aligned and the errors can often be found at an offset of 96 bytes which is also 32-bit aligned. There are also other offsets but it is remarkable that the offset is always identical for the same skb->head address. To put it in other words: The errors don't go away when a new packet is being received on the same memory location.

 

Also, as stated (in detail) in my reply to Sonic, there has always been an outgoing transmission using the exact same memory some time before the error occurs. Sending data only one-way does not produce the error, it is important to have bidirectional high-bandwidth data flow.

 

Thanks for your help!

 

Jon

TranslateQuoteReplyEditDelete

 

 

2010-05-06 12:00:16     Re: tcp drops packets do to checksum errors

Robin Getz (UNITED STATES)

Message: 89190   

 

Jon:

 

I'm assuming that if you go to  /sys/devices/platform/bfin_mac.0/net/eth0/statistics/ - and run:

 

for i in $(ls); do echo -ne "$i\t"; cat $i; done

 

collisions      0

multicast       0

rx_bytes        1156653949

rx_compressed   0

rx_crc_errors   0

rx_dropped      0

rx_errors       0

rx_fifo_errors  0

rx_frame_errors 0

rx_length_errors        0

rx_missed_errors        0

rx_over_errors  0

rx_packets      11802151

tx_aborted_errors       0

tx_bytes        1156384913

tx_carrier_errors       0

tx_compressed   0

tx_dropped      0

tx_errors       0

tx_fifo_errors  0

tx_heartbeat_errors     0

tx_packets      11799851

tx_window_errors        0

 

 

You have a bunch of non zero errors?

 

-Robin

QuoteReplyEditDelete

 

 

2010-05-07 05:19:13     Re: tcp drops packets do to checksum errors

Jon Kowal (GERMANY)

Message: 89214   

 

No, there were no errors and from what I understand there shouldn't be. Maybe my explanation in the previous post was too brief. The errors we see are first recognized in the TCP checksum check. The lower layers don't see any errors which is why they don't show up in the eth0 statistics. It is not TCP introducing the error, though, because we can identify the erroneous data as being data from an old tx-packet which had been located at that same address.

 

Because we have the cache enabled we assume the data stayed in the cache even though the external memory has been updated via DMA. The Question is: Why is that old data being read from the cache even though the cache should get invalidated right after the memory is being allocated in bfin_mac_rx(). To answer that question I want to check the invalid flag of the cache line using the DTEST_COMMAND but I am not experienced in writing assembler and would love to get some help.

 

What we need is a function, that - given a memory location - returns the flags (invalid, dirty) of the corresponding cache line. Also nice would be a function that sets the invalid flag on the cache lines for a memory range. That would spare us from using FLUSHINV which is rather expensive.

 

If you think any of my assumptions are wrong, please let me know what you are thinking.

 

Thanks.

 

Jon

TranslateQuoteReplyEditDelete

 

 

2010-05-07 14:14:55     Re: tcp drops packets do to checksum errors

Robin Getz (UNITED STATES)

Message: 89237   

 

Jon:

 

Then I need a patch - not a description of changes to make - when I looked at things - net/ipv4/tcp_input.c:tcp_rcv_established()

 

csum_error:

        TCP_INC_STATS_BH(sock_net(sk), TCP_MIB_INERRS);

 

should increase the error number.

QuoteReplyEditDelete

 

 

2010-05-10 06:23:15     Re: tcp drops packets do to checksum errors

Jon Kowal (GERMANY)

Message: 89281   

 

I have no idea where to find the errors posted by TCP_INC_STATS_BH() but they don't show up in the device statistics. From what I see in the code it's some SNMP logging mechanism.

 

Honestly, I find comunicating with you to be rather difficult. Both of us are seperated several hours in time which makes it difficult to reply in time or wait for answers because your answers usually arrive when I'm about to leave the office. Thus, every little question posted will take at least a day to be answered and that's a long time for me having to solve the problem. So instead of just posting one little question a day it would help if you shared your thoughts also and especially if you'd help me with one of the questions I posted or - if you think I'm digging in the wrong direction - tell me I should be looking some other way. Don't get this wrong, I am grateful for any help and really appreciate you spending your precious time on this, but I believe we should optimize that communication somehow to increase amount of information transferred per day.

 

That said - from what you know so far, do you think my assumption that cache invalidation somehow failed makes sense? Do you have any other possibility in mind, why the old content of a a memory location should show up in a newly received packet? I thought checking the cache using the DTEST_CONTROL command would help to get a close look on what's going on. Do you agree with that?

 

Here's a svn diff of the changes to display the corrupted packets. I hope it helps. As described in another post above those will show up within a few seconds when running a bidirectional test with iperf.

 

 

 

$ svn diff net/ipv4/

Index: net/ipv4/tcp_input.c

===================================================================

--- net/ipv4/tcp_input.c        (Revision 8124)

+++ net/ipv4/tcp_input.c        (Arbeitskopie)

@@ -5001,7 +5001,27 @@

        tcp_ack_snd_check(sk);

        return 0;

 

-csum_error:

+csum_error:;

+/* DEBUG JON BEGIN */

+               int i;

+               static unsigned int u =0;

+               u++;

+               printk(KERN_ALERT "CSUM error: %u\t %p\n", u, skb->head );

+//         printk(KERN_ALERT "CSUM error:\n Packet Length=%d, SKB Head: %p, Data StartAddr=%p\n", skb->len, skb->head, skb->data);

+               for(i=0;i<skb->len-16;i+=16)

+               {

+                       printk(KERN_ALERT "%02x %02x %02x %02x %02x %02x %02x %02x  %02x %02x %02x %02x %02x %02x %02x %02x\n",

+                                 skb->data[i+0],skb->data[i+1],skb->data[i+2],skb->data[i+3],skb->data[i+4],skb->data[i+5],

+                                 skb->data[i+6],skb->data[i+7],skb->data[i+8],skb->data[i+9],skb->data[i+10],skb->data[i+11],

+                                 skb->data[i+12],skb->data[i+13],skb->data[i+14],skb->data[i+15]);

+               }

+               for(;i<skb->len;i++)

+               {

+                       printk(KERN_ALERT "%02x", skb->data[i]);

+               }

+               printk(KERN_ALERT "\n\n");

+/* DEBUG JON END */

+

        TCP_INC_STATS_BH(sock_net(sk), TCP_MIB_INERRS);

 

discard:

 

TranslateQuoteReplyEditDelete

 

 

2010-05-18 06:17:16     Re: tcp drops packets do to checksum errors

Sonic Zhang (CHINA)

Message: 89508   

 

Jon, I can confirm what you discovered.  With bidirection ethernet traffic and WB cache enabled, sometimes the input buffer contain data of older buffer which is already freed. I have tried to watch CPU writing to the input buffer during RX DMA opreation. I can't find malform CPU writing access. It looks the cache line is not really invalidated after the invalidate operation.

 

 

 

I haven't found a walkaround yet. That's why this bug is still open.

QuoteReplyEditDelete

 

 

2011-05-16 13:41:34     Re: tcp drops packets do to checksum errors

Stefan Wanja (GERMANY)

Message: 100680   

 

Hello again,

 

I was trying to reproduce the problem with UDP and was so far not successful (>1 day iperf) without checksum errors shown with a patch like the TCP one. It seems this problem doesn't occur when using UDP. I though that might possibly be interesting...

Outcomes