2009-10-05 09:35:53 tcp drops packets do to checksum errors
Stefan Wanja (GERMANY)
Message: 80835
Hello,
we are experiencing packet drops due to a failing checksum check in tcp_checksum_complete (in tcp_rcv_established). The checksum is not calculated by the hardware as stated in https://blackfin.uclinux.org/gf/project/uclinux-dist/forum/?_forum_action=ForumMessageBrowse&thread_id=36476&action=ForumBrowse&forum_id=39.
The checksum check failes seldom, but often enough to make us wonder and to cause problems. The network device recevies every packet without errors, so the data or checksum problem must occur within the kernel somehow.
Does anybody know this problem? With wireshark we are seeing Duplicate Acks from the blackfin each time the checksum calculation fails. It fails only for one packet at once.
We are using uclinux-dist-2009R1-RC6 and bf527.
Kind regards,
Stefan
QuoteReplyEditDelete
2009-10-05 11:11:59 Re: tcp drops packets do to checksum errors
Robin Getz (UNITED STATES)
Message: 80850
Stefan:
Is it failing on one packet type more than a different one? (http? mail? ftp?) - what is the easiest way to replicate the problem you are seeing?
-Robin
QuoteReplyEditDelete
2009-10-06 12:02:07 Re: tcp drops packets do to checksum errors
Stefan Wanja (GERMANY)
Message: 80900
Hey Robin,
we are trying to make it reproducable, but at the moment, it seems we found a problem in bfin_mac.
The checksum calculated in the tcp receive is wrong because the data in the skb->data is in fact corrupted.
In the skb->data of the failing packet we found some sequences of old packets(!) which replaced parts of the packet (for example the area from tcp flags to checksum), we have also seen these bytes a couple of times: EA 05 00 50 DA 47 1B D1 3A F8 F3 8D (even after resets), and also random bytes where we were expecting zeros.
We made a seperate copy of the skb->data area in the bfin_mac_rx routine, which is called by the DMA interrupt handler to see if the stack is causing the corruptions in the packet, but the packets are already corrupted down there.
So either the network device copies trash, or the DMA copy somehow makes errors or there is a caching problem?
There are some notes about caching in the code already...
Concerning reproducability: We have a load of about 70% on the bf527 and receive mostly Ethernet MTU sized packets at a datarate of approx 6,15MBit/s to the bf, while the bf sends about 12,3MBit/s. A failure occurs in average about every 20 seconds.
Kind Regards,
Stefan
QuoteReplyEditDelete
2009-10-06 12:32:30 Re: tcp drops packets do to checksum errors
Michael Hennerich (GERMANY)
Message: 80903 A quick hint before I'm out of office for today - Blackfin doesn't have
a dedicated cache invalidate instruction!
It's always flush + invalidate.
So incase the cache is hot - you might flush cached data into the newly
DMAed data.
I took a quick look - bfin_mac_rx() flush + invalidates the packet
twice...
This is where I would start looking...
-Michael
QuoteReplyEditDelete
2009-10-06 13:35:38 Re: tcp drops packets do to checksum errors
Stefan Wanja (GERMANY)
Message: 80908
Hey Michael,
thanks for your hint. We didn't locate the problem yet, but switching to WRITE THROUGH caching solved the checksum failure problem. But we would like to use WRITE BACK if it brings more performance in general.
So we are still looking for a "real" solution....
Stefan
QuoteReplyEditDelete
2009-10-06 13:48:43 Re: tcp drops packets do to checksum errors
Robin Getz (UNITED STATES)
Message: 80910
Hmmm....
There also might be an issue (not sure) if dev_alloc_skb(PKT_BUF_SZ + NET_IP_ALIGN); doesn't return something alighned (start and end) to a cache line.
When I add something like:
Index: linux-2.6.x/drivers/net/bfin_mac.c
===================================================================
--- linux-2.6.x/drivers/net/bfin_mac.c (revision 7535)
+++ linux-2.6.x/drivers/net/bfin_mac.c (working copy)
@@ -1016,6 +1016,11 @@
/* Invidate the data cache of skb->data range when it is write back
* cache. It will prevent overwritting the new data from DMA
*/
+ if ((unsigned long)new_skb->head & 0x1F)
+ printk(KERN_NOTICE DRV_NAME ":new_skb->head not aligned\n");
+ if (((unsigned long)new_skb->end & 0x1F) != 0x1F)
+ printk(KERN_NOTICE DRV_NAME ":new_skb->end not aligned\n");
+
I get alot of "bfin_mac:new_skb->end not aligned" - so when you see the errors - are they in the middle of the packet (aligned to a 32-byte cache line, or at the end?)
-Robin
QuoteReplyEditDelete
2009-10-09 07:20:36 Re: tcp drops packets do to checksum errors
Stefan Wanja (GERMANY)
Message: 81014
Hey Robin,
the errors are in the middle towars the beginning. At the end we have not seen wrong data so far. If it is 32-byte aligned I don't know how to figure out.
Here is a failing packet, the tail of zeros is omitted. The coloured bytes are wrong, they should be zeros.
The first block is the ethernet and ip header, then the tcp header + data.
3a f8 f3 8d 8c fc 00 05 da 47 1b d1 08 00 45 00
04 c0 d4 10 40 00 80 06 f9 22 0a 00 0a 1a 0a 00
0a eb
08 a1 30 39 a1 5d 25 68 d7 94 3e 1a 50 18 80 00
ec e0 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
ea 05 00 50 da 47 1b d1 3a f8 f3 8d 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
The green block is the PC's mac addr, the red block is part of the blackfin's mac addr. The blue is unidentified. We captured about 10 packets with exactly the same offset (!)and content(!), even after reboots, so there must be some logic behind this...
QuoteReplyEditDelete
2009-10-09 16:09:22 Re: tcp drops packets do to checksum errors
Mike Frysinger (UNITED STATES)
Message: 81019
are you using the L1 option ? look in the menuconfig at the bfin_mac driver and the sub options it has.
QuoteReplyEditDelete
2009-10-12 05:19:46 Re: tcp drops packets do to checksum errors
Stefan Wanja (GERMANY)
Message: 81093
Heya,
yes, we're using L1 Cache, 10 transmit buffer packets and 20 receive (thats the default). We had also increased the values to their maxima before and had the same problems.
Kind Regards,
Stefan
QuoteReplyEditDelete
2009-10-13 03:34:01 Re: tcp drops packets do to checksum errors
Sonic Zhang (CHINA)
Message: 81134
I create a bug for this issue in the tracker blackfin.uclinux.org/gf/project/uclinux-dist/tracker/?action=TrackerItemEdit&tracker_item_id=5600
QuoteReplyEditDelete
2009-10-14 07:16:57 Re: tcp drops packets do to checksum errors
Sonic Zhang (CHINA)
Message: 81201
Could you try the attached patch to see any difference in WB cache mode?
In this patch, RX skbs are invalidated only once before it is linked to RX DMA ring.
bfin_mac_invalidate_skb_before_link_to_dma.patch
QuoteReplyEditDelete
2009-10-15 08:27:53 Re: tcp drops packets do to checksum errors
Stefan Wanja (GERMANY)
Message: 81311
Hey Sonic,
thanks for your patch. I cannot really say, that it makes a difference, since the errors occur at random times.
They still happen. There seems to be a relation to uboot's network activity. When uboot uses network before the linux boots, the probability of such an error is raised quite a lot.
But still, if WT cache is used, no errors appeared so far (with and without your patch).
Speaking of uboot I have to admit that we are still using U-Boot 1.1.6 (ADI-2008R1). I will update now and see what happens then, sorry for that ...
Can there be an issue with uboot that causes problems when linux uses WB cache???
Kind regards,
Stefan
QuoteReplyEditDelete
2009-10-20 06:56:49 Re: tcp drops packets do to checksum errors
Stefan Wanja (GERMANY)
Message: 81476
Hey Sonic,
thanks for your patch. I cannot really say, that it makes a difference, since the errors occur at random times.
They still happen. There seems to be a relation to uboot's network activity. When uboot uses network before the linux boots, the probability of such an error is raised quite a lot.
But still, if WT cache is used, no errors appeared so far (with and without your patch).
[Speaking of uboot I have to admit that we are still using U-Boot 1.1.6 (ADI-2008R1). I will update now and see what happens then, sorry for that ...] => Checked that and it still happens with u-boot-2008.10 (ADI-2009R1-rc3).
Can there be an issue with uboot that causes problems when linux uses WB cache???
Kind regards,
Stefan
---
QuoteReplyEditDelete
2009-10-20 08:23:40 Re: tcp drops packets do to checksum errors
Stefan Wanja (GERMANY)
Message: 81478
Hello Sonic,
why did you say its fixed already? Your change didn't solve the problem yet....
Kind regards,
Stefan
QuoteReplyEditDelete
2009-10-20 12:58:30 Re: tcp drops packets do to checksum errors
Mike Frysinger (UNITED STATES)
Message: 81491
once the kernel boots, whatever version of u-boot you were using shouldnt matter
QuoteReplyEditDelete
2009-10-20 23:13:22 Re: tcp drops packets do to checksum errors
Sonic Zhang (CHINA)
Message: 81499
This bug is reopened after you say the patch doesn't help a lot.
QuoteReplyEditDelete
2009-10-22 06:47:24 Re: tcp drops packets do to checksum errors
Sonic Zhang (CHINA)
Message: 81591
With the attached patch, I ftp a 20M-byte file to the board without checksum error for about 20 times. Of course, no one can say this walk around solve the issue.
Please have a try with your application.
bfin_mac_invalidate_skb_before_link_to_dma_2.patch
QuoteReplyEditDelete
2009-11-03 11:53:25 Re: tcp drops packets do to checksum errors
Stefan Wanja (GERMANY)
Message: 82039
Hey Sonic,
unfortunately this patch also doesn't help...
I had to define the CONFIG_BFIN_EXTMEM_WRITEBACK macro because I have the 2.6.28 kernel btw.
We have constant DMA work on both SPORTS, could this be a problem? Also the kernel is configured as preemtible...
Kind regards,
Stefan
QuoteReplyEditDelete
2010-05-06 10:07:19 Re: tcp drops packets do to checksum errors
Jon Kowal (GERMANY)
Message: 89182
Hello,
I picked up the issue because the network ist still not working correctly. We can reproduce the problem easily with EZ-Kits as well as our own boards. The problem causes our high-bandwith network applications (multi-channel realtime audio) to fail because packets are being corrupted after reception and there is no time for retransmission. Switching to write-through mode is no solution because we are experiencing freezes in that mode (see other forum thread) that we are not able to solve, either.
Because the write-back problem is so easy reproducible we decided to focus on that, again. I have spent the last three days trying to find some solution or workaround and this is how far I got:
I use iperf on blackfin and PC to produce the problem. It is important to run iperf with the -d option for bidirectional test. Usually it takes just a second for the first checksum errors to occur in tcp_rcv_established(). I start iperf using the following syntax:
Blackfin: iperf -s
PC: iperf -c 192.168.1.5 -P 1 -i 1 -d -F /dev/zero -t 15
In tcp_rcv_established() (net/ipv4/tcp_input.c) I added the following three lines after the csum_error label to make the errors visible. Sometimes I also dump the entire packet there (see below for example).
csum_error:;
static unsigned long u =0;
u++;
printk(KERN_ALERT "CSUM error: %u\t %p\n", u, skb->head );
TCP checksum check fails because the received data contains data that does not belong there.
Sometimes the invalid data can be identified as header data from a packet that had been sent some time ago. I was also able to match the memory address: the sent packet had been located at the exact address where the error is now occuring in the incoming data.
I have inserted the following memset right before the call to blackfin_dcache_invalidate_range() in bfin_mac_rx() to see if the cache is being invalidated correctly. It turns out that the corrupted packets will now contain the data I introduced using memset.
memset( new_skb->head, 1, new_skb->end-new_skb->head );
The following listing shows an excerpt from a corrupted packet which should be all 0 after the tcp header at the top of the data but contains 32 bytes (one cache line?) of 1's which I introduced before invalidating the cache.
Packet Length=1480, Data StartAddr=00289034
84 33 13 89 e7 42 c2 23 9a b4 5d ff 80 10 00 2e
0a f0 00 00 01 01 08 0a 00 80 b8 2e ff fe f0 19
00 00 00 00 00 00 00 00 00 00 00 00 01 01 01 01
01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01
01 01 01 01 01 01 01 01 01 01 01 01 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
...
So apparently either the invalidate did not work or somewhere up the stack some read/write operation has read the old data into the cache before the dma updated the sdram.
Attempts to invalidate or disable the entire cache in case of an error using blackfin_invalidate_entire_dcache() always crash the system.
I would like to check that the invalidate operation (FLUSHINV) is working correctly by using DTEST_COMMAND to check the invalid flag of the affected memory. Also setting the invalid flag directly may be a quicker way to invalidate the flash instead of using FLUSHINV which always (unnecessarily in our case) flushes the cache first.
Unfortunately I have no experience in writing assembler code and even after reading the programming reference a couple of times I feel unsure on how to actually proceed.
Can anyone help me write a function that, given a specific memory location, returns the valid flag for the associated cache line?
Do you have any other ideas on the cause of the problem? If you have problems to reproduce the issue I am sure I can assist.
Thanks! Jon
TranslateQuoteReplyEditDelete
2010-05-06 10:21:49 Re: tcp drops packets do to checksum errors
Jon Kowal (GERMANY)
Message: 89186
Hi Robin,
I have checked the addresses of skb and error data. the skb->head is always 32-bit aligned and the errors can often be found at an offset of 96 bytes which is also 32-bit aligned. There are also other offsets but it is remarkable that the offset is always identical for the same skb->head address. To put it in other words: The errors don't go away when a new packet is being received on the same memory location.
Also, as stated (in detail) in my reply to Sonic, there has always been an outgoing transmission using the exact same memory some time before the error occurs. Sending data only one-way does not produce the error, it is important to have bidirectional high-bandwidth data flow.
Thanks for your help!
Jon
TranslateQuoteReplyEditDelete
2010-05-06 12:00:16 Re: tcp drops packets do to checksum errors
Robin Getz (UNITED STATES)
Message: 89190
Jon:
I'm assuming that if you go to /sys/devices/platform/bfin_mac.0/net/eth0/statistics/ - and run:
for i in $(ls); do echo -ne "$i\t"; cat $i; done
collisions 0
multicast 0
rx_bytes 1156653949
rx_compressed 0
rx_crc_errors 0
rx_dropped 0
rx_errors 0
rx_fifo_errors 0
rx_frame_errors 0
rx_length_errors 0
rx_missed_errors 0
rx_over_errors 0
rx_packets 11802151
tx_aborted_errors 0
tx_bytes 1156384913
tx_carrier_errors 0
tx_compressed 0
tx_dropped 0
tx_errors 0
tx_fifo_errors 0
tx_heartbeat_errors 0
tx_packets 11799851
tx_window_errors 0
You have a bunch of non zero errors?
-Robin
QuoteReplyEditDelete
2010-05-07 05:19:13 Re: tcp drops packets do to checksum errors
Jon Kowal (GERMANY)
Message: 89214
No, there were no errors and from what I understand there shouldn't be. Maybe my explanation in the previous post was too brief. The errors we see are first recognized in the TCP checksum check. The lower layers don't see any errors which is why they don't show up in the eth0 statistics. It is not TCP introducing the error, though, because we can identify the erroneous data as being data from an old tx-packet which had been located at that same address.
Because we have the cache enabled we assume the data stayed in the cache even though the external memory has been updated via DMA. The Question is: Why is that old data being read from the cache even though the cache should get invalidated right after the memory is being allocated in bfin_mac_rx(). To answer that question I want to check the invalid flag of the cache line using the DTEST_COMMAND but I am not experienced in writing assembler and would love to get some help.
What we need is a function, that - given a memory location - returns the flags (invalid, dirty) of the corresponding cache line. Also nice would be a function that sets the invalid flag on the cache lines for a memory range. That would spare us from using FLUSHINV which is rather expensive.
If you think any of my assumptions are wrong, please let me know what you are thinking.
Thanks.
Jon
TranslateQuoteReplyEditDelete
2010-05-07 14:14:55 Re: tcp drops packets do to checksum errors
Robin Getz (UNITED STATES)
Message: 89237
Jon:
Then I need a patch - not a description of changes to make - when I looked at things - net/ipv4/tcp_input.c:tcp_rcv_established()
csum_error:
TCP_INC_STATS_BH(sock_net(sk), TCP_MIB_INERRS);
should increase the error number.
QuoteReplyEditDelete
2010-05-10 06:23:15 Re: tcp drops packets do to checksum errors
Jon Kowal (GERMANY)
Message: 89281
I have no idea where to find the errors posted by TCP_INC_STATS_BH() but they don't show up in the device statistics. From what I see in the code it's some SNMP logging mechanism.
Honestly, I find comunicating with you to be rather difficult. Both of us are seperated several hours in time which makes it difficult to reply in time or wait for answers because your answers usually arrive when I'm about to leave the office. Thus, every little question posted will take at least a day to be answered and that's a long time for me having to solve the problem. So instead of just posting one little question a day it would help if you shared your thoughts also and especially if you'd help me with one of the questions I posted or - if you think I'm digging in the wrong direction - tell me I should be looking some other way. Don't get this wrong, I am grateful for any help and really appreciate you spending your precious time on this, but I believe we should optimize that communication somehow to increase amount of information transferred per day.
That said - from what you know so far, do you think my assumption that cache invalidation somehow failed makes sense? Do you have any other possibility in mind, why the old content of a a memory location should show up in a newly received packet? I thought checking the cache using the DTEST_CONTROL command would help to get a close look on what's going on. Do you agree with that?
Here's a svn diff of the changes to display the corrupted packets. I hope it helps. As described in another post above those will show up within a few seconds when running a bidirectional test with iperf.
$ svn diff net/ipv4/
Index: net/ipv4/tcp_input.c
===================================================================
--- net/ipv4/tcp_input.c (Revision 8124)
+++ net/ipv4/tcp_input.c (Arbeitskopie)
@@ -5001,7 +5001,27 @@
tcp_ack_snd_check(sk);
return 0;
-csum_error:
+csum_error:;
+/* DEBUG JON BEGIN */
+ int i;
+ static unsigned int u =0;
+ u++;
+ printk(KERN_ALERT "CSUM error: %u\t %p\n", u, skb->head );
+// printk(KERN_ALERT "CSUM error:\n Packet Length=%d, SKB Head: %p, Data StartAddr=%p\n", skb->len, skb->head, skb->data);
+ for(i=0;i<skb->len-16;i+=16)
+ {
+ printk(KERN_ALERT "%02x %02x %02x %02x %02x %02x %02x %02x %02x %02x %02x %02x %02x %02x %02x %02x\n",
+ skb->data[i+0],skb->data[i+1],skb->data[i+2],skb->data[i+3],skb->data[i+4],skb->data[i+5],
+ skb->data[i+6],skb->data[i+7],skb->data[i+8],skb->data[i+9],skb->data[i+10],skb->data[i+11],
+ skb->data[i+12],skb->data[i+13],skb->data[i+14],skb->data[i+15]);
+ }
+ for(;i<skb->len;i++)
+ {
+ printk(KERN_ALERT "%02x", skb->data[i]);
+ }
+ printk(KERN_ALERT "\n\n");
+/* DEBUG JON END */
+
TCP_INC_STATS_BH(sock_net(sk), TCP_MIB_INERRS);
discard:
TranslateQuoteReplyEditDelete
2010-05-18 06:17:16 Re: tcp drops packets do to checksum errors
Sonic Zhang (CHINA)
Message: 89508
Jon, I can confirm what you discovered. With bidirection ethernet traffic and WB cache enabled, sometimes the input buffer contain data of older buffer which is already freed. I have tried to watch CPU writing to the input buffer during RX DMA opreation. I can't find malform CPU writing access. It looks the cache line is not really invalidated after the invalidate operation.
I haven't found a walkaround yet. That's why this bug is still open.
QuoteReplyEditDelete
2011-05-16 13:41:34 Re: tcp drops packets do to checksum errors
Stefan Wanja (GERMANY)
Message: 100680
Hello again,
I was trying to reproduce the problem with UDP and was so far not successful (>1 day iperf) without checksum errors shown with a patch like the TCP one. It seems this problem doesn't occur when using UDP. I though that might possibly be interesting...