Discussion:
[casper] 10 GBE Network Slowdown with Ubuntu 18.04
Gary, Dale E.
2018-09-22 11:56:39 UTC
Permalink
Hi All,

We are running a multi-core (32-core) system at Owens Valley that has a
dual-port Myricom 10GBe NIC. We ran the system very successfully under
Ubuntu 12.04 for more than 1 year, but after upgrading to Ubuntu 18.04
(generic) we are now experiencing reliability problems, despite the tuning
parameters and smp_affinity adjustments being (as far as we can tell) the
same. The problem seems to be somehow associated with system load and
packet handling rather than receipt of the packets by the interface, since
things run fine for up to 10 minutes, then start to deteriorate. In
researching this, I see various other flavors of Ubuntu (low-latency,
realtime, rt, preempt) that make kernel adjustments that might help, but I
am not able to tell from the descriptions which if any of these might
address the problem. Has anyone had a similar experience, and/or have
advice about what options we might have? I am using the myri10ge driver
that came with Ubuntu 18.04.

One thing I might mention is that I ran this script:
https://github.com/majek/dump/blob/master/how-to-receive-a-packet/softnet.sh,
and find a certain number of "squeezed" packets, which are "# of times
ksoftirq ran out of netdev_budget or time slice with work remaining." I
don't know if this is something to worry about? The output of softnet.sh
is like this. Note we had the NIC assigned to cpus 1 and 2, but changed to
30 and 31.

***@dpp:~$ ./softnet.sh
cpu total dropped squeezed collision rps flow_limit
0 1328082 0 3729 0 0 0
1 1716559544 0 7208929 0 0 0
2 1793125842 0 8158475 0 0 0
3 1069150 0 3714 0 0 0
4 1400569 0 5443 0 0 0
5 6988379 0 5985 0 0 0
6 6466640 0 5950 0 0 0
7 1070366 0 4097 0 0 0
8 878808 0 3906 0 0 0
9 933541 0 4207 0 0 0
10 1229 0 4 0 0 0
11 848 0 0 0 0 0
12 1310 0 5 0 0 0
13 662 0 0 0 0 0
14 1304 0 2 0 0 0
15 680 0 3 0 0 0
16 1817 0 2 0 0 0
17 648 0 3 0 0 0
18 742 0 2 0 0 0
19 605 0 2 0 0 0
20 690 0 2 0 0 0
21 536 0 3 0 0 0
22 860 0 0 0 0 0
23 493 0 3 0 0 0
24 1657 0 4 0 0 0
25 9244642 0 1487 0 0 0
26 912 0 2 0 0 0
27 287 0 0 0 0 0
28 5252171 0 877 0 0 0
29 339 0 3 0 0 0
30 3378532079 0 17299324 0 0 0
31 3390959304 0 16129528 0 0 0

Thanks,
Dale
--
You received this message because you are subscribed to the Google Groups "***@lists.berkeley.edu" group.
To unsubscribe from this group and stop receiving emails from it, send an email to casper+***@lists.berkeley.edu.
To post to this group, send email to ***@lists.berkeley.edu.
Jean Borsenberger
2018-10-03 08:25:46 UTC
Permalink
    First sorry for the delay, I was off for a time.

We do not use UBUNTU, but DEBIAN, but the two distribs are in fact two
flavours ofthe same thing.

We manage on each machine an UDP download link from a ROACH2. ROACH2
does nothing but adding an 8byte counter to each 8K data block. That way
we can precisely mesure the packet loss rate.

First notice that driver writers for 10GBe NIC found wise to split IRQ
on wether six or seven IRQ numbers, why this? I do not have the
slightest idea. Then come the worse. We run at 1.1 Gsamples/sec, we are
very close to the link capacity. With the standard setup the loss may
rise to 5%, with a current value of 1%. I suspect a cache problem. On
our 8 (real) core system, IRQ can be splitted on each one, but each has
to be aware of what is currently done by the others. This coherence
issue may take some cycles. Using that guess I assigned all IRQ of a
given I/F to a single core (/proc/irq/xx/smp_affinity). Concurently I
removed all other things from this core (smp affinity and taskset). It
worked: the loss is now arround 10^^-6, which we find acceptable.

The new pledge is named irqbalance, which takes over you on IRQ

aptitude remove irqbalance.

That's harmless.


You may wish also to get rid of systemd, which takes cycles for a
questionable purpose, but the issue is hazardous. Anyhow we took this
option.

systemd gets worse at each OS release.


Jean Borsenberger
Post by Gary, Dale E.
Hi All,
We are running a multi-core (32-core) system at Owens Valley that has
a dual-port Myricom 10GBe NIC.  We ran the system very successfully
under Ubuntu 12.04 for more than 1 year, but after upgrading to Ubuntu
18.04 (generic) we are now experiencing reliability problems, despite
the tuning parameters and smp_affinity adjustments being (as far as we
can tell) the same.  The problem seems to be somehow associated with
system load and packet handling rather than receipt of the packets by
the interface, since things run fine for up to 10 minutes, then start
to deteriorate.  In researching this, I see various other flavors of
Ubuntu (low-latency, realtime, rt, preempt) that make kernel
adjustments that might help, but I am not able to tell from the
descriptions which if any of these might address the problem.  Has
anyone had a similar experience, and/or have advice about what options
we might have?  I am using the myri10ge driver that came with Ubuntu
18.04.
https://github.com/majek/dump/blob/master/how-to-receive-a-packet/softnet.sh,
and find a certain number of "squeezed" packets, which are "# of times
ksoftirq ran out of netdev_budget or time slice with work remaining." 
I don't know if this is something to worry about?  The output of
softnet.sh is like this.  Note we had the NIC assigned to cpus 1 and
2, but changed to 30 and 31.
cpu      total dropped   squeezed  collision        rps flow_limit
  0    1328082       0       3729          0          0          0
  1 1716559544       0    7208929          0          0          0
  2 1793125842       0    8158475          0          0          0
  3    1069150       0       3714          0          0          0
  4    1400569       0       5443          0          0          0
  5    6988379       0       5985          0          0          0
  6    6466640       0       5950          0          0          0
  7    1070366       0       4097          0          0          0
  8     878808       0       3906          0          0          0
  9     933541       0       4207          0          0          0
 10       1229       0          4          0          0          0
 11        848       0          0          0          0          0
 12       1310       0          5          0          0          0
 13        662       0          0          0          0          0
 14       1304       0          2          0          0          0
 15        680       0          3          0          0          0
 16       1817       0          2          0          0          0
 17        648       0          3          0          0          0
 18        742       0          2          0          0          0
 19        605       0          2          0          0          0
 20        690       0          2          0          0          0
 21        536       0          3          0          0          0
 22        860       0          0          0          0          0
 23        493       0          3          0          0          0
 24       1657       0          4          0          0          0
 25    9244642       0       1487          0          0          0
 26        912       0          2          0          0          0
 27        287       0          0          0          0          0
 28    5252171       0        877          0          0          0
 29        339       0          3          0          0          0
 30 3378532079       0   17299324          0          0          0
 31 3390959304       0   16129528          0          0          0
Thanks,
Dale
--
You received this message because you are subscribed to the Google
To unsubscribe from this group and stop receiving emails from it, send
--
You received this message because you are subscribed to the Google Groups "***@lists.berkeley.edu" group.
To unsubscribe from this group and stop receiving emails from it, send an email to casper+***@lists.berkeley.edu.
To post to this group, send email to ***@lists.berkeley.edu.
Gary, Dale E.
2018-10-03 21:22:44 UTC
Permalink
Hi All,

I thought I would send an update to this problem, which still persists.
Jonathan's suggestion did not seem to work, since each ethernet interface
does not send packets to multiple processors. If I specify two cpus in the
SMP_AFFINITY files, the board sends to only one of them. Also, I removed
irqbalance as Jean suggested, but that had no effect.

I wrote a python script to read and plot the number of packets handled by
each interface from /proc/net/softnet_stat once per second, and then
started two packet-reading processes on difference cpus. The attached file
is a good example of what I find. The packet-readers run normally until
about 335 s in, and then the number of packets on both interfaces suddenly
drops by about 30,000, and the packet readers dutifully complain that they
are getting too few packets per accumulation. At about 360 s, I killed one
of the packet-reader processes, and the number of packets on the interface
it was reading jumps immediately up to normal. It is interesting that the
*other* interface also shows more packets arriving, but not up to normal.
After killing the second process, all is well again. When this process is
repeated, the timing of the failures changes, but seems always to be longer
than 5 minutes--I'm not sure I ever saw a failure within the first 5
minutes.

This seems to confirm that

1. The interfaces are running fine, and it is the act of reading them
that somehow is associated with the problem
2. The failure is sudden, and leads to a lower, but stable number of
packets being handled by the interface.
3. The failure is usually, but not always, on both interfaces at the
same time.
4. Killing the process brings the packets back, without resetting
anything else.
5. The probability of failure seems to be near 0 within the first 5
minutes, and near 1 by 10 minutes, yet the timing of the glitch is quite
random between those limits.

Note that we have plenty of resources (top shows cpu idle time on the
packet-reading processes is near 50%, and on the packet-handling processes
is 75%, with 25% si). Memory usage is also miniscule compared to the 65 GB
available.

So far, that is all I have. The myricom folks (CSPi) have opened a ticket,
but so far have not had any suggestions. They did say that they were about
to embark on some tests of Ubuntu 18.04 compatibility, so perhaps they will
find something. Meanwhile, we have no solution.

Regards,
Dale
Post by Jean Borsenberger
First sorry for the delay, I was off for a time.
We do not use UBUNTU, but DEBIAN, but the two distribs are in fact two
flavours of the same thing.
We manage on each machine an UDP download link from a ROACH2. ROACH2 does
nothing but adding an 8byte counter to each 8K data block. That way we can
precisely mesure the packet loss rate.
First notice that driver writers for 10GBe NIC found wise to split IRQ on
wether six or seven IRQ numbers, why this? I do not have the slightest
idea. Then come the worse. We run at 1.1 Gsamples/sec, we are very close to
the link capacity. With the standard setup the loss may rise to 5%, with a
current value of 1%. I suspect a cache problem. On our 8 (real) core
system, IRQ can be splitted on each one, but each has to be aware of what
is currently done by the others. This coherence issue may take some cycles.
Using that guess I assigned all IRQ of a given I/F to a single core
(/proc/irq/xx/smp_affinity). Concurently I removed all other things from
this core (smp affinity and taskset). It worked: the loss is now arround
10^^-6, which we find acceptable.
The new pledge is named irqbalance, which takes over you on IRQ
aptitude remove irqbalance.
That's harmless.
You may wish also to get rid of systemd, which takes cycles for a
questionable purpose, but the issue is hazardous. Anyhow we took this
option.
systemd gets worse at each OS release.
Jean Borsenberger
Hi All,
We are running a multi-core (32-core) system at Owens Valley that has a
dual-port Myricom 10GBe NIC. We ran the system very successfully under
Ubuntu 12.04 for more than 1 year, but after upgrading to Ubuntu 18.04
(generic) we are now experiencing reliability problems, despite the tuning
parameters and smp_affinity adjustments being (as far as we can tell) the
same. The problem seems to be somehow associated with system load and
packet handling rather than receipt of the packets by the interface, since
things run fine for up to 10 minutes, then start to deteriorate. In
researching this, I see various other flavors of Ubuntu (low-latency,
realtime, rt, preempt) that make kernel adjustments that might help, but I
am not able to tell from the descriptions which if any of these might
address the problem. Has anyone had a similar experience, and/or have
advice about what options we might have? I am using the myri10ge driver
that came with Ubuntu 18.04.
https://github.com/majek/dump/blob/master/how-to-receive-a-packet/softnet.sh,
and find a certain number of "squeezed" packets, which are "# of times
ksoftirq ran out of netdev_budget or time slice with work remaining." I
don't know if this is something to worry about? The output of softnet.sh
is like this. Note we had the NIC assigned to cpus 1 and 2, but changed to
30 and 31.
cpu total dropped squeezed collision rps flow_limit
0 1328082 0 3729 0 0 0
1 1716559544 0 7208929 0 0 0
2 1793125842 0 8158475 0 0 0
3 1069150 0 3714 0 0 0
4 1400569 0 5443 0 0 0
5 6988379 0 5985 0 0 0
6 6466640 0 5950 0 0 0
7 1070366 0 4097 0 0 0
8 878808 0 3906 0 0 0
9 933541 0 4207 0 0 0
10 1229 0 4 0 0 0
11 848 0 0 0 0 0
12 1310 0 5 0 0 0
13 662 0 0 0 0 0
14 1304 0 2 0 0 0
15 680 0 3 0 0 0
16 1817 0 2 0 0 0
17 648 0 3 0 0 0
18 742 0 2 0 0 0
19 605 0 2 0 0 0
20 690 0 2 0 0 0
21 536 0 3 0 0 0
22 860 0 0 0 0 0
23 493 0 3 0 0 0
24 1657 0 4 0 0 0
25 9244642 0 1487 0 0 0
26 912 0 2 0 0 0
27 287 0 0 0 0 0
28 5252171 0 877 0 0 0
29 339 0 3 0 0 0
30 3378532079 0 17299324 0 0 0
31 3390959304 0 16129528 0 0 0
Thanks,
Dale
--
You received this message because you are subscribed to the Google Groups
To unsubscribe from this group and stop receiving emails from it, send an
--
You received this message because you are subscribed to the Google Groups "
To unsubscribe from this group and stop receiving emails from it, send an
--
You received this message because you are subscribed to the Google Groups "***@lists.berkeley.edu" group.
To unsubscribe from this group and stop receiving emails from it, send an email to casper+***@lists.berkeley.edu.
To post to this group, send email to ***@lists.berkeley.edu.
David MacMahon
2018-10-04 01:39:56 UTC
Permalink
HI, Dale,

Is this a multi-socket system? If so, are you using "numactl" or "taskset" to bind the packet reading processes to CPU(s) on the same socket that the NIC is connected to? Are you sure you are sending the NIC interrupts to CPU(s) on the socket that the NIC is connected to?

FWIW, the Hashpipe program includes a script (hashpipe_topology.sh) that will summarize the NUMA topology of a system vis a vis network cards and/or GPUs.

https://github.com/david-macmahon/hashpipe/blob/master/src/hashpipe_topology.sh

HTH,
Dave
Post by Gary, Dale E.
Hi All,
I thought I would send an update to this problem, which still persists. Jonathan's suggestion did not seem to work, since each ethernet interface does not send packets to multiple processors. If I specify two cpus in the SMP_AFFINITY files, the board sends to only one of them. Also, I removed irqbalance as Jean suggested, but that had no effect.
I wrote a python script to read and plot the number of packets handled by each interface from /proc/net/softnet_stat once per second, and then started two packet-reading processes on difference cpus. The attached file is a good example of what I find. The packet-readers run normally until about 335 s in, and then the number of packets on both interfaces suddenly drops by about 30,000, and the packet readers dutifully complain that they are getting too few packets per accumulation. At about 360 s, I killed one of the packet-reader processes, and the number of packets on the interface it was reading jumps immediately up to normal. It is interesting that the *other* interface also shows more packets arriving, but not up to normal. After killing the second process, all is well again. When this process is repeated, the timing of the failures changes, but seems always to be longer than 5 minutes--I'm not sure I ever saw a failure within the first 5 minutes.
This seems to confirm that
The interfaces are running fine, and it is the act of reading them that somehow is associated with the problem
The failure is sudden, and leads to a lower, but stable number of packets being handled by the interface.
The failure is usually, but not always, on both interfaces at the same time.
Killing the process brings the packets back, without resetting anything else.
The probability of failure seems to be near 0 within the first 5 minutes, and near 1 by 10 minutes, yet the timing of the glitch is quite random between those limits.
Note that we have plenty of resources (top shows cpu idle time on the packet-reading processes is near 50%, and on the packet-handling processes is 75%, with 25% si). Memory usage is also miniscule compared to the 65 GB available.
So far, that is all I have. The myricom folks (CSPi) have opened a ticket, but so far have not had any suggestions. They did say that they were about to embark on some tests of Ubuntu 18.04 compatibility, so perhaps they will find something. Meanwhile, we have no solution.
Regards,
Dale
First sorry for the delay, I was off for a time.
We do not use UBUNTU, but DEBIAN, but the two distribs are in fact two flavours of the same thing.
We manage on each machine an UDP download link from a ROACH2. ROACH2 does nothing but adding an 8byte counter to each 8K data block. That way we can precisely mesure the packet loss rate.
First notice that driver writers for 10GBe NIC found wise to split IRQ on wether six or seven IRQ numbers, why this? I do not have the slightest idea. Then come the worse. We run at 1.1 Gsamples/sec, we are very close to the link capacity. With the standard setup the loss may rise to 5%, with a current value of 1%. I suspect a cache problem. On our 8 (real) core system, IRQ can be splitted on each one, but each has to be aware of what is currently done by the others. This coherence issue may take some cycles. Using that guess I assigned all IRQ of a given I/F to a single core (/proc/irq/xx/smp_affinity). Concurently I removed all other things from this core (smp affinity and taskset). It worked: the loss is now arround 10^^-6, which we find acceptable.
The new pledge is named irqbalance, which takes over you on IRQ
aptitude remove irqbalance.
That's harmless.
You may wish also to get rid of systemd, which takes cycles for a questionable purpose, but the issue is hazardous. Anyhow we took this option.
systemd gets worse at each OS release.
Jean Borsenberger
Post by Gary, Dale E.
Hi All,
We are running a multi-core (32-core) system at Owens Valley that has a dual-port Myricom 10GBe NIC. We ran the system very successfully under Ubuntu 12.04 for more than 1 year, but after upgrading to Ubuntu 18.04 (generic) we are now experiencing reliability problems, despite the tuning parameters and smp_affinity adjustments being (as far as we can tell) the same. The problem seems to be somehow associated with system load and packet handling rather than receipt of the packets by the interface, since things run fine for up to 10 minutes, then start to deteriorate. In researching this, I see various other flavors of Ubuntu (low-latency, realtime, rt, preempt) that make kernel adjustments that might help, but I am not able to tell from the descriptions which if any of these might address the problem. Has anyone had a similar experience, and/or have advice about what options we might have? I am using the myri10ge driver that came with Ubuntu 18.04.
One thing I might mention is that I ran this script:https://github.com/majek/dump/blob/master/how-to-receive-a-packet/softnet.sh <https://github.com/majek/dump/blob/master/how-to-receive-a-packet/softnet.sh>, and find a certain number of "squeezed" packets, which are "# of times ksoftirq ran out of netdev_budget or time slice with work remaining." I don't know if this is something to worry about? The output of softnet.sh is like this. Note we had the NIC assigned to cpus 1 and 2, but changed to 30 and 31.
cpu total dropped squeezed collision rps flow_limit
0 1328082 0 3729 0 0 0
1 1716559544 0 7208929 0 0 0
2 1793125842 0 8158475 0 0 0
3 1069150 0 3714 0 0 0
4 1400569 0 5443 0 0 0
5 6988379 0 5985 0 0 0
6 6466640 0 5950 0 0 0
7 1070366 0 4097 0 0 0
8 878808 0 3906 0 0 0
9 933541 0 4207 0 0 0
10 1229 0 4 0 0 0
11 848 0 0 0 0 0
12 1310 0 5 0 0 0
13 662 0 0 0 0 0
14 1304 0 2 0 0 0
15 680 0 3 0 0 0
16 1817 0 2 0 0 0
17 648 0 3 0 0 0
18 742 0 2 0 0 0
19 605 0 2 0 0 0
20 690 0 2 0 0 0
21 536 0 3 0 0 0
22 860 0 0 0 0 0
23 493 0 3 0 0 0
24 1657 0 4 0 0 0
25 9244642 0 1487 0 0 0
26 912 0 2 0 0 0
27 287 0 0 0 0 0
28 5252171 0 877 0 0 0
29 339 0 3 0 0 0
30 3378532079 0 17299324 0 0 0
31 3390959304 0 16129528 0 0 0
Thanks,
Dale
--
--
--
<lost_packets.png>
--
You received this message because you are subscribed to the Google Groups "***@lists.berkeley.edu" group.
To unsubscribe from this group and stop receiving emails from it, send an email to casper+***@lists.berkeley.edu.
To post to this group, send email to ***@lists.berkeley.edu.
Dale Gary
2018-10-04 02:12:18 UTC
Permalink
Hi Dave,

When you say multi-socket, do you mean multi-processor? There are two 16 core AMD Opteron processors. We are using taskset, and have tried every permutation we could think of. I’ll check out the script.

Thanks,
Dale

Sent from my iPhone
Post by David MacMahon
HI, Dale,
Is this a multi-socket system? If so, are you using "numactl" or "taskset" to bind the packet reading processes to CPU(s) on the same socket that the NIC is connected to? Are you sure you are sending the NIC interrupts to CPU(s) on the socket that the NIC is connected to?
FWIW, the Hashpipe program includes a script (hashpipe_topology.sh) that will summarize the NUMA topology of a system vis a vis network cards and/or GPUs.
https://github.com/david-macmahon/hashpipe/blob/master/src/hashpipe_topology.sh
HTH,
Dave
Post by Gary, Dale E.
Hi All,
I thought I would send an update to this problem, which still persists. Jonathan's suggestion did not seem to work, since each ethernet interface does not send packets to multiple processors. If I specify two cpus in the SMP_AFFINITY files, the board sends to only one of them. Also, I removed irqbalance as Jean suggested, but that had no effect.
I wrote a python script to read and plot the number of packets handled by each interface from /proc/net/softnet_stat once per second, and then started two packet-reading processes on difference cpus. The attached file is a good example of what I find. The packet-readers run normally until about 335 s in, and then the number of packets on both interfaces suddenly drops by about 30,000, and the packet readers dutifully complain that they are getting too few packets per accumulation. At about 360 s, I killed one of the packet-reader processes, and the number of packets on the interface it was reading jumps immediately up to normal. It is interesting that the *other* interface also shows more packets arriving, but not up to normal. After killing the second process, all is well again. When this process is repeated, the timing of the failures changes, but seems always to be longer than 5 minutes--I'm not sure I ever saw a failure within the first 5 minutes.
This seems to confirm that
The interfaces are running fine, and it is the act of reading them that somehow is associated with the problem
The failure is sudden, and leads to a lower, but stable number of packets being handled by the interface.
The failure is usually, but not always, on both interfaces at the same time.
Killing the process brings the packets back, without resetting anything else.
The probability of failure seems to be near 0 within the first 5 minutes, and near 1 by 10 minutes, yet the timing of the glitch is quite random between those limits.
Note that we have plenty of resources (top shows cpu idle time on the packet-reading processes is near 50%, and on the packet-handling processes is 75%, with 25% si). Memory usage is also miniscule compared to the 65 GB available.
So far, that is all I have. The myricom folks (CSPi) have opened a ticket, but so far have not had any suggestions. They did say that they were about to embark on some tests of Ubuntu 18.04 compatibility, so perhaps they will find something. Meanwhile, we have no solution.
Regards,
Dale
Post by Jean Borsenberger
First sorry for the delay, I was off for a time.
We do not use UBUNTU, but DEBIAN, but the two distribs are in fact two flavours of the same thing.
We manage on each machine an UDP download link from a ROACH2. ROACH2 does nothing but adding an 8byte counter to each 8K data block. That way we can precisely mesure the packet loss rate.
First notice that driver writers for 10GBe NIC found wise to split IRQ on wether six or seven IRQ numbers, why this? I do not have the slightest idea. Then come the worse. We run at 1.1 Gsamples/sec, we are very close to the link capacity. With the standard setup the loss may rise to 5%, with a current value of 1%. I suspect a cache problem. On our 8 (real) core system, IRQ can be splitted on each one, but each has to be aware of what is currently done by the others. This coherence issue may take some cycles. Using that guess I assigned all IRQ of a given I/F to a single core (/proc/irq/xx/smp_affinity). Concurently I removed all other things from this core (smp affinity and taskset). It worked: the loss is now arround 10^^-6, which we find acceptable.
The new pledge is named irqbalance, which takes over you on IRQ
aptitude remove irqbalance.
That's harmless.
You may wish also to get rid of systemd, which takes cycles for a questionable purpose, but the issue is hazardous. Anyhow we took this option.
systemd gets worse at each OS release.
Jean Borsenberger
Post by Gary, Dale E.
Hi All,
We are running a multi-core (32-core) system at Owens Valley that has a dual-port Myricom 10GBe NIC. We ran the system very successfully under Ubuntu 12.04 for more than 1 year, but after upgrading to Ubuntu 18.04 (generic) we are now experiencing reliability problems, despite the tuning parameters and smp_affinity adjustments being (as far as we can tell) the same. The problem seems to be somehow associated with system load and packet handling rather than receipt of the packets by the interface, since things run fine for up to 10 minutes, then start to deteriorate. In researching this, I see various other flavors of Ubuntu (low-latency, realtime, rt, preempt) that make kernel adjustments that might help, but I am not able to tell from the descriptions which if any of these might address the problem. Has anyone had a similar experience, and/or have advice about what options we might have? I am using the myri10ge driver that came with Ubuntu 18.04.
One thing I might mention is that I ran this script: https://github.com/majek/dump/blob/master/how-to-receive-a-packet/softnet.sh, and find a certain number of "squeezed" packets, which are "# of times ksoftirq ran out of netdev_budget or time slice with work remaining." I don't know if this is something to worry about? The output of softnet.sh is like this. Note we had the NIC assigned to cpus 1 and 2, but changed to 30 and 31.
cpu total dropped squeezed collision rps flow_limit
0 1328082 0 3729 0 0 0
1 1716559544 0 7208929 0 0 0
2 1793125842 0 8158475 0 0 0
3 1069150 0 3714 0 0 0
4 1400569 0 5443 0 0 0
5 6988379 0 5985 0 0 0
6 6466640 0 5950 0 0 0
7 1070366 0 4097 0 0 0
8 878808 0 3906 0 0 0
9 933541 0 4207 0 0 0
10 1229 0 4 0 0 0
11 848 0 0 0 0 0
12 1310 0 5 0 0 0
13 662 0 0 0 0 0
14 1304 0 2 0 0 0
15 680 0 3 0 0 0
16 1817 0 2 0 0 0
17 648 0 3 0 0 0
18 742 0 2 0 0 0
19 605 0 2 0 0 0
20 690 0 2 0 0 0
21 536 0 3 0 0 0
22 860 0 0 0 0 0
23 493 0 3 0 0 0
24 1657 0 4 0 0 0
25 9244642 0 1487 0 0 0
26 912 0 2 0 0 0
27 287 0 0 0 0 0
28 5252171 0 877 0 0 0
29 339 0 3 0 0 0
30 3378532079 0 17299324 0 0 0
31 3390959304 0 16129528 0 0 0
Thanks,
Dale
--
--
--
<lost_packets.png>
--
--
You received this message because you are subscribed to the Google Groups "***@lists.berkeley.edu" group.
To unsubscribe from this group and stop receiving emails from it, send an email to casper+***@lists.berkeley.edu.
To post to this group, send email to ***@lists.berkeley.edu.
Gary, Dale E.
2018-10-08 19:02:39 UTC
Permalink
Hi Dave,

I just had some time to investigate your comment, and run the script you
linked to, and indeed there may be some problem here. The output of the
script is shown below, which seems to indicate that all of the NICs are
connected to cpus 0-7 (socket 0). We steered the interrupts (47 and 48) to
cpus 15 and 31 as a test, although we were using 30 and 31 in an earlier
test. However, I just tried assigning the NICs to cpus 14 and 15, and the
packet reading to cpus 12 and 13, and the problem is unchanged. Does this
setup satisfy your expectation as the correct one?

Thanks,
Dale

Sockets/cores to CPUs:
socket 0, core 0 -> cpu 0
socket 0, core 0 -> cpu 8
socket 0, core 1 -> cpu 1
socket 0, core 1 -> cpu 9
socket 0, core 2 -> cpu 2
socket 0, core 2 -> cpu 10
socket 0, core 3 -> cpu 3
socket 0, core 3 -> cpu 11
socket 0, core 4 -> cpu 4
socket 0, core 4 -> cpu 12
socket 0, core 5 -> cpu 5
socket 0, core 5 -> cpu 13
socket 0, core 6 -> cpu 6
socket 0, core 6 -> cpu 14
socket 0, core 7 -> cpu 7
socket 0, core 7 -> cpu 15
socket 1, core 0 -> cpu 16
socket 1, core 0 -> cpu 24
socket 1, core 1 -> cpu 17
socket 1, core 1 -> cpu 25
socket 1, core 2 -> cpu 18
socket 1, core 2 -> cpu 26
socket 1, core 3 -> cpu 19
socket 1, core 3 -> cpu 27
socket 1, core 4 -> cpu 20
socket 1, core 4 -> cpu 28
socket 1, core 5 -> cpu 21
socket 1, core 5 -> cpu 29
socket 1, core 6 -> cpu 22
socket 1, core 6 -> cpu 30
socket 1, core 7 -> cpu 23
socket 1, core 7 -> cpu 31

Ethernet interfaces to CPUs:
eth0: 0-7
eth1: 0-7
eth2: 0-7
eth3: 0-7
Post by Dale Gary
Hi Dave,
When you say multi-socket, do you mean multi-processor? There are two 16
core AMD Opteron processors. We are using taskset, and have tried every
permutation we could think of. I’ll check out the script.
Thanks,
Dale
Sent from my iPhone
HI, Dale,
Is this a multi-socket system? If so, are you using "numactl" or
"taskset" to bind the packet reading processes to CPU(s) on the same socket
that the NIC is connected to? Are you sure you are sending the NIC
interrupts to CPU(s) on the socket that the NIC is connected to?
FWIW, the Hashpipe program includes a script (hashpipe_topology.sh) that
will summarize the NUMA topology of a system vis a vis network cards and/or
GPUs.
https://github.com/david-macmahon/hashpipe/blob/master/src/hashpipe_topology.sh
HTH,
Dave
Hi All,
I thought I would send an update to this problem, which still persists.
Jonathan's suggestion did not seem to work, since each ethernet interface
does not send packets to multiple processors. If I specify two cpus in the
SMP_AFFINITY files, the board sends to only one of them. Also, I removed
irqbalance as Jean suggested, but that had no effect.
I wrote a python script to read and plot the number of packets handled by
each interface from /proc/net/softnet_stat once per second, and then
started two packet-reading processes on difference cpus. The attached file
is a good example of what I find. The packet-readers run normally until
about 335 s in, and then the number of packets on both interfaces suddenly
drops by about 30,000, and the packet readers dutifully complain that they
are getting too few packets per accumulation. At about 360 s, I killed one
of the packet-reader processes, and the number of packets on the interface
it was reading jumps immediately up to normal. It is interesting that the
*other* interface also shows more packets arriving, but not up to normal.
After killing the second process, all is well again. When this process is
repeated, the timing of the failures changes, but seems always to be longer
than 5 minutes--I'm not sure I ever saw a failure within the first 5
minutes.
This seems to confirm that
1. The interfaces are running fine, and it is the act of reading them
that somehow is associated with the problem
2. The failure is sudden, and leads to a lower, but stable number of
packets being handled by the interface.
3. The failure is usually, but not always, on both interfaces at the
same time.
4. Killing the process brings the packets back, without resetting
anything else.
5. The probability of failure seems to be near 0 within the first 5
minutes, and near 1 by 10 minutes, yet the timing of the glitch is quite
random between those limits.
Note that we have plenty of resources (top shows cpu idle time on the
packet-reading processes is near 50%, and on the packet-handling processes
is 75%, with 25% si). Memory usage is also miniscule compared to the 65 GB
available.
So far, that is all I have. The myricom folks (CSPi) have opened a
ticket, but so far have not had any suggestions. They did say that they
were about to embark on some tests of Ubuntu 18.04 compatibility, so
perhaps they will find something. Meanwhile, we have no solution.
Regards,
Dale
On Wed, Oct 3, 2018 at 4:19 AM Jean Borsenberger <
Post by Jean Borsenberger
First sorry for the delay, I was off for a time.
We do not use UBUNTU, but DEBIAN, but the two distribs are in fact two
flavours of the same thing.
We manage on each machine an UDP download link from a ROACH2. ROACH2 does
nothing but adding an 8byte counter to each 8K data block. That way we can
precisely mesure the packet loss rate.
First notice that driver writers for 10GBe NIC found wise to split IRQ
on wether six or seven IRQ numbers, why this? I do not have the slightest
idea. Then come the worse. We run at 1.1 Gsamples/sec, we are very close to
the link capacity. With the standard setup the loss may rise to 5%, with a
current value of 1%. I suspect a cache problem. On our 8 (real) core
system, IRQ can be splitted on each one, but each has to be aware of what
is currently done by the others. This coherence issue may take some cycles.
Using that guess I assigned all IRQ of a given I/F to a single core
(/proc/irq/xx/smp_affinity). Concurently I removed all other things from
this core (smp affinity and taskset). It worked: the loss is now arround
10^^-6, which we find acceptable.
The new pledge is named irqbalance, which takes over you on IRQ
aptitude remove irqbalance.
That's harmless.
You may wish also to get rid of systemd, which takes cycles for a
questionable purpose, but the issue is hazardous. Anyhow we took this
option.
systemd gets worse at each OS release.
Jean Borsenberger
Hi All,
We are running a multi-core (32-core) system at Owens Valley that has a
dual-port Myricom 10GBe NIC. We ran the system very successfully under
Ubuntu 12.04 for more than 1 year, but after upgrading to Ubuntu 18.04
(generic) we are now experiencing reliability problems, despite the tuning
parameters and smp_affinity adjustments being (as far as we can tell) the
same. The problem seems to be somehow associated with system load and
packet handling rather than receipt of the packets by the interface, since
things run fine for up to 10 minutes, then start to deteriorate. In
researching this, I see various other flavors of Ubuntu (low-latency,
realtime, rt, preempt) that make kernel adjustments that might help, but I
am not able to tell from the descriptions which if any of these might
address the problem. Has anyone had a similar experience, and/or have
advice about what options we might have? I am using the myri10ge driver
that came with Ubuntu 18.04.
https://github.com/majek/dump/blob/master/how-to-receive-a-packet/softnet.sh,
and find a certain number of "squeezed" packets, which are "# of times
ksoftirq ran out of netdev_budget or time slice with work remaining." I
don't know if this is something to worry about? The output of softnet.sh
is like this. Note we had the NIC assigned to cpus 1 and 2, but changed to
30 and 31.
cpu total dropped squeezed collision rps flow_limit
0 1328082 0 3729 0 0 0
1 1716559544 0 7208929 0 0 0
2 1793125842 0 8158475 0 0 0
3 1069150 0 3714 0 0 0
4 1400569 0 5443 0 0 0
5 6988379 0 5985 0 0 0
6 6466640 0 5950 0 0 0
7 1070366 0 4097 0 0 0
8 878808 0 3906 0 0 0
9 933541 0 4207 0 0 0
10 1229 0 4 0 0 0
11 848 0 0 0 0 0
12 1310 0 5 0 0 0
13 662 0 0 0 0 0
14 1304 0 2 0 0 0
15 680 0 3 0 0 0
16 1817 0 2 0 0 0
17 648 0 3 0 0 0
18 742 0 2 0 0 0
19 605 0 2 0 0 0
20 690 0 2 0 0 0
21 536 0 3 0 0 0
22 860 0 0 0 0 0
23 493 0 3 0 0 0
24 1657 0 4 0 0 0
25 9244642 0 1487 0 0 0
26 912 0 2 0 0 0
27 287 0 0 0 0 0
28 5252171 0 877 0 0 0
29 339 0 3 0 0 0
30 3378532079 0 17299324 0 0 0
31 3390959304 0 16129528 0 0 0
Thanks,
Dale
--
You received this message because you are subscribed to the Google Groups
To unsubscribe from this group and stop receiving emails from it, send an
--
You received this message because you are subscribed to the Google Groups
To unsubscribe from this group and stop receiving emails from it, send an
--
You received this message because you are subscribed to the Google Groups "
To unsubscribe from this group and stop receiving emails from it, send an
<lost_packets.png>
--
You received this message because you are subscribed to the Google Groups "
To unsubscribe from this group and stop receiving emails from it, send an
--
You received this message because you are subscribed to the Google Groups "***@lists.berkeley.edu" group.
To unsubscribe from this group and stop receiving emails from it, send an email to casper+***@lists.berkeley.edu.
To post to this group, send email to ***@lists.berkeley.edu.
Loading...