What's the issue?
At home I have a NetGear GS108T with a NetGear R7000 serving as an Access Point. My Ubuntu 16.04 machine which functions as a hypervisor (qemu/kvm and openvswitch) has an "adapter reset"/"transmit queue timed out" error on igb and e1000e Intel NICs and causes a 30-60 second outage when my NetGear R7000 Access Point restarts. Kernel logs show the following:
Jun 29 16:31:26 kvm kernel: [ 145.888552] ------------[ cut here ]------------
Jun 29 16:31:26 kvm kernel: [ 145.888557] NETDEV WATCHDOG: p6p1 (igb): transmit queue 0 timed out
Jun 29 16:31:26 kvm kernel: [ 145.888595] WARNING: CPU: 6 PID: 0 at /build/linux-hwe-0EwvTm/linux-hwe-4.15.0/net/sched/sch_generic.c:323 dev_watchdog+0x222/0x230
Jun 29 16:31:26 kvm kernel: [ 145.888597] Modules linked in: vfio_pci vfio_virqfd vfio_iommu_type1 vfio ebtable_filter ebtables xt_multiport openvswitch nsh nf_nat_ipv6 nf_nat_ipv4 binfmt_misc ip6t_REJECT nf_reject_ipv6 xt_hl ip6t_rt ppdev intel_rapl nf_conntrack_ipv6 nf_defrag_ipv6 x86_pkg_temp_thermal intel_powerclamp crct10dif_pclmul crc32_pclmul ghash_clmulni_intel ipt_REJECT nf_reject_ipv4 intel_cstate usblp xt_limit xt_tcpudp ipmi_si ipmi_devintf xt_addrtype intel_rapl_perf ipmi_msghandler lpc_ich parport_pc mei_me mei pcbc shpchp ie31200_edac mac_hid aesni_intel crypto_simd glue_helper cryptd aes_x86_64 dm_crypt vhost_net vhost tap kvm_intel kvm irqbypass nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack ip6table_filter ip6_tables nf_conntrack_netbios_ns nf_conntrack_broadcast nf_nat_ftp nf_nat nf_conntrack_ftp nf_conntrack
Jun 29 16:31:26 kvm kernel: [ 145.888689] libcrc32c iptable_filter ip_tables x_tables sunrpc ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi nct6775 hwmon_vid coretemp lp parport autofs4 btrfs xor zstd_compress raid6_pq raid1 raid10 ast ttm drm_kms_helper syscopyarea sysfillrect igb sysimgblt fb_sys_fops hid_generic dca i2c_algo_bit usbhid ahci ptp libahci drm mpt3sas pps_core hid raid_class scsi_transport_sas video
Jun 29 16:31:26 kvm kernel: [ 145.888761] CPU: 6 PID: 0 Comm: swapper/6 Not tainted 4.15.0-54-generic #58~16.04.1-Ubuntu
Jun 29 16:31:26 kvm kernel: [ 145.888763] Hardware name: ASUSTeK COMPUTER INC. P9D-C Series/P9D-C Series, BIOS 1301 01/26/2015
Jun 29 16:31:26 kvm kernel: [ 145.888770] RIP: 0010:dev_watchdog+0x222/0x230
Jun 29 16:31:26 kvm kernel: [ 145.888773] RSP: 0018:ffff8a72efd83e68 EFLAGS: 00010282
Jun 29 16:31:26 kvm kernel: [ 145.888777] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000006
Jun 29 16:31:26 kvm kernel: [ 145.888779] RDX: 0000000000000007 RSI: 0000000000000082 RDI: ffff8a72efd96490
Jun 29 16:31:26 kvm kernel: [ 145.888782] RBP: ffff8a72efd83e98 R08: 0000000000000001 R09: 000000000000049c
Jun 29 16:31:26 kvm kernel: [ 145.888785] R10: 0000000000000000 R11: 000000000000049c R12: 0000000000000008
Jun 29 16:31:26 kvm kernel: [ 145.888787] R13: ffff8a72bda80000 R14: ffff8a72bda80478 R15: ffff8a72bda79940
Jun 29 16:31:26 kvm kernel: [ 145.888791] FS: 0000000000000000(0000) GS:ffff8a72efd80000(0000) knlGS:0000000000000000
Jun 29 16:31:26 kvm kernel: [ 145.888794] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jun 29 16:31:26 kvm kernel: [ 145.888797] CR2: 00007fa1302dbf48 CR3: 00000001f120a002 CR4: 00000000001626e0
Jun 29 16:31:26 kvm kernel: [ 145.888800] Call Trace:
Jun 29 16:31:26 kvm kernel: [ 145.888803] <IRQ>
Jun 29 16:31:26 kvm kernel: [ 145.888812] ? dev_deactivate_queue.constprop.33+0x60/0x60
Jun 29 16:31:26 kvm kernel: [ 145.888819] call_timer_fn+0x32/0x140
Jun 29 16:31:26 kvm kernel: [ 145.888824] run_timer_softirq+0x1ed/0x440
Jun 29 16:31:26 kvm kernel: [ 145.888830] ? ktime_get+0x3e/0xa0
Jun 29 16:31:26 kvm kernel: [ 145.888838] ? lapic_next_deadline+0x26/0x30
Jun 29 16:31:26 kvm kernel: [ 145.888846] __do_softirq+0xf5/0x28f
Jun 29 16:31:26 kvm kernel: [ 145.888856] irq_exit+0xb8/0xc0
Jun 29 16:31:26 kvm kernel: [ 145.888861] smp_apic_timer_interrupt+0x79/0x140
Jun 29 16:31:26 kvm kernel: [ 145.888866] apic_timer_interrupt+0x84/0x90
Jun 29 16:31:26 kvm kernel: [ 145.888868] </IRQ>
Jun 29 16:31:26 kvm kernel: [ 145.888879] RIP: 0010:cpuidle_enter_state+0xa7/0x300
Jun 29 16:31:26 kvm kernel: [ 145.888881] RSP: 0018:ffffac58031cfe60 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff11
Jun 29 16:31:26 kvm kernel: [ 145.888886] RAX: ffff8a72efda2840 RBX: 0000000000000005 RCX: 000000000000001f
Jun 29 16:31:26 kvm kernel: [ 145.888888] RDX: 0000000000000000 RSI: 0000000025bbf79e RDI: 0000000000000000
Jun 29 16:31:26 kvm kernel: [ 145.888890] RBP: ffffac58031cfe98 R08: ffff8a72efda16a4 R09: 0000000000000018
Jun 29 16:31:26 kvm kernel: [ 145.888893] R10: ffffac58031cfe30 R11: 0000000000000391 R12: 0000000000000005
Jun 29 16:31:26 kvm kernel: [ 145.888895] R13: ffff8a72efdacf00 R14: ffffffff9f371e38 R15: 00000021f77f87f8
Jun 29 16:31:26 kvm kernel: [ 145.888905] cpuidle_enter+0x17/0x20
Jun 29 16:31:26 kvm kernel: [ 145.888914] call_cpuidle+0x23/0x40
Jun 29 16:31:26 kvm kernel: [ 145.888920] do_idle+0x197/0x200
Jun 29 16:31:26 kvm kernel: [ 145.888926] cpu_startup_entry+0x73/0x80
Jun 29 16:31:26 kvm kernel: [ 145.888931] start_secondary+0x1ab/0x200
Jun 29 16:31:26 kvm kernel: [ 145.888939] secondary_startup_64+0xa5/0xb0
Jun 29 16:31:26 kvm kernel: [ 145.888942] Code: 37 00 49 63 4e e8 eb 92 4c 89 ef c6 05 a8 63 d8 00 01 e8 52 33 fd ff 89 d9 48 89 c2 4c 89 ee 48 c7 c7 c0 1e fa 9e e8 3e 3c 80 ff <0f> 0b eb c0 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48
Jun 29 16:31:26 kvm kernel: [ 145.889026] ---[ end trace 87c230fe7d27f115 ]---
Jun 29 16:31:26 kvm kernel: [ 145.889079] igb 0000:0a:00.0 p6p1: Reset adapter
Jun 29 16:31:30 kvm kernel: [ 149.681151] igb 0000:0a:00.0 p6p1: igb: p6p1 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX
Jun 29 16:31:40 kvm kernel: [ 159.964671] igb 0000:0a:00.0 p6p1: Reset adapter
Jun 29 16:31:44 kvm kernel: [ 163.669265] igb 0000:0a:00.0 p6p1: igb: p6p1 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX
This only happened on my Linux hypervisor, none of the other devices on my network exhibited this behaviour.
What troubleshooting steps did I take?
- Different network cables
- Using a newer kernel (4.15, I was on 4.4 to start)
- Building igb drivers from Intel's website via DKMS and loading them
- Using a different network card in a PCI express slot (e1000e) vs. onboard (igb)
- Trying all slots on the NIC
- Updating the BIOS on the ASUS server mainboard (P9D-C) to the latest available
- Turning off TSO/GSO offloading on the NIC via ethtool
None of these issues resolved the error I was encountering. If I restarted the R7000, it caused the adapter to time out every time.
How did I solve it?
Disabling RX/TX Flow control on the switch caused the issue to stop happening.
Since I did not want to disable RX/TX Flow control globally, I implemented the fix by disabling auto negotiate and setting RX/TX flow control off (aka. "PAUSE" Ethernet frames) using interface(8) and ethtool on Ubuntu:
pre-up /sbin/ethtool -s $IFACE autoneg off speed 1000 duplex full
pre-up /sbin/ethtool -A $IFACE autoneg off rx off
Now my R7000 can do whatever it likes, including randomly restarting and my network connectivity continues to function on my hypervisor.