2013-02-11 // Intel 82574L NICs: network hangs / ASPM Bug / e1000 driver
A few days ago, I ran into an ugly bug on different Scientific Linux 6.3 hosts (therefore this should also affect RHEL 6.3 and CentOS 6.3). The network hangs while the system itself is up, running and responsive. “Just” no network. Restarting the affected network interfaces is not enough, only a complete reboot brings the Intel 82574L-based network cards back to life (those NICs are onBoard on the Supermicro X9SCM-F and X8SIL mainboards of the affected hosts, so I can't simply change them). The logs showed entries like the following:
[...] Jan 24 09:52:35 host2 kernel: WARNING: at net/sched/sch_generic.c:261 dev_watchdog+0x26d/0x280() (Not tainted) Jan 24 09:52:35 host2 kernel: Hardware name: X9SCL/X9SCM Jan 24 09:52:35 host2 kernel: NETDEV WATCHDOG: eth1 (e1000e): transmit queue 0 timed out Jan 24 09:52:35 host2 kernel: Modules linked in: fuse autofs4 sunrpc vboxpci(U) vboxnetadp(U) vboxnetflt(U) vboxdrv(U) cpufreq_ondemand acpi_cpufreq freq_table mperf ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter ip_tables ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables ipv6 ext3 jbd uinput raid1 sg microcode i2c_i801 i2c_core iTCO_wdt iTCO_vendor_support shpchp e1000e ext4 mbcache jbd2 fpu aesni_intel cryptd aes_x86_64 aes_generic xts gf128mul dm_crypt raid10 sd_mod crc_t10dif ahci video output dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan] Jan 24 09:52:35 host2 kernel: Pid: 0, comm: swapper Not tainted 2.6.32-279.19.1.el6.x86_64 #1 Jan 24 09:52:35 host2 kernel: Call Trace: Jan 24 09:52:35 host2 kernel: <IRQ> [<ffffffff8106a1e7>] ? warn_slowpath_common+0x87/0xc0 Jan 24 09:52:35 host2 kernel: [<ffffffff8101c0fa>] ? intel_pmu_enable_all+0xba/0x160 Jan 24 09:52:35 host2 kernel: [<ffffffff8106a2d6>] ? warn_slowpath_fmt+0x46/0x50 Jan 24 09:52:35 host2 kernel: [<ffffffff8144792d>] ? dev_watchdog+0x26d/0x280 Jan 24 09:52:35 host2 kernel: [<ffffffff814476c0>] ? dev_watchdog+0x0/0x280 Jan 24 09:52:35 host2 kernel: [<ffffffff8107d2c7>] ? run_timer_softirq+0x197/0x340 Jan 24 09:52:35 host2 kernel: [<ffffffff810a0910>] ? tick_sched_timer+0x0/0xc0 Jan 24 09:52:35 host2 kernel: [<ffffffff8102adad>] ? lapic_next_event+0x1d/0x30 Jan 24 09:52:35 host2 kernel: [<ffffffff81072991>] ? __do_softirq+0xc1/0x1e0 Jan 24 09:52:35 host2 kernel: [<ffffffff81095510>] ? hrtimer_interrupt+0x140/0x250 Jan 24 09:52:35 host2 kernel: [<ffffffff8100c1cc>] ? call_softirq+0x1c/0x30 Jan 24 09:52:35 host2 kernel: [<ffffffff8100de05>] ? do_softirq+0x65/0xa0 Jan 24 09:52:35 host2 kernel: [<ffffffff81072775>] ? irq_exit+0x85/0x90 Jan 24 09:52:35 host2 kernel: [<ffffffff814f1fa0>] ? smp_apic_timer_interrupt+0x70/0x9b Jan 24 09:52:35 host2 kernel: [<ffffffff8100bb93>] ? apic_timer_interrupt+0x13/0x20 Jan 24 09:52:35 host2 kernel: <EOI> [<ffffffff812ec17e>] ? acpi_idle_enter_c1+0xa3/0xc1 Jan 24 09:52:35 host2 kernel: [<ffffffff812ec15d>] ? acpi_idle_enter_c1+0x82/0xc1 Jan 24 09:52:35 host2 kernel: [<ffffffff813f6c67>] ? cpuidle_idle_call+0xa7/0x140 Jan 24 09:52:35 host2 kernel: [<ffffffff81009fc6>] ? cpu_idle+0xb6/0x110 Jan 24 09:52:35 host2 kernel: [<ffffffff814d109a>] ? rest_init+0x7a/0x80 Jan 24 09:52:35 host2 kernel: [<ffffffff81c21f7b>] ? start_kernel+0x424/0x430 Jan 24 09:52:35 host2 kernel: [<ffffffff81c2133a>] ? x86_64_start_reservations+0x125/0x129 Jan 24 09:52:35 host2 kernel: [<ffffffff81c21438>] ? x86_64_start_kernel+0xfa/0x109 Jan 24 09:52:35 host2 kernel: ---[ end trace 1f3cc9d5dfc619c0 ]--- Jan 24 09:52:35 host2 kernel: e1000e 0000:02:00.0: eth1: Reset adapter [...]
After some googleing, I found a useful Bug-Report and a mailing list thread. Especially three postings are quite informative:
It seems that the ASPM of the Intel 82574L is broken. The corresponding Linux driver “e1000” therefore has this chip on its ASPM blacklists and disables it when the systems boots. However, there is some side effect which re-enabled the NIC'S ASPM state L1 after a network connection was established. This does not happen on all Linux flavors and kernel versions, but it happens at least on Scientific 6.3 with kernel 2.6.32-279.19.1.
Workaround: disable the NIC's ASPM after the system boots
A quick workaround is to manually disable the NIC'S ASPM after the system booted and the network “stabilized” (e.g. after a few minutes). The following command disables ASPM for a device:
setpci -s <ID-of-device> CAP_EXP+10.b=40
You can use lspci -vnn
to get the device ID (first number of the line, 02:00.0
in the following example output):
[root@host2 ~]# lspci -vnn | grep '82574' 02:00.0 Ethernet controller [0200]: Intel Corporation 82574L Gigabit Network Connection [8086:10d3]
Example: I used /etc/rc.local
to disable ASPM on the device with ID 02:00.0
, five minutes after the system boots by putting the following lines at the end of the file:
# workaround for Intel 82574L bug, see http://bit.ly/1565w6I for details printf '%s\n' 'setpci -s 02:00.0 CAP_EXP+10.b=40' | at now + 5min
Use lspci -vvvv -s <ID-of-device>
if you want to check if ASPM is really disabled (look for “LnkCtl: ASPM Disabled”):
[root@host2 ~]# lspci -vvvv -s 02:00.0 02:00.0 Ethernet controller: Intel Corporation 82574L Gigabit Network Connection [...] LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk+ ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt- [...]
I hope this helps someone else in some way.