// Intel 82574L NICs: network hangs / ASPM Bug / e1000 driver

A few days ago, I ran into an ugly bug on different Scientific Linux 6.3 hosts (therefore this should also affect RHEL 6.3 and CentOS 6.3). The network hangs while the system itself is up, running and responsive. “Just” no network. Restarting the affected network interfaces is not enough, only a complete reboot brings the Intel 82574L-based network cards back to life (those NICs are onBoard on the Supermicro X9SCM-F and X8SIL mainboards of the affected hosts, so I can't simply change them). The logs showed entries like the following:

[...]
Jan 24 09:52:35 host2 kernel: WARNING: at net/sched/sch_generic.c:261 dev_watchdog+0x26d/0x280() (Not tainted)
Jan 24 09:52:35 host2 kernel: Hardware name: X9SCL/X9SCM
Jan 24 09:52:35 host2 kernel: NETDEV WATCHDOG: eth1 (e1000e): transmit queue 0 timed out
Jan 24 09:52:35 host2 kernel: Modules linked in: fuse autofs4 sunrpc vboxpci(U) vboxnetadp(U) vboxnetflt(U) vboxdrv(U) cpufreq_ondemand acpi_cpufreq freq_table mperf ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter ip_tables ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables ipv6 ext3 jbd uinput raid1 sg microcode i2c_i801 i2c_core iTCO_wdt iTCO_vendor_support shpchp e1000e ext4 mbcache jbd2 fpu aesni_intel cryptd aes_x86_64 aes_generic xts gf128mul dm_crypt raid10 sd_mod crc_t10dif ahci video output dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]
Jan 24 09:52:35 host2 kernel: Pid: 0, comm: swapper Not tainted 2.6.32-279.19.1.el6.x86_64 #1
Jan 24 09:52:35 host2 kernel: Call Trace:
Jan 24 09:52:35 host2 kernel: <IRQ>  [<ffffffff8106a1e7>] ? warn_slowpath_common+0x87/0xc0
Jan 24 09:52:35 host2 kernel: [<ffffffff8101c0fa>] ? intel_pmu_enable_all+0xba/0x160
Jan 24 09:52:35 host2 kernel: [<ffffffff8106a2d6>] ? warn_slowpath_fmt+0x46/0x50
Jan 24 09:52:35 host2 kernel: [<ffffffff8144792d>] ? dev_watchdog+0x26d/0x280
Jan 24 09:52:35 host2 kernel: [<ffffffff814476c0>] ? dev_watchdog+0x0/0x280
Jan 24 09:52:35 host2 kernel: [<ffffffff8107d2c7>] ? run_timer_softirq+0x197/0x340
Jan 24 09:52:35 host2 kernel: [<ffffffff810a0910>] ? tick_sched_timer+0x0/0xc0
Jan 24 09:52:35 host2 kernel: [<ffffffff8102adad>] ? lapic_next_event+0x1d/0x30
Jan 24 09:52:35 host2 kernel: [<ffffffff81072991>] ? __do_softirq+0xc1/0x1e0
Jan 24 09:52:35 host2 kernel: [<ffffffff81095510>] ? hrtimer_interrupt+0x140/0x250
Jan 24 09:52:35 host2 kernel: [<ffffffff8100c1cc>] ? call_softirq+0x1c/0x30
Jan 24 09:52:35 host2 kernel: [<ffffffff8100de05>] ? do_softirq+0x65/0xa0
Jan 24 09:52:35 host2 kernel: [<ffffffff81072775>] ? irq_exit+0x85/0x90
Jan 24 09:52:35 host2 kernel: [<ffffffff814f1fa0>] ? smp_apic_timer_interrupt+0x70/0x9b
Jan 24 09:52:35 host2 kernel: [<ffffffff8100bb93>] ? apic_timer_interrupt+0x13/0x20
Jan 24 09:52:35 host2 kernel: <EOI>  [<ffffffff812ec17e>] ? acpi_idle_enter_c1+0xa3/0xc1
Jan 24 09:52:35 host2 kernel: [<ffffffff812ec15d>] ? acpi_idle_enter_c1+0x82/0xc1
Jan 24 09:52:35 host2 kernel: [<ffffffff813f6c67>] ? cpuidle_idle_call+0xa7/0x140
Jan 24 09:52:35 host2 kernel: [<ffffffff81009fc6>] ? cpu_idle+0xb6/0x110
Jan 24 09:52:35 host2 kernel: [<ffffffff814d109a>] ? rest_init+0x7a/0x80
Jan 24 09:52:35 host2 kernel: [<ffffffff81c21f7b>] ? start_kernel+0x424/0x430
Jan 24 09:52:35 host2 kernel: [<ffffffff81c2133a>] ? x86_64_start_reservations+0x125/0x129
Jan 24 09:52:35 host2 kernel: [<ffffffff81c21438>] ? x86_64_start_kernel+0xfa/0x109
Jan 24 09:52:35 host2 kernel: ---[ end trace 1f3cc9d5dfc619c0 ]---
Jan 24 09:52:35 host2 kernel: e1000e 0000:02:00.0: eth1: Reset adapter
[...]

After some googleing, I found a useful Bug-Report and a mailing list thread. Especially three postings are quite informative:

It seems that the ASPM of the Intel 82574L is broken. The corresponding Linux driver “e1000” therefore has this chip on its ASPM blacklists and disables it when the systems boots. However, there is some side effect which re-enabled the NIC'S ASPM state L1 after a network connection was established. This does not happen on all Linux flavors and kernel versions, but it happens at least on Scientific 6.3 with kernel 2.6.32-279.19.1.

Workaround: disable the NIC's ASPM after the system boots

A quick workaround is to manually disable the NIC'S ASPM after the system booted and the network “stabilized” (e.g. after a few minutes). The following command disables ASPM for a device:

setpci -s <ID-of-device> CAP_EXP+10.b=40

You can use lspci -vnn to get the device ID (first number of the line, 02:00.0 in the following example output):

[root@host2 ~]# lspci -vnn | grep '82574'
02:00.0 Ethernet controller [0200]: Intel Corporation 82574L Gigabit Network Connection [8086:10d3]

Example: I used /etc/rc.local to disable ASPM on the device with ID 02:00.0, five minutes after the system boots by putting the following lines at the end of the file:

# workaround for Intel 82574L bug, see http://bit.ly/1565w6I for details
printf '%s\n' 'setpci -s 02:00.0 CAP_EXP+10.b=40' | at now + 5min

Use lspci -vvvv -s <ID-of-device> if you want to check if ASPM is really disabled (look for “LnkCtl: ASPM Disabled”):

[root@host2 ~]# lspci -vvvv -s 02:00.0
02:00.0 Ethernet controller: Intel Corporation 82574L Gigabit Network Connection
[...]
		LnkCtl:	ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk+
			ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
[...]

I hope this helps someone else in some way. :-)

I'm no native speaker (English)
Please let me know if you find any errors (I want to improve my English skills). Thank you!
QR Code: URL of current page
QR Code: URL of current page start (generated for current page)