構成
問題・応急復帰手順
Open vSwitchをnetdevデータパスで使用し,TSO(TCP Segmentation Offload)を有効にしている状況で負荷をかける(帯域測定)と,以下のようにインタフェースがフラップして疎通できなくなることを確認しました.(syslogを末尾に掲載)
この問題はカーネルモジュールの再loadで暫定的には解消します.ただし当該NICはフラップしており使用できないため,OOBでコンソールにアクセスできなければ結局ホストを再起動することになります.
systemctl stop networking.service
rmmod ixgbe
modprobe ixgbe
systemctl start networking.service
(from: ixgbe driver hang up | Detected Tx Unit Hang Tx Queue | Proxmox Support Forum)
インターネット上に,同様の声がちらほら見受けられました.どうやらこのNIC/ドライバが抱える問題のようです.NICへのオフロード機能が有効である状況において,高負荷になると落ちるようです.
(情報ソースによって,効果的であったとする無効化対象のオフロード処理は異なりました.)
- Proxmox Forum
- RedHat Knowledge base
- Scatter-Gather offload engineの無効化で回避できるとのコメント
- ethtool -K <interface> sg off
- RHEL Subscriptionが必要かも
- Intel Forum
- Issue with "Detected Tx Unit Hang" dropping network connections - Intel Community
- LRO(Large Receive Offload)の無効化で解決するという声,解決しないという声
おまけ
Intel X540-T2に差し替えたところ,この問題は発生しなくなりました.Intel X550-T2と何かしらの相性が悪い or X550-T2のハード/ファームに問題があるようです.
syslog
Jan 30 22:37:08 olive kernel: [172925.881034] ixgbe 0000:03:00.0 enp3s0f0: NIC Link is Up 10 Gbps, Flow Control: None
Jan 30 22:37:08 olive NetworkManager[704]: <info> [1738244228.3918] device (enp3s0f0): carrier: link connected
Jan 30 22:37:08 olive systemd-networkd[659]: enp3s0f0: Gained carrier
Jan 30 22:37:08 olive avahi-daemon[700]: Joining mDNS multicast group on interface enp3s0f0.IPv4 with address 192.168.0.52.
Jan 30 22:37:08 olive systemd-timesyncd[662]: Network configuration changed, trying to establish connection.
Jan 30 22:37:08 olive avahi-daemon[700]: New relevant interface enp3s0f0.IPv4 for mDNS.
Jan 30 22:37:08 olive avahi-daemon[700]: Registering new address record for 192.168.0.52 on enp3s0f0.IPv4.
Jan 30 22:37:08 olive kernel: [172925.965230] ixgbe 0000:03:00.0 enp3s0f0: NIC Link is Down
Jan 30 22:37:08 olive kernel: [172925.965468] ixgbe 0000:03:00.0 enp3s0f0: initiating reset to clear Tx work after link loss
Jan 30 22:37:08 olive kernel: [172926.252743] ixgbe 0000:03:00.1 enp3s0f1: Detected Tx Unit Hang
Jan 30 22:37:08 olive kernel: [172926.252743] Tx Queue <14>
Jan 30 22:37:08 olive kernel: [172926.252743] TDH, TDT <0>, <2>
Jan 30 22:37:08 olive kernel: [172926.252743] next_to_use <2>
Jan 30 22:37:08 olive kernel: [172926.252743] next_to_clean <0>
Jan 30 22:37:08 olive kernel: [172926.252743] tx_buffer_info[next_to_clean]
Jan 30 22:37:08 olive kernel: [172926.252743] time_stamp <10a4a1d41>
Jan 30 22:37:08 olive kernel: [172926.252743] jiffies <10a4a2ac0>
Jan 30 22:37:08 olive kernel: [172926.252774] ixgbe 0000:03:00.1 enp3s0f1: tx hang 57 detected on queue 14, resetting adapter
Jan 30 22:37:08 olive kernel: [172926.252776] ixgbe 0000:03:00.1 enp3s0f1: initiating reset due to tx timeout
Jan 30 22:37:08 olive kernel: [172926.252787] ixgbe 0000:03:00.1 enp3s0f1: Detected Tx Unit Hang
Jan 30 22:37:08 olive kernel: [172926.252787] Tx Queue <5>
Jan 30 22:37:08 olive kernel: [172926.252787] TDH, TDT <0>, <1>
Jan 30 22:37:08 olive kernel: [172926.252787] next_to_use <1>
Jan 30 22:37:08 olive kernel: [172926.252787] next_to_clean <0>
Jan 30 22:37:08 olive kernel: [172926.252787] tx_buffer_info[next_to_clean]
Jan 30 22:37:08 olive kernel: [172926.252787] time_stamp <10a4a1f57>
Jan 30 22:37:08 olive kernel: [172926.252787] jiffies <10a4a2ac0>
Jan 30 22:37:08 olive kernel: [172926.252788] ixgbe 0000:03:00.1 enp3s0f1: Reset adapter
Jan 30 22:37:08 olive kernel: [172926.252789] ixgbe 0000:03:00.1 enp3s0f1: tx hang 57 detected on queue 5, resetting adapter
Jan 30 22:37:08 olive kernel: [172926.252790] ixgbe 0000:03:00.1 enp3s0f1: initiating reset due to tx timeout
Jan 30 22:37:08 olive kernel: [172926.299309] ixgbe 0000:03:00.1 enp3s0f1: RXDCTL.ENABLE for one or more queues not cleared within the polling period
Jan 30 22:37:09 olive systemd-networkd[659]: enp3s0f0: Lost carrier
Jan 30 22:37:09 olive systemd-networkd[659]: enp3s0f0: DHCPv6 lease lost
Jan 30 22:37:09 olive kernel: [172927.044510] ixgbe 0000:03:00.0 enp3s0f0: Reset adapter
Jan 30 22:37:09 olive kernel: [172927.091385] ixgbe 0000:03:00.0 enp3s0f0: RXDCTL.ENABLE for one or more queues not cleared within the polling period
Jan 30 22:37:10 olive avahi-daemon[700]: Withdrawing address record for 192.168.0.52 on enp3s0f0.
Jan 30 22:37:10 olive avahi-daemon[700]: Leaving mDNS multicast group on interface enp3s0f0.IPv4 with address 192.168.0.52.
Jan 30 22:37:10 olive systemd-networkd[659]: enp3s0f1: Lost carrier
Jan 30 22:37:10 olive systemd-timesyncd[662]: Network configuration changed, trying to establish connection.
Jan 30 22:37:10 olive avahi-daemon[700]: Interface enp3s0f0.IPv4 no longer relevant for mDNS.
Jan 30 22:37:10 olive systemd-networkd[659]: enp3s0f1: DHCPv6 lease lost
Jan 30 22:37:15 olive kernel: [172932.683954] ixgbe 0000:03:00.0 enp3s0f0: NIC Link is Up 10 Gbps, Flow Control: None
Jan 30 22:37:15 olive NetworkManager[704]: <info> [1738244235.1950] device (enp3s0f0): carrier: link connected
Jan 30 22:37:15 olive systemd-networkd[659]: enp3s0f0: Gained carrier
Jan 30 22:37:15 olive systemd-timesyncd[662]: Network configuration changed, trying to establish connection.
Jan 30 22:37:15 olive avahi-daemon[700]: Joining mDNS multicast group on interface enp3s0f0.IPv4 with address 192.168.0.52.
Jan 30 22:37:15 olive avahi-daemon[700]: New relevant interface enp3s0f0.IPv4 for mDNS.
Jan 30 22:37:15 olive avahi-daemon[700]: Registering new address record for 192.168.0.52 on enp3s0f0.IPv4.
Jan 30 22:37:15 olive kernel: [172932.795930] ixgbe 0000:03:00.0 enp3s0f0: NIC Link is Down
Jan 30 22:37:15 olive kernel: [172932.796167] ixgbe 0000:03:00.0 enp3s0f0: initiating reset to clear Tx work after link loss
Jan 30 22:37:16 olive systemd-networkd[659]: enp3s0f0: Lost carrier