Dataset1001
See Dumps/Dump servers for the overview of all dump server hardware.
Notes on Dataset1001 deployment:
Contents |
During rsync testing with Kevin at your.org
Processes running:
- cron job with bwlimit 10000 to get latest data written to 2012 dumps from dataset2 via rsync
- background job with bwlimit 40000 to get 2011 dumps from dataset2 via rsync
- rsync daemon with bwlimit 50000 which your.org connects to
Note that some files transfered are quite large, several GB.
Feb 7 2012
Saw stack traces in /var/log/messages using kernel 2.6.32-38-server (Ubuntu 10.04):
Feb 7 07:46:49 dataset1001 kernel: [384047.468170] swapper: page allocation failure. order:0, mode:0x4020 Feb 7 07:46:49 dataset1001 kernel: [384047.468176] Pid: 0, comm: swapper Not tainted 2.6.32-38-server #83-Ubuntu Feb 7 07:46:49 dataset1001 kernel: [384047.468179] Call Trace: Feb 7 07:46:49 dataset1001 kernel: [384047.468181] <IRQ> [<ffffffff810fca37>] __alloc_pages_slowpath+0x4a7/0x590 Feb 7 07:46:49 dataset1001 kernel: [384047.468195] [<ffffffff810fcc99>] __alloc_pages_nodemask+0x179/0x180 Feb 7 07:46:49 dataset1001 kernel: [384047.468202] [<ffffffff8112fe57>] alloc_pages_current+0x87/0xd0 Feb 7 07:46:49 dataset1001 kernel: [384047.468206] [<ffffffff81135e97>] new_slab+0x2f7/0x310 Feb 7 07:46:49 dataset1001 kernel: [384047.468210] [<ffffffff81138771>] __slab_alloc+0x201/0x2d0 Feb 7 07:46:49 dataset1001 kernel: [384047.468216] [<ffffffff8146e376>] ? __netdev_alloc_skb+0x36/0x60 Feb 7 07:46:49 dataset1001 kernel: [384047.468221] [<ffffffff8113956f>] __kmalloc_node_track_caller+0xaf/0x160 Feb 7 07:46:49 dataset1001 kernel: [384047.468225] [<ffffffff8146e376>] ? __netdev_alloc_skb+0x36/0x60 Feb 7 07:46:49 dataset1001 kernel: [384047.468229] [<ffffffff8146db90>] __alloc_skb+0x80/0x190 Feb 7 07:46:49 dataset1001 kernel: [384047.468233] [<ffffffff8146e376>] __netdev_alloc_skb+0x36/0x60 Feb 7 07:46:49 dataset1001 kernel: [384047.468243] [<ffffffffa00a6303>] bnx2_rx_skb+0x63/0x6e0 [bnx2] Feb 7 07:46:49 dataset1001 kernel: [384047.468249] [<ffffffffa00a6e1c>] bnx2_rx_int+0x49c/0x760 [bnx2] Feb 7 07:46:49 dataset1001 kernel: [384047.468255] [<ffffffff810200c6>] ? show_shared_cpu_map_func+0x66/0x80 Feb 7 07:46:49 dataset1001 kernel: [384047.468262] [<ffffffff81093de4>] ? clockevents_program_event+0x54/0xa0 Feb 7 07:46:49 dataset1001 kernel: [384047.468266] [<ffffffff81095488>] ? tick_dev_program_event+0x68/0xd0 Feb 7 07:46:49 dataset1001 kernel: [384047.468272] [<ffffffffa00a7150>] bnx2_poll_work+0x70/0xa0 [bnx2] Feb 7 07:46:49 dataset1001 kernel: [384047.468277] [<ffffffffa00a71bd>] bnx2_poll_msix+0x3d/0xc0 [bnx2] Feb 7 07:46:49 dataset1001 kernel: [384047.468284] [<ffffffff81039889>] ? default_spin_lock_flags+0x9/0x10 Feb 7 07:46:49 dataset1001 kernel: [384047.468289] [<ffffffff8147844f>] net_rx_action+0x10f/0x250 Feb 7 07:46:49 dataset1001 kernel: [384047.468294] [<ffffffff8106f287>] __do_softirq+0xb7/0x1f0 Feb 7 07:46:49 dataset1001 kernel: [384047.468299] [<ffffffff810c6610>] ? handle_IRQ_event+0x60/0x170 Feb 7 07:46:49 dataset1001 kernel: [384047.468303] [<ffffffff810142ac>] call_softirq+0x1c/0x30 Feb 7 07:46:49 dataset1001 kernel: [384047.468306] [<ffffffff81015c75>] do_softirq+0x65/0xa0 Feb 7 07:46:49 dataset1001 kernel: [384047.468309] [<ffffffff8106f085>] irq_exit+0x85/0x90 Feb 7 07:46:49 dataset1001 kernel: [384047.468314] [<ffffffff815654e5>] do_IRQ+0x75/0xf0 Feb 7 07:46:49 dataset1001 kernel: [384047.468318] [<ffffffff81013ad3>] ret_from_intr+0x0/0x11 Feb 7 07:46:49 dataset1001 kernel: [384047.468320] <EOI> [<ffffffff81312c25>] ? acpi_idle_enter_bm+0x296/0x2ca Feb 7 07:46:49 dataset1001 kernel: [384047.468329] [<ffffffff81312c1e>] ? acpi_idle_enter_bm+0x28f/0x2ca Feb 7 07:46:49 dataset1001 kernel: [384047.468335] [<ffffffff81452297>] ? cpuidle_idle_call+0xa7/0x140 Feb 7 07:46:49 dataset1001 kernel: [384047.468341] [<ffffffff81011e43>] ? cpu_idle+0xb3/0x110 Feb 7 07:46:49 dataset1001 kernel: [384047.468347] [<ffffffff81558105>] ? start_secondary+0xa8/0xaa Feb 7 07:46:49 dataset1001 kernel: [384047.468350] Mem-Info: Feb 7 07:46:49 dataset1001 kernel: [384047.468352] Node 0 DMA per-cpu: Feb 7 07:46:49 dataset1001 kernel: [384047.468355] CPU 0: hi: 0, btch: 1 usd: 0 Feb 7 07:46:49 dataset1001 kernel: [384047.468357] CPU 1: hi: 0, btch: 1 usd: 0 Feb 7 07:46:49 dataset1001 kernel: [384047.468360] CPU 2: hi: 0, btch: 1 usd: 0 Feb 7 07:46:49 dataset1001 kernel: [384047.468362] CPU 3: hi: 0, btch: 1 usd: 0 Feb 7 07:46:49 dataset1001 kernel: [384047.468364] Node 0 DMA32 per-cpu: Feb 7 07:46:49 dataset1001 kernel: [384047.468367] CPU 0: hi: 186, btch: 31 usd: 174 Feb 7 07:46:49 dataset1001 kernel: [384047.468369] CPU 1: hi: 186, btch: 31 usd: 24 Feb 7 07:46:49 dataset1001 kernel: [384047.468371] CPU 2: hi: 186, btch: 31 usd: 183 Feb 7 07:46:49 dataset1001 kernel: [384047.468374] CPU 3: hi: 186, btch: 31 usd: 183 Feb 7 07:46:49 dataset1001 kernel: [384047.468375] Node 0 Normal per-cpu: Feb 7 07:46:49 dataset1001 kernel: [384047.468378] CPU 0: hi: 186, btch: 31 usd: 179 Feb 7 07:46:49 dataset1001 kernel: [384047.468380] CPU 1: hi: 186, btch: 31 usd: 54 Feb 7 07:46:49 dataset1001 kernel: [384047.468383] CPU 2: hi: 186, btch: 31 usd: 157 Feb 7 07:46:49 dataset1001 kernel: [384047.468385] CPU 3: hi: 186, btch: 31 usd: 59 Feb 7 07:46:49 dataset1001 kernel: [384047.468390] active_anon:26517 inactive_anon:5096 isolated_anon:0 Feb 7 07:46:49 dataset1001 kernel: [384047.468392] active_file:76812 inactive_file:3796080 isolated_file:0 Feb 7 07:46:49 dataset1001 kernel: [384047.468393] unevictable:850 dirty:388919 writeback:233464 unstable:0 Feb 7 07:46:49 dataset1001 kernel: [384047.468394] free:18338 slab_reclaimable:84627 slab_unreclaimable:21641 Feb 7 07:46:49 dataset1001 kernel: [384047.468395] mapped:2677 shmem:74 pagetables:755 bounce:0 Feb 7 07:46:49 dataset1001 kernel: [384047.468398] Node 0 DMA free:15892kB min:12kB low:12kB high:16kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15340kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_re claimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes Feb 7 07:46:49 dataset1001 kernel: [384047.468409] lowmem_reserve[]: 0 3243 16121 16121 Feb 7 07:46:49 dataset1001 kernel: [384047.468413] Node 0 DMA32 free:52708kB min:3264kB low:4080kB high:4896kB active_anon:11084kB inactive_anon:5500kB active_file:46444kB inactive_file:2763984kB unevictable:568kB isolated(anon):0kB isolated(file):0kB present:3321540kB mlocked:568kB dirty:299064kB writeb ack:181156kB mapped:136kB shmem:0kB slab_reclaimable:160712kB slab_unreclaimable:20836kB kernel_stack:64kB pagetables:164kB unstable:0kB bounce:0kB write back_tmp:0kB pages_scanned:0 all_unreclaimable? no Feb 7 07:46:49 dataset1001 kernel: [384047.468426] lowmem_reserve[]: 0 0 12877 12877 Feb 7 07:46:49 dataset1001 kernel: [384047.468430] Node 0 Normal free:4752kB min:12972kB low:16212kB high:19456kB active_anon:94984kB inactive_anon:1488 4kB active_file:260804kB inactive_file:12420336kB unevictable:2832kB isolated(anon):0kB isolated(file):0kB present:13186560kB mlocked:2832kB dirty:125661 2kB writeback:752700kB mapped:10572kB shmem:296kB slab_reclaimable:177796kB slab_unreclaimable:65728kB kernel_stack:1184kB pagetables:2856kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no Feb 7 07:46:49 dataset1001 kernel: [384047.468443] lowmem_reserve[]: 0 0 0 0 Feb 7 07:46:49 dataset1001 kernel: [384047.468447] Node 0 DMA: 3*4kB 3*8kB 3*16kB 2*32kB 2*64kB 2*128kB 2*256kB 1*512kB 0*1024kB 1*2048kB 3*4096kB = 158 92kB Feb 7 07:46:49 dataset1001 kernel: [384047.468458] Node 0 DMA32: 11541*4kB 176*8kB 174*16kB 1*32kB 1*64kB 1*128kB 0*256kB 0*512kB 0*1024kB 1*2048kB 0*40 96kB = 52628kB Feb 7 07:46:49 dataset1001 kernel: [384047.468469] Node 0 Normal: 994*4kB 0*8kB 0*16kB 0*32kB 1*64kB 0*128kB 1*256kB 1*512kB 0*1024kB 0*2048kB 0*4096kB = 4808kB Feb 7 07:46:49 dataset1001 kernel: [384047.468480] 3873689 total pagecache pages Feb 7 07:46:49 dataset1001 kernel: [384047.468482] 0 pages in swap cache Feb 7 07:46:49 dataset1001 kernel: [384047.468484] Swap cache stats: add 0, delete 0, find 0/0 Feb 7 07:46:49 dataset1001 kernel: [384047.468486] Free swap = 976888kB Feb 7 07:46:49 dataset1001 kernel: [384047.468487] Total swap = 976888kB Feb 7 07:46:49 dataset1001 kernel: [384047.529621] 4194304 pages RAM Feb 7 07:46:49 dataset1001 kernel: [384047.529624] 81220 pages reserved Feb 7 07:46:49 dataset1001 kernel: [384047.529625] 2627094 pages shared Feb 7 07:46:49 dataset1001 kernel: [384047.529627] 1474311 pages non-shared Feb 7 07:46:49 dataset1001 kernel: [384047.529631] SLUB: Unable to allocate memory on node -1 (gfp=0x20) Feb 7 07:46:49 dataset1001 kernel: [384047.529634] cache: kmalloc-4096, object size: 4096, buffer size: 4096, default order: 3, min order: 0 Feb 7 07:46:49 dataset1001 kernel: [384047.529637] node 0: slabs: 1541, objs: 4880, free: 0
I upgraded to 2.6.38-13-generic from the natty backports ppa.
Feb 8 2012
Saw similar stack traces in /var/log/messages from the network driver (bnx2):
Feb 8 08:03:26 dataset1001 kernel: [38432.288760] swapper: page allocation failure. order:0, mode:0x4020 Feb 8 08:03:26 dataset1001 kernel: [38432.288765] Pid: 0, comm: swapper Not tainted 2.6.38-13-generic #54~lucid1-Ubuntu Feb 8 08:03:26 dataset1001 kernel: [38432.288768] Call Trace: Feb 8 08:03:26 dataset1001 kernel: [38432.288770] <IRQ> [<ffffffff81113e50>] ? __alloc_pages_slowpath+0x6d0/0x710 Feb 8 08:03:26 dataset1001 kernel: [38432.288784] [<ffffffff8114af4a>] ? alloc_pages_current+0xaa/0x120 Feb 8 08:03:26 dataset1001 kernel: [38432.288788] [<ffffffff81114039>] ? __alloc_pages_nodemask+0x1a9/0x200 Feb 8 08:03:26 dataset1001 kernel: [38432.288792] [<ffffffff8114af4a>] ? alloc_pages_current+0xaa/0x120 Feb 8 08:03:26 dataset1001 kernel: [38432.288796] [<ffffffff81152db5>] ? new_slab+0x295/0x2a0 Feb 8 08:03:26 dataset1001 kernel: [38432.288799] [<ffffffff811544bd>] ? __slab_alloc+0x17d/0x320 Feb 8 08:03:26 dataset1001 kernel: [38432.288806] [<ffffffff814c8be4>] ? __netdev_alloc_skb+0x24/0x50 Feb 8 08:03:26 dataset1001 kernel: [38432.288810] [<ffffffff81157134>] ? __kmalloc_node_track_caller+0xa4/0x1d0 Feb 8 08:03:26 dataset1001 kernel: [38432.288814] [<ffffffff814c8be4>] ? __netdev_alloc_skb+0x24/0x50 Feb 8 08:03:26 dataset1001 kernel: [38432.288818] [<ffffffff814c8393>] ? __alloc_skb+0x83/0x170 Feb 8 08:03:26 dataset1001 kernel: [38432.288822] [<ffffffff814c8be4>] ? __netdev_alloc_skb+0x24/0x50 Feb 8 08:03:26 dataset1001 kernel: [38432.288831] [<ffffffffa00dccf3>] ? bnx2_rx_skb+0x63/0x6f0 [bnx2] Feb 8 08:03:26 dataset1001 kernel: [38432.288837] [<ffffffffa00dd7f8>] ? bnx2_rx_int+0x478/0x720 [bnx2] Feb 8 08:03:26 dataset1001 kernel: [38432.288843] [<ffffffffa00ddb10>] ? bnx2_poll_work+0x70/0xa0 [bnx2] Feb 8 08:03:26 dataset1001 kernel: [38432.288848] [<ffffffffa00ddb7d>] ? bnx2_poll_msix+0x3d/0xc0 [bnx2] Feb 8 08:03:26 dataset1001 kernel: [38432.288853] [<ffffffff81038d09>] ? default_spin_lock_flags+0x9/0x10 Feb 8 08:03:26 dataset1001 kernel: [38432.288859] [<ffffffff814d72a8>] ? net_rx_action+0x108/0x2d0 Feb 8 08:03:26 dataset1001 kernel: [38432.288864] [<ffffffff8106c04b>] ? __do_softirq+0xab/0x200 Feb 8 08:03:26 dataset1001 kernel: [38432.288870] [<ffffffff810d2330>] ? handle_IRQ_event+0x50/0x170 Feb 8 08:03:26 dataset1001 kernel: [38432.288875] [<ffffffff8100cf9c>] ? call_softirq+0x1c/0x30 Feb 8 08:03:26 dataset1001 kernel: [38432.288878] [<ffffffff8100ea05>] ? do_softirq+0x65/0xa0 Feb 8 08:03:26 dataset1001 kernel: [38432.288882] [<ffffffff8106bf15>] ? irq_exit+0x85/0x90 Feb 8 08:03:26 dataset1001 kernel: [38432.288886] [<ffffffff815cf326>] ? do_IRQ+0x66/0xe0 Feb 8 08:03:26 dataset1001 kernel: [38432.288892] [<ffffffff815c7753>] ? ret_from_intr+0x0/0x15 Feb 8 08:03:26 dataset1001 kernel: [38432.288894] <EOI> [<ffffffff81334861>] ? intel_idle+0xc1/0x120 Feb 8 08:03:26 dataset1001 kernel: [38432.288902] [<ffffffff81334844>] ? intel_idle+0xa4/0x120 Feb 8 08:03:26 dataset1001 kernel: [38432.288907] [<ffffffff814a7182>] ? cpuidle_idle_call+0xb2/0x1b0 Feb 8 08:03:26 dataset1001 kernel: [38432.288912] [<ffffffff8100b077>] ? cpu_idle+0xb7/0x110 Feb 8 08:03:26 dataset1001 kernel: [38432.288916] [<ffffffff815ac442>] ? rest_init+0x72/0x80 Feb 8 08:03:26 dataset1001 kernel: [38432.288923] [<ffffffff81accc9a>] ? start_kernel+0x374/0x37b Feb 8 08:03:26 dataset1001 kernel: [38432.288927] [<ffffffff81acc346>] ? x86_64_start_reservations+0x131/0x135 Feb 8 08:03:26 dataset1001 kernel: [38432.288931] [<ffffffff81acc452>] ? x86_64_start_kernel+0x108/0x117 Feb 8 08:03:26 dataset1001 kernel: [38432.288933] Mem-Info: Feb 8 08:03:26 dataset1001 kernel: [38432.288935] Node 0 DMA per-cpu: Feb 8 08:03:26 dataset1001 kernel: [38432.288938] CPU 0: hi: 0, btch: 1 usd: 0 Feb 8 08:03:26 dataset1001 kernel: [38432.288941] CPU 1: hi: 0, btch: 1 usd: 0 Feb 8 08:03:26 dataset1001 kernel: [38432.288943] CPU 2: hi: 0, btch: 1 usd: 0 Feb 8 08:03:26 dataset1001 kernel: [38432.288945] CPU 3: hi: 0, btch: 1 usd: 0 Feb 8 08:03:26 dataset1001 kernel: [38432.288947] Node 0 DMA32 per-cpu: Feb 8 08:03:26 dataset1001 kernel: [38432.288950] CPU 0: hi: 186, btch: 31 usd: 179 Feb 8 08:03:26 dataset1001 kernel: [38432.288952] CPU 1: hi: 186, btch: 31 usd: 15 Feb 8 08:03:26 dataset1001 kernel: [38432.288954] CPU 2: hi: 186, btch: 31 usd: 156 Feb 8 08:03:26 dataset1001 kernel: [38432.288956] CPU 3: hi: 186, btch: 31 usd: 0 Feb 8 08:03:26 dataset1001 kernel: [38432.288958] Node 0 Normal per-cpu: Feb 8 08:03:26 dataset1001 kernel: [38432.288961] CPU 0: hi: 186, btch: 31 usd: 88 Feb 8 08:03:26 dataset1001 kernel: [38432.288963] CPU 1: hi: 186, btch: 31 usd: 21 Feb 8 08:03:26 dataset1001 kernel: [38432.288965] CPU 2: hi: 186, btch: 31 usd: 161 Feb 8 08:03:26 dataset1001 kernel: [38432.288968] CPU 3: hi: 186, btch: 31 usd: 28 Feb 8 08:03:26 dataset1001 kernel: [38432.288973] active_anon:31711 inactive_anon:6004 isolated_anon:0 Feb 8 08:03:26 dataset1001 kernel: [38432.288974] active_file:26426 inactive_file:3857074 isolated_file:0 Feb 8 08:03:26 dataset1001 kernel: [38432.288976] unevictable:1019 dirty:729121 writeback:11217 unstable:0 Feb 8 08:03:26 dataset1001 kernel: [38432.288977] free:18317 slab_reclaimable:79292 slab_unreclaimable:9887 Feb 8 08:03:26 dataset1001 kernel: [38432.288978] mapped:2051 shmem:71 pagetables:867 bounce:0 Feb 8 08:03:26 dataset1001 kernel: [38432.288981] Node 0 DMA free:15868kB min:12kB low:12kB high:16kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15676kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_rec laimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes Feb 8 08:03:26 dataset1001 kernel: [38432.288992] lowmem_reserve[]: 0 3243 16121 16121 Feb 8 08:03:26 dataset1001 kernel: [38432.288996] Node 0 DMA32 free:52716kB min:3264kB low:4080kB high:4896kB active_anon:18048kB inactive_anon:7992kB a ctive_file:4864kB inactive_file:3085260kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:3321540kB mlocked:0kB dirty:578832kB writeback:74 12kB mapped:148kB shmem:0kB slab_reclaimable:117856kB slab_unreclaimable:6472kB kernel_stack:56kB pagetables:292kB unstable:0kB bounce:0kB writeback_tmp: 0kB pages_scanned:0 all_unreclaimable? no Feb 8 08:03:26 dataset1001 kernel: [38432.289008] lowmem_reserve[]: 0 0 12877 12877 Feb 8 08:03:26 dataset1001 kernel: [38432.289012] Node 0 Normal free:4684kB min:12972kB low:16212kB high:19456kB active_anon:108796kB inactive_anon:1602 4kB active_file:100840kB inactive_file:12343036kB unevictable:4076kB isolated(anon):0kB isolated(file):0kB present:13186560kB mlocked:4028kB dirty:233765 2kB writeback:37456kB mapped:8056kB shmem:284kB slab_reclaimable:199312kB slab_unreclaimable:33076kB kernel_stack:952kB pagetables:3176kB unstable:0kB bo unce:0kB writeback_tmp:0kB pages_scanned:76 all_unreclaimable? no Feb 8 08:03:26 dataset1001 kernel: [38432.289025] lowmem_reserve[]: 0 0 0 0 Feb 8 08:03:26 dataset1001 kernel: [38432.289028] Node 0 DMA: 1*4kB 1*8kB 1*16kB 1*32kB 1*64kB 1*128kB 1*256kB 0*512kB 1*1024kB 1*2048kB 3*4096kB = 1586 8kB Feb 8 08:03:26 dataset1001 kernel: [38432.289039] Node 0 DMA32: 368*4kB 249*8kB 634*16kB 1146*32kB 4*64kB 0*128kB 3*256kB 3*512kB 0*1024kB 0*2048kB 0*40 96kB = 52840kB Feb 8 08:03:26 dataset1001 kernel: [38432.289050] Node 0 Normal: 1060*4kB 0*8kB 0*16kB 0*32kB 1*64kB 0*128kB 1*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 4560kB Feb 8 08:03:26 dataset1001 kernel: [38432.289060] 3884258 total pagecache pages Feb 8 08:03:26 dataset1001 kernel: [38432.289062] 0 pages in swap cache Feb 8 08:03:26 dataset1001 kernel: [38432.289064] Swap cache stats: add 0, delete 0, find 0/0 Feb 8 08:03:26 dataset1001 kernel: [38432.289066] Free swap = 976892kB Feb 8 08:03:26 dataset1001 kernel: [38432.289068] Total swap = 976892kB Feb 8 08:03:26 dataset1001 kernel: [38432.326609] kworker/0:1: page allocation failure. order:0, mode:0x4020 Feb 8 08:03:26 dataset1001 kernel: [38432.326615] Pid: 0, comm: kworker/0:1 Not tainted 2.6.38-13-generic #54~lucid1-Ubuntu Feb 8 08:03:26 dataset1001 kernel: [38432.326618] Call Trace: Feb 8 08:03:26 dataset1001 kernel: [38432.326620] <IRQ> [<ffffffff81113e50>] ? __alloc_pages_slowpath+0x6d0/0x710 Feb 8 08:03:26 dataset1001 kernel: [38432.326633] [<ffffffff81114039>] ? __alloc_pages_nodemask+0x1a9/0x200 Feb 8 08:03:26 dataset1001 kernel: [38432.326638] [<ffffffff8114af4a>] ? alloc_pages_current+0xaa/0x120 Feb 8 08:03:26 dataset1001 kernel: [38432.326642] [<ffffffff81152db5>] ? new_slab+0x295/0x2a0 Feb 8 08:03:26 dataset1001 kernel: [38432.326645] [<ffffffff811544bd>] ? __slab_alloc+0x17d/0x320 Feb 8 08:03:26 dataset1001 kernel: [38432.326650] [<ffffffff81524ac3>] ? tcp_send_ack+0x33/0x130 Feb 8 08:03:26 dataset1001 kernel: [38432.326656] [<ffffffff81573fb0>] ? ip6_pol_route_input+0x0/0x20 Feb 8 08:03:26 dataset1001 kernel: [38432.326660] [<ffffffff81157134>] ? __kmalloc_node_track_caller+0xa4/0x1d0 Feb 8 08:03:26 dataset1001 kernel: [38432.326663] [<ffffffff81521a1a>] ? tcp_rcv_established+0x26a/0x6e0 Feb 8 08:03:26 dataset1001 kernel: [38432.326667] [<ffffffff81524ac3>] ? tcp_send_ack+0x33/0x130 Feb 8 08:03:26 dataset1001 kernel: [38432.326672] [<ffffffff814c8393>] ? __alloc_skb+0x83/0x170 Feb 8 08:03:26 dataset1001 kernel: [38432.326676] [<ffffffff81524ac3>] ? tcp_send_ack+0x33/0x130 Feb 8 08:03:26 dataset1001 kernel: [38432.326680] [<ffffffff8151bafe>] ? __tcp_ack_snd_check+0x5e/0xa0 Feb 8 08:03:26 dataset1001 kernel: [38432.326684] [<ffffffff815219f9>] ? tcp_rcv_established+0x249/0x6e0 Feb 8 08:03:26 dataset1001 kernel: [38432.326688] [<ffffffff81529dc5>] ? tcp_v4_do_rcv+0x125/0x1c0 Feb 8 08:03:26 dataset1001 kernel: [38432.326692] [<ffffffff8152b5c9>] ? tcp_v4_rcv+0x5a9/0x840 Feb 8 08:03:26 dataset1001 kernel: [38432.326697] [<ffffffff814fffe4>] ? nf_iterate+0x84/0xb0 Feb 8 08:03:26 dataset1001 kernel: [38432.326701] [<ffffffff815080fd>] ? ip_local_deliver_finish+0xdd/0x290 Feb 8 08:03:26 dataset1001 kernel: [38432.326704] [<ffffffff81508330>] ? ip_local_deliver+0x80/0x90 Feb 8 08:03:26 dataset1001 kernel: [38432.326708] [<ffffffff81507901>] ? ip_rcv_finish+0x121/0x3f0 Feb 8 08:03:26 dataset1001 kernel: [38432.326711] [<ffffffff81507f4d>] ? ip_rcv+0x23d/0x310 Feb 8 08:03:26 dataset1001 kernel: [38432.326715] [<ffffffff814d1a0a>] ? __netif_receive_skb+0x40a/0x690 Feb 8 08:03:26 dataset1001 kernel: [38432.326720] [<ffffffff814d6ac0>] ? netif_receive_skb+0x80/0x90 Feb 8 08:03:26 dataset1001 kernel: [38432.326724] [<ffffffff814d6cd5>] ? napi_gro_complete+0x95/0xe0 Feb 8 08:03:26 dataset1001 kernel: [38432.326728] [<ffffffff814d715a>] ? napi_complete+0x3a/0x80 Feb 8 08:03:26 dataset1001 kernel: [38432.326736] [<ffffffffa00ddbc5>] ? bnx2_poll_msix+0x85/0xc0 [bnx2] Feb 8 08:03:26 dataset1001 kernel: [38432.326741] [<ffffffff814d72a8>] ? net_rx_action+0x108/0x2d0 Feb 8 08:03:26 dataset1001 kernel: [38432.326746] [<ffffffff8108af73>] ? hrtimer_get_next_event+0xb3/0xf0 Feb 8 08:03:26 dataset1001 kernel: [38432.326751] [<ffffffff8106c04b>] ? __do_softirq+0xab/0x200 Feb 8 08:03:26 dataset1001 kernel: [38432.326757] [<ffffffff810d2330>] ? handle_IRQ_event+0x50/0x170 Feb 8 08:03:26 dataset1001 kernel: [38432.326762] [<ffffffff810132e9>] ? sched_clock+0x9/0x10 Feb 8 08:03:26 dataset1001 kernel: [38432.326766] [<ffffffff8100cf9c>] ? call_softirq+0x1c/0x30 Feb 8 08:03:26 dataset1001 kernel: [38432.326769] [<ffffffff8100ea05>] ? do_softirq+0x65/0xa0 Feb 8 08:03:26 dataset1001 kernel: [38432.326773] [<ffffffff8106bf15>] ? irq_exit+0x85/0x90 Feb 8 08:03:26 dataset1001 kernel: [38432.326777] [<ffffffff815cf326>] ? do_IRQ+0x66/0xe0 Feb 8 08:03:26 dataset1001 kernel: [38432.326782] [<ffffffff815c7753>] ? ret_from_intr+0x0/0x15 Feb 8 08:03:26 dataset1001 kernel: [38432.326784] <EOI> [<ffffffff81334861>] ? intel_idle+0xc1/0x120 Feb 8 08:03:26 dataset1001 kernel: [38432.326793] [<ffffffff81334844>] ? intel_idle+0xa4/0x120 Feb 8 08:03:26 dataset1001 kernel: [38432.326799] [<ffffffff814a7182>] ? cpuidle_idle_call+0xb2/0x1b0 Feb 8 08:03:26 dataset1001 kernel: [38432.326805] [<ffffffff8100b077>] ? cpu_idle+0xb7/0x110 Feb 8 08:03:26 dataset1001 kernel: [38432.326809] [<ffffffff815c085d>] ? start_secondary+0xc2/0xc4 Feb 8 08:03:26 dataset1001 kernel: [38432.326812] Mem-Info: Feb 8 08:03:26 dataset1001 kernel: [38432.326814] Node 0 DMA per-cpu: Feb 8 08:03:26 dataset1001 kernel: [38432.326817] CPU 0: hi: 0, btch: 1 usd: 0 Feb 8 08:03:26 dataset1001 kernel: [38432.326819] CPU 1: hi: 0, btch: 1 usd: 0 Feb 8 08:03:26 dataset1001 kernel: [38432.326822] CPU 2: hi: 0, btch: 1 usd: 0 Feb 8 08:03:26 dataset1001 kernel: [38432.326824] CPU 3: hi: 0, btch: 1 usd: 0 Feb 8 08:03:26 dataset1001 kernel: [38432.326826] Node 0 DMA32 per-cpu: Feb 8 08:03:26 dataset1001 kernel: [38432.326829] CPU 0: hi: 186, btch: 31 usd: 179 Feb 8 08:03:26 dataset1001 kernel: [38432.326831] CPU 1: hi: 186, btch: 31 usd: 15 Feb 8 08:03:26 dataset1001 kernel: [38432.326834] CPU 2: hi: 186, btch: 31 usd: 156 Feb 8 08:03:26 dataset1001 kernel: [38432.326836] CPU 3: hi: 186, btch: 31 usd: 0 Feb 8 08:03:26 dataset1001 kernel: [38432.326838] Node 0 Normal per-cpu: Feb 8 08:03:26 dataset1001 kernel: [38432.326841] CPU 0: hi: 186, btch: 31 usd: 88 Feb 8 08:03:26 dataset1001 kernel: [38432.326843] CPU 1: hi: 186, btch: 31 usd: 21 Feb 8 08:03:26 dataset1001 kernel: [38432.326845] CPU 2: hi: 186, btch: 31 usd: 161 Feb 8 08:03:26 dataset1001 kernel: [38432.326848] CPU 3: hi: 186, btch: 31 usd: 28 Feb 8 08:03:26 dataset1001 kernel: [38432.326853] active_anon:31711 inactive_anon:6004 isolated_anon:0 Feb 8 08:03:26 dataset1001 kernel: [38432.326854] active_file:26426 inactive_file:3857074 isolated_file:0 Feb 8 08:03:26 dataset1001 kernel: [38432.326856] unevictable:1019 dirty:729121 writeback:11217 unstable:0 Feb 8 08:03:26 dataset1001 kernel: [38432.326857] free:18317 slab_reclaimable:79292 slab_unreclaimable:9887 Feb 8 08:03:26 dataset1001 kernel: [38432.326858] mapped:2051 shmem:71 pagetables:867 bounce:0 Feb 8 08:03:26 dataset1001 kernel: [38432.326861] Node 0 DMA free:15868kB min:12kB low:12kB high:16kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15676kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_rec laimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes Feb 8 08:03:26 dataset1001 kernel: [38432.326872] lowmem_reserve[]: 0 3243 16121 16121 Feb 8 08:03:26 dataset1001 kernel: [38432.326876] Node 0 DMA32 free:52716kB min:3264kB low:4080kB high:4896kB active_anon:18048kB inactive_anon:7992kB a ctive_file:4864kB inactive_file:3085260kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:3321540kB mlocked:0kB dirty:578832kB writeback:74 12kB mapped:148kB shmem:0kB slab_reclaimable:117856kB slab_unreclaimable:6472kB kernel_stack:56kB pagetables:292kB unstable:0kB bounce:0kB writeback_tmp: 0kB pages_scanned:0 all_unreclaimable? no Feb 8 08:03:26 dataset1001 kernel: [38432.326889] lowmem_reserve[]: 0 0 12877 12877 Feb 8 08:03:26 dataset1001 kernel: [38432.326893] Node 0 Normal free:4684kB min:12972kB low:16212kB high:19456kB active_anon:108796kB inactive_anon:1602 4kB active_file:100840kB inactive_file:12343036kB unevictable:4076kB isolated(anon):0kB isolated(file):0kB present:13186560kB mlocked:4028kB dirty:233765 2kB writeback:37456kB mapped:8056kB shmem:284kB slab_reclaimable:199312kB slab_unreclaimable:33076kB kernel_stack:952kB pagetables:3176kB unstable:0kB bo unce:0kB writeback_tmp:0kB pages_scanned:76 all_unreclaimable? no Feb 8 08:03:26 dataset1001 kernel: [38432.326905] lowmem_reserve[]: 0 0 0 0 Feb 8 08:03:26 dataset1001 kernel: [38432.326909] Node 0 DMA: 1*4kB 1*8kB 1*16kB 1*32kB 1*64kB 1*128kB 1*256kB 0*512kB 1*1024kB 1*2048kB 3*4096kB = 1586 8kB Feb 8 08:03:26 dataset1001 kernel: [38432.326920] Node 0 DMA32: 368*4kB 249*8kB 634*16kB 1146*32kB 4*64kB 0*128kB 3*256kB 3*512kB 0*1024kB 0*2048kB 0*40 96kB = 52840kB Feb 8 08:03:26 dataset1001 kernel: [38432.326930] Node 0 Normal: 1060*4kB 0*8kB 0*16kB 0*32kB 1*64kB 0*128kB 1*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 4560kB Feb 8 08:03:26 dataset1001 kernel: [38432.326941] 3884258 total pagecache pages Feb 8 08:03:26 dataset1001 kernel: [38432.326943] 0 pages in swap cache Feb 8 08:03:26 dataset1001 kernel: [38432.326945] Swap cache stats: add 0, delete 0, find 0/0 Feb 8 08:03:26 dataset1001 kernel: [38432.326947] Free swap = 976892kB Feb 8 08:03:26 dataset1001 kernel: [38432.326948] Total swap = 976892kB Feb 8 08:03:26 dataset1001 kernel: [38432.350585] 4194288 pages RAM Feb 8 08:03:26 dataset1001 kernel: [38432.350588] 81311 pages reserved Feb 8 08:03:26 dataset1001 kernel: [38432.350589] 2544422 pages shared Feb 8 08:03:26 dataset1001 kernel: [38432.350591] 1558178 pages non-shared Feb 8 08:03:26 dataset1001 kernel: [38432.350595] SLUB: Unable to allocate memory on node -1 (gfp=0x20) Feb 8 08:03:26 dataset1001 kernel: [38432.350598] cache: kmalloc-2048, object size: 2048, buffer size: 2048, default order: 3, min order: 0 Feb 8 08:03:26 dataset1001 kernel: [38432.350602] node 0: slabs: 2939, objs: 10974, free: 0
The first is a failure of the driver to atomically allocate memory for a socket buffer associated with the incoming packet. This is at a point where it's not possible to wait for a page to be swapped in. This would probably mean a lost packet. The second stack trace is in the code path for sending a tcp ACK, which means that the remote end would have to retransmit. We saw two groups of 5-6 stack traces, one at 08:03:26 and one at 11:23:17. AFAICT these two incidents didn't actually affect network traffic except for short hiccups. More recent versions of the driver use kmalloc instead of __netdev_alloc and handle it differently; if the issue turns out to be serious we could look into backporting those changes.
We did see more of those later on Feb 8 and 9th, a total of about 5 groups of 5 stack traces within 3 hours. Not sure of the effect on the remote end.
One possibility that has been reported on various mailing lists is to increase the setting for /proc/sys/vm/min_free_kbytes but this needs more checking.
Okay, after a bunch of reading and crunching the numbers, it looks lik we need to buy more time for kswapd to reclaim memory. Increased (live hack) min_free_kbytes from 16259 to 262144. This is still well below the max recommended 6% for a host with 16GB ram.
Feb 9 2012
Saw a single stack trace from xfs:
Feb 9 02:54:20 dataset1001 kernel: [34353.016174] rsync D ffff880407bb3170 0 6830 4054 0x00000000 Feb 9 02:54:20 dataset1001 kernel: [34353.016180] ffff88003749ba28 0000000000000082 ffff88003749b9f8 ffff880000000000 Feb 9 02:54:20 dataset1001 kernel: [34353.016184] 0000000000013cc0 ffff880407bb2dc0 ffff880407bb3170 ffff88003749bfd8 Feb 9 02:54:20 dataset1001 kernel: [34353.016188] ffff880407bb3178 0000000000013cc0 ffff88003749a010 0000000000013cc0 Feb 9 02:54:20 dataset1001 kernel: [34353.016193] Call Trace: Feb 9 02:54:20 dataset1001 kernel: [34353.016203] [<ffffffff815c558d>] schedule_timeout+0x21d/0x300 Feb 9 02:54:20 dataset1001 kernel: [34353.016209] [<ffffffff812bbf84>] ? blk_unplug+0x34/0x70 Feb 9 02:54:20 dataset1001 kernel: [34353.016214] [<ffffffff8148dfec>] ? dm_table_unplug_all+0x5c/0x110 Feb 9 02:54:20 dataset1001 kernel: [34353.016218] [<ffffffff815c651e>] __down+0x7e/0xc0 Feb 9 02:54:20 dataset1001 kernel: [34353.016257] [<ffffffffa0162d55>] ? _xfs_buf_find+0xf5/0x280 [xfs] Feb 9 02:54:20 dataset1001 kernel: [34353.016263] [<ffffffff8108c991>] down+0x41/0x50 Feb 9 02:54:20 dataset1001 kernel: [34353.016291] [<ffffffffa0161a73>] xfs_buf_lock+0x53/0x120 [xfs] Feb 9 02:54:20 dataset1001 kernel: [34353.016318] [<ffffffffa0162d55>] _xfs_buf_find+0xf5/0x280 [xfs] Feb 9 02:54:20 dataset1001 kernel: [34353.016346] [<ffffffffa0162f40>] xfs_buf_get+0x60/0x1c0 [xfs] Feb 9 02:54:20 dataset1001 kernel: [34353.016374] [<ffffffffa01638dc>] xfs_buf_read+0x2c/0x110 [xfs] Feb 9 02:54:20 dataset1001 kernel: [34353.016402] [<ffffffffa0158ae0>] xfs_trans_read_buf+0x190/0x420 [xfs] Feb 9 02:54:20 dataset1001 kernel: [34353.016426] [<ffffffffa012ba09>] xfs_da_do_buf+0x289/0x790 [xfs] Feb 9 02:54:20 dataset1001 kernel: [34353.016431] [<ffffffff8116dde3>] ? generic_permission+0x23/0xb0 Feb 9 02:54:20 dataset1001 kernel: [34353.016436] [<ffffffff815c705e>] ? _raw_spin_lock+0xe/0x20 Feb 9 02:54:20 dataset1001 kernel: [34353.016443] [<ffffffff81181e30>] ? mntput_no_expire+0x60/0x190 Feb 9 02:54:20 dataset1001 kernel: [34353.016448] [<ffffffff8116efc4>] ? finish_open+0xe4/0x1c0 Feb 9 02:54:20 dataset1001 kernel: [34353.016471] [<ffffffffa012bf8a>] xfs_da_read_buf+0x2a/0x30 [xfs] Feb 9 02:54:20 dataset1001 kernel: [34353.016496] [<ffffffffa01303d8>] ? xfs_dir2_block_getdents+0x98/0x220 [xfs] Feb 9 02:54:20 dataset1001 kernel: [34353.016520] [<ffffffffa01303d8>] xfs_dir2_block_getdents+0x98/0x220 [xfs] Feb 9 02:54:20 dataset1001 kernel: [34353.016542] [<ffffffffa0118a11>] ? xfs_bmap_last_offset+0x111/0x140 [xfs] Feb 9 02:54:20 dataset1001 kernel: [34353.016547] [<ffffffff811753b0>] ? filldir+0x0/0xe0 Feb 9 02:54:20 dataset1001 kernel: [34353.016571] [<ffffffffa012ebf7>] xfs_readdir+0xa7/0x130 [xfs] Feb 9 02:54:20 dataset1001 kernel: [34353.016575] [<ffffffff811753b0>] ? filldir+0x0/0xe0 Feb 9 02:54:20 dataset1001 kernel: [34353.016580] [<ffffffff811753b0>] ? filldir+0x0/0xe0 Feb 9 02:54:20 dataset1001 kernel: [34353.016607] [<ffffffffa0164ac9>] xfs_file_readdir+0x39/0x50 [xfs] Feb 9 02:54:20 dataset1001 kernel: [34353.016611] [<ffffffff81175548>] vfs_readdir+0xb8/0xe0 Feb 9 02:54:20 dataset1001 kernel: [34353.016616] [<ffffffff811756d9>] sys_getdents+0x89/0xf0 Feb 9 02:54:20 dataset1001 kernel: [34353.016621] [<ffffffff8100c082>] system_call_fastpath+0x16/0x1b
Investigating.
Other issues
Initial issues included a firmware bug on the receiving host's array causing a slowdown in writes; the array was swapped out for different hardware, fixing the issue.
The same rsync via a different route (not he.net) was much faster. Somone at he.net made an unspecified configuration change which helped for a short time and then things went downhill again. Using a different route we are still doing well. Speeds bounce around some due to the way the rsync bwlimit works (burst, then throttle).