Dataset1001

From Wikitech
(Difference between revisions)
Jump to: navigation, search
(During rsync testing with Kevin at your.org)
Line 9: Line 9:
 
* background job with bwlimit 40000 to get 2011 dumps from dataset2 via rsync
 
* background job with bwlimit 40000 to get 2011 dumps from dataset2 via rsync
 
* rsync daemon with bwlimit 50000 which your.org connects to
 
* rsync daemon with bwlimit 50000 which your.org connects to
 +
 +
Note that some files transfered are quite large, several GB.
  
 
===Feb 7 2012===
 
===Feb 7 2012===
Line 340: Line 342:
  
 
Investigating.
 
Investigating.
 
  
 
== Other issues ==
 
== Other issues ==

Revision as of 10:06, 9 February 2012

See Dumps/Dump servers for the overview of all dump server hardware.

Notes on Dataset1001 deployment:

Contents

During rsync testing with Kevin at your.org

Processes running:

  • cron job with bwlimit 10000 to get latest data written to 2012 dumps from dataset2 via rsync
  • background job with bwlimit 40000 to get 2011 dumps from dataset2 via rsync
  • rsync daemon with bwlimit 50000 which your.org connects to

Note that some files transfered are quite large, several GB.

Feb 7 2012

Saw stack traces in /var/log/messages using kernel 2.6.32-38-server (Ubuntu 10.04):

Feb  7 07:46:49 dataset1001 kernel: [384047.468170] swapper: page allocation failure. order:0, mode:0x4020
Feb  7 07:46:49 dataset1001 kernel: [384047.468176] Pid: 0, comm: swapper Not tainted 2.6.32-38-server #83-Ubuntu
Feb  7 07:46:49 dataset1001 kernel: [384047.468179] Call Trace:
Feb  7 07:46:49 dataset1001 kernel: [384047.468181]  <IRQ>  [<ffffffff810fca37>] __alloc_pages_slowpath+0x4a7/0x590
Feb  7 07:46:49 dataset1001 kernel: [384047.468195]  [<ffffffff810fcc99>] __alloc_pages_nodemask+0x179/0x180
Feb  7 07:46:49 dataset1001 kernel: [384047.468202]  [<ffffffff8112fe57>] alloc_pages_current+0x87/0xd0
Feb  7 07:46:49 dataset1001 kernel: [384047.468206]  [<ffffffff81135e97>] new_slab+0x2f7/0x310
Feb  7 07:46:49 dataset1001 kernel: [384047.468210]  [<ffffffff81138771>] __slab_alloc+0x201/0x2d0
Feb  7 07:46:49 dataset1001 kernel: [384047.468216]  [<ffffffff8146e376>] ? __netdev_alloc_skb+0x36/0x60
Feb  7 07:46:49 dataset1001 kernel: [384047.468221]  [<ffffffff8113956f>] __kmalloc_node_track_caller+0xaf/0x160
Feb  7 07:46:49 dataset1001 kernel: [384047.468225]  [<ffffffff8146e376>] ? __netdev_alloc_skb+0x36/0x60
Feb  7 07:46:49 dataset1001 kernel: [384047.468229]  [<ffffffff8146db90>] __alloc_skb+0x80/0x190
Feb  7 07:46:49 dataset1001 kernel: [384047.468233]  [<ffffffff8146e376>] __netdev_alloc_skb+0x36/0x60
Feb  7 07:46:49 dataset1001 kernel: [384047.468243]  [<ffffffffa00a6303>] bnx2_rx_skb+0x63/0x6e0 [bnx2]
Feb  7 07:46:49 dataset1001 kernel: [384047.468249]  [<ffffffffa00a6e1c>] bnx2_rx_int+0x49c/0x760 [bnx2]
Feb  7 07:46:49 dataset1001 kernel: [384047.468255]  [<ffffffff810200c6>] ? show_shared_cpu_map_func+0x66/0x80
Feb  7 07:46:49 dataset1001 kernel: [384047.468262]  [<ffffffff81093de4>] ? clockevents_program_event+0x54/0xa0
Feb  7 07:46:49 dataset1001 kernel: [384047.468266]  [<ffffffff81095488>] ? tick_dev_program_event+0x68/0xd0
Feb  7 07:46:49 dataset1001 kernel: [384047.468272]  [<ffffffffa00a7150>] bnx2_poll_work+0x70/0xa0 [bnx2]
Feb  7 07:46:49 dataset1001 kernel: [384047.468277]  [<ffffffffa00a71bd>] bnx2_poll_msix+0x3d/0xc0 [bnx2]
Feb  7 07:46:49 dataset1001 kernel: [384047.468284]  [<ffffffff81039889>] ? default_spin_lock_flags+0x9/0x10
Feb  7 07:46:49 dataset1001 kernel: [384047.468289]  [<ffffffff8147844f>] net_rx_action+0x10f/0x250
Feb  7 07:46:49 dataset1001 kernel: [384047.468294]  [<ffffffff8106f287>] __do_softirq+0xb7/0x1f0
Feb  7 07:46:49 dataset1001 kernel: [384047.468299]  [<ffffffff810c6610>] ? handle_IRQ_event+0x60/0x170
Feb  7 07:46:49 dataset1001 kernel: [384047.468303]  [<ffffffff810142ac>] call_softirq+0x1c/0x30
Feb  7 07:46:49 dataset1001 kernel: [384047.468306]  [<ffffffff81015c75>] do_softirq+0x65/0xa0
Feb  7 07:46:49 dataset1001 kernel: [384047.468309]  [<ffffffff8106f085>] irq_exit+0x85/0x90
Feb  7 07:46:49 dataset1001 kernel: [384047.468314]  [<ffffffff815654e5>] do_IRQ+0x75/0xf0
Feb  7 07:46:49 dataset1001 kernel: [384047.468318]  [<ffffffff81013ad3>] ret_from_intr+0x0/0x11
Feb  7 07:46:49 dataset1001 kernel: [384047.468320]  <EOI>  [<ffffffff81312c25>] ? acpi_idle_enter_bm+0x296/0x2ca
Feb  7 07:46:49 dataset1001 kernel: [384047.468329]  [<ffffffff81312c1e>] ? acpi_idle_enter_bm+0x28f/0x2ca
Feb  7 07:46:49 dataset1001 kernel: [384047.468335]  [<ffffffff81452297>] ? cpuidle_idle_call+0xa7/0x140
Feb  7 07:46:49 dataset1001 kernel: [384047.468341]  [<ffffffff81011e43>] ? cpu_idle+0xb3/0x110
Feb  7 07:46:49 dataset1001 kernel: [384047.468347]  [<ffffffff81558105>] ? start_secondary+0xa8/0xaa
Feb  7 07:46:49 dataset1001 kernel: [384047.468350] Mem-Info:
Feb  7 07:46:49 dataset1001 kernel: [384047.468352] Node 0 DMA per-cpu:
Feb  7 07:46:49 dataset1001 kernel: [384047.468355] CPU    0: hi:    0, btch:   1 usd:   0
Feb  7 07:46:49 dataset1001 kernel: [384047.468357] CPU    1: hi:    0, btch:   1 usd:   0
Feb  7 07:46:49 dataset1001 kernel: [384047.468360] CPU    2: hi:    0, btch:   1 usd:   0
Feb  7 07:46:49 dataset1001 kernel: [384047.468362] CPU    3: hi:    0, btch:   1 usd:   0
Feb  7 07:46:49 dataset1001 kernel: [384047.468364] Node 0 DMA32 per-cpu:
Feb  7 07:46:49 dataset1001 kernel: [384047.468367] CPU    0: hi:  186, btch:  31 usd: 174
Feb  7 07:46:49 dataset1001 kernel: [384047.468369] CPU    1: hi:  186, btch:  31 usd:  24
Feb  7 07:46:49 dataset1001 kernel: [384047.468371] CPU    2: hi:  186, btch:  31 usd: 183
Feb  7 07:46:49 dataset1001 kernel: [384047.468374] CPU    3: hi:  186, btch:  31 usd: 183
Feb  7 07:46:49 dataset1001 kernel: [384047.468375] Node 0 Normal per-cpu:
Feb  7 07:46:49 dataset1001 kernel: [384047.468378] CPU    0: hi:  186, btch:  31 usd: 179
Feb  7 07:46:49 dataset1001 kernel: [384047.468380] CPU    1: hi:  186, btch:  31 usd:  54
Feb  7 07:46:49 dataset1001 kernel: [384047.468383] CPU    2: hi:  186, btch:  31 usd: 157
Feb  7 07:46:49 dataset1001 kernel: [384047.468385] CPU    3: hi:  186, btch:  31 usd:  59
Feb  7 07:46:49 dataset1001 kernel: [384047.468390] active_anon:26517 inactive_anon:5096 isolated_anon:0
Feb  7 07:46:49 dataset1001 kernel: [384047.468392]  active_file:76812 inactive_file:3796080 isolated_file:0
Feb  7 07:46:49 dataset1001 kernel: [384047.468393]  unevictable:850 dirty:388919 writeback:233464 unstable:0
Feb  7 07:46:49 dataset1001 kernel: [384047.468394]  free:18338 slab_reclaimable:84627 slab_unreclaimable:21641
Feb  7 07:46:49 dataset1001 kernel: [384047.468395]  mapped:2677 shmem:74 pagetables:755 bounce:0
Feb  7 07:46:49 dataset1001 kernel: [384047.468398] Node 0 DMA free:15892kB min:12kB low:12kB high:16kB active_anon:0kB inactive_anon:0kB active_file:0kB
 inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15340kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_re
claimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
Feb  7 07:46:49 dataset1001 kernel: [384047.468409] lowmem_reserve[]: 0 3243 16121 16121
Feb  7 07:46:49 dataset1001 kernel: [384047.468413] Node 0 DMA32 free:52708kB min:3264kB low:4080kB high:4896kB active_anon:11084kB inactive_anon:5500kB 
active_file:46444kB inactive_file:2763984kB unevictable:568kB isolated(anon):0kB isolated(file):0kB present:3321540kB mlocked:568kB dirty:299064kB writeb
ack:181156kB mapped:136kB shmem:0kB slab_reclaimable:160712kB slab_unreclaimable:20836kB kernel_stack:64kB pagetables:164kB unstable:0kB bounce:0kB write
back_tmp:0kB pages_scanned:0 all_unreclaimable? no
Feb  7 07:46:49 dataset1001 kernel: [384047.468426] lowmem_reserve[]: 0 0 12877 12877
Feb  7 07:46:49 dataset1001 kernel: [384047.468430] Node 0 Normal free:4752kB min:12972kB low:16212kB high:19456kB active_anon:94984kB inactive_anon:1488
4kB active_file:260804kB inactive_file:12420336kB unevictable:2832kB isolated(anon):0kB isolated(file):0kB present:13186560kB mlocked:2832kB dirty:125661
2kB writeback:752700kB mapped:10572kB shmem:296kB slab_reclaimable:177796kB slab_unreclaimable:65728kB kernel_stack:1184kB pagetables:2856kB unstable:0kB
 bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
Feb  7 07:46:49 dataset1001 kernel: [384047.468443] lowmem_reserve[]: 0 0 0 0
Feb  7 07:46:49 dataset1001 kernel: [384047.468447] Node 0 DMA: 3*4kB 3*8kB 3*16kB 2*32kB 2*64kB 2*128kB 2*256kB 1*512kB 0*1024kB 1*2048kB 3*4096kB = 158
92kB
Feb  7 07:46:49 dataset1001 kernel: [384047.468458] Node 0 DMA32: 11541*4kB 176*8kB 174*16kB 1*32kB 1*64kB 1*128kB 0*256kB 0*512kB 0*1024kB 1*2048kB 0*40
96kB = 52628kB
Feb  7 07:46:49 dataset1001 kernel: [384047.468469] Node 0 Normal: 994*4kB 0*8kB 0*16kB 0*32kB 1*64kB 0*128kB 1*256kB 1*512kB 0*1024kB 0*2048kB 0*4096kB 
= 4808kB
Feb  7 07:46:49 dataset1001 kernel: [384047.468480] 3873689 total pagecache pages
Feb  7 07:46:49 dataset1001 kernel: [384047.468482] 0 pages in swap cache
Feb  7 07:46:49 dataset1001 kernel: [384047.468484] Swap cache stats: add 0, delete 0, find 0/0
Feb  7 07:46:49 dataset1001 kernel: [384047.468486] Free swap  = 976888kB
Feb  7 07:46:49 dataset1001 kernel: [384047.468487] Total swap = 976888kB
Feb  7 07:46:49 dataset1001 kernel: [384047.529621] 4194304 pages RAM
Feb  7 07:46:49 dataset1001 kernel: [384047.529624] 81220 pages reserved
Feb  7 07:46:49 dataset1001 kernel: [384047.529625] 2627094 pages shared
Feb  7 07:46:49 dataset1001 kernel: [384047.529627] 1474311 pages non-shared
Feb  7 07:46:49 dataset1001 kernel: [384047.529631] SLUB: Unable to allocate memory on node -1 (gfp=0x20)
Feb  7 07:46:49 dataset1001 kernel: [384047.529634]   cache: kmalloc-4096, object size: 4096, buffer size: 4096, default order: 3, min order: 0
Feb  7 07:46:49 dataset1001 kernel: [384047.529637]   node 0: slabs: 1541, objs: 4880, free: 0

I upgraded to 2.6.38-13-generic from the natty backports ppa.

Feb 8 2012

Saw similar stack traces in /var/log/messages from the network driver (bnx2):

Feb  8 08:03:26 dataset1001 kernel: [38432.288760] swapper: page allocation failure. order:0, mode:0x4020
Feb  8 08:03:26 dataset1001 kernel: [38432.288765] Pid: 0, comm: swapper Not tainted 2.6.38-13-generic #54~lucid1-Ubuntu
Feb  8 08:03:26 dataset1001 kernel: [38432.288768] Call Trace:
Feb  8 08:03:26 dataset1001 kernel: [38432.288770]  <IRQ>  [<ffffffff81113e50>] ? __alloc_pages_slowpath+0x6d0/0x710
Feb  8 08:03:26 dataset1001 kernel: [38432.288784]  [<ffffffff8114af4a>] ? alloc_pages_current+0xaa/0x120
Feb  8 08:03:26 dataset1001 kernel: [38432.288788]  [<ffffffff81114039>] ? __alloc_pages_nodemask+0x1a9/0x200
Feb  8 08:03:26 dataset1001 kernel: [38432.288792]  [<ffffffff8114af4a>] ? alloc_pages_current+0xaa/0x120
Feb  8 08:03:26 dataset1001 kernel: [38432.288796]  [<ffffffff81152db5>] ? new_slab+0x295/0x2a0
Feb  8 08:03:26 dataset1001 kernel: [38432.288799]  [<ffffffff811544bd>] ? __slab_alloc+0x17d/0x320
Feb  8 08:03:26 dataset1001 kernel: [38432.288806]  [<ffffffff814c8be4>] ? __netdev_alloc_skb+0x24/0x50
Feb  8 08:03:26 dataset1001 kernel: [38432.288810]  [<ffffffff81157134>] ? __kmalloc_node_track_caller+0xa4/0x1d0
Feb  8 08:03:26 dataset1001 kernel: [38432.288814]  [<ffffffff814c8be4>] ? __netdev_alloc_skb+0x24/0x50
Feb  8 08:03:26 dataset1001 kernel: [38432.288818]  [<ffffffff814c8393>] ? __alloc_skb+0x83/0x170
Feb  8 08:03:26 dataset1001 kernel: [38432.288822]  [<ffffffff814c8be4>] ? __netdev_alloc_skb+0x24/0x50
Feb  8 08:03:26 dataset1001 kernel: [38432.288831]  [<ffffffffa00dccf3>] ? bnx2_rx_skb+0x63/0x6f0 [bnx2]
Feb  8 08:03:26 dataset1001 kernel: [38432.288837]  [<ffffffffa00dd7f8>] ? bnx2_rx_int+0x478/0x720 [bnx2]
Feb  8 08:03:26 dataset1001 kernel: [38432.288843]  [<ffffffffa00ddb10>] ? bnx2_poll_work+0x70/0xa0 [bnx2]
Feb  8 08:03:26 dataset1001 kernel: [38432.288848]  [<ffffffffa00ddb7d>] ? bnx2_poll_msix+0x3d/0xc0 [bnx2]
Feb  8 08:03:26 dataset1001 kernel: [38432.288853]  [<ffffffff81038d09>] ? default_spin_lock_flags+0x9/0x10
Feb  8 08:03:26 dataset1001 kernel: [38432.288859]  [<ffffffff814d72a8>] ? net_rx_action+0x108/0x2d0
Feb  8 08:03:26 dataset1001 kernel: [38432.288864]  [<ffffffff8106c04b>] ? __do_softirq+0xab/0x200
Feb  8 08:03:26 dataset1001 kernel: [38432.288870]  [<ffffffff810d2330>] ? handle_IRQ_event+0x50/0x170
Feb  8 08:03:26 dataset1001 kernel: [38432.288875]  [<ffffffff8100cf9c>] ? call_softirq+0x1c/0x30
Feb  8 08:03:26 dataset1001 kernel: [38432.288878]  [<ffffffff8100ea05>] ? do_softirq+0x65/0xa0
Feb  8 08:03:26 dataset1001 kernel: [38432.288882]  [<ffffffff8106bf15>] ? irq_exit+0x85/0x90
Feb  8 08:03:26 dataset1001 kernel: [38432.288886]  [<ffffffff815cf326>] ? do_IRQ+0x66/0xe0
Feb  8 08:03:26 dataset1001 kernel: [38432.288892]  [<ffffffff815c7753>] ? ret_from_intr+0x0/0x15
Feb  8 08:03:26 dataset1001 kernel: [38432.288894]  <EOI>  [<ffffffff81334861>] ? intel_idle+0xc1/0x120
Feb  8 08:03:26 dataset1001 kernel: [38432.288902]  [<ffffffff81334844>] ? intel_idle+0xa4/0x120
Feb  8 08:03:26 dataset1001 kernel: [38432.288907]  [<ffffffff814a7182>] ? cpuidle_idle_call+0xb2/0x1b0
Feb  8 08:03:26 dataset1001 kernel: [38432.288912]  [<ffffffff8100b077>] ? cpu_idle+0xb7/0x110
Feb  8 08:03:26 dataset1001 kernel: [38432.288916]  [<ffffffff815ac442>] ? rest_init+0x72/0x80
Feb  8 08:03:26 dataset1001 kernel: [38432.288923]  [<ffffffff81accc9a>] ? start_kernel+0x374/0x37b
Feb  8 08:03:26 dataset1001 kernel: [38432.288927]  [<ffffffff81acc346>] ? x86_64_start_reservations+0x131/0x135
Feb  8 08:03:26 dataset1001 kernel: [38432.288931]  [<ffffffff81acc452>] ? x86_64_start_kernel+0x108/0x117
Feb  8 08:03:26 dataset1001 kernel: [38432.288933] Mem-Info:
Feb  8 08:03:26 dataset1001 kernel: [38432.288935] Node 0 DMA per-cpu:
Feb  8 08:03:26 dataset1001 kernel: [38432.288938] CPU    0: hi:    0, btch:   1 usd:   0
Feb  8 08:03:26 dataset1001 kernel: [38432.288941] CPU    1: hi:    0, btch:   1 usd:   0
Feb  8 08:03:26 dataset1001 kernel: [38432.288943] CPU    2: hi:    0, btch:   1 usd:   0
Feb  8 08:03:26 dataset1001 kernel: [38432.288945] CPU    3: hi:    0, btch:   1 usd:   0
Feb  8 08:03:26 dataset1001 kernel: [38432.288947] Node 0 DMA32 per-cpu:
Feb  8 08:03:26 dataset1001 kernel: [38432.288950] CPU    0: hi:  186, btch:  31 usd: 179
Feb  8 08:03:26 dataset1001 kernel: [38432.288952] CPU    1: hi:  186, btch:  31 usd:  15
Feb  8 08:03:26 dataset1001 kernel: [38432.288954] CPU    2: hi:  186, btch:  31 usd: 156
Feb  8 08:03:26 dataset1001 kernel: [38432.288956] CPU    3: hi:  186, btch:  31 usd:   0
Feb  8 08:03:26 dataset1001 kernel: [38432.288958] Node 0 Normal per-cpu:
Feb  8 08:03:26 dataset1001 kernel: [38432.288961] CPU    0: hi:  186, btch:  31 usd:  88
Feb  8 08:03:26 dataset1001 kernel: [38432.288963] CPU    1: hi:  186, btch:  31 usd:  21
Feb  8 08:03:26 dataset1001 kernel: [38432.288965] CPU    2: hi:  186, btch:  31 usd: 161
Feb  8 08:03:26 dataset1001 kernel: [38432.288968] CPU    3: hi:  186, btch:  31 usd:  28
Feb  8 08:03:26 dataset1001 kernel: [38432.288973] active_anon:31711 inactive_anon:6004 isolated_anon:0
Feb  8 08:03:26 dataset1001 kernel: [38432.288974]  active_file:26426 inactive_file:3857074 isolated_file:0
Feb  8 08:03:26 dataset1001 kernel: [38432.288976]  unevictable:1019 dirty:729121 writeback:11217 unstable:0
Feb  8 08:03:26 dataset1001 kernel: [38432.288977]  free:18317 slab_reclaimable:79292 slab_unreclaimable:9887
Feb  8 08:03:26 dataset1001 kernel: [38432.288978]  mapped:2051 shmem:71 pagetables:867 bounce:0
Feb  8 08:03:26 dataset1001 kernel: [38432.288981] Node 0 DMA free:15868kB min:12kB low:12kB high:16kB active_anon:0kB inactive_anon:0kB active_file:0kB 
inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15676kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_rec
laimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
Feb  8 08:03:26 dataset1001 kernel: [38432.288992] lowmem_reserve[]: 0 3243 16121 16121
Feb  8 08:03:26 dataset1001 kernel: [38432.288996] Node 0 DMA32 free:52716kB min:3264kB low:4080kB high:4896kB active_anon:18048kB inactive_anon:7992kB a
ctive_file:4864kB inactive_file:3085260kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:3321540kB mlocked:0kB dirty:578832kB writeback:74
12kB mapped:148kB shmem:0kB slab_reclaimable:117856kB slab_unreclaimable:6472kB kernel_stack:56kB pagetables:292kB unstable:0kB bounce:0kB writeback_tmp:
0kB pages_scanned:0 all_unreclaimable? no
Feb  8 08:03:26 dataset1001 kernel: [38432.289008] lowmem_reserve[]: 0 0 12877 12877
Feb  8 08:03:26 dataset1001 kernel: [38432.289012] Node 0 Normal free:4684kB min:12972kB low:16212kB high:19456kB active_anon:108796kB inactive_anon:1602
4kB active_file:100840kB inactive_file:12343036kB unevictable:4076kB isolated(anon):0kB isolated(file):0kB present:13186560kB mlocked:4028kB dirty:233765
2kB writeback:37456kB mapped:8056kB shmem:284kB slab_reclaimable:199312kB slab_unreclaimable:33076kB kernel_stack:952kB pagetables:3176kB unstable:0kB bo
unce:0kB writeback_tmp:0kB pages_scanned:76 all_unreclaimable? no
Feb  8 08:03:26 dataset1001 kernel: [38432.289025] lowmem_reserve[]: 0 0 0 0
Feb  8 08:03:26 dataset1001 kernel: [38432.289028] Node 0 DMA: 1*4kB 1*8kB 1*16kB 1*32kB 1*64kB 1*128kB 1*256kB 0*512kB 1*1024kB 1*2048kB 3*4096kB = 1586
8kB
Feb  8 08:03:26 dataset1001 kernel: [38432.289039] Node 0 DMA32: 368*4kB 249*8kB 634*16kB 1146*32kB 4*64kB 0*128kB 3*256kB 3*512kB 0*1024kB 0*2048kB 0*40
96kB = 52840kB
Feb  8 08:03:26 dataset1001 kernel: [38432.289050] Node 0 Normal: 1060*4kB 0*8kB 0*16kB 0*32kB 1*64kB 0*128kB 1*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB 
= 4560kB
Feb  8 08:03:26 dataset1001 kernel: [38432.289060] 3884258 total pagecache pages
Feb  8 08:03:26 dataset1001 kernel: [38432.289062] 0 pages in swap cache
Feb  8 08:03:26 dataset1001 kernel: [38432.289064] Swap cache stats: add 0, delete 0, find 0/0
Feb  8 08:03:26 dataset1001 kernel: [38432.289066] Free swap  = 976892kB
Feb  8 08:03:26 dataset1001 kernel: [38432.289068] Total swap = 976892kB



Feb  8 08:03:26 dataset1001 kernel: [38432.326609] kworker/0:1: page allocation failure. order:0, mode:0x4020
Feb  8 08:03:26 dataset1001 kernel: [38432.326615] Pid: 0, comm: kworker/0:1 Not tainted 2.6.38-13-generic #54~lucid1-Ubuntu
Feb  8 08:03:26 dataset1001 kernel: [38432.326618] Call Trace:
Feb  8 08:03:26 dataset1001 kernel: [38432.326620]  <IRQ>  [<ffffffff81113e50>] ? __alloc_pages_slowpath+0x6d0/0x710
Feb  8 08:03:26 dataset1001 kernel: [38432.326633]  [<ffffffff81114039>] ? __alloc_pages_nodemask+0x1a9/0x200
Feb  8 08:03:26 dataset1001 kernel: [38432.326638]  [<ffffffff8114af4a>] ? alloc_pages_current+0xaa/0x120
Feb  8 08:03:26 dataset1001 kernel: [38432.326642]  [<ffffffff81152db5>] ? new_slab+0x295/0x2a0
Feb  8 08:03:26 dataset1001 kernel: [38432.326645]  [<ffffffff811544bd>] ? __slab_alloc+0x17d/0x320
Feb  8 08:03:26 dataset1001 kernel: [38432.326650]  [<ffffffff81524ac3>] ? tcp_send_ack+0x33/0x130
Feb  8 08:03:26 dataset1001 kernel: [38432.326656]  [<ffffffff81573fb0>] ? ip6_pol_route_input+0x0/0x20
Feb  8 08:03:26 dataset1001 kernel: [38432.326660]  [<ffffffff81157134>] ? __kmalloc_node_track_caller+0xa4/0x1d0
Feb  8 08:03:26 dataset1001 kernel: [38432.326663]  [<ffffffff81521a1a>] ? tcp_rcv_established+0x26a/0x6e0
Feb  8 08:03:26 dataset1001 kernel: [38432.326667]  [<ffffffff81524ac3>] ? tcp_send_ack+0x33/0x130
Feb  8 08:03:26 dataset1001 kernel: [38432.326672]  [<ffffffff814c8393>] ? __alloc_skb+0x83/0x170
Feb  8 08:03:26 dataset1001 kernel: [38432.326676]  [<ffffffff81524ac3>] ? tcp_send_ack+0x33/0x130
Feb  8 08:03:26 dataset1001 kernel: [38432.326680]  [<ffffffff8151bafe>] ? __tcp_ack_snd_check+0x5e/0xa0
Feb  8 08:03:26 dataset1001 kernel: [38432.326684]  [<ffffffff815219f9>] ? tcp_rcv_established+0x249/0x6e0
Feb  8 08:03:26 dataset1001 kernel: [38432.326688]  [<ffffffff81529dc5>] ? tcp_v4_do_rcv+0x125/0x1c0
Feb  8 08:03:26 dataset1001 kernel: [38432.326692]  [<ffffffff8152b5c9>] ? tcp_v4_rcv+0x5a9/0x840
Feb  8 08:03:26 dataset1001 kernel: [38432.326697]  [<ffffffff814fffe4>] ? nf_iterate+0x84/0xb0
Feb  8 08:03:26 dataset1001 kernel: [38432.326701]  [<ffffffff815080fd>] ? ip_local_deliver_finish+0xdd/0x290
Feb  8 08:03:26 dataset1001 kernel: [38432.326704]  [<ffffffff81508330>] ? ip_local_deliver+0x80/0x90
Feb  8 08:03:26 dataset1001 kernel: [38432.326708]  [<ffffffff81507901>] ? ip_rcv_finish+0x121/0x3f0
Feb  8 08:03:26 dataset1001 kernel: [38432.326711]  [<ffffffff81507f4d>] ? ip_rcv+0x23d/0x310
Feb  8 08:03:26 dataset1001 kernel: [38432.326715]  [<ffffffff814d1a0a>] ? __netif_receive_skb+0x40a/0x690
Feb  8 08:03:26 dataset1001 kernel: [38432.326720]  [<ffffffff814d6ac0>] ? netif_receive_skb+0x80/0x90
Feb  8 08:03:26 dataset1001 kernel: [38432.326724]  [<ffffffff814d6cd5>] ? napi_gro_complete+0x95/0xe0
Feb  8 08:03:26 dataset1001 kernel: [38432.326728]  [<ffffffff814d715a>] ? napi_complete+0x3a/0x80
Feb  8 08:03:26 dataset1001 kernel: [38432.326736]  [<ffffffffa00ddbc5>] ? bnx2_poll_msix+0x85/0xc0 [bnx2]
Feb  8 08:03:26 dataset1001 kernel: [38432.326741]  [<ffffffff814d72a8>] ? net_rx_action+0x108/0x2d0
Feb  8 08:03:26 dataset1001 kernel: [38432.326746]  [<ffffffff8108af73>] ? hrtimer_get_next_event+0xb3/0xf0
Feb  8 08:03:26 dataset1001 kernel: [38432.326751]  [<ffffffff8106c04b>] ? __do_softirq+0xab/0x200
Feb  8 08:03:26 dataset1001 kernel: [38432.326757]  [<ffffffff810d2330>] ? handle_IRQ_event+0x50/0x170
Feb  8 08:03:26 dataset1001 kernel: [38432.326762]  [<ffffffff810132e9>] ? sched_clock+0x9/0x10
Feb  8 08:03:26 dataset1001 kernel: [38432.326766]  [<ffffffff8100cf9c>] ? call_softirq+0x1c/0x30
Feb  8 08:03:26 dataset1001 kernel: [38432.326769]  [<ffffffff8100ea05>] ? do_softirq+0x65/0xa0
Feb  8 08:03:26 dataset1001 kernel: [38432.326773]  [<ffffffff8106bf15>] ? irq_exit+0x85/0x90
Feb  8 08:03:26 dataset1001 kernel: [38432.326777]  [<ffffffff815cf326>] ? do_IRQ+0x66/0xe0
Feb  8 08:03:26 dataset1001 kernel: [38432.326782]  [<ffffffff815c7753>] ? ret_from_intr+0x0/0x15
Feb  8 08:03:26 dataset1001 kernel: [38432.326784]  <EOI>  [<ffffffff81334861>] ? intel_idle+0xc1/0x120
Feb  8 08:03:26 dataset1001 kernel: [38432.326793]  [<ffffffff81334844>] ? intel_idle+0xa4/0x120
Feb  8 08:03:26 dataset1001 kernel: [38432.326799]  [<ffffffff814a7182>] ? cpuidle_idle_call+0xb2/0x1b0
Feb  8 08:03:26 dataset1001 kernel: [38432.326805]  [<ffffffff8100b077>] ? cpu_idle+0xb7/0x110
Feb  8 08:03:26 dataset1001 kernel: [38432.326809]  [<ffffffff815c085d>] ? start_secondary+0xc2/0xc4
Feb  8 08:03:26 dataset1001 kernel: [38432.326812] Mem-Info:
Feb  8 08:03:26 dataset1001 kernel: [38432.326814] Node 0 DMA per-cpu:
Feb  8 08:03:26 dataset1001 kernel: [38432.326817] CPU    0: hi:    0, btch:   1 usd:   0
Feb  8 08:03:26 dataset1001 kernel: [38432.326819] CPU    1: hi:    0, btch:   1 usd:   0
Feb  8 08:03:26 dataset1001 kernel: [38432.326822] CPU    2: hi:    0, btch:   1 usd:   0
Feb  8 08:03:26 dataset1001 kernel: [38432.326824] CPU    3: hi:    0, btch:   1 usd:   0
Feb  8 08:03:26 dataset1001 kernel: [38432.326826] Node 0 DMA32 per-cpu:
Feb  8 08:03:26 dataset1001 kernel: [38432.326829] CPU    0: hi:  186, btch:  31 usd: 179
Feb  8 08:03:26 dataset1001 kernel: [38432.326831] CPU    1: hi:  186, btch:  31 usd:  15
Feb  8 08:03:26 dataset1001 kernel: [38432.326834] CPU    2: hi:  186, btch:  31 usd: 156
Feb  8 08:03:26 dataset1001 kernel: [38432.326836] CPU    3: hi:  186, btch:  31 usd:   0
Feb  8 08:03:26 dataset1001 kernel: [38432.326838] Node 0 Normal per-cpu:
Feb  8 08:03:26 dataset1001 kernel: [38432.326841] CPU    0: hi:  186, btch:  31 usd:  88
Feb  8 08:03:26 dataset1001 kernel: [38432.326843] CPU    1: hi:  186, btch:  31 usd:  21
Feb  8 08:03:26 dataset1001 kernel: [38432.326845] CPU    2: hi:  186, btch:  31 usd: 161
Feb  8 08:03:26 dataset1001 kernel: [38432.326848] CPU    3: hi:  186, btch:  31 usd:  28
Feb  8 08:03:26 dataset1001 kernel: [38432.326853] active_anon:31711 inactive_anon:6004 isolated_anon:0
Feb  8 08:03:26 dataset1001 kernel: [38432.326854]  active_file:26426 inactive_file:3857074 isolated_file:0
Feb  8 08:03:26 dataset1001 kernel: [38432.326856]  unevictable:1019 dirty:729121 writeback:11217 unstable:0
Feb  8 08:03:26 dataset1001 kernel: [38432.326857]  free:18317 slab_reclaimable:79292 slab_unreclaimable:9887
Feb  8 08:03:26 dataset1001 kernel: [38432.326858]  mapped:2051 shmem:71 pagetables:867 bounce:0
Feb  8 08:03:26 dataset1001 kernel: [38432.326861] Node 0 DMA free:15868kB min:12kB low:12kB high:16kB active_anon:0kB inactive_anon:0kB active_file:0kB 
inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15676kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_rec
laimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
Feb  8 08:03:26 dataset1001 kernel: [38432.326872] lowmem_reserve[]: 0 3243 16121 16121
Feb  8 08:03:26 dataset1001 kernel: [38432.326876] Node 0 DMA32 free:52716kB min:3264kB low:4080kB high:4896kB active_anon:18048kB inactive_anon:7992kB a
ctive_file:4864kB inactive_file:3085260kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:3321540kB mlocked:0kB dirty:578832kB writeback:74
12kB mapped:148kB shmem:0kB slab_reclaimable:117856kB slab_unreclaimable:6472kB kernel_stack:56kB pagetables:292kB unstable:0kB bounce:0kB writeback_tmp:
0kB pages_scanned:0 all_unreclaimable? no
Feb  8 08:03:26 dataset1001 kernel: [38432.326889] lowmem_reserve[]: 0 0 12877 12877
Feb  8 08:03:26 dataset1001 kernel: [38432.326893] Node 0 Normal free:4684kB min:12972kB low:16212kB high:19456kB active_anon:108796kB inactive_anon:1602
4kB active_file:100840kB inactive_file:12343036kB unevictable:4076kB isolated(anon):0kB isolated(file):0kB present:13186560kB mlocked:4028kB dirty:233765
2kB writeback:37456kB mapped:8056kB shmem:284kB slab_reclaimable:199312kB slab_unreclaimable:33076kB kernel_stack:952kB pagetables:3176kB unstable:0kB bo
unce:0kB writeback_tmp:0kB pages_scanned:76 all_unreclaimable? no
Feb  8 08:03:26 dataset1001 kernel: [38432.326905] lowmem_reserve[]: 0 0 0 0
Feb  8 08:03:26 dataset1001 kernel: [38432.326909] Node 0 DMA: 1*4kB 1*8kB 1*16kB 1*32kB 1*64kB 1*128kB 1*256kB 0*512kB 1*1024kB 1*2048kB 3*4096kB = 1586
8kB
Feb  8 08:03:26 dataset1001 kernel: [38432.326920] Node 0 DMA32: 368*4kB 249*8kB 634*16kB 1146*32kB 4*64kB 0*128kB 3*256kB 3*512kB 0*1024kB 0*2048kB 0*40
96kB = 52840kB
Feb  8 08:03:26 dataset1001 kernel: [38432.326930] Node 0 Normal: 1060*4kB 0*8kB 0*16kB 0*32kB 1*64kB 0*128kB 1*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB 
= 4560kB
Feb  8 08:03:26 dataset1001 kernel: [38432.326941] 3884258 total pagecache pages
Feb  8 08:03:26 dataset1001 kernel: [38432.326943] 0 pages in swap cache
Feb  8 08:03:26 dataset1001 kernel: [38432.326945] Swap cache stats: add 0, delete 0, find 0/0
Feb  8 08:03:26 dataset1001 kernel: [38432.326947] Free swap  = 976892kB
Feb  8 08:03:26 dataset1001 kernel: [38432.326948] Total swap = 976892kB
Feb  8 08:03:26 dataset1001 kernel: [38432.350585] 4194288 pages RAM
Feb  8 08:03:26 dataset1001 kernel: [38432.350588] 81311 pages reserved
Feb  8 08:03:26 dataset1001 kernel: [38432.350589] 2544422 pages shared
Feb  8 08:03:26 dataset1001 kernel: [38432.350591] 1558178 pages non-shared
Feb  8 08:03:26 dataset1001 kernel: [38432.350595] SLUB: Unable to allocate memory on node -1 (gfp=0x20)
Feb  8 08:03:26 dataset1001 kernel: [38432.350598]   cache: kmalloc-2048, object size: 2048, buffer size: 2048, default order: 3, min order: 0
Feb  8 08:03:26 dataset1001 kernel: [38432.350602]   node 0: slabs: 2939, objs: 10974, free: 0

The first is a failure of the driver to atomically allocate memory for a socket buffer associated with the incoming packet. This is at a point where it's not possible to wait for a page to be swapped in. This would probably mean a lost packet. The second stack trace is in the code path for sending a tcp ACK, which means that the remote end would have to retransmit. We saw two groups of 5-6 stack traces, one at 08:03:26 and one at 11:23:17. AFAICT these two incidents didn't actually affect network traffic except for short hiccups. More recent versions of the driver use kmalloc instead of __netdev_alloc and handle it differently; if the issue turns out to be serious we could look into backporting those changes.

We did see more of those later on Feb 8 and 9th, a total of about 5 groups of 5 stack traces within 3 hours. Not sure of the effect on the remote end.

One possibility that has been reported on various mailing lists is to increase the setting for /proc/sys/vm/min_free_kbytes but this needs more checking.

Feb 9 2012

Saw a single stack trace from xfs:

Feb  9 02:54:20 dataset1001 kernel: [34353.016174] rsync           D ffff880407bb3170     0  6830   4054 0x00000000
Feb  9 02:54:20 dataset1001 kernel: [34353.016180]  ffff88003749ba28 0000000000000082 ffff88003749b9f8 ffff880000000000
Feb  9 02:54:20 dataset1001 kernel: [34353.016184]  0000000000013cc0 ffff880407bb2dc0 ffff880407bb3170 ffff88003749bfd8
Feb  9 02:54:20 dataset1001 kernel: [34353.016188]  ffff880407bb3178 0000000000013cc0 ffff88003749a010 0000000000013cc0
Feb  9 02:54:20 dataset1001 kernel: [34353.016193] Call Trace:
Feb  9 02:54:20 dataset1001 kernel: [34353.016203]  [<ffffffff815c558d>] schedule_timeout+0x21d/0x300
Feb  9 02:54:20 dataset1001 kernel: [34353.016209]  [<ffffffff812bbf84>] ? blk_unplug+0x34/0x70
Feb  9 02:54:20 dataset1001 kernel: [34353.016214]  [<ffffffff8148dfec>] ? dm_table_unplug_all+0x5c/0x110
Feb  9 02:54:20 dataset1001 kernel: [34353.016218]  [<ffffffff815c651e>] __down+0x7e/0xc0
Feb  9 02:54:20 dataset1001 kernel: [34353.016257]  [<ffffffffa0162d55>] ? _xfs_buf_find+0xf5/0x280 [xfs]
Feb  9 02:54:20 dataset1001 kernel: [34353.016263]  [<ffffffff8108c991>] down+0x41/0x50
Feb  9 02:54:20 dataset1001 kernel: [34353.016291]  [<ffffffffa0161a73>] xfs_buf_lock+0x53/0x120 [xfs]
Feb  9 02:54:20 dataset1001 kernel: [34353.016318]  [<ffffffffa0162d55>] _xfs_buf_find+0xf5/0x280 [xfs]
Feb  9 02:54:20 dataset1001 kernel: [34353.016346]  [<ffffffffa0162f40>] xfs_buf_get+0x60/0x1c0 [xfs]
Feb  9 02:54:20 dataset1001 kernel: [34353.016374]  [<ffffffffa01638dc>] xfs_buf_read+0x2c/0x110 [xfs]
Feb  9 02:54:20 dataset1001 kernel: [34353.016402]  [<ffffffffa0158ae0>] xfs_trans_read_buf+0x190/0x420 [xfs]
Feb  9 02:54:20 dataset1001 kernel: [34353.016426]  [<ffffffffa012ba09>] xfs_da_do_buf+0x289/0x790 [xfs]
Feb  9 02:54:20 dataset1001 kernel: [34353.016431]  [<ffffffff8116dde3>] ? generic_permission+0x23/0xb0
Feb  9 02:54:20 dataset1001 kernel: [34353.016436]  [<ffffffff815c705e>] ? _raw_spin_lock+0xe/0x20
Feb  9 02:54:20 dataset1001 kernel: [34353.016443]  [<ffffffff81181e30>] ? mntput_no_expire+0x60/0x190
Feb  9 02:54:20 dataset1001 kernel: [34353.016448]  [<ffffffff8116efc4>] ? finish_open+0xe4/0x1c0
Feb  9 02:54:20 dataset1001 kernel: [34353.016471]  [<ffffffffa012bf8a>] xfs_da_read_buf+0x2a/0x30 [xfs]
Feb  9 02:54:20 dataset1001 kernel: [34353.016496]  [<ffffffffa01303d8>] ? xfs_dir2_block_getdents+0x98/0x220 [xfs]
Feb  9 02:54:20 dataset1001 kernel: [34353.016520]  [<ffffffffa01303d8>] xfs_dir2_block_getdents+0x98/0x220 [xfs]
Feb  9 02:54:20 dataset1001 kernel: [34353.016542]  [<ffffffffa0118a11>] ? xfs_bmap_last_offset+0x111/0x140 [xfs]
Feb  9 02:54:20 dataset1001 kernel: [34353.016547]  [<ffffffff811753b0>] ? filldir+0x0/0xe0
Feb  9 02:54:20 dataset1001 kernel: [34353.016571]  [<ffffffffa012ebf7>] xfs_readdir+0xa7/0x130 [xfs]
Feb  9 02:54:20 dataset1001 kernel: [34353.016575]  [<ffffffff811753b0>] ? filldir+0x0/0xe0
Feb  9 02:54:20 dataset1001 kernel: [34353.016580]  [<ffffffff811753b0>] ? filldir+0x0/0xe0
Feb  9 02:54:20 dataset1001 kernel: [34353.016607]  [<ffffffffa0164ac9>] xfs_file_readdir+0x39/0x50 [xfs]
Feb  9 02:54:20 dataset1001 kernel: [34353.016611]  [<ffffffff81175548>] vfs_readdir+0xb8/0xe0
Feb  9 02:54:20 dataset1001 kernel: [34353.016616]  [<ffffffff811756d9>] sys_getdents+0x89/0xf0
Feb  9 02:54:20 dataset1001 kernel: [34353.016621]  [<ffffffff8100c082>] system_call_fastpath+0x16/0x1b
Feb  9 06:42:41 dataset1001 kernel: [48018.732976] kworker/0:0: page allocation failure. order:0, mode:0x4020
Feb  9 06:42:41 dataset1001 kernel: [48018.732981] Pid: 0, comm: kworker/0:0 Not tainted 2.6.38-13-generic #54~lucid1-Ubuntu

Investigating.

Other issues

Initial issues included a firmware bug on the receiving host's array causing a slowdown in writes; the array was swapped out for different hardware, fixing the issue.

The same rsync via a different route (not he.net) was much faster. Somone at he.net made an unspecified configuration change which helped for a short time and then things went downhill again. Using a different route we are still doing well. Speeds bounce around some due to the way the rsync bwlimit works (burst, then throttle).

Reading material

Personal tools
Namespaces

Variants
Actions
Navigation
Ops documentation
Wiki
Toolbox