Buscar

Linux Performance and Tuning Guidelines - IBM


Prévia do material em texto

ibm.com/redbooks Redpaper
Front cover
Linux Performance and
Tuning Guidelines
Eduardo Ciliendo
Takechika Kunimasa
Operating system tuning methods
Performance monitoring tools
Performance analysis
http://www.redbooks.ibm.com/ 
http://www.redbooks.ibm.com/
International Technical Support Organization
Linux Performance and Tuning Guidelines
July 2007
REDP-4285-00
© Copyright International Business Machines Corporation 2007. All rights reserved.
Note to U.S. Government Users Restricted Rights -- Use, duplication or disclosure restricted by GSA ADP Schedule
Contract with IBM Corp.
First Edition (July 2007)
This edition applies to kernel 2.6 Linux distributions.
This paper was updated on April 25, 2008.
Note: Before using this information and the product it supports, read the information in “Notices” on 
page vii.
Contents
Notices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
Trademarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
How this paper is structured. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
The team that wrote this paper . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .x
Become a published author . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
Comments welcome. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii
Chapter 1. Understanding the Linux operating system. . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Linux process management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.1 What is a process? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.2 Life cycle of a process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.3 Thread. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.1.4 Process priority and nice level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.1.5 Context switching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.1.6 Interrupt handling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.1.7 Process state . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.1.8 Process memory segments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.1.9 Linux CPU scheduler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.2 Linux memory architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.2.1 Physical and virtual memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.2.2 Virtual memory manager. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.3 Linux file systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.3.1 Virtual file system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.3.2 Journaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.3.3 Ext2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.3.4 Ext3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.3.5 ReiserFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.3.6 Journal File System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.3.7 XFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.4 Disk I/O subsystem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.4.1 I/O subsystem architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
1.4.2 Cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
1.4.3 Block layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
1.4.4 I/O device driver . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
1.4.5 RAID and storage system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
1.5 Network subsystem. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
1.5.1 Networking implementation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
1.5.2 TCP/IP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
1.5.3 Offload . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
1.5.4 Bonding module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
1.6 Understanding Linux performance metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
1.6.1 Processor metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
1.6.2 Memory metrics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
1.6.3 Network interface metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
1.6.4 Block device metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
Chapter 2. Monitoring and benchmark tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
© Copyright IBM Corp. 2007. All rights reserved. iii
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
2.2 Overview of tool functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
2.3 Monitoring tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
2.3.1 top . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
2.3.2 vmstat . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
2.3.3 uptime . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
2.3.4 ps and pstree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
2.3.5 free . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
2.3.6 iostat . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
2.3.7 sar . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
2.3.8 mpstat . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
2.3.9 numastat . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
2.3.10 pmap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
2.3.11 netstat . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
2.3.12 iptraf . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
2.3.13 tcpdump / ethereal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
2.3.14 nmon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
2.3.15 strace . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
2.3.16 Proc file system. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
2.3.17 KDE System Guard. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
2.3.18 Gnome System Monitor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
2.3.19 Capacity Manager . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
2.4 Benchmark tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
2.4.1 LMbench . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
2.4.2 IOzone . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
2.4.3 netperf . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
2.4.4 Other useful tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
Chapter 3. Analyzing performance bottlenecks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
3.1 Identifying bottlenecks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
3.1.1 Gathering information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
3.1.2 Analyzing the server’s performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
3.2 CPU bottlenecks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
3.2.1 Finding CPU bottlenecks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
3.2.2 SMP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
3.2.3 Performance tuning options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
3.3 Memory bottlenecks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
3.3.1 Finding memory bottlenecks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
3.3.2 Performance tuning options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
3.4 Disk bottlenecks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
3.4.1 Finding disk bottlenecks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
3.4.2 Performance tuning options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
3.5 Network bottlenecks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
3.5.1 Finding network bottlenecks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
3.5.2 Performance tuning options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
Chapter 4. Tuning the operating system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
4.1 Tuning principles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
4.1.1 Change management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
4.2 Installation considerations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
4.2.1 Installation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
4.2.2 Check the current configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
4.2.3 Minimize resource use . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
iv Linux Performance and Tuning Guidelines
4.2.4 SELinux. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
4.2.5 Compiling the kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
4.3 Changing kernel parameters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
4.3.1 Where the parameters are stored . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
4.3.2 Using the sysctl command . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
4.4 Tuning the processor subsystem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
4.4.1 Tuning process priority . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
4.4.2 CPU affinity for interrupt handling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
4.4.3 Considerations for NUMA systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
4.5 Tuning the vm subsystem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
4.5.1 Setting kernel swap and pdflush behavior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
4.5.2 Swap partition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
4.5.3 HugeTLBfs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
4.6 Tuning the disk subsystem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
4.6.1 Hardware considerations before installing Linux. . . . . . . . . . . . . . . . . . . . . . . . . 113
4.6.2 I/O elevator tuning and selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
4.6.3 File system selection and tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
4.7 Tuning the network subsystem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
4.7.1 Considerations of traffic characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
4.7.2 Speed and duplexing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
4.7.3 MTU size. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
4.7.4 Increasing network buffers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
4.7.5 Additional TCP/IP tuning. . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . 128
4.7.6 Performance impact of Netfilter. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
4.7.7 Offload configuration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
4.7.8 Increasing the packet queues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
4.7.9 Increasing the transmit queue length . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
4.7.10 Decreasing interrupts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
Appendix A. Testing configurations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
Hardware and software configurations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
Linux installed on guest IBM z/VM systems. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
Linux installed on IBM System x servers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
Abbreviations and acronyms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
Related publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
IBM Redbooks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
Other publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
Online resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
How to get IBM Redbooks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
Help from IBM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
 Contents v
vi Linux Performance and Tuning Guidelines
Notices
This information was developed for products and services offered in the U.S.A. 
IBM may not offer the products, services, or features discussed in this document in other countries. Consult 
your local IBM representative for information on the products and services currently available in your area. Any 
reference to an IBM product, program, or service is not intended to state or imply that only that IBM product, 
program, or service may be used. Any functionally equivalent product, program, or service that does not 
infringe any IBM intellectual property right may be used instead. However, it is the user's responsibility to 
evaluate and verify the operation of any non-IBM product, program, or service. 
IBM may have patents or pending patent applications covering subject matter described in this document. The 
furnishing of this document does not give you any license to these patents. You can send license inquiries, in 
writing, to: 
IBM Director of Licensing, IBM Corporation, North Castle Drive, Armonk, NY 10504-1785 U.S.A.
The following paragraph does not apply to the United Kingdom or any other country where such 
provisions are inconsistent with local law: INTERNATIONAL BUSINESS MACHINES CORPORATION 
PROVIDES THIS PUBLICATION "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESS OR 
IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF NON-INFRINGEMENT, 
MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Some states do not allow disclaimer of 
express or implied warranties in certain transactions, therefore, this statement may not apply to you. 
This information could include technical inaccuracies or typographical errors. Changes are periodically made 
to the information herein; these changes will be incorporated in new editions of the publication. IBM may make 
improvements and/or changes in the product(s) and/or the program(s) described in this publication at any time 
without notice. 
Any references in this information to non-IBM Web sites are provided for convenience only and do not in any 
manner serve as an endorsement of those Web sites. The materials at those Web sites are not part of the 
materials for this IBM product and use of those Web sites is at your own risk. 
IBM may use or distribute any of the information you supply in any way it believes appropriate without incurring 
any obligation to you.
Information concerning non-IBM products was obtained from the suppliers of those products, their published 
announcements or other publicly available sources. IBM has not tested those products and cannot confirm the 
accuracy of performance, compatibility or any other claims related to non-IBM products. Questions on the 
capabilities of non-IBM products should be addressed to the suppliers of those products.
This information contains examples of data and reports used in daily business operations. To illustrate them 
as completely as possible, the examples include the names of individuals, companies, brands, and products. 
All of these names are fictitious and any similarity to the names and addresses used by an actual business 
enterprise is entirely coincidental. 
COPYRIGHT LICENSE:
This information contains sample application programs in source language, which illustrate programming 
techniques on various operating platforms. You may copy, modify, and distribute these sample programs in 
any form without payment to IBM, for the purposes of developing, using, marketing or distributing application 
programs conforming to the application programming interface for the operating platform for which the sample 
programs are written. These examples have not been thoroughly tested under all conditions. IBM, therefore, 
cannot guarantee or imply reliability, serviceability, or function of these programs. 
© Copyright IBM Corp. 2007. All rights reserved. vii
Trademarks
The following terms are trademarks of the International Business Machines Corporation in the United States, 
other countries, or both: 
Redbooks (logo) ®
eServer™
xSeries®
z/OS®
AIX®
DB2®
DS8000™
IBM®
POWER™
Redbooks®
ServeRAID™
System i™
System p™
System x™
System z™
System Storage™
TotalStorage®
The following terms are trademarks of other companies:
Java, JDBC, Solaris, and all Java-based trademarks are trademarks of Sun Microsystems, Inc. in the United 
States, other countries, or both.
Excel, Microsoft, Windows, and the Windows logo are trademarks of Microsoft Corporation in the United 
States, other countries, or both.
Intel, Itanium, Intel logo, Intel Inside logo, and Intel Centrino logo are trademarks or registered trademarks of 
Intel Corporation or its subsidiaries in the United States, other countries, or both.
UNIX is a registered trademark of The Open Group in the United States and other countries.
Linux is a trademark of Linus Torvalds in the United States, other countries, or both.
Other company, product, or service names may be trademarks or service marks of others. 
viii Linux Performance and Tuning Guidelines
Preface
Linux® is an open source operating system developed by people from all over the world. The 
source code is freely available and can be used under the GNU General Public License. The 
operating system is made available to users in the form of distributions from companies such 
as Red Hat and Novell. Some desktop Linux distributions can be downloaded at no charge 
from the Web, but the server versions typically must be purchased.
Over the past few years, Linux has made its way into the data centers of many corporations 
worldwide. The Linux operating system is accepted by both the scientific and enterprise user 
population. Today, Linux is by far the most versatile operating system. You can find Linux on 
embedded devices such as firewalls, cell phones, and mainframes.Naturally, performance of 
the Linux operating system has become a hot topic for scientific and enterprise users. 
However, calculating a global weather forecast and hosting a database impose different 
requirements on an operating system. Linux must accommodate all possible usage scenarios 
with optimal performance. Most Linux distributions contain general tuning parameters to 
accommodate all users.
IBM® recognizes Linux as an operating system suitable for enterprise-level applications that 
run on IBM systems. Most enterprise applications are now available on Linux, including file 
and print servers, database servers, Web servers, and collaboration and mail servers.
The use of Linux in an enterprise-class server requires monitoring performance and, when 
necessary, tune the server to remove bottlenecks that affect users. This IBM Redpaper 
publication describes the methods you can use to tune Linux, tools that you can use to 
monitor and analyze server performance, and key tuning parameters for specific server 
applications. The purpose of this paper is to explain how to analyze and tune the Linux 
operating system to yield superior performance for any type of application you plan to run on 
these systems.
The tuning parameters, benchmark results, and monitoring tools used in our test environment 
were executed on Red Hat and Novell SUSE Linux kernel 2.6 systems running on IBM 
System x™ servers and IBM System z™ servers. However, the information in this paper 
should be helpful for all Linux hardware platforms.
How this paper is structured
To help those of you who are new to Linux or performance tuning get started quickly, we have 
structured this book the following way:
� Chapter 1, “Understanding the Linux operating system” on page 1
This chapter introduces the factors that influence system performance and the way the 
Linux operating system manages system resources. You are introduced to several 
important performance metrics that are needed to quantify system performance.
� Chapter 2, “Monitoring and benchmark tools” on page 39
The second chapter introduces the various utilities that are available for Linux to measure 
and analyze systems performance.
� Chapter 3, “Analyzing performance bottlenecks” on page 77
This chapter introduces the process of identifying and analyzing bottlenecks in the system.
© Copyright IBM Corp. 2007. All rights reserved. ix
� Chapter 4, “Tuning the operating system” on page 91
With the basic knowledge of how the operating system works and how to use performance 
measurement utilities, you are ready to explore the various performance tweaks available 
in the Linux operating system.
The team that wrote this paper
This paper was produced by a team of specialists from around the world working at the 
International Technical Support Organization, Raleigh Center.
The team: Byron, Eduardo, Takechika
Eduardo Ciliendo is an Advisory IT Specialist working as a performance specialist on 
IBM Mainframe Systems in IBM Switzerland. He has more than 10 years of experience in 
computer sciences. Eddy studied Computer and Business Sciences at the University of 
Zurich and holds a post-diploma in Japanology. Eddy is a member of the zChampion team 
and holds several IT certifications including the RHCE title. As a Systems Engineer for 
IBM System z™, he works on capacity planning and systems performance for z/OS® and 
Linux for System z. Eddy has authored several publications on systems performance and 
Linux.
Takechika Kunimasa is an Associate IT Architect in IBM Global Services in Japan. He 
studied Electrical and Electronics engineering at Chiba University. He has more than 10 years 
of experience in IT industry. He worked as a network engineer for five years, and he has been 
working for Linux technical support. His areas of expertise include Linux on System x™, 
Linux on System p™, Linux on System z, high availability system, networking, and 
infrastructure architecture design. He is a Cisco Certified Network Professional and a Red 
Hat Certified Engineer.
x Linux Performance and Tuning Guidelines
Byron Braswell is a Networking Professional at the International Technical Support 
Organization, Raleigh Center. He received a B.S. degree in Physics and an M.S. degree in 
Computer Sciences from Texas A&M University. He writes extensively in the areas of 
networking, application integration middleware, and personal computer software. Before 
joining the ITSO, Byron worked in IBM Learning Services Development in networking 
education development.
Thanks to the following people for their contributions to this project:
Margaret Ticknor
Carolyn Briscoe
International Technical Support Organization, Raleigh Center
Roy Costa
Michael B Schwartz
Frieder Hamm
International Technical Support Organization, Poughkeepsie Center
Christian Ehrhardt
Martin Kammerer
IBM Böblingen, Germany
Erwan Auffret
IBM France
Become a published author
Join us for a two- to six-week residency program! Help write an IBM Redbook dealing with 
specific products or solutions, while getting hands-on experience with leading-edge 
technologies. You will have the opportunity to team with IBM technical professionals, 
Business Partners, and Clients. 
Your efforts will help increase product acceptance and customer satisfaction. As a bonus, 
you'll develop a network of contacts in IBM development labs and increase your productivity 
and marketability. 
Find out more about the residency program, browse the residency index, and apply online at:
ibm.com/redbooks/residencies.html
 Preface xi
http://www.redbooks.ibm.com/residencies.html
http://www.redbooks.ibm.com/residencies.html
Comments welcome
Your comments are important to us!
We want our papers to be as helpful as possible. Send us your comments about this paper or 
other IBM Redbooks® in one of the following ways:
� Use the online Contact us review redbook form found at:
ibm.com/redbooks
� Send your comments in an e-mail to:
redbooks@us.ibm.com
� Mail your comments to:
IBM Corporation, International Technical Support Organization
Dept. HYTD Mail Station P099
2455 South Road
Poughkeepsie, NY 12601-5400
xii Linux Performance and Tuning Guidelines
http://www.redbooks.ibm.com/
http://www.redbooks.ibm.com/
http://www.redbooks.ibm.com/contacts.html
Chapter 1. Understanding the Linux 
operating system
We begin this paper with an overview of how the Linux operating system handles its tasks to 
complete interacting with its hardware resources. Performance tuning is a challenging task 
that requires in-depth understanding of the hardware, operating system, and application. If 
performance tuning were simple, the parameters we are about to explore would be 
hard-coded into the firmware or the operating system and you would not be reading these 
lines. However, as shown in Figure 1-1 server performance is affected by multiple factors. 
Figure 1-1 Schematic interaction of different performance components
1
Applications
Libraries
Kernel
Drivers
Firmware
Hardware
Applications
Libraries
Kernel
Drivers
Firmware
Hardware
© Copyright IBM Corp. 2007. All rights reserved. 1
You could tune the I/O subsystem for weeks in vain if the disk subsystem for a 20,000 user 
database server consisted of a single IDE drive. Often a new driver or an update to the 
application yields impressive performance gains. As we discuss specific details, keep in mind 
the whole picture of systems performance. Understanding the way an operating system 
manages the system resources helps us understand what subsystems we need to tune in any 
application scenario.
The following sections provide a short introduction to the architecture of the Linux operating 
system. A complete analysis of the Linux kernel is beyond the scope of this paper. You can 
refer to the kernel documentation for a complete reference of the Linux kernel. 
In this chapter we cover:
� 1.1, “Linux process management” on page 2
� 1.2, “Linux memory architecture” on page 10
� 1.3, “Linuxfile systems” on page 15
� 1.4, “Disk I/O subsystem” on page 19
� 1.5, “Network subsystem” on page 26
� 1.6, “Understanding Linux performance metrics” on page 34
1.1 Linux process management
Process management is one of the most important roles of any operating system. Effective 
process management enables an application to operate steadily and effectively.
Linux process management implementation is similar to UNIX® implementation. It includes 
process scheduling, interrupt handling, signaling, process prioritization, process switching, 
process state, process memory, and so on.
In this section, we discuss the fundamentals of the Linux process management 
implementation. It helps you understand how the Linux kernel deals with processes that will 
have an effect on system performance.
1.1.1 What is a process?
A process is an instance of execution that runs on a processor. The process uses any 
resources that the Linux kernel can handle to complete its task.
All processes running on Linux operating system are managed by the task_struct structure, 
which is also called a process descriptor. A process descriptor contains all the information 
necessary for a single process to run such as process identification, attributes of the process, 
and resources which construct the process. If you know the structure of the process, you can 
understand what is important for process execution and performance. Figure 1-2 shows the 
outline of structures related to process information.
Note: This paper focuses on the performance of the Linux operating system.
2 Linux Performance and Tuning Guidelines
Figure 1-2 task_struct structure
1.1.2 Life cycle of a process
Every process has its own life cycle such as creation, execution, termination, and removal. 
These phases will be repeated literally millions of times as long as the system is up and 
running. Therefore, the process life cycle is very important from the performance perspective.
Figure 1-3 shows typical life cycle of processes.
Figure 1-3 Life cycle of typical processes 
When a process creates a new process, the creating process (parent process) issues a 
fork() system call. When a fork() system call is issued, it gets a process descriptor for the 
newly created process (child process) and sets a new process id. It copies the values of the 
userUser management
:
group_infoGroup management
:
:
signalSignal information
sighandSignal handler
:
fliesFile descriptor
fsWorking directory
Root directory
:
pidProcess ID
:
mmProcess address space
:
run_list, arrayFor process scheduling
:
thread_infoProcess information and 
kernel stack
stateProcess state
userUser management
:
group_infoGroup management
:
:
signalSignal information
sighandSignal handler
:
fliesFile descriptor
fsWorking directory
Root directory
:
pidProcess ID
:
mmProcess address space
:
run_list, arrayFor process scheduling
:
thread_infoProcess information and 
kernel stack
stateProcess state
exec_domain
Kernel stack
status
flags
task
exec_domain
Kernel stack
status
flags
task
task_struct structure thread_info structure
runqueue
mm_struct
group_info
user_struct
fs_struct
files_struct
signal_struct
sighand_struct
the other structures
parent
process
child
process
child
process
zombie
process
parent
process
wait()
fork()
exec() exit()
parent
process
child
process
child
process
zombie
process
parent
process
wait()
fork()
exec() exit()
Chapter 1. Understanding the Linux operating system 3
parent process’ process descriptor to the child’s. At this time the entire address space of the 
parent process is not copied; both processes share the same address space.
The exec() system call copies the new program to the address space of the child process. 
Because both processes share the same address space, writing new program data causes a 
page fault exception. At this point, the kernel assigns the new physical page to the child 
process.
This deferred operation is called the Copy On Write. The child process usually executes their 
own program rather than the same execution as its parent does. This operation avoids 
unnecessary overhead because copying an entire address space is a very slow and 
inefficient operation which uses a lot of processor time and resources. 
When program execution has completed, the child process terminates with an exit() system 
call. The exit() system call releases most of the data structure of the process and notifies 
the parent process of the termination sending a signal. At this time, the process is called a 
zombie process (refer to “Zombie processes” on page 7).
The child process will not be completely removed until the parent process knows of the 
termination of its child process by the wait() system call. As soon as the parent process is 
notified of the child process termination, it removes all the data structure of the child process 
and release the process descriptor. 
1.1.3 Thread
A thread is an execution unit generated in a single process. It runs parallel with other threads 
in the same process. They can share the same resources such as memory, address space, 
open files, and so on. They can access the same set of application data. A thread is also 
called Light Weight Process (LWP). Because they share resources, each thread should not 
change their shared resources at the same time. The implementation of mutual exclusion, 
locking, serialization, and so on, are the user application’s responsibility.
From the performance perspective, thread creation is less expensive than process creation 
because a thread does not need to copy resources on creation. On the other hand, processes 
and threads have similar characteristics in terms of scheduling algorithm. The kernel deals 
with both of them in a similar manner. 
Figure 1-4 process and thread
In current Linux implementations, a thread is supported with the Portable Operating System 
Interface for UNIX (POSIX) compliant library (pthread). Several thread implementations are 
available in the Linux operating system. The following are the widely used.
� LinuxThreads
Process Process
resourceeresourceresourceresourceeresourceresource copy
Process
ThreadThreadThread ThreadThreadThread
resourceeresourceresourceshare share
Process creation Thread creation
4 Linux Performance and Tuning Guidelines
LinuxThreads have been the default thread implementation since Linux kernel 2.0. The 
LinuxThread has some noncompliant implementations with the POSIX standard. Native 
POSIX Thread Library (NPTL) is taking the place of LinuxThreads. The LinuxThreads will 
not be supported in future release of Enterprise Linux distributions.
� Native POSIX Thread Library (NPTL)
The NPTL was originally developed by Red Hat. NPTL is more compliant with POSIX 
standards. By taking advantage of enhancements in kernel 2.6 such as the new clone() 
system call, signal handling implementation, and so on, it has better performance and 
scalability than LinuxThreads. 
NPTL has some incompatibility with LinuxThreads. An application which has a 
dependence on LinuxThread might not work with the NPTL implementation.
� Next Generation POSIX Thread (NGPT)
NGPT is an IBM developed version of POSIX thread library. It is currently under 
maintenance operation and no further development is planned.
Using the LD_ASSUME_KERNEL environment variable, you can choose which threads library the 
application should use. 
1.1.4 Process priority and nice level
Process priority is a number that determines the order in which the process is handled by the 
CPU and is determined by dynamic priority and static priority. A process which has higher 
process priority has a greater chance of getting permission to run on a processor. 
The kernel dynamically adjusts dynamic priority up and down as needed using a heuristic 
algorithm based on process behaviors and characteristics. A user process can change the 
static priority indirectly through the use of the nice levelof the process. A process which has 
higher static priority will have longer time slice (how long the process can run on a processor).
Linux supports nice levels from 19 (lowest priority) to -20 (highest priority). The default value 
is 0. To change the nice level of a program to a negative number (which makes it a higher 
priority), it is necessary to log on or use su on the root.
1.1.5 Context switching
During process execution, information on the running process is stored in registers on the 
processor and its cache. The set of data that is loaded to the register for the executing 
process is called the context. To switch processes, the context of the running process is 
stored and the context of the next running process is restored to the register. The process 
descriptor and the area called kernel mode stack are used to store the context. This switching 
process is called context switching. Having too much context switching is undesirable 
because the processor has to flush its register and cache every time to make room for the 
new process. It could cause performance problems.
Figure 1-5 illustrates how the context switching works.
Chapter 1. Understanding the Linux operating system 5
Figure 1-5 Context switching
1.1.6 Interrupt handling
Interrupt handling is one of the highest priority tasks. Interrupts are usually generated by I/O 
devices such as a network interface card, keyboard, disk controller, serial adapter, and so on. 
The interrupt handler notifies the Linux kernel of an event (such as keyboard input, ethernet 
frame arrival, and so on). It tells the kernel to interrupt process execution and perform 
interrupt handling as quickly as possible because some device requires quick 
responsiveness. This is critical for system stability. When an interrupt signal arrives to the 
kernel, the kernel must switch a current execution process to a new one to handle the 
interrupt. This means interrupts cause context switching, and therefore a significant amount 
of interrupts could cause performance degradation.
In Linux implementations, there are two types of interrupts. A hard interrupt is generated for 
devices which require responsiveness (disk I/O interrupt, network adapter interrupt, keyboard 
interrupt, mouse interrupt). A soft interrupt is used for tasks which processing can be 
deferred (TCP/IP operation, SCSI protocol operation, and so on). You can see information 
related to hard interrupts at /proc/interrupts.
In a multi-processor environment, interrupts are handled by each processor. Binding 
interrupts to a single physical processor could improve system performance. For more details, 
refer to 4.4.2, “CPU affinity for interrupt handling” on page 108.
1.1.7 Process state
Every process has its own state that shows what is currently happening in the process. 
Process state changes during process execution. Some of the possible states are as follows:
� TASK_RUNNING
In this state, a process is running on a CPU or waiting to run in the queue (run queue).
� TASK_STOPPED
A process suspended by certain signals (for example SIGINT, SIGSTOP) is in this state. The 
process is waiting to be resumed by a signal such as SIGCONT.
� TASK_INTERRUPTIBLE
stack pointer
other registers
EIP register
etc.
CPU
Address space
of process B
Address space
of process A
stack stack
task_struct
(Process A)
task_struct
(Process B)
Suspend Resume
Context switch
6 Linux Performance and Tuning Guidelines
In this state, the process is suspended and waits for a certain condition to be satisfied. If a 
process is in TASK_INTERRUPTIBLE state and it receives a signal to stop, the process 
state is changed and operation will be interrupted. A typical example of a 
TASK_INTERRUPTIBLE process is a process waiting for keyboard interrupt.
� TASK_UNINTERRUPTIBLE
Similar to TASK_INTERRUPTIBLE. While a process in TASK_INTERRUPTIBLE state can 
be interrupted, sending a signal does nothing to the process in 
TASK_UNINTERRUPTIBLE state. A typical example of a TASK_UNINTERRUPTIBLE 
process is a process waiting for disk I/O operation.
� TASK_ZOMBIE
After a process exits with exit() system call, its parent should know of the termination. In 
TASK_ZOMBIE state, a process is waiting for its parent to be notified to release all the 
data structure.
Figure 1-6 Process state
Zombie processes
When a process has already terminated, having received a signal to do so, it normally takes 
some time to finish all tasks (such as closing open files) before ending itself. In that normally 
very short time frame, the process is a zombie. 
After the process has completed all of these shutdown tasks, it reports to the parent process 
that it is about to terminate. Sometimes, a zombie process is unable to terminate itself, in 
which case it shows a status of Z (zombie).
It is not possible to kill such a process with the kill command, because it is already 
considered dead. If you cannot get rid of a zombie, you can kill the parent process and then 
the zombie disappears as well. However, if the parent process is the init process, you should 
not kill it. The init process is a very important process so a reboot might be needed to get rid 
of the zombie process.
Processor
TASK_INTERRUPTIBLETASK_INTERRUPTIBLE
TASK_RUNNING
(READY)
TASK_RUNNING
(READY) TASK_RUNNINGTASK_RUNNING
TASK_ZOMBIETASK_ZOMBIE
TASK_STOPPEDTASK_STOPPED
exit()
TASK_UNINTERRUPTIBLETASK_UNINTERRUPTIBLE
Preemption
Scheduling
fork()
Chapter 1. Understanding the Linux operating system 7
1.1.8 Process memory segments
A process uses its own memory area to perform work. The work varies depending on the 
situation and process usage. A process can have different workload characteristics and 
different data size requirements. The process has to handle a of variety of data sizes. To 
satisfy this requirement, the Linux kernel uses a dynamic memory allocation mechanism for 
each process. The process memory allocation structure is shown in Figure 1-7.
Figure 1-7 Process address space
The process memory area consist of these segments
� Text segment
The area where executable code is stored.
� Data segment
The data segment consists of these three areas.
– Data: The area where initialized data such as static variables are stored.
– BSS: The area where zero-initialized data is stored. The data is initialized to zero.
– Heap: The area where malloc() allocates dynamic memory based on the demand. 
The heap grows towards higher addresses.
� Stack segment
The area where local variables, function parameters, and the return address of a function 
is stored. The stack grows toward lower addresses.
The memory allocation of a user process address space can be displayed with the pmap 
command. You can display the total size of the segment with the ps command. Refer to 
2.3.10, “pmap” on page 52 and 2.3.4, “ps and pstree” on page 44.
Text
Executable instruction (Read-only)
Data
Initialized data
BSS
Zero-initialized data
Heap
Dynamic memory allocation 
by malloc()
Stack
Local variables
Function parameters, 
Return address, and so on
Text
segment
Data
segment
Stack
segment
Process address space
Heap
segment
0x0000
8 Linux Performance and Tuning Guidelines
1.1.9 Linux CPU scheduler
The basic functionality of any computer is, quite simply, to compute. To be able to compute, 
there must be a means to manage the computing resources, or processors, and the 
computing tasks, also known as threads or processes. Thanks to the great work of Ingo 
Molnar, Linux features a kernel using a O(1) algorithm as opposed to the O(n) algorithm used 
to describe the former CPU scheduler. The term O(1) refers to a static algorithm, meaning 
that the time taken to choose a process for placing into execution is constant, regardless of 
the number of processes. 
The new scheduler scales very well, regardless of process count or processor count, and 
imposes a low overhead on the system. The algorithm uses two process priority arrays:
� active
� expiredAs processes are allocated a timeslice by the scheduler, based on their priority and prior 
blocking rate, they are placed in a list of processes for their priority in the active array. When 
they expire their timeslice, they are allocated a new timeslice and placed on the expired array. 
When all processes in the active array have expired their timeslice, the two arrays are 
switched, restarting the algorithm. For general interactive processes (as opposed to real-time 
processes) this results in high-priority processes, which typically have long timeslices, getting 
more compute time than low-priority processes, but not to the point where they can starve the 
low-priority processes completely. The advantage of such an algorithm is the vastly improved 
scalability of the Linux kernel for enterprise workloads that often include vast amounts of 
threads or processes and also a significant number of processors. The new O(1) CPU 
scheduler was designed for kernel 2.6 but backported to the 2.4 kernel family. Figure 1-8 on 
page 9 illustrates how the Linux CPU scheduler works.
Figure 1-8 Linux kernel 2.6 O(1) scheduler
Another significant advantage of the new scheduler is the support for Non-Uniform Memory 
Architecture (NUMA) and symmetric multithreading processors, such as Intel® 
Hyper-Threading technology.
The improved NUMA support ensures that load balancing will not occur across NUMA nodes 
unless a node gets overburdened. This mechanism ensures that traffic over the comparatively 
slow scalability links in a NUMA system are minimized. Although load balancing across 
processors in a scheduler domain group will be load balanced with every scheduler tick, 
priority0
:
priority 139
priority0
:
priority 139
P
P P P
active
expired
array[0]
array[1]
P P
:
:
P P P
priority0
:
priority 139
priority0
:
priority 139
P
P P P
active
expired
array[0]
array[1]
P P
:
:
P P P
Chapter 1. Understanding the Linux operating system 9
workload across scheduler domains will only occur if that node is overloaded and asks for 
load balancing. 
Figure 1-9 Architecture of the O(1) CPU scheduler on an 8-way NUMA based system with 
Hyper-Threading enabled
1.2 Linux memory architecture
To execute a process, the Linux kernel allocates a portion of the memory area to the 
requesting process. The process uses the memory area as workspace and performs the 
required work. It is similar to you having your own desk allocated and then using the desktop 
to scatter papers, documents and memos to perform your work. The difference is that the 
kernel has to allocate space in a more dynamic manner. The number of running processes 
sometimes comes to tens of thousands and amount of memory is usually limited. Therefore, 
Linux kernel must handle the memory efficiently. In this section, we describe the Linux 
memory architecture, address layout, and how Linux manages memory space efficiently.
1.2.1 Physical and virtual memory
Today we are faced with the choice of 32-bit systems and 64-bit systems. One of the most 
important differences for enterprise-class clients is the possibility of virtual memory 
addressing above 4 GB. From a performance point of view, it is interesting to understand how 
the Linux kernel maps physical memory into virtual memory on both 32-bit and 64-bit 
systems. 
As you can see in Figure 1-10 on page 11, there are obvious differences in the way the Linux 
kernel has to address memory in 32-bit and 64-bit systems. Exploring the physical-to-virtual 
mapping in detail is beyond the scope of this paper, so we highlight some specifics in the 
Linux memory architecture. 
On 32-bit architectures such as the IA-32, the Linux kernel can directly address only the first 
gigabyte of physical memory (896 MB when considering the reserved range). Memory above 
Two node xSeries 445 (8 CPU)
One CEC (4 CPU)
One Xeon MP (HT)
One HT CPU
Parent
Scheduler
Domain
Child
Scheduler
Domain
Scheduler
Domain
Group
Logical
CPU
Load balancing
only if a child
is overburdened
Load balancing
via scheduler_tick()
and time slice
Load balancing
via scheduler_tick()
1
2
3
…
1
2
3
…
1
2
3
…
1
2
…
1
2
…
1
2
…
1
2
…
1
2
…
1
2
…
10 Linux Performance and Tuning Guidelines
the so-called ZONE_NORMAL must be mapped into the lower 1 GB. This mapping is 
completely transparent to applications, but allocating a memory page in ZONE_HIGHMEM 
causes a small performance degradation. 
On the other hand, with 64-bit architectures such as x86-64 (also x64), ZONE_NORMAL 
extends all the way to 64 GB or to 128 GB in the case of IA-64 systems. As you can see, the 
overhead of mapping memory pages from ZONE_HIGHMEM into ZONE_NORMAL can be 
eliminated by using a 64-bit architecture.
Figure 1-10 Linux kernel memory layout for 32-bit and 64-bit systems
Virtual memory addressing layout
Figure 1-11 shows the Linux virtual addressing layout for 32-bit and 64-bit architecture. 
On 32-bit architectures, the maximum address space that single process can access is 4GB. 
This is a restriction derived from 32-bit virtual addressing. In a standard implementation, the 
virtual address space is divided into a 3 GB user space and a 1 GB kernel space. There is 
some variants like 4 G/4 G addressing layout implementing.
On the other hand, on 64-bit architecture such as x86_64 and ia64, no such restriction exits. 
Each single process can benefit from the vast and huge address space.
The Linux Memory Architecture
32-bit Architecture 64-bit Architecture
16 MB
1 GB
64 GB
ZONE_NORMAL
ZONE_DMA
ZONE_HIGHMEM
“Reserved”128 MB
896 MB
Pages in ZONE_HIGHMEM
must be mapped into
ZONE_NORMAL
1 GB
64 GB
ZONE_DMA
ZONE_NORMAL
~~
~~
Reserved for Kernel
data structures
Chapter 1. Understanding the Linux operating system 11
Figure 1-11 Virtual memory addressing layout for 32bit and 64-bit architecture
1.2.2 Virtual memory manager
The physical memory architecture of an operating system is usually hidden to the application 
and the user because operating systems map any memory into virtual memory. If we want to 
understand the tuning possibilities within the Linux operating system, we have to understand 
how Linux handles virtual memory. As explained in 1.2.1, “Physical and virtual memory” on 
page 10, applications do not allocate physical memory, but request a memory map of a 
certain size at the Linux kernel and in exchange receive a map in virtual memory. As you can 
see in Figure 1-12, virtual memory does not necessarily have to be mapped into physical 
memory. If your application allocates a large amount of memory, some of it might be mapped 
to the swap file on the disk subsystem.
Figure 1-12 shows that applications usually do not write directly to the disk subsystem, but 
into cache or buffers. The pdflush kernel threads then flushes out data in cache/buffers to the 
disk when it has time to do so or if a file size exceeds the buffer cache. Refer to “Flushing a 
dirty buffer” on page 22.
32-bit Architecture
64-bit Architecture
3 GB
3 G/1 G kernel
User space Kernel space
0 GB
User space Kernel space
0 GB
4 GB
512 GB or more
x86_64
12 Linux Performance and Tuning Guidelines
Figure 1-12 The Linux virtual memory manager
Closely connected to the way the Linux kernel handles writes to the physical disk subsystem 
is the way the Linux kernel manages disk cache. While other operating systems allocate only 
a certain portion of memory as disk cache, Linux handles the memory resource far more 
efficiently. The default configuration of the virtual memory manager allocates all available free 
memory space as disk cache. Hence it is not unusual to see productive Linux systems that 
boast gigabytes of memory but only have 20 MB of that memory free. 
In the same context, Linux also handles swap space very efficiently. Swap space being used 
does not indicate a memory bottleneck but proves how efficiently Linux handles system 
resources. See “Page frame reclaiming” on page 14 for more detail.
Page frame allocationA page is a group of contiguous linear addresses in physical memory (page frame) or virtual 
memory. The Linux kernel handles memory with this page unit. A page is usually 4 K bytes in 
size. When a process requests a certain amount of pages, if there are available pages the 
Linux kernel can allocate them to the process immediately. Otherwise pages have to be taken 
from some other process or page cache. The kernel knows how many memory pages are 
available and where they are located.
Buddy system
The Linux kernel maintains its free pages by using a mechanism called a buddy system. The 
buddy system maintains free pages and tries to allocate pages for page allocation requests. It 
tries to keep the memory area contiguous. If small pages are scattered without consideration, 
it might cause memory fragmentation and it’s more difficult to allocate a large portion of pages 
into a contiguous area. It could lead to inefficient memory use and performance decline. 
Figure 1-13 illustrates how the buddy system allocates pages.
Standard
C Library
(glibc)
Kernel
Subsystems
sh
httpd
mozilla
kswapd
bdflush
Slab Allocator
zoned
buddy
allocator
MMU
VM Subsystem
Disk Driver
User Space
Processes Disk
Physical
Memory
Chapter 1. Understanding the Linux operating system 13
Figure 1-13 Buddy System
When the attempt of pages allocation fails, the page reclaiming is activated. Refer to “Page 
frame reclaiming” on page 14.
You can find information on the buddy system through /proc/buddyinfo. For details, refer to 
“Memory used in a zone” on page 47.
Page frame reclaiming
If pages are not available when a process requests to map a certain amount of pages, the 
Linux kernel tries to get pages for the new request by releasing certain pages (which were 
used before but are not used anymore and are still marked as active pages based on certain 
principles) and allocating the memory to a new process. This process is called page 
reclaiming. kswapd kernel thread and try_to_free_page() kernel function are responsible for 
page reclaiming.
While kswapd is usually sleeping in task interruptible state, it is called by the buddy system 
when free pages in a zone fall short of a threshold. It tries to find the candidate pages to be 
taken out of active pages based on the Least Recently Used (LRU) principle. The pages least 
recently used should be released first. The active list and the inactive list are used to maintain 
the candidate pages. kswapd scans part of the active list and check how recently the pages 
were used and the pages not used recently are put into the inactive list. You can take a look at 
how much memory is considered as active and inactive using the vmstat -a command. For 
detail refer to 2.3.2, “vmstat” on page 42.
kswapd also follows another principle. The pages are used mainly for two purposes: page 
cache and process address space. The page cache is pages mapped to a file on disk. The 
pages that belong to a process address space (called anonymous memory because it is not 
mapped to any files, and it has no name) are used for heap and stack. Refer to 1.1.8, 
“Process memory segments” on page 8. When kswapd reclaims pages, it would rather shrink 
the page cache than page out (or swap out) the pages owned by processes.
A large proportion of page cache that is reclaimed and process address space that is 
reclaimed might depend on the usage scenario and will affect performance. You can take 
some control of this behavior by using /proc/sys/vm/swappiness. Refer to 4.5.1, “Setting 
kernel swap and pdflush behavior” on page 109 for tuning details.
Page out and swap out: The phrases “page out” and “swap out” are sometimes 
confusing. The phrase “page out” means take some pages (a part of entire address space) 
into swap space while “swap out” means taking entire address space into swap space. 
They are sometimes used interchangeably.
Used
Used
Used
Used
Used
Request 
for 2 pages Used
4 pages
chunk
Used
Request 
for 2 pages
Used
2 pages
chunk
Used
Used
8 pages
chunk
Used
Release 
2 pages
Used
2 pages
chunk
8 pages
chunk
8 pages
chunk
14 Linux Performance and Tuning Guidelines
swap
As we stated before, when page reclaiming occurs, the candidate pages in the inactive list 
which belong to the process address space may be paged out. Having swap itself is not 
problematic situation. While swap is nothing more than a guarantee in case of over allocation 
of main memory in other operating systems, Linux uses swap space far more efficiently. As 
you can see in Figure 1-12 on page 13, virtual memory is composed of both physical memory 
and the disk subsystem or the swap partition. If the virtual memory manager in Linux realizes 
that a memory page has been allocated but not used for a significant amount of time, it moves 
this memory page to swap space. 
Often you will see daemons such as getty that will be launched when the system starts up but 
will hardly ever be used. It appears that it would be more efficient to free the expensive main 
memory of such a page and move the memory page to swap. This is exactly how Linux 
handles swap, so there is no need to be alarmed if you find the swap partition filled to 50%. 
The fact that swap space is being used does not indicate a memory bottleneck; instead it 
proves how efficiently Linux handles system resources. 
1.3 Linux file systems
One of the great advantages of Linux as an open source operating system is that it offers 
users a variety of supported file systems. Modern Linux kernels can support nearly every file 
system ever used by a computer system, from basic FAT support to high performance file 
systems such as the journaling file system (JFS). However, because Ext2, Ext3, and 
ReiserFS are native Linux file systems supported by most Linux distributions (ReiserFS is 
commercially supported only on Novell SUSE Linux), we will focus on their characteristics 
and give only an overview of the other frequently used Linux file systems.
For more information on file systems and the disk subsystem, see 4.6, “Tuning the disk 
subsystem” on page 112.
1.3.1 Virtual file system
Virtual Files System (VFS) is an abstraction interface layer that resides between the user 
process and various types of Linux file system implementations. VFS provides common 
object models (such as i-node, file object, page cache, directory entry, and so on) and 
methods to access file system objects. It hides the differences of each file system 
implementation from user processes. Thanks to VFS, user processes do not need to know 
which file system to use, or which system call should be issued for each file system. 
Figure 1-14 on page 16 illustrates the concept of VFS.
Chapter 1. Understanding the Linux operating system 15
Figure 1-14 VFS concept
1.3.2 Journaling
In a non-journaling file system, when a write is performed to a file system the Linux kernel 
makes changes to the file system metadata first and then writes actual user data next. This 
operation sometimes causes higher chances of losing data integrity. If the system suddenly 
crashes for some reason while the write operation to file system metadata is in process, the 
file system consistency may be broken. fsck fixes the inconsistency by checking all the 
metadata and recover the consistency at the time of next reboot. But when the system has a 
large volume, it takes a lot of time to be completed. The system is not operational during this 
process.
A Journaling file system solves this problem by writing data to be changed to the area called 
the journal area before writing the data to the actual file system. The journal area can be 
placed both in the file system or out of the file system. The data written to the journal area is 
called the journal log. It includes the changes to file system metadata and the actual file data 
if supported.
Because journaling writes journal logs before writing actual user data to the file system, it can 
cause performance overhead compared tono-journaling file system. How much performance 
overhead is sacrificed to maintain higher data consistency depends on how much information 
is written to disk before writing user data. We will discuss this topic in 1.3.4, “Ext3” on 
page 18.
Figure 1-15 Journaling concept
VFS
System call
User Process cp
open(), read(), write()
translation for each file system
ext2 ext3 Reiserfs
NFS
XFS JFS
AFS VFAT proc
1. w
rite 
jour
nal 
logs
File system
Journal area
2. Make changes to actualfile system
3. dele
te jour
nal log
s
write
16 Linux Performance and Tuning Guidelines
1.3.3 Ext2
The extended 2 file system is the predecessor of the extended 3 file system. A fast, simple file 
system, it features no journaling capabilities, unlike most other current file systems.
Figure 1-16 shows the Ext2 file system data structure. The file system starts with the boot 
sector and is followed by block groups. Splitting the entire file system into several small block 
groups contributes to performance gain because the i-node table and data blocks which hold 
user data can reside closer on the disk platter, so seek time can be reduced. A block group 
consists of these items:
Super block Information on the file system is stored here. The exact copy of a 
super block is placed in the top of every block group.
Block group descriptorInformation on the block group is stored here.
Data block bitmaps Used for free data block management
i-node bitmaps Used for free i-node management
i-node tables i-node tables are stored here. Every file has a corresponding i-node 
table which holds meta-data of the file such as file mode, uid, gid, 
atime, ctime, mtime, dtime, and pointer to the data block.
Data blocks Where actual user data is stored
Figure 1-16 Ext2 file system data structure
To find data blocks which consist of a file, the kernel searches the i-node of the file first. When 
a request to open /var/log/messages comes from a process, the kernel parses the file path 
and searches a directory entry of / (root directory) which has the information about files and 
directories under itself (root directory). Then the kernel can find the i-node of /var next and 
look at the directory entry of /var. It also has the information of files and directories under 
itself. The kernel gets down to the file in same manner until it finds i-node of the file. The Linux 
Ext2
boot sectorboot sector
BLOCK
GROUP 0
BLOCK
GROUP 0
BLOCK
GROUP 1
BLOCK
GROUP 1
BLOCK
GROUP 2
BLOCK
GROUP 2
:
:
:
:
BLOCK
GROUP N
BLOCK
GROUP N
super blocksuper block
block group
descriptors
block group
descriptors
data-block
bitmaps
data block
bitmaps
inode
bitmaps
i-node
bitmaps
inode-tablei-node table
Data-blocksdata blocks
Chapter 1. Understanding the Linux operating system 17
kernel uses a file object cache such as directory entry cache or i-node cache to accelerate 
finding the corresponding i-node.
Once the Linux kernel knows the i-node of the file, it tries to reach the actual user data block. 
As we described, i-node has the pointer to the data block. By referring to it, the kernel can get 
to the data block. For large files, Ext2 implements direct/indirect references to the data block. 
Figure 1-17 illustrates how it works.
Figure 1-17 Ext2 file system direct / indirect reference to data block
The file system structure and file access operations differ by file systems. This gives each 
files system different characteristics. 
1.3.4 Ext3
The current Enterprise Linux distributions support the extended 3 file system. This is an 
updated version of the widely used extended 2 file system. Though the fundamental 
structures are similar to the Ext2 file system, the major difference is the support of journaling 
capability. Highlights of this file system include:
� Availability: Ext3 always writes data to the disks in a consistent way, so in case of an 
unclean shutdown (unexpected power failure or system crash), the server does not have 
to spend time checking the consistency of the data, thereby reducing system recovery 
from hours to seconds.
� Data integrity: By specifying the journaling mode data=journal on the mount command, all 
data, both file data and metadata, is journaled.
� Speed: By specifying the journaling mode data=writeback, you can decide on speed 
versus integrity to meet the needs of your business requirements. This will be notable in 
environments where there are heavy synchronous writes.
� Flexibility: Upgrading from existing Ext2 file systems is simple, and no reformatting is 
necessary. By executing the tune2fs command and modifying the /etc/fstab file, you can 
easily update an Ext2 to an Ext3 file system. Also note that Ext3 file systems can be 
mounted as Ext2 with journaling disabled. Products from many third-party vendors have 
ext2 disk i-node
i_blocks[2]
i_blocks[12]
i_blocks[13]
i_blocks[14]
i_blocks[3]
i_blocks[4]
i_blocks[0]
i_blocks[1]
i_size
:
i_blocks
i_blocks[6]
i_blocks[7]
i_blocks[8]
i_blocks[9]
i_blocks[10]
i_blocks[11]
Data
block
Indirect
block
Indirect
block
Indirect
block
Indirect
block
i_blocks[5]
direct
indirect
double indirect
trebly indirect
Indirect
block
Indirect
block
Data
block
Indirect
block
Indirect
block
Data
block
Indirect
block
Indirect
block
Indirect
block
Indirect
block
Data
block
18 Linux Performance and Tuning Guidelines
the capability of manipulating Ext3 file systems. For example, PartitionMagic can handle 
the modification of Ext3 partitions.
Mode of journaling
Ext3 supports three types of journaling modes.
� journal
This journaling option provides the highest form of data consistency by causing both file 
data and metadata to be journaled. It also has higher performance overhead.
� ordered
In this mode only metadata is written. However, file data is guaranteed to be written first. 
This is the default setting.
� writeback
This journaling option provides the fastest access to the data at the expense of data 
consistency. The data is guaranteed to be consistent as the metadata is still being logged. 
However, no special handling of actual file data is done and this may lead to old data 
appearing in files after a system crash. 
1.3.5 ReiserFS
ReiserFS is a fast journaling file system with optimized disk space utilization and quick crash 
recovery. ReiserFS has been developed to a great extent with the help of Novell. ReiserFS is 
commercially supported only on Novell SUSE Linux.
1.3.6 Journal File System
The Journal File System (JFS) is a full 64-bit file system that can support very large files and 
partitions. JFS was developed by IBM originally for AIX® and is now available under the 
general public license (GPL). JFS is an ideal file system for very large partitions and file sizes 
that are typically encountered in high performance computing (HPC) or database 
environments. If you would like to learn more about JFS, refer to:
http://jfs.sourceforge.net
1.3.7 XFS
The eXtended File System (XFS) is a high-performance journaling file system developed by 
Silicon Graphics Incorporated originally for its IRIX family of systems. It features 
characteristics similar to JFS from IBM by also supporting very large file and partition sizes. 
Therefore, usage scenarios are very similar to JFS. 
1.4 Disk I/O subsystem
Before a processor can decode and execute instructions, data should be retrieved all the way 
from sectors on a disk platter to the processor cache and its registers. The results of the 
executions can be written back to the disk. 
Note: In Novell SUSE Linux Enterprise Server 10, JFS is no longer supported as a new file 
system.
Chapter 1. Understanding the Linux operating system 19
http://jfs.sourceforge.net
We’ll take a look at the Linux disk I/O subsystem to have a better understanding of the 
components which have a major effect on system performance.
1.4.1 I/O subsystem architecture
Figure 1-18 shows basic concept of I/O subsystem architecture
Figure 1-18 I/O subsystemarchitecture
For a quick overview of overall I/O subsystem operations, we will use an example of writing 
data to a disk. The following sequence outlines the fundamental operations that occur when a 
disk-write operation is performed. Assume that the file data is on sectors on disk platters, has 
already been read, and is on the page cache.
1. A process requests to write a file through the write() system call.
2. The kernel updates the page cache mapped to the file.
3. A pdflush kernel thread takes care of flushing the page cache to disk.
4. The file system layer puts each block buffer together to a bio struct (refer to 1.4.3, “Block 
layer” on page 23) and submits a write request to the block device layer.
5. The block device layer gets requests from upper layers and performs an I/O elevator 
operation and puts the requests into the I/O request queue.
device driver
block layer
VFS / file system layer
file
disk device
I/O Request queue
User process
sector
block buffer
bio
page cache page
cache
page
cache
Device driver
Disk 
write()
pdflush
I/O scheduler
20 Linux Performance and Tuning Guidelines
6. A device driver such as SCSI or other device specific drivers will take care of write 
operation.
7. A disk device firmware performs hardware operations like seek head, rotation, and data 
transfer to the sector on the platter.
1.4.2 Cache
In the last 20 years, the performance improvement of processors has outperformed that of the 
other components in a computer system such as processor cache, bus, RAM, disk, and so 
on. Slower access to memory and disk restricts overall system performance, so system 
performance is not enhanced by processor speed improvement. The cache mechanism 
resolves this problem by caching frequently used data in faster memory. It reduces the 
chances of having to access slower memory. Current computer systems use this technique in 
almost all I/O components such as hard disk drive cache, disk controller cache, file system 
cache, cache handled by each application, and so on.
Memory hierarchy
Figure 1-19 shows the concept of memory hierarchy. As the difference of access speed 
between the CPU register and disk is large, the CPU will spend more time waiting for data 
from slow disk devices, and therefore it significantly reduces the advantage of a fast CPU. 
Memory hierarchal structure reduces this mismatch by placing L1 cache, L2 cache, RAM and 
some other caches between the CPU and disk. It enables a process to get less chance to 
access slower memory and disk. The memory closer to the processor has higher speed and 
less size. 
This technique can also take advantage of locality of reference principle. The higher the 
cache hit rate on faster memory is, the faster the access to data.
Figure 1-19 Memory hierarchy
Locality of reference
As we stated previously in “Memory hierarchy” achieving higher cache hit rate is the key for 
performance improvement. To achieve higher cache hit rate, the technique called “locality of 
reference” is used. This technique is based on the following principles:
� The data most recently used has a high probability of being used in the near future 
(temporal locality).
� The data that resides close to the data which has been used has a high probability of 
being used (spatial locality).
Figure 1-20 on page 22 illustrates this principle. 
CPU register
CPU
cacheregister RAM
very fast very slow
Large 
speed mismatch
very fast
fast
Disk
slow very slow
Disk
Chapter 1. Understanding the Linux operating system 21
Figure 1-20 Locality of reference
Linux uses this principle in many components such as page cache, file object cache (i-node 
cache, directory entry cache, and so on), read ahead buffer and more.
Flushing a dirty buffer
When a process reads data from disk, the data is copied to memory. The process and other 
processes can retrieve the same data from the copy of the data cached in memory. When a 
process tries to change the data, the process changes the data in memory first. At this time, 
the data on disk and the data in memory is not identical and the data in memory is referred to 
as a dirty buffer. The dirty buffer should be synchronized to the data on disk as soon as 
possible, or the data in memory could be lost if a sudden crash occurs.
The synchronization process for a dirty buffer is called flush. In the Linux kernel 2.6 
implementation, pdflush kernel thread is responsible for flushing data to the disk. The flush 
occurs on a regular basis (kupdate) and when the proportion of dirty buffers in memory 
exceeds a certain threshold (bdflush). The threshold is configurable in the 
/proc/sys/vm/dirty_background_ratio file. For more information, refer to 4.5.1, “Setting 
kernel swap and pdflush behavior” on page 109. 
Temporal locality Spatial locality
CPU
Register
Cache
Memory
Disk
First access
Data
Data
Data
Data
Second access in a few seconds Second access to data2 in a few seconds
Data2
Data2
CPU
Register
Cache
Memory
Disk
Data
Data
Data
Data
CPU
Register
Cache
Memory
Disk
First access
Data1
Data1
Data
Data
Data2
Data2
CPU
Register
Cache
Memory
Disk
Data1
Data1
Data
Data
22 Linux Performance and Tuning Guidelines
Figure 1-21 Flushing dirty buffers
1.4.3 Block layer
The block layer handles all the activity related to block device operation (refer to Figure 1-18 
on page 20). The key data structure in the block layer is the bio structure. The bio structure is 
an interface between the file system layer and the block layer.
When a write is performed, the file system layer tries to write to the page cache which is made 
up of block buffers. It makes up a bio structure by putting the contiguous blocks together, then 
sends bio to the block layer. (refer to Figure 1-18 on page 20)
The block layer handles the bio request and links these requests into a queue called the I/O 
request queue. This linking operation is called I/O elevator. In Linux kernel 2.6 
implementations, four types of I/O elevator algorithms are available. They are:
Block sizes
The block size, the smallest amount of data that can be read or written to a drive, can have a 
direct impact on a server’s performance. As a guideline, if your server is handling a lot of 
small files, then a smaller block size will be more efficient. If your server is dedicated to 
handling large files, a larger block size might improve performance. Block sizes cannot be 
changed on the fly on existing file systems. Only a reformat will modify the current block size.
I/O elevator
The Linux kernel 2.6 employs a new I/O elevator model. While the Linux kernel 2.4 used a 
single, general-purpose I/O elevator, kernel 2.6 offers the choice of four elevators. Because 
the Linux operating system can be used for a wide range of tasks, both I/O devices and 
workload characteristics change significantly. A notebook computer probably has different I/O 
requirements than a 10,000 user database system. To accommodate this, four I/O elevators 
are available.
Process
Cache
Data
Disk
Data
read
Process
Cache
Disk
Data
write
Data
dirty buffer
•Process read data from disk
The data on memory and the data on disk are identical at this time.
•Process writes new data
Only the data on memory has been changed, the data on disk and the data on memory is not identical.
Process
Cache
Disk
Data flush
•Flushing writes the data on memory to the disk.
The data on disk is now identical to the data on memory.
Data
•pdflush
•sync()
Chapter 1. Understanding the Linux operating system 23
� Anticipatory
The anticipatory I/O elevator was created based on the assumption of a block device with 
only one physical seek head (for example a single SATA drive). The anticipatory elevator 
uses the deadline mechanism described in more detail below plus an anticipation 
heuristic. As the name suggests, the anticipatory I/O elevator “anticipates” I/O and 
attempts to write it in single, bigger streams to the disk instead of multiple

Mais conteúdos dessa disciplina