Prévia do material em texto
ibm.com/redbooks Redpaper Front cover Linux Performance and Tuning Guidelines Eduardo Ciliendo Takechika Kunimasa Operating system tuning methods Performance monitoring tools Performance analysis http://www.redbooks.ibm.com/ http://www.redbooks.ibm.com/ International Technical Support Organization Linux Performance and Tuning Guidelines July 2007 REDP-4285-00 © Copyright International Business Machines Corporation 2007. All rights reserved. Note to U.S. Government Users Restricted Rights -- Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp. First Edition (July 2007) This edition applies to kernel 2.6 Linux distributions. This paper was updated on April 25, 2008. Note: Before using this information and the product it supports, read the information in “Notices” on page vii. Contents Notices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii Trademarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix How this paper is structured. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix The team that wrote this paper . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .x Become a published author . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi Comments welcome. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii Chapter 1. Understanding the Linux operating system. . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Linux process management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.1.1 What is a process? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.1.2 Life cycle of a process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.1.3 Thread. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.1.4 Process priority and nice level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.1.5 Context switching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.1.6 Interrupt handling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.1.7 Process state . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.1.8 Process memory segments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.1.9 Linux CPU scheduler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.2 Linux memory architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.2.1 Physical and virtual memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.2.2 Virtual memory manager. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 1.3 Linux file systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 1.3.1 Virtual file system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 1.3.2 Journaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 1.3.3 Ext2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 1.3.4 Ext3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 1.3.5 ReiserFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 1.3.6 Journal File System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 1.3.7 XFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 1.4 Disk I/O subsystem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 1.4.1 I/O subsystem architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 1.4.2 Cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 1.4.3 Block layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 1.4.4 I/O device driver . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 1.4.5 RAID and storage system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 1.5 Network subsystem. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 1.5.1 Networking implementation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 1.5.2 TCP/IP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 1.5.3 Offload . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 1.5.4 Bonding module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 1.6 Understanding Linux performance metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 1.6.1 Processor metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 1.6.2 Memory metrics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 1.6.3 Network interface metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 1.6.4 Block device metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 Chapter 2. Monitoring and benchmark tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 © Copyright IBM Corp. 2007. All rights reserved. iii 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 2.2 Overview of tool functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 2.3 Monitoring tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 2.3.1 top . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 2.3.2 vmstat . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 2.3.3 uptime . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 2.3.4 ps and pstree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 2.3.5 free . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 2.3.6 iostat . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 2.3.7 sar . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 2.3.8 mpstat . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 2.3.9 numastat . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 2.3.10 pmap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 2.3.11 netstat . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 2.3.12 iptraf . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 2.3.13 tcpdump / ethereal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 2.3.14 nmon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 2.3.15 strace . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 2.3.16 Proc file system. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 2.3.17 KDE System Guard. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 2.3.18 Gnome System Monitor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 2.3.19 Capacity Manager . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 2.4 Benchmark tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 2.4.1 LMbench . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 2.4.2 IOzone . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 2.4.3 netperf . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 2.4.4 Other useful tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 Chapter 3. Analyzing performance bottlenecks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 3.1 Identifying bottlenecks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 3.1.1 Gathering information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 3.1.2 Analyzing the server’s performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 3.2 CPU bottlenecks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 3.2.1 Finding CPU bottlenecks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 3.2.2 SMP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 3.2.3 Performance tuning options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 3.3 Memory bottlenecks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 3.3.1 Finding memory bottlenecks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 3.3.2 Performance tuning options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 3.4 Disk bottlenecks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 3.4.1 Finding disk bottlenecks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 3.4.2 Performance tuning options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 3.5 Network bottlenecks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 3.5.1 Finding network bottlenecks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 3.5.2 Performance tuning options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 Chapter 4. Tuning the operating system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 4.1 Tuning principles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 4.1.1 Change management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 4.2 Installation considerations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 4.2.1 Installation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 4.2.2 Check the current configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 4.2.3 Minimize resource use . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 iv Linux Performance and Tuning Guidelines 4.2.4 SELinux. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 4.2.5 Compiling the kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 4.3 Changing kernel parameters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 4.3.1 Where the parameters are stored . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 4.3.2 Using the sysctl command . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 4.4 Tuning the processor subsystem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 4.4.1 Tuning process priority . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 4.4.2 CPU affinity for interrupt handling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 4.4.3 Considerations for NUMA systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 4.5 Tuning the vm subsystem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 4.5.1 Setting kernel swap and pdflush behavior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 4.5.2 Swap partition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 4.5.3 HugeTLBfs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 4.6 Tuning the disk subsystem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 4.6.1 Hardware considerations before installing Linux. . . . . . . . . . . . . . . . . . . . . . . . . 113 4.6.2 I/O elevator tuning and selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 4.6.3 File system selection and tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 4.7 Tuning the network subsystem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 4.7.1 Considerations of traffic characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 4.7.2 Speed and duplexing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 4.7.3 MTU size. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 4.7.4 Increasing network buffers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 4.7.5 Additional TCP/IP tuning. . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . 128 4.7.6 Performance impact of Netfilter. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 4.7.7 Offload configuration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 4.7.8 Increasing the packet queues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 4.7.9 Increasing the transmit queue length . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 4.7.10 Decreasing interrupts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 Appendix A. Testing configurations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 Hardware and software configurations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 Linux installed on guest IBM z/VM systems. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 Linux installed on IBM System x servers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 Abbreviations and acronyms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 Related publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 IBM Redbooks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 Other publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 Online resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 How to get IBM Redbooks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 Help from IBM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 Contents v vi Linux Performance and Tuning Guidelines Notices This information was developed for products and services offered in the U.S.A. IBM may not offer the products, services, or features discussed in this document in other countries. Consult your local IBM representative for information on the products and services currently available in your area. Any reference to an IBM product, program, or service is not intended to state or imply that only that IBM product, program, or service may be used. Any functionally equivalent product, program, or service that does not infringe any IBM intellectual property right may be used instead. However, it is the user's responsibility to evaluate and verify the operation of any non-IBM product, program, or service. IBM may have patents or pending patent applications covering subject matter described in this document. The furnishing of this document does not give you any license to these patents. You can send license inquiries, in writing, to: IBM Director of Licensing, IBM Corporation, North Castle Drive, Armonk, NY 10504-1785 U.S.A. The following paragraph does not apply to the United Kingdom or any other country where such provisions are inconsistent with local law: INTERNATIONAL BUSINESS MACHINES CORPORATION PROVIDES THIS PUBLICATION "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF NON-INFRINGEMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Some states do not allow disclaimer of express or implied warranties in certain transactions, therefore, this statement may not apply to you. This information could include technical inaccuracies or typographical errors. Changes are periodically made to the information herein; these changes will be incorporated in new editions of the publication. IBM may make improvements and/or changes in the product(s) and/or the program(s) described in this publication at any time without notice. Any references in this information to non-IBM Web sites are provided for convenience only and do not in any manner serve as an endorsement of those Web sites. The materials at those Web sites are not part of the materials for this IBM product and use of those Web sites is at your own risk. IBM may use or distribute any of the information you supply in any way it believes appropriate without incurring any obligation to you. Information concerning non-IBM products was obtained from the suppliers of those products, their published announcements or other publicly available sources. IBM has not tested those products and cannot confirm the accuracy of performance, compatibility or any other claims related to non-IBM products. Questions on the capabilities of non-IBM products should be addressed to the suppliers of those products. This information contains examples of data and reports used in daily business operations. To illustrate them as completely as possible, the examples include the names of individuals, companies, brands, and products. All of these names are fictitious and any similarity to the names and addresses used by an actual business enterprise is entirely coincidental. COPYRIGHT LICENSE: This information contains sample application programs in source language, which illustrate programming techniques on various operating platforms. You may copy, modify, and distribute these sample programs in any form without payment to IBM, for the purposes of developing, using, marketing or distributing application programs conforming to the application programming interface for the operating platform for which the sample programs are written. These examples have not been thoroughly tested under all conditions. IBM, therefore, cannot guarantee or imply reliability, serviceability, or function of these programs. © Copyright IBM Corp. 2007. All rights reserved. vii Trademarks The following terms are trademarks of the International Business Machines Corporation in the United States, other countries, or both: Redbooks (logo) ® eServer™ xSeries® z/OS® AIX® DB2® DS8000™ IBM® POWER™ Redbooks® ServeRAID™ System i™ System p™ System x™ System z™ System Storage™ TotalStorage® The following terms are trademarks of other companies: Java, JDBC, Solaris, and all Java-based trademarks are trademarks of Sun Microsystems, Inc. in the United States, other countries, or both. Excel, Microsoft, Windows, and the Windows logo are trademarks of Microsoft Corporation in the United States, other countries, or both. Intel, Itanium, Intel logo, Intel Inside logo, and Intel Centrino logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States, other countries, or both. UNIX is a registered trademark of The Open Group in the United States and other countries. Linux is a trademark of Linus Torvalds in the United States, other countries, or both. Other company, product, or service names may be trademarks or service marks of others. viii Linux Performance and Tuning Guidelines Preface Linux® is an open source operating system developed by people from all over the world. The source code is freely available and can be used under the GNU General Public License. The operating system is made available to users in the form of distributions from companies such as Red Hat and Novell. Some desktop Linux distributions can be downloaded at no charge from the Web, but the server versions typically must be purchased. Over the past few years, Linux has made its way into the data centers of many corporations worldwide. The Linux operating system is accepted by both the scientific and enterprise user population. Today, Linux is by far the most versatile operating system. You can find Linux on embedded devices such as firewalls, cell phones, and mainframes.Naturally, performance of the Linux operating system has become a hot topic for scientific and enterprise users. However, calculating a global weather forecast and hosting a database impose different requirements on an operating system. Linux must accommodate all possible usage scenarios with optimal performance. Most Linux distributions contain general tuning parameters to accommodate all users. IBM® recognizes Linux as an operating system suitable for enterprise-level applications that run on IBM systems. Most enterprise applications are now available on Linux, including file and print servers, database servers, Web servers, and collaboration and mail servers. The use of Linux in an enterprise-class server requires monitoring performance and, when necessary, tune the server to remove bottlenecks that affect users. This IBM Redpaper publication describes the methods you can use to tune Linux, tools that you can use to monitor and analyze server performance, and key tuning parameters for specific server applications. The purpose of this paper is to explain how to analyze and tune the Linux operating system to yield superior performance for any type of application you plan to run on these systems. The tuning parameters, benchmark results, and monitoring tools used in our test environment were executed on Red Hat and Novell SUSE Linux kernel 2.6 systems running on IBM System x™ servers and IBM System z™ servers. However, the information in this paper should be helpful for all Linux hardware platforms. How this paper is structured To help those of you who are new to Linux or performance tuning get started quickly, we have structured this book the following way: � Chapter 1, “Understanding the Linux operating system” on page 1 This chapter introduces the factors that influence system performance and the way the Linux operating system manages system resources. You are introduced to several important performance metrics that are needed to quantify system performance. � Chapter 2, “Monitoring and benchmark tools” on page 39 The second chapter introduces the various utilities that are available for Linux to measure and analyze systems performance. � Chapter 3, “Analyzing performance bottlenecks” on page 77 This chapter introduces the process of identifying and analyzing bottlenecks in the system. © Copyright IBM Corp. 2007. All rights reserved. ix � Chapter 4, “Tuning the operating system” on page 91 With the basic knowledge of how the operating system works and how to use performance measurement utilities, you are ready to explore the various performance tweaks available in the Linux operating system. The team that wrote this paper This paper was produced by a team of specialists from around the world working at the International Technical Support Organization, Raleigh Center. The team: Byron, Eduardo, Takechika Eduardo Ciliendo is an Advisory IT Specialist working as a performance specialist on IBM Mainframe Systems in IBM Switzerland. He has more than 10 years of experience in computer sciences. Eddy studied Computer and Business Sciences at the University of Zurich and holds a post-diploma in Japanology. Eddy is a member of the zChampion team and holds several IT certifications including the RHCE title. As a Systems Engineer for IBM System z™, he works on capacity planning and systems performance for z/OS® and Linux for System z. Eddy has authored several publications on systems performance and Linux. Takechika Kunimasa is an Associate IT Architect in IBM Global Services in Japan. He studied Electrical and Electronics engineering at Chiba University. He has more than 10 years of experience in IT industry. He worked as a network engineer for five years, and he has been working for Linux technical support. His areas of expertise include Linux on System x™, Linux on System p™, Linux on System z, high availability system, networking, and infrastructure architecture design. He is a Cisco Certified Network Professional and a Red Hat Certified Engineer. x Linux Performance and Tuning Guidelines Byron Braswell is a Networking Professional at the International Technical Support Organization, Raleigh Center. He received a B.S. degree in Physics and an M.S. degree in Computer Sciences from Texas A&M University. He writes extensively in the areas of networking, application integration middleware, and personal computer software. Before joining the ITSO, Byron worked in IBM Learning Services Development in networking education development. Thanks to the following people for their contributions to this project: Margaret Ticknor Carolyn Briscoe International Technical Support Organization, Raleigh Center Roy Costa Michael B Schwartz Frieder Hamm International Technical Support Organization, Poughkeepsie Center Christian Ehrhardt Martin Kammerer IBM Böblingen, Germany Erwan Auffret IBM France Become a published author Join us for a two- to six-week residency program! Help write an IBM Redbook dealing with specific products or solutions, while getting hands-on experience with leading-edge technologies. You will have the opportunity to team with IBM technical professionals, Business Partners, and Clients. Your efforts will help increase product acceptance and customer satisfaction. As a bonus, you'll develop a network of contacts in IBM development labs and increase your productivity and marketability. Find out more about the residency program, browse the residency index, and apply online at: ibm.com/redbooks/residencies.html Preface xi http://www.redbooks.ibm.com/residencies.html http://www.redbooks.ibm.com/residencies.html Comments welcome Your comments are important to us! We want our papers to be as helpful as possible. Send us your comments about this paper or other IBM Redbooks® in one of the following ways: � Use the online Contact us review redbook form found at: ibm.com/redbooks � Send your comments in an e-mail to: redbooks@us.ibm.com � Mail your comments to: IBM Corporation, International Technical Support Organization Dept. HYTD Mail Station P099 2455 South Road Poughkeepsie, NY 12601-5400 xii Linux Performance and Tuning Guidelines http://www.redbooks.ibm.com/ http://www.redbooks.ibm.com/ http://www.redbooks.ibm.com/contacts.html Chapter 1. Understanding the Linux operating system We begin this paper with an overview of how the Linux operating system handles its tasks to complete interacting with its hardware resources. Performance tuning is a challenging task that requires in-depth understanding of the hardware, operating system, and application. If performance tuning were simple, the parameters we are about to explore would be hard-coded into the firmware or the operating system and you would not be reading these lines. However, as shown in Figure 1-1 server performance is affected by multiple factors. Figure 1-1 Schematic interaction of different performance components 1 Applications Libraries Kernel Drivers Firmware Hardware Applications Libraries Kernel Drivers Firmware Hardware © Copyright IBM Corp. 2007. All rights reserved. 1 You could tune the I/O subsystem for weeks in vain if the disk subsystem for a 20,000 user database server consisted of a single IDE drive. Often a new driver or an update to the application yields impressive performance gains. As we discuss specific details, keep in mind the whole picture of systems performance. Understanding the way an operating system manages the system resources helps us understand what subsystems we need to tune in any application scenario. The following sections provide a short introduction to the architecture of the Linux operating system. A complete analysis of the Linux kernel is beyond the scope of this paper. You can refer to the kernel documentation for a complete reference of the Linux kernel. In this chapter we cover: � 1.1, “Linux process management” on page 2 � 1.2, “Linux memory architecture” on page 10 � 1.3, “Linuxfile systems” on page 15 � 1.4, “Disk I/O subsystem” on page 19 � 1.5, “Network subsystem” on page 26 � 1.6, “Understanding Linux performance metrics” on page 34 1.1 Linux process management Process management is one of the most important roles of any operating system. Effective process management enables an application to operate steadily and effectively. Linux process management implementation is similar to UNIX® implementation. It includes process scheduling, interrupt handling, signaling, process prioritization, process switching, process state, process memory, and so on. In this section, we discuss the fundamentals of the Linux process management implementation. It helps you understand how the Linux kernel deals with processes that will have an effect on system performance. 1.1.1 What is a process? A process is an instance of execution that runs on a processor. The process uses any resources that the Linux kernel can handle to complete its task. All processes running on Linux operating system are managed by the task_struct structure, which is also called a process descriptor. A process descriptor contains all the information necessary for a single process to run such as process identification, attributes of the process, and resources which construct the process. If you know the structure of the process, you can understand what is important for process execution and performance. Figure 1-2 shows the outline of structures related to process information. Note: This paper focuses on the performance of the Linux operating system. 2 Linux Performance and Tuning Guidelines Figure 1-2 task_struct structure 1.1.2 Life cycle of a process Every process has its own life cycle such as creation, execution, termination, and removal. These phases will be repeated literally millions of times as long as the system is up and running. Therefore, the process life cycle is very important from the performance perspective. Figure 1-3 shows typical life cycle of processes. Figure 1-3 Life cycle of typical processes When a process creates a new process, the creating process (parent process) issues a fork() system call. When a fork() system call is issued, it gets a process descriptor for the newly created process (child process) and sets a new process id. It copies the values of the userUser management : group_infoGroup management : : signalSignal information sighandSignal handler : fliesFile descriptor fsWorking directory Root directory : pidProcess ID : mmProcess address space : run_list, arrayFor process scheduling : thread_infoProcess information and kernel stack stateProcess state userUser management : group_infoGroup management : : signalSignal information sighandSignal handler : fliesFile descriptor fsWorking directory Root directory : pidProcess ID : mmProcess address space : run_list, arrayFor process scheduling : thread_infoProcess information and kernel stack stateProcess state exec_domain Kernel stack status flags task exec_domain Kernel stack status flags task task_struct structure thread_info structure runqueue mm_struct group_info user_struct fs_struct files_struct signal_struct sighand_struct the other structures parent process child process child process zombie process parent process wait() fork() exec() exit() parent process child process child process zombie process parent process wait() fork() exec() exit() Chapter 1. Understanding the Linux operating system 3 parent process’ process descriptor to the child’s. At this time the entire address space of the parent process is not copied; both processes share the same address space. The exec() system call copies the new program to the address space of the child process. Because both processes share the same address space, writing new program data causes a page fault exception. At this point, the kernel assigns the new physical page to the child process. This deferred operation is called the Copy On Write. The child process usually executes their own program rather than the same execution as its parent does. This operation avoids unnecessary overhead because copying an entire address space is a very slow and inefficient operation which uses a lot of processor time and resources. When program execution has completed, the child process terminates with an exit() system call. The exit() system call releases most of the data structure of the process and notifies the parent process of the termination sending a signal. At this time, the process is called a zombie process (refer to “Zombie processes” on page 7). The child process will not be completely removed until the parent process knows of the termination of its child process by the wait() system call. As soon as the parent process is notified of the child process termination, it removes all the data structure of the child process and release the process descriptor. 1.1.3 Thread A thread is an execution unit generated in a single process. It runs parallel with other threads in the same process. They can share the same resources such as memory, address space, open files, and so on. They can access the same set of application data. A thread is also called Light Weight Process (LWP). Because they share resources, each thread should not change their shared resources at the same time. The implementation of mutual exclusion, locking, serialization, and so on, are the user application’s responsibility. From the performance perspective, thread creation is less expensive than process creation because a thread does not need to copy resources on creation. On the other hand, processes and threads have similar characteristics in terms of scheduling algorithm. The kernel deals with both of them in a similar manner. Figure 1-4 process and thread In current Linux implementations, a thread is supported with the Portable Operating System Interface for UNIX (POSIX) compliant library (pthread). Several thread implementations are available in the Linux operating system. The following are the widely used. � LinuxThreads Process Process resourceeresourceresourceresourceeresourceresource copy Process ThreadThreadThread ThreadThreadThread resourceeresourceresourceshare share Process creation Thread creation 4 Linux Performance and Tuning Guidelines LinuxThreads have been the default thread implementation since Linux kernel 2.0. The LinuxThread has some noncompliant implementations with the POSIX standard. Native POSIX Thread Library (NPTL) is taking the place of LinuxThreads. The LinuxThreads will not be supported in future release of Enterprise Linux distributions. � Native POSIX Thread Library (NPTL) The NPTL was originally developed by Red Hat. NPTL is more compliant with POSIX standards. By taking advantage of enhancements in kernel 2.6 such as the new clone() system call, signal handling implementation, and so on, it has better performance and scalability than LinuxThreads. NPTL has some incompatibility with LinuxThreads. An application which has a dependence on LinuxThread might not work with the NPTL implementation. � Next Generation POSIX Thread (NGPT) NGPT is an IBM developed version of POSIX thread library. It is currently under maintenance operation and no further development is planned. Using the LD_ASSUME_KERNEL environment variable, you can choose which threads library the application should use. 1.1.4 Process priority and nice level Process priority is a number that determines the order in which the process is handled by the CPU and is determined by dynamic priority and static priority. A process which has higher process priority has a greater chance of getting permission to run on a processor. The kernel dynamically adjusts dynamic priority up and down as needed using a heuristic algorithm based on process behaviors and characteristics. A user process can change the static priority indirectly through the use of the nice levelof the process. A process which has higher static priority will have longer time slice (how long the process can run on a processor). Linux supports nice levels from 19 (lowest priority) to -20 (highest priority). The default value is 0. To change the nice level of a program to a negative number (which makes it a higher priority), it is necessary to log on or use su on the root. 1.1.5 Context switching During process execution, information on the running process is stored in registers on the processor and its cache. The set of data that is loaded to the register for the executing process is called the context. To switch processes, the context of the running process is stored and the context of the next running process is restored to the register. The process descriptor and the area called kernel mode stack are used to store the context. This switching process is called context switching. Having too much context switching is undesirable because the processor has to flush its register and cache every time to make room for the new process. It could cause performance problems. Figure 1-5 illustrates how the context switching works. Chapter 1. Understanding the Linux operating system 5 Figure 1-5 Context switching 1.1.6 Interrupt handling Interrupt handling is one of the highest priority tasks. Interrupts are usually generated by I/O devices such as a network interface card, keyboard, disk controller, serial adapter, and so on. The interrupt handler notifies the Linux kernel of an event (such as keyboard input, ethernet frame arrival, and so on). It tells the kernel to interrupt process execution and perform interrupt handling as quickly as possible because some device requires quick responsiveness. This is critical for system stability. When an interrupt signal arrives to the kernel, the kernel must switch a current execution process to a new one to handle the interrupt. This means interrupts cause context switching, and therefore a significant amount of interrupts could cause performance degradation. In Linux implementations, there are two types of interrupts. A hard interrupt is generated for devices which require responsiveness (disk I/O interrupt, network adapter interrupt, keyboard interrupt, mouse interrupt). A soft interrupt is used for tasks which processing can be deferred (TCP/IP operation, SCSI protocol operation, and so on). You can see information related to hard interrupts at /proc/interrupts. In a multi-processor environment, interrupts are handled by each processor. Binding interrupts to a single physical processor could improve system performance. For more details, refer to 4.4.2, “CPU affinity for interrupt handling” on page 108. 1.1.7 Process state Every process has its own state that shows what is currently happening in the process. Process state changes during process execution. Some of the possible states are as follows: � TASK_RUNNING In this state, a process is running on a CPU or waiting to run in the queue (run queue). � TASK_STOPPED A process suspended by certain signals (for example SIGINT, SIGSTOP) is in this state. The process is waiting to be resumed by a signal such as SIGCONT. � TASK_INTERRUPTIBLE stack pointer other registers EIP register etc. CPU Address space of process B Address space of process A stack stack task_struct (Process A) task_struct (Process B) Suspend Resume Context switch 6 Linux Performance and Tuning Guidelines In this state, the process is suspended and waits for a certain condition to be satisfied. If a process is in TASK_INTERRUPTIBLE state and it receives a signal to stop, the process state is changed and operation will be interrupted. A typical example of a TASK_INTERRUPTIBLE process is a process waiting for keyboard interrupt. � TASK_UNINTERRUPTIBLE Similar to TASK_INTERRUPTIBLE. While a process in TASK_INTERRUPTIBLE state can be interrupted, sending a signal does nothing to the process in TASK_UNINTERRUPTIBLE state. A typical example of a TASK_UNINTERRUPTIBLE process is a process waiting for disk I/O operation. � TASK_ZOMBIE After a process exits with exit() system call, its parent should know of the termination. In TASK_ZOMBIE state, a process is waiting for its parent to be notified to release all the data structure. Figure 1-6 Process state Zombie processes When a process has already terminated, having received a signal to do so, it normally takes some time to finish all tasks (such as closing open files) before ending itself. In that normally very short time frame, the process is a zombie. After the process has completed all of these shutdown tasks, it reports to the parent process that it is about to terminate. Sometimes, a zombie process is unable to terminate itself, in which case it shows a status of Z (zombie). It is not possible to kill such a process with the kill command, because it is already considered dead. If you cannot get rid of a zombie, you can kill the parent process and then the zombie disappears as well. However, if the parent process is the init process, you should not kill it. The init process is a very important process so a reboot might be needed to get rid of the zombie process. Processor TASK_INTERRUPTIBLETASK_INTERRUPTIBLE TASK_RUNNING (READY) TASK_RUNNING (READY) TASK_RUNNINGTASK_RUNNING TASK_ZOMBIETASK_ZOMBIE TASK_STOPPEDTASK_STOPPED exit() TASK_UNINTERRUPTIBLETASK_UNINTERRUPTIBLE Preemption Scheduling fork() Chapter 1. Understanding the Linux operating system 7 1.1.8 Process memory segments A process uses its own memory area to perform work. The work varies depending on the situation and process usage. A process can have different workload characteristics and different data size requirements. The process has to handle a of variety of data sizes. To satisfy this requirement, the Linux kernel uses a dynamic memory allocation mechanism for each process. The process memory allocation structure is shown in Figure 1-7. Figure 1-7 Process address space The process memory area consist of these segments � Text segment The area where executable code is stored. � Data segment The data segment consists of these three areas. – Data: The area where initialized data such as static variables are stored. – BSS: The area where zero-initialized data is stored. The data is initialized to zero. – Heap: The area where malloc() allocates dynamic memory based on the demand. The heap grows towards higher addresses. � Stack segment The area where local variables, function parameters, and the return address of a function is stored. The stack grows toward lower addresses. The memory allocation of a user process address space can be displayed with the pmap command. You can display the total size of the segment with the ps command. Refer to 2.3.10, “pmap” on page 52 and 2.3.4, “ps and pstree” on page 44. Text Executable instruction (Read-only) Data Initialized data BSS Zero-initialized data Heap Dynamic memory allocation by malloc() Stack Local variables Function parameters, Return address, and so on Text segment Data segment Stack segment Process address space Heap segment 0x0000 8 Linux Performance and Tuning Guidelines 1.1.9 Linux CPU scheduler The basic functionality of any computer is, quite simply, to compute. To be able to compute, there must be a means to manage the computing resources, or processors, and the computing tasks, also known as threads or processes. Thanks to the great work of Ingo Molnar, Linux features a kernel using a O(1) algorithm as opposed to the O(n) algorithm used to describe the former CPU scheduler. The term O(1) refers to a static algorithm, meaning that the time taken to choose a process for placing into execution is constant, regardless of the number of processes. The new scheduler scales very well, regardless of process count or processor count, and imposes a low overhead on the system. The algorithm uses two process priority arrays: � active � expiredAs processes are allocated a timeslice by the scheduler, based on their priority and prior blocking rate, they are placed in a list of processes for their priority in the active array. When they expire their timeslice, they are allocated a new timeslice and placed on the expired array. When all processes in the active array have expired their timeslice, the two arrays are switched, restarting the algorithm. For general interactive processes (as opposed to real-time processes) this results in high-priority processes, which typically have long timeslices, getting more compute time than low-priority processes, but not to the point where they can starve the low-priority processes completely. The advantage of such an algorithm is the vastly improved scalability of the Linux kernel for enterprise workloads that often include vast amounts of threads or processes and also a significant number of processors. The new O(1) CPU scheduler was designed for kernel 2.6 but backported to the 2.4 kernel family. Figure 1-8 on page 9 illustrates how the Linux CPU scheduler works. Figure 1-8 Linux kernel 2.6 O(1) scheduler Another significant advantage of the new scheduler is the support for Non-Uniform Memory Architecture (NUMA) and symmetric multithreading processors, such as Intel® Hyper-Threading technology. The improved NUMA support ensures that load balancing will not occur across NUMA nodes unless a node gets overburdened. This mechanism ensures that traffic over the comparatively slow scalability links in a NUMA system are minimized. Although load balancing across processors in a scheduler domain group will be load balanced with every scheduler tick, priority0 : priority 139 priority0 : priority 139 P P P P active expired array[0] array[1] P P : : P P P priority0 : priority 139 priority0 : priority 139 P P P P active expired array[0] array[1] P P : : P P P Chapter 1. Understanding the Linux operating system 9 workload across scheduler domains will only occur if that node is overloaded and asks for load balancing. Figure 1-9 Architecture of the O(1) CPU scheduler on an 8-way NUMA based system with Hyper-Threading enabled 1.2 Linux memory architecture To execute a process, the Linux kernel allocates a portion of the memory area to the requesting process. The process uses the memory area as workspace and performs the required work. It is similar to you having your own desk allocated and then using the desktop to scatter papers, documents and memos to perform your work. The difference is that the kernel has to allocate space in a more dynamic manner. The number of running processes sometimes comes to tens of thousands and amount of memory is usually limited. Therefore, Linux kernel must handle the memory efficiently. In this section, we describe the Linux memory architecture, address layout, and how Linux manages memory space efficiently. 1.2.1 Physical and virtual memory Today we are faced with the choice of 32-bit systems and 64-bit systems. One of the most important differences for enterprise-class clients is the possibility of virtual memory addressing above 4 GB. From a performance point of view, it is interesting to understand how the Linux kernel maps physical memory into virtual memory on both 32-bit and 64-bit systems. As you can see in Figure 1-10 on page 11, there are obvious differences in the way the Linux kernel has to address memory in 32-bit and 64-bit systems. Exploring the physical-to-virtual mapping in detail is beyond the scope of this paper, so we highlight some specifics in the Linux memory architecture. On 32-bit architectures such as the IA-32, the Linux kernel can directly address only the first gigabyte of physical memory (896 MB when considering the reserved range). Memory above Two node xSeries 445 (8 CPU) One CEC (4 CPU) One Xeon MP (HT) One HT CPU Parent Scheduler Domain Child Scheduler Domain Scheduler Domain Group Logical CPU Load balancing only if a child is overburdened Load balancing via scheduler_tick() and time slice Load balancing via scheduler_tick() 1 2 3 … 1 2 3 … 1 2 3 … 1 2 … 1 2 … 1 2 … 1 2 … 1 2 … 1 2 … 10 Linux Performance and Tuning Guidelines the so-called ZONE_NORMAL must be mapped into the lower 1 GB. This mapping is completely transparent to applications, but allocating a memory page in ZONE_HIGHMEM causes a small performance degradation. On the other hand, with 64-bit architectures such as x86-64 (also x64), ZONE_NORMAL extends all the way to 64 GB or to 128 GB in the case of IA-64 systems. As you can see, the overhead of mapping memory pages from ZONE_HIGHMEM into ZONE_NORMAL can be eliminated by using a 64-bit architecture. Figure 1-10 Linux kernel memory layout for 32-bit and 64-bit systems Virtual memory addressing layout Figure 1-11 shows the Linux virtual addressing layout for 32-bit and 64-bit architecture. On 32-bit architectures, the maximum address space that single process can access is 4GB. This is a restriction derived from 32-bit virtual addressing. In a standard implementation, the virtual address space is divided into a 3 GB user space and a 1 GB kernel space. There is some variants like 4 G/4 G addressing layout implementing. On the other hand, on 64-bit architecture such as x86_64 and ia64, no such restriction exits. Each single process can benefit from the vast and huge address space. The Linux Memory Architecture 32-bit Architecture 64-bit Architecture 16 MB 1 GB 64 GB ZONE_NORMAL ZONE_DMA ZONE_HIGHMEM “Reserved”128 MB 896 MB Pages in ZONE_HIGHMEM must be mapped into ZONE_NORMAL 1 GB 64 GB ZONE_DMA ZONE_NORMAL ~~ ~~ Reserved for Kernel data structures Chapter 1. Understanding the Linux operating system 11 Figure 1-11 Virtual memory addressing layout for 32bit and 64-bit architecture 1.2.2 Virtual memory manager The physical memory architecture of an operating system is usually hidden to the application and the user because operating systems map any memory into virtual memory. If we want to understand the tuning possibilities within the Linux operating system, we have to understand how Linux handles virtual memory. As explained in 1.2.1, “Physical and virtual memory” on page 10, applications do not allocate physical memory, but request a memory map of a certain size at the Linux kernel and in exchange receive a map in virtual memory. As you can see in Figure 1-12, virtual memory does not necessarily have to be mapped into physical memory. If your application allocates a large amount of memory, some of it might be mapped to the swap file on the disk subsystem. Figure 1-12 shows that applications usually do not write directly to the disk subsystem, but into cache or buffers. The pdflush kernel threads then flushes out data in cache/buffers to the disk when it has time to do so or if a file size exceeds the buffer cache. Refer to “Flushing a dirty buffer” on page 22. 32-bit Architecture 64-bit Architecture 3 GB 3 G/1 G kernel User space Kernel space 0 GB User space Kernel space 0 GB 4 GB 512 GB or more x86_64 12 Linux Performance and Tuning Guidelines Figure 1-12 The Linux virtual memory manager Closely connected to the way the Linux kernel handles writes to the physical disk subsystem is the way the Linux kernel manages disk cache. While other operating systems allocate only a certain portion of memory as disk cache, Linux handles the memory resource far more efficiently. The default configuration of the virtual memory manager allocates all available free memory space as disk cache. Hence it is not unusual to see productive Linux systems that boast gigabytes of memory but only have 20 MB of that memory free. In the same context, Linux also handles swap space very efficiently. Swap space being used does not indicate a memory bottleneck but proves how efficiently Linux handles system resources. See “Page frame reclaiming” on page 14 for more detail. Page frame allocationA page is a group of contiguous linear addresses in physical memory (page frame) or virtual memory. The Linux kernel handles memory with this page unit. A page is usually 4 K bytes in size. When a process requests a certain amount of pages, if there are available pages the Linux kernel can allocate them to the process immediately. Otherwise pages have to be taken from some other process or page cache. The kernel knows how many memory pages are available and where they are located. Buddy system The Linux kernel maintains its free pages by using a mechanism called a buddy system. The buddy system maintains free pages and tries to allocate pages for page allocation requests. It tries to keep the memory area contiguous. If small pages are scattered without consideration, it might cause memory fragmentation and it’s more difficult to allocate a large portion of pages into a contiguous area. It could lead to inefficient memory use and performance decline. Figure 1-13 illustrates how the buddy system allocates pages. Standard C Library (glibc) Kernel Subsystems sh httpd mozilla kswapd bdflush Slab Allocator zoned buddy allocator MMU VM Subsystem Disk Driver User Space Processes Disk Physical Memory Chapter 1. Understanding the Linux operating system 13 Figure 1-13 Buddy System When the attempt of pages allocation fails, the page reclaiming is activated. Refer to “Page frame reclaiming” on page 14. You can find information on the buddy system through /proc/buddyinfo. For details, refer to “Memory used in a zone” on page 47. Page frame reclaiming If pages are not available when a process requests to map a certain amount of pages, the Linux kernel tries to get pages for the new request by releasing certain pages (which were used before but are not used anymore and are still marked as active pages based on certain principles) and allocating the memory to a new process. This process is called page reclaiming. kswapd kernel thread and try_to_free_page() kernel function are responsible for page reclaiming. While kswapd is usually sleeping in task interruptible state, it is called by the buddy system when free pages in a zone fall short of a threshold. It tries to find the candidate pages to be taken out of active pages based on the Least Recently Used (LRU) principle. The pages least recently used should be released first. The active list and the inactive list are used to maintain the candidate pages. kswapd scans part of the active list and check how recently the pages were used and the pages not used recently are put into the inactive list. You can take a look at how much memory is considered as active and inactive using the vmstat -a command. For detail refer to 2.3.2, “vmstat” on page 42. kswapd also follows another principle. The pages are used mainly for two purposes: page cache and process address space. The page cache is pages mapped to a file on disk. The pages that belong to a process address space (called anonymous memory because it is not mapped to any files, and it has no name) are used for heap and stack. Refer to 1.1.8, “Process memory segments” on page 8. When kswapd reclaims pages, it would rather shrink the page cache than page out (or swap out) the pages owned by processes. A large proportion of page cache that is reclaimed and process address space that is reclaimed might depend on the usage scenario and will affect performance. You can take some control of this behavior by using /proc/sys/vm/swappiness. Refer to 4.5.1, “Setting kernel swap and pdflush behavior” on page 109 for tuning details. Page out and swap out: The phrases “page out” and “swap out” are sometimes confusing. The phrase “page out” means take some pages (a part of entire address space) into swap space while “swap out” means taking entire address space into swap space. They are sometimes used interchangeably. Used Used Used Used Used Request for 2 pages Used 4 pages chunk Used Request for 2 pages Used 2 pages chunk Used Used 8 pages chunk Used Release 2 pages Used 2 pages chunk 8 pages chunk 8 pages chunk 14 Linux Performance and Tuning Guidelines swap As we stated before, when page reclaiming occurs, the candidate pages in the inactive list which belong to the process address space may be paged out. Having swap itself is not problematic situation. While swap is nothing more than a guarantee in case of over allocation of main memory in other operating systems, Linux uses swap space far more efficiently. As you can see in Figure 1-12 on page 13, virtual memory is composed of both physical memory and the disk subsystem or the swap partition. If the virtual memory manager in Linux realizes that a memory page has been allocated but not used for a significant amount of time, it moves this memory page to swap space. Often you will see daemons such as getty that will be launched when the system starts up but will hardly ever be used. It appears that it would be more efficient to free the expensive main memory of such a page and move the memory page to swap. This is exactly how Linux handles swap, so there is no need to be alarmed if you find the swap partition filled to 50%. The fact that swap space is being used does not indicate a memory bottleneck; instead it proves how efficiently Linux handles system resources. 1.3 Linux file systems One of the great advantages of Linux as an open source operating system is that it offers users a variety of supported file systems. Modern Linux kernels can support nearly every file system ever used by a computer system, from basic FAT support to high performance file systems such as the journaling file system (JFS). However, because Ext2, Ext3, and ReiserFS are native Linux file systems supported by most Linux distributions (ReiserFS is commercially supported only on Novell SUSE Linux), we will focus on their characteristics and give only an overview of the other frequently used Linux file systems. For more information on file systems and the disk subsystem, see 4.6, “Tuning the disk subsystem” on page 112. 1.3.1 Virtual file system Virtual Files System (VFS) is an abstraction interface layer that resides between the user process and various types of Linux file system implementations. VFS provides common object models (such as i-node, file object, page cache, directory entry, and so on) and methods to access file system objects. It hides the differences of each file system implementation from user processes. Thanks to VFS, user processes do not need to know which file system to use, or which system call should be issued for each file system. Figure 1-14 on page 16 illustrates the concept of VFS. Chapter 1. Understanding the Linux operating system 15 Figure 1-14 VFS concept 1.3.2 Journaling In a non-journaling file system, when a write is performed to a file system the Linux kernel makes changes to the file system metadata first and then writes actual user data next. This operation sometimes causes higher chances of losing data integrity. If the system suddenly crashes for some reason while the write operation to file system metadata is in process, the file system consistency may be broken. fsck fixes the inconsistency by checking all the metadata and recover the consistency at the time of next reboot. But when the system has a large volume, it takes a lot of time to be completed. The system is not operational during this process. A Journaling file system solves this problem by writing data to be changed to the area called the journal area before writing the data to the actual file system. The journal area can be placed both in the file system or out of the file system. The data written to the journal area is called the journal log. It includes the changes to file system metadata and the actual file data if supported. Because journaling writes journal logs before writing actual user data to the file system, it can cause performance overhead compared tono-journaling file system. How much performance overhead is sacrificed to maintain higher data consistency depends on how much information is written to disk before writing user data. We will discuss this topic in 1.3.4, “Ext3” on page 18. Figure 1-15 Journaling concept VFS System call User Process cp open(), read(), write() translation for each file system ext2 ext3 Reiserfs NFS XFS JFS AFS VFAT proc 1. w rite jour nal logs File system Journal area 2. Make changes to actualfile system 3. dele te jour nal log s write 16 Linux Performance and Tuning Guidelines 1.3.3 Ext2 The extended 2 file system is the predecessor of the extended 3 file system. A fast, simple file system, it features no journaling capabilities, unlike most other current file systems. Figure 1-16 shows the Ext2 file system data structure. The file system starts with the boot sector and is followed by block groups. Splitting the entire file system into several small block groups contributes to performance gain because the i-node table and data blocks which hold user data can reside closer on the disk platter, so seek time can be reduced. A block group consists of these items: Super block Information on the file system is stored here. The exact copy of a super block is placed in the top of every block group. Block group descriptorInformation on the block group is stored here. Data block bitmaps Used for free data block management i-node bitmaps Used for free i-node management i-node tables i-node tables are stored here. Every file has a corresponding i-node table which holds meta-data of the file such as file mode, uid, gid, atime, ctime, mtime, dtime, and pointer to the data block. Data blocks Where actual user data is stored Figure 1-16 Ext2 file system data structure To find data blocks which consist of a file, the kernel searches the i-node of the file first. When a request to open /var/log/messages comes from a process, the kernel parses the file path and searches a directory entry of / (root directory) which has the information about files and directories under itself (root directory). Then the kernel can find the i-node of /var next and look at the directory entry of /var. It also has the information of files and directories under itself. The kernel gets down to the file in same manner until it finds i-node of the file. The Linux Ext2 boot sectorboot sector BLOCK GROUP 0 BLOCK GROUP 0 BLOCK GROUP 1 BLOCK GROUP 1 BLOCK GROUP 2 BLOCK GROUP 2 : : : : BLOCK GROUP N BLOCK GROUP N super blocksuper block block group descriptors block group descriptors data-block bitmaps data block bitmaps inode bitmaps i-node bitmaps inode-tablei-node table Data-blocksdata blocks Chapter 1. Understanding the Linux operating system 17 kernel uses a file object cache such as directory entry cache or i-node cache to accelerate finding the corresponding i-node. Once the Linux kernel knows the i-node of the file, it tries to reach the actual user data block. As we described, i-node has the pointer to the data block. By referring to it, the kernel can get to the data block. For large files, Ext2 implements direct/indirect references to the data block. Figure 1-17 illustrates how it works. Figure 1-17 Ext2 file system direct / indirect reference to data block The file system structure and file access operations differ by file systems. This gives each files system different characteristics. 1.3.4 Ext3 The current Enterprise Linux distributions support the extended 3 file system. This is an updated version of the widely used extended 2 file system. Though the fundamental structures are similar to the Ext2 file system, the major difference is the support of journaling capability. Highlights of this file system include: � Availability: Ext3 always writes data to the disks in a consistent way, so in case of an unclean shutdown (unexpected power failure or system crash), the server does not have to spend time checking the consistency of the data, thereby reducing system recovery from hours to seconds. � Data integrity: By specifying the journaling mode data=journal on the mount command, all data, both file data and metadata, is journaled. � Speed: By specifying the journaling mode data=writeback, you can decide on speed versus integrity to meet the needs of your business requirements. This will be notable in environments where there are heavy synchronous writes. � Flexibility: Upgrading from existing Ext2 file systems is simple, and no reformatting is necessary. By executing the tune2fs command and modifying the /etc/fstab file, you can easily update an Ext2 to an Ext3 file system. Also note that Ext3 file systems can be mounted as Ext2 with journaling disabled. Products from many third-party vendors have ext2 disk i-node i_blocks[2] i_blocks[12] i_blocks[13] i_blocks[14] i_blocks[3] i_blocks[4] i_blocks[0] i_blocks[1] i_size : i_blocks i_blocks[6] i_blocks[7] i_blocks[8] i_blocks[9] i_blocks[10] i_blocks[11] Data block Indirect block Indirect block Indirect block Indirect block i_blocks[5] direct indirect double indirect trebly indirect Indirect block Indirect block Data block Indirect block Indirect block Data block Indirect block Indirect block Indirect block Indirect block Data block 18 Linux Performance and Tuning Guidelines the capability of manipulating Ext3 file systems. For example, PartitionMagic can handle the modification of Ext3 partitions. Mode of journaling Ext3 supports three types of journaling modes. � journal This journaling option provides the highest form of data consistency by causing both file data and metadata to be journaled. It also has higher performance overhead. � ordered In this mode only metadata is written. However, file data is guaranteed to be written first. This is the default setting. � writeback This journaling option provides the fastest access to the data at the expense of data consistency. The data is guaranteed to be consistent as the metadata is still being logged. However, no special handling of actual file data is done and this may lead to old data appearing in files after a system crash. 1.3.5 ReiserFS ReiserFS is a fast journaling file system with optimized disk space utilization and quick crash recovery. ReiserFS has been developed to a great extent with the help of Novell. ReiserFS is commercially supported only on Novell SUSE Linux. 1.3.6 Journal File System The Journal File System (JFS) is a full 64-bit file system that can support very large files and partitions. JFS was developed by IBM originally for AIX® and is now available under the general public license (GPL). JFS is an ideal file system for very large partitions and file sizes that are typically encountered in high performance computing (HPC) or database environments. If you would like to learn more about JFS, refer to: http://jfs.sourceforge.net 1.3.7 XFS The eXtended File System (XFS) is a high-performance journaling file system developed by Silicon Graphics Incorporated originally for its IRIX family of systems. It features characteristics similar to JFS from IBM by also supporting very large file and partition sizes. Therefore, usage scenarios are very similar to JFS. 1.4 Disk I/O subsystem Before a processor can decode and execute instructions, data should be retrieved all the way from sectors on a disk platter to the processor cache and its registers. The results of the executions can be written back to the disk. Note: In Novell SUSE Linux Enterprise Server 10, JFS is no longer supported as a new file system. Chapter 1. Understanding the Linux operating system 19 http://jfs.sourceforge.net We’ll take a look at the Linux disk I/O subsystem to have a better understanding of the components which have a major effect on system performance. 1.4.1 I/O subsystem architecture Figure 1-18 shows basic concept of I/O subsystem architecture Figure 1-18 I/O subsystemarchitecture For a quick overview of overall I/O subsystem operations, we will use an example of writing data to a disk. The following sequence outlines the fundamental operations that occur when a disk-write operation is performed. Assume that the file data is on sectors on disk platters, has already been read, and is on the page cache. 1. A process requests to write a file through the write() system call. 2. The kernel updates the page cache mapped to the file. 3. A pdflush kernel thread takes care of flushing the page cache to disk. 4. The file system layer puts each block buffer together to a bio struct (refer to 1.4.3, “Block layer” on page 23) and submits a write request to the block device layer. 5. The block device layer gets requests from upper layers and performs an I/O elevator operation and puts the requests into the I/O request queue. device driver block layer VFS / file system layer file disk device I/O Request queue User process sector block buffer bio page cache page cache page cache Device driver Disk write() pdflush I/O scheduler 20 Linux Performance and Tuning Guidelines 6. A device driver such as SCSI or other device specific drivers will take care of write operation. 7. A disk device firmware performs hardware operations like seek head, rotation, and data transfer to the sector on the platter. 1.4.2 Cache In the last 20 years, the performance improvement of processors has outperformed that of the other components in a computer system such as processor cache, bus, RAM, disk, and so on. Slower access to memory and disk restricts overall system performance, so system performance is not enhanced by processor speed improvement. The cache mechanism resolves this problem by caching frequently used data in faster memory. It reduces the chances of having to access slower memory. Current computer systems use this technique in almost all I/O components such as hard disk drive cache, disk controller cache, file system cache, cache handled by each application, and so on. Memory hierarchy Figure 1-19 shows the concept of memory hierarchy. As the difference of access speed between the CPU register and disk is large, the CPU will spend more time waiting for data from slow disk devices, and therefore it significantly reduces the advantage of a fast CPU. Memory hierarchal structure reduces this mismatch by placing L1 cache, L2 cache, RAM and some other caches between the CPU and disk. It enables a process to get less chance to access slower memory and disk. The memory closer to the processor has higher speed and less size. This technique can also take advantage of locality of reference principle. The higher the cache hit rate on faster memory is, the faster the access to data. Figure 1-19 Memory hierarchy Locality of reference As we stated previously in “Memory hierarchy” achieving higher cache hit rate is the key for performance improvement. To achieve higher cache hit rate, the technique called “locality of reference” is used. This technique is based on the following principles: � The data most recently used has a high probability of being used in the near future (temporal locality). � The data that resides close to the data which has been used has a high probability of being used (spatial locality). Figure 1-20 on page 22 illustrates this principle. CPU register CPU cacheregister RAM very fast very slow Large speed mismatch very fast fast Disk slow very slow Disk Chapter 1. Understanding the Linux operating system 21 Figure 1-20 Locality of reference Linux uses this principle in many components such as page cache, file object cache (i-node cache, directory entry cache, and so on), read ahead buffer and more. Flushing a dirty buffer When a process reads data from disk, the data is copied to memory. The process and other processes can retrieve the same data from the copy of the data cached in memory. When a process tries to change the data, the process changes the data in memory first. At this time, the data on disk and the data in memory is not identical and the data in memory is referred to as a dirty buffer. The dirty buffer should be synchronized to the data on disk as soon as possible, or the data in memory could be lost if a sudden crash occurs. The synchronization process for a dirty buffer is called flush. In the Linux kernel 2.6 implementation, pdflush kernel thread is responsible for flushing data to the disk. The flush occurs on a regular basis (kupdate) and when the proportion of dirty buffers in memory exceeds a certain threshold (bdflush). The threshold is configurable in the /proc/sys/vm/dirty_background_ratio file. For more information, refer to 4.5.1, “Setting kernel swap and pdflush behavior” on page 109. Temporal locality Spatial locality CPU Register Cache Memory Disk First access Data Data Data Data Second access in a few seconds Second access to data2 in a few seconds Data2 Data2 CPU Register Cache Memory Disk Data Data Data Data CPU Register Cache Memory Disk First access Data1 Data1 Data Data Data2 Data2 CPU Register Cache Memory Disk Data1 Data1 Data Data 22 Linux Performance and Tuning Guidelines Figure 1-21 Flushing dirty buffers 1.4.3 Block layer The block layer handles all the activity related to block device operation (refer to Figure 1-18 on page 20). The key data structure in the block layer is the bio structure. The bio structure is an interface between the file system layer and the block layer. When a write is performed, the file system layer tries to write to the page cache which is made up of block buffers. It makes up a bio structure by putting the contiguous blocks together, then sends bio to the block layer. (refer to Figure 1-18 on page 20) The block layer handles the bio request and links these requests into a queue called the I/O request queue. This linking operation is called I/O elevator. In Linux kernel 2.6 implementations, four types of I/O elevator algorithms are available. They are: Block sizes The block size, the smallest amount of data that can be read or written to a drive, can have a direct impact on a server’s performance. As a guideline, if your server is handling a lot of small files, then a smaller block size will be more efficient. If your server is dedicated to handling large files, a larger block size might improve performance. Block sizes cannot be changed on the fly on existing file systems. Only a reformat will modify the current block size. I/O elevator The Linux kernel 2.6 employs a new I/O elevator model. While the Linux kernel 2.4 used a single, general-purpose I/O elevator, kernel 2.6 offers the choice of four elevators. Because the Linux operating system can be used for a wide range of tasks, both I/O devices and workload characteristics change significantly. A notebook computer probably has different I/O requirements than a 10,000 user database system. To accommodate this, four I/O elevators are available. Process Cache Data Disk Data read Process Cache Disk Data write Data dirty buffer •Process read data from disk The data on memory and the data on disk are identical at this time. •Process writes new data Only the data on memory has been changed, the data on disk and the data on memory is not identical. Process Cache Disk Data flush •Flushing writes the data on memory to the disk. The data on disk is now identical to the data on memory. Data •pdflush •sync() Chapter 1. Understanding the Linux operating system 23 � Anticipatory The anticipatory I/O elevator was created based on the assumption of a block device with only one physical seek head (for example a single SATA drive). The anticipatory elevator uses the deadline mechanism described in more detail below plus an anticipation heuristic. As the name suggests, the anticipatory I/O elevator “anticipates” I/O and attempts to write it in single, bigger streams to the disk instead of multiple