当前位置：首页 > 软件应用 > 正文

nr和run有什么不同

软件应用
2023-12-30 10:28:19
2

摘要： nr和run有什么不同最佳答案53678位专家为你答疑解惑十八般武器之cache工具：linux性能分析工具,下面一起来看看本站...

nr和run有什么不同

最佳答案 53678位专家为你答疑解惑

十八般武器之cache工具：linux性能分析工具,下面一起来看看本站小编linux上的码农给大家精心整理的答案，希望对您有帮助

nr和run有什么不同1

系统级性能优化通常包括两个阶段：性能剖析（performance profiling）和代码优化。

性能剖析的目标是寻找性能瓶颈，查找引发性能问题的原因及热点代码。

代码优化的目标是针对具体性能问题而优化代码或编译选项，以改善软件性能。

在性能剖析阶段，需要借助于现有的profiling工具，如perf等。在代码优化阶段往往需要借助开发者的经验，编写简洁高效的代码，甚至在汇编级别合理使用各种指令，合理安排各种指令的执行顺序。

perf是一款Linux性能分析工具。Linux性能计数器是一个新的基于内核的子系统，它提供一个性能分析框架，比如硬件（CPU、PMU(Performance Monitoring Unit)）功能和软件(软件计数器、tracepoint)功能。通过perf，应用程序可以利用PMU、tracepoint和内核中的计数器来进行性能统计。它不但可以分析制定应用程序的性能问题（per thread），也可以用来分析内核的性能问题，当然也可以同时分析应用程序和内核，从而全面理解应用程序中的性能瓶颈。

使用perf，可以分析程序运行期间发生的硬件事件，比如instructions retired、processor clock cycles等；也可以分析软件时间，比如page fault和进程切换。

perf是一款综合性分析工具，大到系统全局性性能，再小到进程线程级别，甚至到函数及汇编级别。

perf提供了十八般武器，可以拿大刀大卸八块，也可以拿起手术刀细致分析。

1. 背景知识

1.1 tracepoints

tracepoints是散落在内核源码中的一些hook，它们可以在特定的代码被执行到时触发，这一特定可以被各种trace/debug工具所使用。

perf将tracepoint产生的时间记录下来，生成报告，通过分析这些报告，条有人缘便可以了解程序运行期间内核的各种细节，对性能症状做出准确的诊断。

这些tracepint的对应的sysfs节点在/sys/kernel/debug/tracing/events目录下。

1.2 硬件特性之cache

内存读写是很快的，但是还是无法和处理器指令执行速度相比。为了从内存中读取指令和数据，处理器需要等待，用处理器时间来衡量，这种等待非常漫长。cache是一种SRAM，读写速度非常快，能和处理器相匹配。因此将常用的数据保存在cache中，处理器便无需等待，从而提高性能。cache的尺寸一般都很小，充分利用cache是软件调优非常重要部分。

2. 主要关注点

基于性能分析，可以进行算法优化（空间复杂度和时间复杂度权衡）、代码优化（提高执行速度、减少内存占用）。

评估程序对硬件资源的使用情况，例如各级cache的访问次数、各级cache的丢失次数、流水线停顿周期、前端总线访问次数等。

评估程序对操作系统资源的使用情况，系统调用次数、上下文切换次数、任务迁移次数。

事件可以分为三种：

Hardware Event由PMU部件产生，在特定的条件下探测性能事件是否发生以及发生的次数。比如cache命中。Software Event是内核产生的事件，分布在各个功能模块中，统计和操作系统相关性能事件。比如进程切换，tick数等。Tracepoint Event是内核中静态tracepoint所触发的事件，这些tracepoint用来判断程序运行期间内核的行为细节，比如slab分配器的分配次数等。

3. perf的使用

perf --help之后可以看到perf的二级命令。

序号命令作用

1annotate解析perf record生成的perf.data文件，显示被注释的代码。

2archive根据数据文件记录的build-id，将所有被采样到的elf文件打包。利用此压缩包，可以在任何机器上分析数据文件中记录的采样数据。

3benchperf中内置的benchmark，目前包括两套针对调度器和内存管理子系统的benchmark。

4buildid-cache管理perf的buildid缓存，每个elf文件都有一个独一无二的buildid。buildid被perf用来关联性能数据与elf文件。

更多linux内核视频教程文本资料免费获取后台私信【内核】。

5buildid-list列出数据文件中记录的所有buildid。

6diff对比两个数据文件的差异。能够给出每个符号（函数）在热点分析上的具体差异。

7evlist列出数据文件perf.data中所有性能事件。

8inject该工具读取perf record工具记录的事件流，并将其定向到标准输出。在被分析代码中的任何一点，都可以向事件流中注入其它事件。

9kmem针对内核内存（slab）子系统进行追踪测量的工具

10kvm用来追踪测试运行在KVM虚拟机上的Guest OS。

11list列出当前系统支持的所有性能事件。包括硬件性能事件、软件性能事件以及检查点。

12lock分析内核中的锁信息，包括锁的使用情况，等待延迟等。

13mem内存存取情况

14record收集采样信息，并将其记录在数据文件中。随后可通过其它工具对数据文件进行分析。

15report读取perf record创建的数据文件，并给出热点分析结果。

16sched针对调度器子系统的分析工具。

17script执行perl或python写的功能扩展脚本、生成脚本框架、读取数据文件中的数据信息等。

18stat执行某个命令，收集特定进程的性能概况，包括CPI、Cache丢失率等。

19testperf对当前软硬件平台进行健全性测试，可用此工具测试当前的软硬件平台是否能支持perf的所有功能。

20timechart针对测试期间系统行为进行可视化的工具

21top类似于linux的top命令，对系统性能进行实时分析。

22trace关于syscall的工具。

23probe用于定义动态检查点。

全局性概况：

perf list查看当前系统支持的性能事件；
perf bench对系统性能进行摸底；
perf test对系统进行健全性测试；
perf stat对全局性能进行统计；

全局细节：

perf top可以实时查看当前系统进程函数占用率情况；
perf probe可以自定义动态事件；

特定功能分析：

perf kmem针对slab的系统性能分析；
perf kvm针对kvm虚拟化分析；
perf lock分析锁性能；
perf mem分析内存slab性能；
perf sched分析内核调度器性能；
perf trace记录系统调用轨迹；

最常用功能perf record，可以系统全局，也可以具体到某个进程，更可以具体到某一进程某一事件；可宏观，也可以很微观。

pref record记录信息到perf.data；
perf report生成报告；
perf diff对两个记录进行diff；
perf evlist列出记录的性能事件；
perf annotate显示perf.data函数代码；
perf archive将相关符号打包，方便在其它机器进行分析；
perf script将perf.data输出可读性文本；

可视化工具perf timechart

perf timechart record记录事件；
perf timechart生成output.svg文档；

3.0 perf引入的overhead

perf测试不可避免的会引入额外负荷，有三种形式：

counting：内核提供计数总结，多是Hardware Event、Software Events、PMU计数等。相关命令perf stat。

sampling：perf将事件数据缓存到一块buffer中，然后异步写入到perf.data文件中。使用perf report等工具进行离线分析。

bpf：Kernel 4.4+新增功能，可以提供更多有效filter和输出总结。

counting引入的额外负荷最小；sampling在某些情况下会引入非常大的负荷；bpf可以有效缩减负荷。

针对sampling，可以通过挂在建立在RAM上的文件系统来有效降低读写I/O引入的负荷。

mkdir /tmpfs
mount -t tmpfs tmpfs /tmpfs

3.1 perf list

perf list不能完全显示所有支持的事件类型，需要sudo perf list。

同时还可以显示特定模块支持的perf事件：hw/cache/pmu都是硬件相关的；tracepoint基于内核的ftrace；sw实际上是内核计数器。

hw/hardware显示支持的硬件事件相关，如：

al@al-System-Product-Name:~/perf$ sudo perf list hardware
List of pre-defined events (to be used in -e):
branch-instructions OR branches [Hardware event]branch-misses [Hardware event]cache-misses [Hardware event]cache-references [Hardware event]cpu-cycles OR cycles [Hardware event]instructions [Hardware event]stalled-cycles-backend OR idle-cycles-backend [Hardware event]stalled-cycles-frontend OR idle-cycles-frontend [Hardware event]

sw/software显示支持的软件事件列表：

al@al-System-Product-Name:~/perf$ sudo perf list sw
List of pre-defined events (to be used in -e):
alignment-faults [Software event]bpf-output [Software event]context-switches OR cs [Software event]cpu-clock [Software event]cpu-migrations OR migrations [Software event]dummy [Software event]emulation-faults [Software event]major-faults [Software event]minor-faults [Software event]page-faults OR faults [Software event]task-clock [Software event]

cache/hwcache显示硬件cache相关事件列表：

al@al-System-Product-Name:~/perf$ sudo perf list cache
List of pre-defined events (to be used in -e):
L1-dcache-load-misses [Hardware cache event]L1-dcache-loads [Hardware cache event]L1-dcache-prefetch-misses [Hardware cache event]L1-dcache-prefetches [Hardware cache event]L1-icache-load-misses [Hardware cache event]L1-icache-loads [Hardware cache event]L1-icache-prefetches [Hardware cache event]LLC-load-misses [Hardware cache event]LLC-loads [Hardware cache event]LLC-stores [Hardware cache event]branch-load-misses [Hardware cache event]branch-loads [Hardware cache event]dTLB-load-misses [Hardware cache event]dTLB-loads [Hardware cache event]iTLB-load-misses [Hardware cache event]iTLB-loads [Hardware cache event]node-load-misses [Hardware cache event]node-loads [Hardware cache event]

pmu显示支持的PMU事件列表：

al@al-System-Product-Name:~/perf$ sudo perf list pmu
List of pre-defined events (to be used in -e):
branch-instructions OR cpu/branch-instructions/ [Kernel PMU event]branch-misses OR cpu/branch-misses/ [Kernel PMU event]cache-misses OR cpu/cache-misses/ [Kernel PMU event]cache-references OR cpu/cache-references/ [Kernel PMU event]cpu-cycles OR cpu/cpu-cycles/ [Kernel PMU event]instructions OR cpu/instructions/ [Kernel PMU event]msr/aperf/ [Kernel PMU event]msr/mperf/ [Kernel PMU event]msr/tsc/ [Kernel PMU event]stalled-cycles-backend OR cpu/stalled-cycles-backend/ [Kernel PMU event]stalled-cycles-frontend OR cpu/stalled-cycles-frontend/ [Kernel PMU event]

tracepoint显示支持的所有tracepoint列表，这个列表就比较庞大：

al@al-System-Product-Name:~/perf$ sudo perf list tracepoint
List of pre-defined events (to be used in -e):
alarmtimer:alarmtimer_cancel [Tracepoint event]alarmtimer:alarmtimer_fired [Tracepoint event]alarmtimer:alarmtimer_start [Tracepoint event]alarmtimer:alarmtimer_suspend [Tracepoint event]block:block_bio_backmerge [Tracepoint event]block:block_bio_bounce [Tracepoint event]block:block_bio_complete [Tracepoint event]block:block_bio_frontmerge [Tracepoint event]block:block_bio_queue [Tracepoint event]
…

3.2 perf top

默认情况下perf top是无法显示信息的，需要sudo perf top或者echo -1 > /proc/sys/kernel/perf_event_paranoid(在Ubuntu16.04，还需要echo 0 > /proc/sys/kernel/kptr_restrict)。

即可以正常显示perf top如下：

第一列：符号引发的性能事件的比例，指占用的cpu周期比例。

第二列：符号所在的DSO(Dynamic Shared Object)，可以是应用程序、内核、动态链接库、模块。

第三列：DSO的类型。[.]表示此符号属于用户态的ELF文件，包括可执行文件与动态链接库；[k]表述此符号属于内核或模块。

第四列：符号名。有些符号不能解析为函数名，只能用地址表示。

关于perf top界面常用命令如下：

h：显示帮助，即可显示详细的帮助信息。
UP/DOWN/PGUP/PGDN/SPACE：上下和翻页。
a：annotate current symbol，注解当前符号。能够给出汇编语言的注解，给出各条指令的采样率。
d：过滤掉所有不属于此DSO的符号。非常方便查看同一类别的符号。
P：将当前信息保存到perf.hist.N中。

perf top常用选项有：

-e <event>：指明要分析的性能事件。
-p <pid>：Profile events on existing Process ID (comma sperated list). 仅分析目标进程及其创建的线程。
-k <path>：Path to vmlinux. Required for annotation functionality. 带符号表的内核映像所在的路径。
-K：不显示属于内核或模块的符号。
-U：不显示属于用户态程序的符号。
-d <n>：界面的刷新周期，默认为2s，因为perf top默认每2s从mmap的内存区域读取一次性能数据。
-g：得到函数的调用关系图。

perf top --call-graph [fractal]，路径概率为相对值，加起来为100%，调用顺序为从下往上。

perf top --call-graph graph，路径概率为绝对值，加起来为该函数的热度。

3.3 perf stat

perf stat用于运行指令，并分析其统计结果。虽然perf top也可以指定pid，但是必须先启动应用才能查看信息。

perf stat能完整统计应用整个生命周期的信息。

命令格式为：

perf stat [-e <EVENT> | --event=EVENT] [-a] <command>perf stat [-e <EVENT> | --event=EVENT] [-a] — <command> [<options>]

下面简单看一下perf stat 的输出：

al@al-System-Product-Name:~/perf$ sudo perf stat^CPerformance counter stats for 'system wide':
40904.820871 cpu-clock (msec) # 5.000 CPUs utilized 18,132 context-switches # 0.443 K/sec 1,053 cpu-migrations # 0.026 K/sec 2,420 page-faults # 0.059 K/sec 3,958,376,712 cycles # 0.097 GHz (49.99%)574,598,403 stalled-cycles-frontend # 14.52% frontend cycles idle (49.98%)9,392,982,910 stalled-cycles-backend # 237.29% backend cycles idle (50.00%)1,653,185,883 instructions # 0.42 insn per cycle # 5.68 stalled cycles per insn (50.01%)237,061,366 branches # 5.795 M/sec (50.02%)18,333,168 branch-misses # 7.73% of all branches (50.00%)
8.181521203 seconds time elapsed

输出解释如下：

cpu-clock：任务真正占用的处理器时间，单位为ms。CPUs utilized=task-clock / time elapsed，CPU的占用率。
context-switches：程序在运行过程中上下文的切换次数。
CPU-migrations：程序在运行过程中发生的处理器迁移次数。Linux为了维持多个处理器的负载均衡，在特定条件下会将某个任务从一个CPU迁移到另一个CPU。
CPU迁移和上下文切换：发生上下文切换不一定会发生CPU迁移，而发生CPU迁移时肯定会发生上下文切换。发生上下文切换有可能只是把上下文从当前CPU中换出，下一次调度器还是将进程安排在这个CPU上执行。
page-faults：缺页异常的次数。当应用程序请求的页面尚未建立、请求的页面不在内存中，或者请求的页面虽然在内存中，但物理地址和虚拟地址的映射关系尚未建立时，都会触发一次缺页异常。另外TLB不命中，页面访问权限不匹配等情况也会触发缺页异常。
cycles：消耗的处理器周期数。如果把被ls使用的cpu cycles看成是一个处理器的，那么它的主频为2.486GHz。可以用cycles / task-clock算出。
stalled-cycles-frontend：指令读取或解码的质量步骤，未能按理想状态发挥并行左右，发生停滞的时钟周期。
stalled-cycles-backend：指令执行步骤，发生停滞的时钟周期。
instructions：执行了多少条指令。IPC为平均每个cpu cycle执行了多少条指令。
branches：遇到的分支指令数。branch-misses是预测错误的分支指令数。

其他常用参数

-a, --all-cpus 显示所有CPU上的统计信息 -C, --cpu <cpu> 显示指定CPU的统计信息-c, --scale scale/normalize counters-D, --delay <n> ms to wait before starting measurement after program start-d, --detailed detailed run - start a lot of events -e, --event <event> event selector. use 'perf list' to list available events-G, --cgroup <name> monitor event in cgroup name only-g, --group put the counters into a counter group-I, --interval-print <n>print counts at regular interval in ms (>=10)-i, --no-inherit child tasks do not inherit counters-n, --null null run - dont start any counters-o, --output <file> 输出统计信息到文件 -p, --pid <pid> stat events on existing process id-r, --repeat <n> repeat command and print average + stddev (max: 100, forever: 0)-S, --sync call sync() before starting a run-t, --tid <tid> stat events on existing thread id...

示例

前面统计程序的示例，下面看一下统计CPU信息的示例：

执行sudo perf stat -C 0，统计CPU 0的信息。想要停止后，按下Ctrl+C终止。可以看到统计项一样，只是统计对象变了。

al@al-System-Product-Name:~/perf$ sudo perf stat -C 0^CPerformance counter stats for 'CPU(s) 0':
2517.107315 cpu-clock (msec) # 1.000 CPUs utilized 2,941 context-switches # 0.001 M/sec 109 cpu-migrations # 0.043 K/sec 38 page-faults # 0.015 K/sec 644,094,340 cycles # 0.256 GHz (49.94%)70,425,076 stalled-cycles-frontend # 10.93% frontend cycles idle (49.94%)965,270,543 stalled-cycles-backend # 149.86% backend cycles idle (49.94%)623,284,864 instructions # 0.97 insn per cycle # 1.55 stalled cycles per insn (50.06%)65,658,190 branches # 26.085 M/sec (50.06%)3,276,104 branch-misses # 4.99% of all branches (50.06%)
2.516996126 seconds time elapsed

如果需要统计更多的项，需要使用-e，如：

perf stat -e task-clock,context-switches,cpu-migrations,page-faults,cycles,stalled-cycles-frontend,stalled-cycles-backend,instructions,branches,branch-misses,L1-dcache-loads,L1-dcache-load-misses,LLC-loads,LLC-load-misses,dTLB-loads,dTLB-load-misses ls

结果如下，关注的特殊项也纳入统计。

al@al-System-Product-Name:~/perf$ sudo perf stat -e task-clock,context-switches,cpu-migrations,page-faults,cycles,stalled-cycles-frontend,stalled-cycles-backend,instructions,branches,branch-misses,L1-dcache-loads,L1-dcache-load-misses,LLC-loads,LLC-load-misses,dTLB-loads,dTLB-load-misses ls
Performance counter stats for 'ls':
2.319422 task-clock (msec) # 0.719 CPUs utilized 0 context-switches # 0.000 K/sec 0 cpu-migrations # 0.000 K/sec 89 page-faults # 0.038 M/sec 2,142,386 cycles # 0.924 GHz 659,800 stalled-cycles-frontend # 30.80% frontend cycles idle 725,343 stalled-cycles-backend # 33.86% backend cycles idle 1,344,518 instructions # 0.63 insn per cycle # 0.54 stalled cycles per insn<not counted> branches <not counted> branch-misses <not counted> L1-dcache-loads <not counted> L1-dcache-load-misses <not counted> LLC-loads <not counted> LLC-load-misses <not counted> dTLB-loads <not counted> dTLB-load-misses
0.003227507 seconds time elapsed

3.4 perf bench

perf bench作为benchmark工具的通用框架，包含sched/mem/numa/futex等子系统，all可以指定所有。

perf bench可用于评估系统sched/mem等特定性能。

perf bench sched：调度器和IPC机制。包含messaging和pipe两个功能。
perf bench mem：内存存取性能。包含memcpy和memset两个功能。
perf bench numa：NUMA架构的调度和内存处理性能。包含mem功能。
perf bench futex：futex压力测试。包含hash/wake/wake-parallel/requeue/lock-pi功能。
perf bench all：所有bench测试的集合

3.4.1 perf bench sched all

测试messaging和pipi两部分性能。

3.4.1.1 sched messaging评估进程调度和核间通信

sched message 是从经典的测试程序 hackbench 移植而来，用来衡量调度器的性能，overhead 以及可扩展性。

该 benchmark 启动 N 个 reader/sender 进程或线程对，通过 IPC(socket 或者 pipe) 进行并发的读写。一般人们将 N 不断加大来衡量调度器的可扩展性。

sched message 的用法及用途和 hackbench 一样，可以通过修改参数进行不同目的测试：

-g, --group <n> Specify number of groups
-l, --nr_loops <n> Specify the number of loops to run (default: 100)
-p, --pipe Use pipe() instead of socketpair()
-t, --thread Be multi thread instead of multi process

测试结果：

al@al-System-Product-Name:~/perf$ perf bench sched all# Running sched/messaging benchmark...# 20 sender and receiver processes per group# 10 groups==400 processes run
Total time: 0.173 [sec]
# Running sched/pipe benchmark...# Executed 1000000 pipe operations between two processes
Total time: 12.233 [sec]
12.233170 usecs/op81744 ops/sec

使用pipe()和socketpair()对测试影响：

1. perf bench sched messaging
# Running 'sched/messaging' benchmark:# 20 sender and receiver processes per group# 10 groups==400 processes run
Total time: 0.176 [sec]
2. perf bench sched messaging -p
# Running 'sched/messaging' benchmark:# 20 sender and receiver processes per group# 10 groups==400 processes run
Total time: 0.093 [sec]

可见socketpair()性能要明显低于pipe()。

3.4.1.2 sched pipe评估pipe性能

sched pipe 从 Ingo Molnar 的 pipe-test-1m.c 移植而来。当初 Ingo 的原始程序是为了测试不同的调度器的性能和公平性的。

其工作原理很简单，两个进程互相通过 pipe 拼命地发 1000000 个整数，进程 A 发给 B，同时 B 发给 A。因为 A 和 B 互相依赖，因此假如调度器不公平，对 A 比 B 好，那么 A 和 B 整体所需要的时间就会更长。

al@al-System-Product-Name:~/perf$ perf bench sched pipe# Running 'sched/pipe' benchmark:# Executed 1000000 pipe operations between two processes
Total time: 12.240 [sec]
12.240411 usecs/op81696 ops/sec

3.4.2 perf bench mem all

该测试衡量不同版本的memcpy/memset/ 函数处理一个 1M 数据的所花费的时间，转换成吞吐率。

al@al-System-Product-Name:~/perf$ perf bench mem all# Running mem/memcpy benchmark...# function 'default' (Default memcpy() provided by glibc)# Copying 1MB bytes ...
1.236155 GB/sec.
..

3.4.3 perf bench futex

Futex是一种用户态和内核态混合机制，所以需要两个部分合作完成，linux上提供了sys_futex系统调用，对进程竞争情况下的同步处理提供支持。

所有的futex同步操作都应该从用户空间开始，首先创建一个futex同步变量，也就是位于共享内存的一个整型计数器。

当进程尝试持有锁或者要进入互斥区的时候，对futex执行"down"操作，即原子性的给futex同步变量减1。如果同步变量变为0，则没有竞争发生，进程照常执行。

如果同步变量是个负数，则意味着有竞争发生，需要调用futex系统调用的futex_wait操作休眠当前进程。

当进程释放锁或者要离开互斥区的时候，对futex进行"up"操作，即原子性的给futex同步变量加1。如果同步变量由0变成1，则没有竞争发生，进程照常执行。

如果加之前同步变量是负数，则意味着有竞争发生，需要调用futex系统调用的futex_wake操作唤醒一个或者多个等待进程。

al@al-System-Product-Name:~/perf$ perf bench futex all# Running futex/hash benchmark...Run summary [PID 3806]: 5 threads, each operating on 1024 [private] futexes for 10 secs.
[thread 0] futexes: 0x4003d20 ... 0x4004d1c [ 4635648 ops/sec ][thread 1] futexes: 0x4004d30 ... 0x4005d2c [ 4611072 ops/sec ][thread 2] futexes: 0x4005e70 ... 0x4006e6c [ 4254515 ops/sec ][thread 3] futexes: 0x4006fb0 ... 0x4007fac [ 4559360 ops/sec ][thread 4] futexes: 0x40080f0 ... 0x40090ec [ 4636262 ops/sec ]
Averaged 4539371 operations/sec (+- 1.60%), total secs=10
# Running futex/wake benchmark...Run summary [PID 3806]: blocking on 5 threads (at [private] futex 0x96b52c), waking up 1 at a time.
[Run 1]: Wokeup 5 of 5 threads in 0.0270 ms[Run 2]: Wokeup 5 of 5 threads in 0.0370 ms
...

3.4 perf record

运行一个命令，并将其数据保存到perf.data中。随后，可以使用perf report进行分析。

perf record和perf report可以更精确的分析一个应用，perf record可以精确到函数级别。并且在函数里面混合显示汇编语言和代码。

创建一个fork.c文件用于测试：

#include <stdio.h>void test_little(void){  int i,j;  for(i=0; i < 30000000; i++)     j=i; }void test_mdedium(void){  int i,j;  for(i=0; i < 60000000; i++)     j=i; }void test_high(void){  int i,j;  for(i=0; i < 90000000; i++)     j=i; }void test_hi(void){  int i,j;  for(i=0; i < 120000000; i++)     j=i; }int main(void){  int i, pid, result;  for(i=0; i<2; i++) {    result=fork();    if(result>0)      printf("i=%d parent parent=%d current=%d child=%d\n", i, getppid(), getpid(), result);    else      printf("i=%d child parent=%d current=%d\n", i, getppid(), getpid());    if(i==0)    {      test_little();      sleep(1);    } else {      test_mdedium();      sleep(1);    }  }  pid=wait(NULL);  test_high();  printf("pid=%d wait=%d\n", getpid(), pid);  sleep(1);  pid=wait(NULL);  test_hi();  printf("pid=%d wait=%d\n", getpid(), pid);  return 0;}

编译fork.c文件gcc fork.c -o fork-g -O0，同时可以使用此方法分析是否选择编译优化产生的结果。-g是只能callgraph功能，-O0是关闭优化。

常用选项

-e record指定PMU事件--filter event事件过滤器-a 录取所有CPU的事件-p 录取指定pid进程的事件-o 指定录取保存数据的文件名-g 使能函数调用图功能-C 录取指定CPU的事件

sudo perf record -a -g ./fork：会在当前目录生成perf.data文件。

sudo perf report --call-graph none结果如下,后面结合perf timechart分析.

上图看上去比较乱，如果想只看fork产生的信息：

sudo perf report --call-graph none -c fork

可以看出只显示了fork程序的相关符号及其占用率。

3.5 perf report

解析perf record产生的数据，并给出分析结果。

常用参数：

-i 导入的数据文件名称，如果没有则默认为perf.data
-g 生成函数调用关系图，此时内核要打开CONFIG_KALLSYMS；用户空间库或者执行文件需要带符号信息(not stripped)，编译选项需要加上-g。
--sort 从更高层面显示分类统计信息，比如： pid, comm, dso, symbol, parent, cpu,socket, srcline, weight, local_weight.

执行sudo perf report -i perf.data，可以看出main函数所占百分比，以及funcA和funcB分别所占百分比。

在funcB执行过程中，还产生了apic timer，占用了一部分cpu资源。除此之外，占比基本上是1：10。

funcA和funcB的占比，基本符合预期。那么进入longa，分析一下热点。

在C和汇编混合显示界面，可以看出for循环占用了69.92%，j=i赋值占用了30.08%。

根据之上描述，可以看出top适合监控整个系统的性能，stat比较适合单个程序的性能分析，record/report更适合对程序进行更细粒度的分析。

注意点：

在使用perf report -g的时候，可能会提示Failed to open /lib/libpthread-0.9.33.2.so, continuing without symbols。

这时候通过file xxx查看，如果提示xxxx stripped表示此文件不包含符号信息，需要xxxx not stripped文件。

3.6 perf timechart

perf timechart是将之前的各种统计信息图形化的一个工具。

perf timechart record <option> <command>用于记录整个系统或者一个应用的事件，还可以加option记录指定类型的事件。

perf timechart用于将perf.data转换成SVG格式的文件，SVG可以通过Inkscape或者浏览器打开。

perf timechart record可以指定特定类型的事件：

-P：记录power相关事件
-T：记录任务相关事件
-I：记录io相关事件
-g：记录函数调用关系

perf timechart用于将perf timechart record录取的perf.data转换成output.svg。

-w调整输出的svg文件长度，可以查看更多细节。

-p可以指定只查看某些进程输出，使用方式：sudo perf timechart -p test1 -p thermald

-o 指定输出文件名
-i 指定待解析的文件名
-w 输出SVG文件宽度
-P 只显示power相关事件图标
-T , --tasks-only 显示task信息，不显示处理器信息
-p 显示指定进程名称或者PID显示
--symfs=<directory> 指定系统符号表路径
-t, --topology 根据拓扑结构对CPU进行分类
--highlight=<duration_nsecs|task_name> 对运行超过特定时间的task高亮

当线程太多影响svg解析速度的时候，可以通过-p指定特定线程进行分析。如果需要几个线程，每个线程采用-p xxx。

sudo perf timechart record -T ./fork && sudo perf timechart –p fork

结果如下，可以看到相关task的名称，开始时间/结束时间，每个时间点的状态（Running/Idle/Deeper Idle/Deepest Idle/Sleeping/Waiting for Cpu /Blocked on IO）。

3.6.1 结合perf timechart和perf report分析函数占比

由perf report可知test_little、test_medium、test_high、test_hi四个函数占比分别为3.84%、12.01%、22.99%、30.43%。

有代码可知如果以test_little为1单位，那么test_medium就为2单位，test_high为3单位，test_hi为4单位。

四个函数执行次数分别为2、4、4、4，所以四个函数每个单位对应CPU占比为：

test_little - 3.84%/2=1.9%

test_medium - 12.01%/4/2=1.5%

test_high - 22.99/4/3=1.9%

test_hi - 30.43%/4/4=1.9%

基本上符合预期。

记录IO事件，可以看到按应用分类的，Disk/Network/Sync/Poll/Error信息。以及每个应用数据吞吐量。

sudo perf timechart record -I && sudo perf timechart -w 1800。

记录Power状态事件，可以看到不同之处在于Idle之类的状态里面还有细分C/C2更详细的显示Power状态。

sudo perf timechart record -I && sudo perf timechart -w 1800

3.7perf script

用于读取perf record保存的裸trace数据。

使用方法：

perf script [<options>]perf script [<options>] record <script> [<record-options>] <command>perf script [<options>] report <script> [script-args]perf script [<options>] <script> <required-script-args> [<record-options>] <command>perf script [<options>] <top-script> [script-args]还可以编写perl或者python脚本进行数据分析。

3.8perf lock

3.8.1 perf lock内核配置

要使用此功能，内核需要编译选项的支持：CONFIG_LOCKDEP、CONFIG_LOCK_STAT。

CONFIG_LOCKDEP defines acquired and release events.

CONFIG_LOCK_STAT defines contended and acquired lock events.

CONFIG_LOCKDEP=y
CONFIG_LOCK_STAT=y

3.8.2 perf lock使用

分析内核锁统计信息。

锁是内核用于同步的方法，一旦加了锁，其他加锁的内核执行路径就必须等待，降低了并行。同时，如果加锁不正确还会造成死锁。

因此对于内核锁进行分析是一项重要的调优工作。

perf lock record：抓取执行命令的lock事件信息到perf.data中

perf lock report：产生统计报告

perf lock script：显示原始lock事件

perf lock info：

-k <value>：sorting key，默认为acquired，还可以按contended、wait_total、wait_max和wait_min来排序。

Name：内核锁的名字。

aquired：该锁被直接获得的次数，因为没有其它内核路径占用该锁，此时不用等待。

contended：该锁等待后获得的次数，此时被其它内核路径占用，需要等待。

total wait：为了获得该锁，总共的等待时间。

max wait：为了获得该锁，最大的等待时间。

min wait：为了获得该锁，最小的等待时间。

3.9perf kmem

3.9.1 perf kmem介绍

perf kmem用于跟踪测量内核slab分配器事件信息。比如内存分配/释放等。可以用来研究程序在哪里分配了大量内存，或者在什么地方产生碎片之类的和内存管理相关的问题。

perf kmem和perf lock实际上都是perf tracepoint的子类，等同于perf record -e kmem:*和perf record -e lock:*。

但是这些工具在内部队员是数据进行了慧聪和分析，因此统计报表更具可读性。

perf kmem record：抓取命令的内核slab分配器事件

perf kmem stat：生成内核slab分配器统计信息

选项：

--caller 显示每个调用点统计信息
--alloc 显示每次内存分配事件
-s <key[,key2...]>, --sort=<key[,key2...]>
Sort the output (default: frag,hit,bytes for slab and bytes,hit for page). Available sort keys are ptr, callsite, bytes, hit, pingpong, frag for slab and page, callsite, bytes, hit, order, migtype, gfp for page.This option should be preceded by one of the mode selection options - i.e. --slab, --page, --alloc and/or --caller.
-l <num>, 只显示固定行数
--raw-ip
Print raw ip instead of symbol
--slab 分析slab分配器事件
--page 分析页分配事件
--live
Show live page stat. The perf kmem shows total allocation stat by default, but this option shows live (currently allocated) pages instead. (This option works with --page option only)

3.9.2 perf kmem使用

sudo perf kmem record ls

sudo perf kmem stat只显示概要统计信息：

SUMMARY (SLAB allocator)========================Total bytes requested: 368,589Total bytes allocated: 369,424Total bytes wasted on internal fragmentation: 835Internal fragmentation: 0.226028%Cross CPU allocations: 0/2,256

sudo perf kmem --alloc --caller --slab stat显示了更加详细的分类信息：

---------------------------------------------------------------------------------------------------------Callsite | Total_alloc/Per | Total_req/Per | Hit | Ping-pong | Frag---------------------------------------------------------------------------------------------------------proc_reg_open+32 | 64/64 | 40/40 | 1 | 0 | 37.500%seq_open+34 | 384/192 | 272/136 | 2 | 0 | 29.167%apparmor_file_alloc_security+5c | 608/32 | 456/24 | 19 | 1 | 25.000%ext4_readdir+8bd | 64/64 | 48/48 | 1 | 0 | 25.000%ext4_htree_store_dirent+3e | 896/68 | 770/59 | 13 | 0 | 14.062%load_elf_phdrs+64 | 1024/512 | 896/448 | 2 | 0 | 12.500%load_elf_binary+222 | 32/32 | 28/28 | 1 | 0 | 12.500%anon_vma_prepare+11b | 1280/80 | 1152/72 | 16 | 0 | 10.000%inotify_handle_event+75 | 73664/64 | 66758/58 | 1151 | 0 | 9.375%do_execveat_common.isra.33+e5 | 2048/256 | 1920/240 | 8 | 1 | 6.250%... | ... | ... | ... | ... | ... ---------------------------------------------------------------------------------------------------------

---------------------------------------------------------------------------------------------------------Alloc Ptr | Total_alloc/Per | Total_req/Per | Hit | Ping-pong | Frag---------------------------------------------------------------------------------------------------------0xffff8800ca4d86c0 | 192/192 | 136/136 | 1 | 0 | 29.167%0xffff8801ea05aa80 | 192/192 | 136/136 | 1 | 0 | 29.167%0xffff8801f6ad6540 | 96/96 | 68/68 | 1 | 0 | 29.167%0xffff8801f6ad6f00 | 96/96 | 68/68 | 1 | 0 | 29.167%0xffff880214e65e80 | 96/32 | 72/24 | 3 | 0 | 25.000%0xffff8801f45ddac0 | 64/64 | 48/48 | 1 | 0 | 25.000%0xffff8800ac4093c0 | 32/32 | 24/24 | 1 | 1 | 25.000%0xffff8800af5a4260 | 32/32 | 24/24 | 1 | 0 | 25.000%0xffff880214e651e0 | 32/32 | 24/24 | 1 | 0 | 25.000%0xffff880214e65220 | 32/32 | 24/24 | 1 | 0 | 25.000%0xffff880214e654e0 | 32/32 | 24/24 | 1 | 0 | 25.000%---------------------------------------------------------------------------------------------------------SUMMARY (SLAB allocator)========================Total bytes requested: 409,260Total bytes allocated: 417,008Total bytes wasted on internal fragmentation: 7,748Internal fragmentation: 1.857998%Cross CPU allocations: 0/2,833

该报告有三个部分：根据 Callsite 显示的部分，所谓 Callsite 即内核代码中调用 kmalloc 和 kfree 的地方。

比如上图中的函数proc_reg_open，Hit 栏为 1，表示该函数在 record 期间一共调用了 kmalloc 一次。

对于第一行 Total_alloc/Per 显示为 1024/1024，第一个值 1024 表示函数 proc_reg_open 总共分配的内存大小，Per 表示平均值。

比较有趣的两个参数是 Ping-pong 和 Frag。Frag 比较容易理解，即内部碎片。虽然相对于 Buddy System，Slab 正是要解决内部碎片问题，但 slab 依然存在内部碎片，比如一个 cache 的大小为 1024，但需要分配的数据结构大小为 1022，那么有 2 个字节成为碎片。Frag 即碎片的比例。

Ping-pong 是一种现象，在多 CPU 系统中，多个 CPU 共享的内存会出现”乒乓现象”。一个 CPU 分配内存，其他 CPU 可能访问该内存对象，也可能最终由另外一个 CPU 释放该内存对象。而在多 CPU 系统中，L1 cache 是 per CPU 的，CPU2 修改了内存，那么其他的 CPU 的 cache 都必须更新，这对于性能是一个损失。Perf kmem 在 kfree 事件中判断 CPU 号，如果和 kmalloc 时的不同，则视为一次 ping-pong，理想的情况下 ping-pong 越小越好。Ibm developerworks 上有一篇讲述 oprofile 的文章，其中关于 cache 的调优可以作为很好的参考资料。

Callsite：内核代码中调用kmalloc和kfree的地方。

Total_alloc/Per：总共分配的内存大小，平均每次分配的内存大小。

Total_req/Per：总共请求的内存大小，平均每次请求的内存大小。

Hit：调用的次数。

Ping-pong：kmalloc和kfree不被同一个CPU执行时的次数，这会导致cache效率降低。

Frag：碎片所占的百分比，碎片=分配的内存 - 请求的内存，这部分是浪费的。

有使用--alloc选项，还会看到Alloc Ptr，即所分配内存的地址。

后面则有根据被调用地点的显示方式的部分。

最后一个部分是汇总数据，显示总的分配的内存和碎片情况，Cross CPU allocation 即 ping-pong 的汇总。

要分析--page事件，需要在record的时候加上--page选项。sudo perf kmem record --page ls，使用sudo perf kmem stat --page查看结果：

0xee318 [0x8]: failed to process type: 68error during process events: -22

3.10perf sched

perf sched专门用于跟踪/测量调度器，包括延时等。

perf sched record <command>：录制测试过程中的调度事件

perf sched latency：报告线程调度延时和其他调度相关属性

perf sched script：查看执行过程中详细的trace信息

perf sched replay：回放perf sched record录制的执行过程

perf sched map:用字符表示打印上下文切换

执行sudo perf sched record ls后，通过不同方式查看结果。

sudo perf sched latency，可以看到ls进程的Average delay/Maximum delay时间。各个 column 的含义如下： Task: 进程的名字和 pid Runtime: 实际运行时间 Switches: 进程切换的次数 Average delay: 平均的调度延迟 Maximum delay: 最大延迟

这里最值得人们关注的是 Maximum delay，一般从这里可以看到对交互性影响最大的特性：调度延迟，如果调度延迟比较大，那么用户就会感受到视频或者音频断断续续的。

sudo perf sched script能看到更详细的sched信息，包括sched_wakeup/sched_switch等等。每一列的含义依次是：进程名/pid/CPU ID/时间戳。

perf 7801 [002] 5398.722314: sched:sched_stat_sleep: comm=perf pid=7806 delay=110095391 [ns]perf 7801 [002] 5398.722316: sched:sched_wakeup: comm=perf pid=7806 prio=120 target_cpu=004swapper 0 [004] 5398.722328: sched:sched_stat_wait: comm=perf pid=7806 delay=0 [ns]swapper 0 [004] 5398.722333: sched:sched_switch: prev_comm=swapper/4 prev_pid=0 prev_prio=120 prev_state=R==> next_comm=perf next_pid=7806 next_prio=120perf 7801 [002] 5398.722363: sched:sched_stat_runtime: comm=perf pid=7801 runtime=1255788 [ns] vruntime=3027478102 [ns]perf 7801 [002] 5398.722364: sched:sched_switch: prev_comm=perf prev_pid=7801 prev_prio=120 prev_state=S==> next_comm=swapper/2 next_pid=0 next_prio=120perf 7806 [004] 5398.722568: sched:sched_wakeup: comm=migration/4 pid=27 prio=0 target_cpu=004perf 7806 [004] 5398.722571: sched:sched_stat_runtime: comm=perf pid=7806 runtime=254732 [ns] vruntime=1979611107 [ns]perf 7806 [004] 5398.722575: sched:sched_switch: prev_comm=perf prev_pid=7806 prev_prio=120 prev_state=R+==> next_comm=migration/4 next_pid=27 next_prio=0migration/4 27 [004] 5398.722582: sched:sched_stat_wait: comm=perf pid=7806 delay=13914 [ns]migration/4 27 [004] 5398.722586: sched:sched_migrate_task: comm=perf pid=7806 prio=120 orig_cpu=4 dest_cpu=2swapper 0 [002] 5398.722595: sched:sched_stat_wait: comm=perf pid=7806 delay=0 [ns]swapper 0 [002] 5398.722596: sched:sched_switch: prev_comm=swapper/2 prev_pid=0 prev_prio=120 prev_state=R==> next_comm=perf next_pid=7806 next_prio=120migration/4 27 [004] 5398.722611: sched:sched_switch: prev_comm=migration/4 prev_pid=27 prev_prio=0 prev_state=S==> next_comm=swapper/4 next_pid=0 next_prio=120ls 7806 [002] 5398.723421: sched:sched_stat_sleep: comm=kworker/u12:3 pid=7064 delay=1226675 [ns]ls 7806 [002] 5398.723423: sched:sched_wakeup: comm=kworker/u12:3 pid=7064 prio=120 target_cpu=003swapper 0 [003] 5398.723432: sched:sched_stat_wait: comm=kworker/u12:3 pid=7064 delay=0 [ns]swapper 0 [003] 5398.723434: sched:sched_switch: prev_comm=swapper/3 prev_pid=0 prev_prio=120 prev_state=R==> next_comm=kworker/u12:3 next_pid=7064 next_prio=120kworker/u12:3 7064 [003] 5398.723441: sched:sched_stat_sleep: comm=/usr/bin/termin pid=2511 delay=80833386 [ns]kworker/u12:3 7064 [003] 5398.723447: sched:sched_wakeup: comm=/usr/bin/termin pid=2511 prio=120 target_cpu=004kworker/u12:3 7064 [003] 5398.723449: sched:sched_stat_runtime: comm=kworker/u12:3 pid=7064 runtime=29315 [ns] vruntime=846439549943 [ns]kworker/u12:3 7064 [003] 5398.723451: sched:sched_switch: prev_comm=kworker/u12:3 prev_pid=7064 prev_prio=120 prev_state=S==> next_comm=swapper/3 next_pid=0 next_prio=120swapper 0 [004] 5398.723462: sched:sched_stat_wait: comm=/usr/bin/termin pid=2511 delay=0 [ns]swapper 0 [004] 5398.723466: sched:sched_switch: prev_comm=swapper/4 prev_pid=0 prev_prio=120 prev_state=R==> next_comm=/usr/bin/termin next_pid=2511 next_prio=120ls 7806 [002] 5398.723503: sched:sched_migrate_task: comm=perf pid=7801 prio=120 orig_cpu=2 dest_cpu=3ls 7806 [002] 5398.723505: sched:sched_stat_sleep: comm=perf pid=7801 delay=1142537 [ns]ls 7806 [002] 5398.723506: sched:sched_wakeup: comm=perf pid=7801 prio=120 target_cpu=003ls 7806 [002] 5398.723508: sched:sched_stat_runtime: comm=ls pid=7806 runtime=920005 [ns] vruntime=3028398107 [ns]swapper 0 [003] 5398.723508: sched:sched_stat_wait: comm=perf pid=7801 delay=0 [ns]swapper 0 [003] 5398.723508: sched:sched_switch: prev_comm=swapper/3 prev_pid=0 prev_prio=120 prev_state=R==> next_comm=perf next_pid=7801 next_prio=120ls 7806 [002] 5398.723510: sched:sched_switch: prev_comm=ls prev_pid=7806 prev_prio=120 prev_state=x==> next_comm=swapper/2 next_pid=0 next_prio=120/usr/bin/termin 2511 [004] 5398.723605: sched:sched_stat_runtime: comm=/usr/bin/termin pid=2511 runtime=162720 [ns] vruntime=191386139371 [ns]/usr/bin/termin 2511 [004] 5398.723611: sched:sched_switch: prev_comm=/usr/bin/termin prev_pid=2511 prev_prio=120 prev_state=S==> next_comm=swapper/4 next_pid=0 next_prio=120

sudo perf sched map的好处在于提供了一个的总的视图，将成百上千的调度事件进行总结，显示了系统任务在 CPU 之间的分布，假如有不好的调度迁移，比如一个任务没有被及时迁移到 idle 的 CPU 却被迁移到其他忙碌的 CPU，类似这种调度器的问题可以从 map 的报告中一眼看出来。

星号表示调度事件发生所在的 CPU。

点号表示该 CPU 正在 IDLE。

*A0 5398.722333 secs A0=> perf:7806*. A0 5398.722365 secs .=> swapper:0. *B0 5398.722575 secs B0=> migration/4:27*A0 B0 5398.722597 secs A0 *. 5398.722611 secs A0 *C0 . 5398.723434 secs C0=> kworker/u12:3:7064A0 *. . 5398.723451 secs A0 . *D0 5398.723467 secs D0=> /usr/bin/termin:2511A0 *E0 D0 5398.723509 secs E0=> perf:7801*. E0 D0 5398.723510 secs . E0 *. 5398.723612 secs

perf sched replay 这个工具更是专门为调度器开发人员所设计，它试图重放 perf.data 文件中所记录的调度场景。很多情况下，一般用户假如发现调度器的奇怪行为，他们也无法准确说明发生该情形的场景，或者一些测试场景不容易再次重现，或者仅仅是出于“偷懒”的目的，使用 perf replay，perf 将模拟 perf.data 中的场景，无需开发人员花费很多的时间去重现过去，这尤其利于调试过程，因为需要一而再，再而三地重复新的修改是否能改善原始的调度场景所发现的问题。

run measurement overhead: 166 nsecssleep measurement overhead: 52177 nsecsthe run test took 999975 nsecsthe sleep test took 1064623 nsecsnr_run_events: 11nr_sleep_events: 581nr_wakeup_events: 5task 0 ( swapper: 0), nr_events: 11task 1 ( swapper: 1), nr_events: 1task 2 ( swapper: 2), nr_events: 1task 3 ( kthreadd: 3), nr_events: 1...task 563 ( kthreadd: 7509), nr_events: 1task 564 ( bash: 7751), nr_events: 1task 565 ( man: 7762), nr_events: 1task 566 ( kthreadd: 7789), nr_events: 1task 567 ( bash: 7800), nr_events: 1task 568 ( sudo: 7801), nr_events: 4task 569 ( perf: 7806), nr_events: 8------------------------------------------------------------#1 : 25.887, ravg: 25.89, cpu: 1919.68 / 1919.68#2 : 27.994, ravg: 26.10, cpu: 2887.76 / 2016.49#3 : 26.403, ravg: 26.13, cpu: 2976.09 / 2112.45#4 : 29.400, ravg: 26.46, cpu: 1015.01 / 2002.70#5 : 26.750, ravg: 26.49, cpu: 2942.80 / 2096.71#6 : 27.647, ravg: 26.60, cpu: 3087.37 / 2195.78#7 : 31.405, ravg: 27.08, cpu: 2762.43 / 2252.44#8 : 23.770, ravg: 26.75, cpu: 2172.55 / 2244.45#9 : 26.952, ravg: 26.77, cpu: 2794.93 / 2299.50#10 : 30.904, ravg: 27.18, cpu: 973.26 / 2166.88

3.11perf probe

Need to find vmlinux XXXXXXXXXXXXXXXXXX

可以自定义探测点。

Define new dynamic tracepoints.

使用例子

(1) Display which lines in schedule() can be probed

# perf probe --line schedule

前面有行号的可以探测，没有行号的就不行了。

(2) Add a probe on schedule() function 12th line.

# perf probe -a schedule:12

在schedule函数的12处增加一个探测点。

3.14perf trace

perf trace类似于strace，但增加了其他系统事件的分析，比如pagefaults、task lifetime事件、scheduling事件等。

下面的命令可以查看系统中已经安装的脚本：

 # perf trace -l     List of available trace scripts:       syscall-counts [comm]                system-wide syscall counts       syscall-counts-by-pid [comm]         system-wide syscall counts, by pid       failed-syscalls-by-pid [comm]        system-wide failed syscalls, by pid

比如 failed-syscalls 脚本，执行的效果如下：

 # perf trace record failed-syscalls     ^C[ perf record: Woken up 11 times to write data ]                             [ perf record: Captured and wrote 1.939 MB perf.data (~84709 samples) ]    perf trace report failed-syscalls     perf trace started with Perl script \  /root/libexec/perf-core/scripts/perl/failed-syscalls.pl     failed syscalls, by comm:     comm                    # errors     --------------------  ----------     firefox                     1721     claws-mail                   149     konsole                       99     X                             77     emacs                         56     [...]     failed syscalls, by syscall:     syscall                           # errors     ------------------------------  ----------     sys_read                              2042     sys_futex                              130     sys_mmap_pgoff                          71     sys_access                              33     sys_stat64                               5     sys_inotify_add_watch                    4     [...]

该报表分别按进程和按系统调用显示失败的次数。非常简单明了，而如果通过普通的 perf record 加 perf report 命令，则需要自己手工或者编写脚本来统计这些数字。

4. perf扩展应用

4.1Flame Graph

FlameGraph是

1.抓取perf信息并转换

perf record -F 99 -a -g -- sleep 60 perf script > out.perf

./stackcollapse-perf.pl out.perf > out.folded

./flamegraph.pl out.kern_folded > kernel.svg

nr和run有什么不同2

本篇介绍下常用的模块。根据官方的分类，将模块按功能分类为：云模块、命令模块、数据库模块、文件模块、资产模块、消息模块、监控模块、网络模块、通知模块、包管理模块、源码控制模块、系统模块、单元模块、web设施模块、windows模块，具体可以参看官方页面。

这里从官方分类的模块里选择最常用的一些模块进行介绍（commands模块上一篇已经介绍，这里不再提）。

一、ping模块

测试主机是否是通的，用法很简单，不涉及参数：

[root@361way ~]# ansible 10.212.52.252 -m ping10.212.52.252 | success >> {"changed": false,"ping": "pong"}

二、setup模块

setup模块，主要用于获取主机信息，在playbooks里经常会用到的一个参数gather_facts就与该模块相关。setup模块下经常使用的一个参数是filter参数，具体使用示例如下（由于输出结果较多，这里只列命令不写结果）：

[root@361way ~]# ansible 10.212.52.252 -m setup -a 'filter=ansible_*_mb' //查看主机内存信息[root@361way ~]# ansible 10.212.52.252 -m setup -a 'filter=ansible_eth[0-2]' //查看地接口为eth0-2的网卡信息[root@361way ~]# ansible all -m setup --tree /tmp/facts //将所有主机的信息输入到/tmp/facts目录下，每台主机的信息输入到主机名文件中（/etc/ansible/hosts里的主机名）

三、file模块

file模块主要用于远程主机上的文件操作，file模块包含如下选项：

force：需要在两种情况下强制创建软链接，一种是源文件不存在但之后会建立的情况下；另一种是目标软链接已存在,需要先取消之前的软链，然后创建新的软链，有两个选项：yes|nogroup：定义文件/目录的属组mode：定义文件/目录的权限owner：定义文件/目录的属主path：必选项，定义文件/目录的路径recurse：递归的设置文件的属性，只对目录有效src：要被链接的源文件的路径，只应用于state=link的情况dest：被链接到的路径，只应用于state=link的情况state： directory：如果目录不存在，创建目录file：即使文件不存在，也不会被创建link：创建软链接hard：创建硬链接touch：如果文件不存在，则会创建一个新的文件，如果文件或目录已存在，则更新其最后修改时间absent：删除目录、文件或者取消链接文件

使用示例：

ansible test -m file -a "src=/etc/fstab dest=/tmp/fstab state=link"ansible test -m file -a "path=/tmp/fstab state=absent"ansible test -m file -a "path=/tmp/test state=touch"

四、copy模块

复制文件到远程主机，copy模块包含如下选项：

backup：在覆盖之前将原文件备份，备份文件包含时间信息。有两个选项：yes|nocontent：用于替代"src",可以直接设定指定文件的值dest：必选项。要将源文件复制到的远程主机的绝对路径，如果源文件是一个目录，那么该路径也必须是个目录directory_mode：递归的设定目录的权限，默认为系统默认权限force：如果目标主机包含该文件，但内容不同，如果设置为yes，则强制覆盖，如果为no，则只有当目标主机的目标位置不存在该文件时，才复制。默认为yesothers：所有的file模块里的选项都可以在这里使用src：要复制到远程主机的文件在本地的地址，可以是绝对路径，也可以是相对路径。如果路径是一个目录，它将递归复制。在这种情况下，如果路径使用"/"来结尾，则只复制目录里的内容，如果没有使用"/"来结尾，则包含目录在内的整个内容全部复制，类似于rsync。validate ：The validation command to run before copying into place. The path to the file to validate is passed in via '%s' which must be present as in the visudo example below.

示例如下：

ansible test -m copy -a "src=/srv/myfiles/foo.conf dest=/etc/foo.conf owner=foo group=foo mode=0644"ansible test -m copy -a "src=/mine/ntp.conf dest=/etc/ntp.conf owner=root group=root mode=644 backup=yes"ansible test -m copy -a "src=/mine/sudoers dest=/etc/sudoers validate='visudo -cf %s'"

五、service模块

用于管理服务

该模块包含如下选项：

arguments：给命令行提供一些选项

enabled：是否开机启动 yes|no

name：必选项，服务名称

pattern：定义一个模式，如果通过status指令来查看服务的状态时，没有响应，就会通过ps指令在进程中根据该模式进行查找，如果匹配到，则认为该服务依然在运行

runlevel：运行级别

sleep：如果执行了restarted，在则stop和start之间沉睡几秒钟

state：对当前服务执行启动，停止、重启、重新加载等操作（started,stopped,restarted,reloaded）

使用示例：

# Example action to reload service httpd, in all cases- service: name=httpd state=reloaded# Example action to enable service httpd, and not touch the running state- service: name=httpd enabled=yes# Example action to start service foo, based on running process /usr/bin/foo- service: name=foo pattern=/usr/bin/foo state=started# Example action to restart network service for interface eth0- service: name=network state=restarted args=eth0

六、cron模块

用于管理计划任务包含如下选项：backup：对远程主机上的原任务计划内容修改之前做备份cron_file：如果指定该选项，则用该文件替换远程主机上的cron.d目录下的用户的任务计划day：日（1-31，*，*/2,……）hour：小时（0-23，*，*/2，……）minute：分钟（0-59，*，*/2，……）month：月（1-12，*，*/2，……）weekday：周（0-7，*，……）job：要执行的任务，依赖于state=presentname：该任务的描述special_time：指定什么时候执行，参数：reboot,yearly,annually,monthly,weekly,daily,hourlystate：确认该任务计划是创建还是删除user：以哪个用户的身份执行

示例：

ansible test -m cron -a 'name="a job for reboot" special_time=reboot job="/some/job.sh"'ansible test -m cron -a 'name="yum autoupdate" weekday="2" minute=0 hour=12 user="rootansible 10.212.52.252 -m cron -a 'backup="True" name="test" minute="0" hour="2" job="ls -alh > /dev/null"'ansilbe test -m cron -a 'cron_file=ansible_yum-autoupdate state=absent'

七、yum模块

使用yum包管理器来管理软件包，其选项有：

config_file：yum的配置文件disable_gpg_check：关闭gpg_checkdisablerepo：不启用某个源enablerepo：启用某个源name：要进行操作的软件包的名字，也可以传递一个url或者一个本地的rpm包的路径state：状态（present，absent，latest）

示例如下：

ansible test -m yum -a 'name=httpd state=latest'ansible test -m yum -a 'name="@Development tools" state=present'ansible test -m yum -a 'name=http://nginx.org/packages//6/noarch/RPMS/nginx-release-centos-6-0.el6.ngx.noarch.rpm state=present'

八、user模块与group模块

user模块是请求的是useradd, userdel, usermod三个指令，goup模块请求的是groupadd, groupdel, groupmod 三个指令，具体参数这里不再细讲，直接上示例。

1、user模块示例：

- user: name=johnd comment="John Doe" uid=1040 group=admin- user: name=james shell=/bin/bash groups=admins,developers append=yes- user: name=johnd state=absent remove=yes- user: name=james18 shell=/bin/zsh groups=developers expires=1422403387#生成密钥时，只会生成公钥文件和私钥文件，和直接使用ssh-keygen指令效果相同，不会生成authorized_keys文件。- user: name=test generate_ssh_key=yes ssh_key_bits=2048 ssh_key_file=.ssh/id_rsa

注：指定password参数时，不能使用后面这一串密码会被直接传送到被管理主机的/etc/shadow文件中，所以需要先将密码字符串进行加密处理。然后将得到的字符串放到password中即可。

[root@361way ~]# openssl passwd -1 -salt $(< /dev/urandom tr -dc '[:alnum:]' | head -c 32)Password:$1$YngB4z8s$atSVltYKnDxJmWZ3s.4/80或者[root@361way ~]# echo "123456" | openssl passwd -1 -salt $(< /dev/urandom tr -dc '[:alnum:]' | head -c 32) -stdin$1$4P4PlFuE$ur9ObJiT5iHNrb9QnjaIB0#经验证下面生成的密码串也可以正常使用，不过与/etc/shadow的格式不统一，不建议使用[root@361way ~]# openssl passwd -salt -1 "123456"-1yEWqqJQLC66#使用上面的密码创建用户[root@361way ~]#ansible all -m user -a 'name=foo password="$1$4P4PlFuE$ur9ObJiT5iHNrb9QnjaIB0"'

不同的发行版默认使用的加密方式可能会有区别，具体可以查看/etc/login.defs文件确认，centos 6.5版本使用的是SHA512加密算法，生成密码可以通过ansible官方给出的示例：

python -c "from passlib.hash import sha512_crypt; import getpass; print sha512_crypt.encrypt(getpass.getpass())"

2、group示例

- group: name=somegroup state=present

九、synchronize模块

使用rsync同步文件，其参数如下：

archive: 归档，相当于同时开启recursive(递归)、links、perms、times、owner、group、-D选项都为yes ，默认该项为开启checksum: 跳过检测sum值，默认关闭compress:是否开启压缩copy_links：复制链接文件，默认为no ，注意后面还有一个links参数delete: 删除不存在的文件，默认nodest：目录路径dest_port：默认目录主机上的端口，默认是22，走的ssh协议dirs：传速目录不进行递归，默认为no，即进行目录递归rsync_opts：rsync参数部分set_remote_user：主要用于/etc/ansible/hosts中定义或默认使用的用户与rsync使用的用户不同的情况mode: push或pull 模块，push模的话，一般用于从本机向远程主机上传文件，pull 模式用于从远程主机上取文件

另外还有其他参数，这里不再一一说明。上几个用法：

src=some/relative/path dest=/some/absolute/path rsync_path="sudo rsync"src=some/relative/path dest=/some/absolute/path archive=no links=yessrc=some/relative/path dest=/some/absolute/path checksum=yes times=nosrc=/tmp/helloworld dest=/var/www/helloword rsync_opts=--no-motd,--exclude=.git mode=pull

十、mount模块

配置挂载点选项：dumpfstype：必选项，挂载文件的类型name：必选项，挂载点opts：传递给mount命令的参数src：必选项，要挂载的文件state：必选项present：只处理fstab中的配置absent：删除挂载点mounted：自动创建挂载点并挂载之umounted：卸载

示例：

name=/mnt/dvd src=/dev/sr0 fstype=iso9660 opts=ro state=presentname=/srv/disk src='LABEL=SOME_LABEL' state=presentname=/home src='UUID=b3e48f45-f933-4c8e-a700-22a159ec9077' opts=noatime state=presentansible test -a 'dd if=/dev/zero of=/disk.img bs=4k count=1024'ansible test -a 'losetup /dev/loop0 /disk.img'ansible test -m filesystem 'fstype=ext4 force=yes opts=-F dev=/dev/loop0'ansible test -m mount 'name=/mnt src=/dev/loop0 fstype=ext4 state=mounted opts=rw'

十一、get_url 模块

该模块主要用于从http、ftp、https服务器上下载文件（类似于wget），主要有如下选项：

sha256sum：下载完成后进行sha256 check；

timeout：下载超时时间，默认10s

url：下载的URL

url_password、url_username：主要用于需要用户名密码进行验证的情况

use_proxy：是事使用代理，代理需事先在环境变更中定义

示例：

- name: download foo.confget_url: url=http://example.com/path/file.conf dest=/etc/foo.conf mode=0440- name: download file with sha256 checkget_url: url=http://example.com/path/file.conf dest=/etc/foo.conf sha256sum=b5bb9d8014a0f9b1d61e21e796d78dccdf1352f23cd32812f4850b878ae4944c

模块部分就先介绍到这里吧，官方提供的可能用到模块有git、svn版本控制模块，sysctl 、authorized_key_module系统模块，apt、zypper、pip、gem包管理模块，find、template文件模块，mysql_db、redis数据库模块，url 网络模块等。

nr和run有什么不同3

一. 前言

在前文中，我们分析了内核中进程和线程的统一结构体task_struct，并分析进程、线程的创建和派生的过程。在本文中，我们会对任务间调度进行详细剖析，了解其原理和整个执行过程。由此，进程、线程部分的大体框架就算是介绍完了。本节主要分为三个部分：Linux内核中常见的调度策略，调度的基本结构体以及调度发生的整个流程。下面将详细展开说明。

二. 调度策略

Linux 作为一个多任务操作系统，将每个 CPU 的时间划分为很短的时间片，再通过调度器轮流分配给各个任务使用，因此造成多任务同时运行的错觉。为了维护 CPU 时间，Linux 通过事先定义的节拍率（内核中表示为 HZ），触发时间中断，并使用全局变量 Jiffies 记录了开机以来的节拍数。每发生一次时间中断，Jiffies 的值就加 1。节拍率 HZ 是内核的可配选项，可以设置为 100、250、1000 等。不同的系统可能设置不同的数值，可以通过查询 /boot/config 内核选项来查看它的配置值。

Linux的调度策略主要分为实时任务和普通任务。实时任务需求尽快返回结果，而普通任务则没有较高的要求。在前文中我们提到了task_struct中调度策略相应的变量为policy，调度优先级有prio, static_prio, normal_prio, rt_priority几个。优先级其实就是一个数值，对于实时进程来说，优先级的范围是 0～99；对于普通进程，优先级的范围是 100～139。数值越小，优先级越高。

2.1 实时调度策略

实时调度策略主要包括以下几种

SCHED_FIFO：先来先出型策略，顾名思义相同优先级的情况下先到先得SCHED_RR：轮询策略，注重公平性，相同优先级的任务会使用相同的时间片轮流执行SCHED_DEADLINE：根据任务结束时间来进行调度，即将结束的拥有较高的优先级

2.2 普通调度策略

普通调度策略主要包括以下几种：

SCHED_NORMAL：普通任务SCHED_BATCH：后台任务，优先级较低SCHED_IDLE：空闲时间才会跑的任务CFS：完全公平调度策略，较为特殊的一种策略。CFS 会为每一个任务安排一个虚拟运行时间 vruntime。如果一个任务在运行，随着一个个 CPU时钟tick 的到来，任务的 vruntime 将不断增大，而没有得到执行的任务的 vruntime 不变。由此，当调度的时候，vruntime较小的就拥有较高的优先级。 vruntime的实际计算方式和权重相关，由此保证了优先级高的按比例拥有更多的执行时间，从而达到完全公平。

三. 调度相关的架构体

首先，我们需要一个结构体去执行调度策略，即sched_class。该类有几种实现方式

stop_sched_class 优先级最高的任务会使用这种策略，会中断所有其他线程，且不会被其他任务打断；dl_sched_class 就对应上面的 deadline 调度策略；rt_sched_class 就对应 RR 算法或者 FIFO 算法的调度策略，具体调度策略由进程的 task_struct->policy 指定；fair_sched_class 就是普通进程的调度策略；idle_sched_class 就是空闲进程的调度策略。 ?? 其次，我们需要一个调度结构体来集合调度信息，用于调度，即sched_entity，主要有struct sched_entity se：普通任务调度实体struct sched_rt_entity rt：实时调度实体struct sched_dl_entity dl：DEADLINE调度实体

普通任务调度实体源码如下，这里面包含了 vruntime 和权重 load_weight，以及对于运行时间的统计。

struct sched_entity {    /* For load-balancing: */    struct load_weightload;    unsigned longrunnable_weight;    struct rb_noderun_node;    struct list_headgroup_node;    unsigned inton_rq;    u64exec_start;    u64sum_exec_runtime;    u64vruntime;    u64prev_sum_exec_runtime;    u64nr_migrations;    struct sched_statisticsstatistics;#ifdef CONFIG_FAIR_GROUP_SCHED    intdepth;    struct sched_entity*parent;    /* rq on which this entity is (to be) queued: */    struct cfs_rq*cfs_rq;    /* rq "owned" by this entity/group: */    struct cfs_rq*my_q;#endif#ifdef CONFIG_SMP    /*     * Per entity load average tracking.     *     * Put into separate cache line so it does not     * collide with read-mostly values above.     */    struct sched_avgavg;#endif};

在调度时，多个任务调度实体会首先区分是实时任务还是普通任务，然后通过以时间为顺序的红黑树结构组合起来，vruntime 最小的在树的左侧，vruntime最多的在树的右侧。以CFS策略为例，则会选择红黑树最左边的叶子节点作为下一个将获得 CPU 的任务。而这颗红黑树，我们称之为运行时队列（run queue），即struct rq。

/* * This is the main, per-CPU runqueue data structure. * * Locking rule: those places that want to lock multiple runqueues * (such as the load balancing or the thread migration code), lock * acquire operations must be ordered by ascending &runqueue. */struct rq {    /* runqueue lock: */    raw_spinlock_tlock;    /*     * nr_running and cpu_load should be in the same cacheline because     * remote CPUs use both these fields when doing load calculation.     */    unsigned intnr_running;......    #define CPU_LOAD_IDX_MAX 5    unsigned longcpu_load[CPU_LOAD_IDX_MAX];......    /* capture load from *all* tasks on this CPU: */    struct load_weightload;    unsigned longnr_load_updates;    u64nr_switches;    struct cfs_rqcfs;    struct rt_rqrt;    struct dl_rqdl;......    /*     * This is part of a global counter where only the total sum     * over all CPUs matters. A task can increase this counter on     * one CPU and if it got migrated afterwards it may decrease     * it on another CPU. Always updated under the runqueue lock:     */    unsigned longnr_uninterruptible;    struct task_struct*curr;    struct task_struct*idle;    struct task_struct*stop;    unsigned longnext_balance;    struct mm_struct*prev_mm;    unsigned intclock_update_flags;    u64clock;    /* Ensure that all clocks are in the same cache line */    u64clock_task ____cacheline_aligned;    u64clock_pelt;    unsigned longlost_idle_time;    atomic_tnr_iowait;......    /* calc_load related fields */    unsigned longcalc_load_update;    longcalc_load_active;......};

其中包含结构体cfs_rq，其定义如下，主要是CFS调度相关的结构体，主要有权值相关变量、vruntime相关变量以及红黑树指针，其中结构体rb_root_cached即为红黑树的节点

/* CFS-related fields in a runqueue */struct cfs_rq {    struct load_weightload;    unsigned longrunnable_weight;    unsigned intnr_running;    unsigned inth_nr_running;    u64exec_clock;    u64min_vruntime;#ifndef CONFIG_64BIT    u64min_vruntime_copy;#endif    struct rb_root_cachedtasks_timeline;    /*     * 'curr' points to currently running entity on this cfs_rq.     * It is set to NULL otherwise (i.e when none are currently running).     */    struct sched_entity*curr;    struct sched_entity*next;    struct sched_entity*last;    struct sched_entity*skip;......};

对结构体dl_rq有类似的定义，运行队列由红黑树结构体构成，并按照deadline策略进行管理

/* Deadline class' related fields in a runqueue */struct dl_rq {    /* runqueue is an rbtree, ordered by deadline */    struct rb_root_cachedroot;    unsigned longdl_nr_running;#ifdef CONFIG_SMP    /*     * Deadline values of the currently executing and the     * earliest ready task on this rq. Caching these facilitates     * the decision whether or not a ready but not running task     * should migrate somewhere else.     */    struct {        u64curr;        u64next;    } earliest_dl;    unsigned longdl_nr_migratory;    intoverloaded;    /*     * Tasks on this rq that can be pushed away. They are kept in     * an rb-tree, ordered by tasks' deadlines, with caching     * of the leftmost (earliest deadline) element.     */    struct rb_root_cachedpushable_dl_tasks_root;#else    struct dl_bwdl_bw;#endif    /*     * "Active utilization" for this runqueue: increased when a     * task wakes up (becomes TASK_RUNNING) and decreased when a     * task blocks     */    u64running_bw;    /*     * Utilization of the tasks "assigned" to this runqueue (including     * the tasks that are in runqueue and the tasks that executed on this     * CPU and blocked). Increased when a task moves to this runqueue, and     * decreased when the task moves away (migrates, changes scheduling     * policy, or terminates).     * This is needed to compute the "inactive utilization" for the     * runqueue (inactive utilization=this_bw - running_bw).     */    u64this_bw;    u64extra_bw;    /*     * Inverse of the fraction of CPU utilization that can be reclaimed     * by the GRUB algorithm.     */    u64bw_ratio;};

对于实施队列相应的rt_rq则有所不同，并没有用红黑树实现。

/* Real-Time classes' related field in a runqueue: */struct rt_rq {    struct rt_prio_arrayactive;    unsigned intrt_nr_running;    unsigned intrr_nr_running;#if defined CONFIG_SMP || defined CONFIG_RT_GROUP_SCHED    struct {        intcurr; /* highest queued rt task prio */#ifdef CONFIG_SMP        intnext; /* next highest */#endif    } highest_prio;#endif#ifdef CONFIG_SMP    unsigned longrt_nr_migratory;    unsigned longrt_nr_total;    intoverloaded;    struct plist_headpushable_tasks;#endif /* CONFIG_SMP */    intrt_queued;    intrt_throttled;    u64rt_time;    u64rt_runtime;    /* Nests inside the rq lock: */    raw_spinlock_trt_runtime_lock;#ifdef CONFIG_RT_GROUP_SCHED    unsigned longrt_nr_boosted;    struct rq*rq;    struct task_group*tg;#endif};

下面再看看调度类sched_class，该类以函数指针的形式定义了诸多队列操作，如

enqueue_task 向就绪队列中添加一个任务，当某个任务进入可运行状态时，调用这个函数；dequeue_task 将一个任务从就绪队列中删除；yield_task将主动放弃CPU；yield_to_task主动放弃CPU并执行指定的task_struct；check_preempt_curr检查当前任务是否可被强占；pick_next_task 选择接下来要运行的任务；put_prev_task 用另一个进程代替当前运行的任务；set_curr_task 用于修改调度策略；task_tick 每次周期性时钟到的时候，这个函数被调用，可能触发调度。task_dead:进程结束时调用switched_from、switched_to:进程改变调度器时使用prio_changed:改变进程优先级

struct sched_class {    const struct sched_class *next;    void (*enqueue_task) (struct rq *rq, struct task_struct *p, int flags);    void (*dequeue_task) (struct rq *rq, struct task_struct *p, int flags);    void (*yield_task)   (struct rq *rq);    bool (*yield_to_task)(struct rq *rq, struct task_struct *p, bool preempt);    void (*check_preempt_curr)(struct rq *rq, struct task_struct *p, int flags);    /*     * It is the responsibility of the pick_next_task() method that will     * return the next task to call put_prev_task() on the @prev task or     * something equivalent.     *     * May return RETRY_TASK when it finds a higher prio class has runnable     * tasks.     */    struct task_struct * (*pick_next_task)(struct rq *rq,                           struct task_struct *prev,                    *        struct rq_flags *rf);    void (*put_prev_task)(struct rq *rq, struct task_struct *p);......    void (*set_curr_task)(struct rq *rq);    void (*task_tick)(struct rq *rq, struct task_struct *p, int queued);    void (*task_fork)(struct task_struct *p);    void (*task_dead)(struct task_struct *p);    /*     * The switched_from() call is allowed to drop rq->lock, therefore we     * cannot assume the switched_from/switched_to pair is serliazed by     * rq->lock. They are however serialized by p->pi_lock.     */    void (*switched_from)(struct rq *this_rq, struct task_struct *task);    void (*switched_to)  (struct rq *this_rq, struct task_struct *task);    void (*prio_changed) (struct rq *this_rq, struct task_struct *task,                  int oldprio);    unsigned int (*get_rr_interval)(struct rq *rq,                    struct task_struct *task);    void (*update_curr)(struct rq *rq);#define TASK_SET_GROUP0#define TASK_MOVE_GROUP1......};

调度类分为下面几种：

extern const struct sched_class stop_sched_class;extern const struct sched_class dl_sched_class;extern const struct sched_class rt_sched_class;extern const struct sched_class fair_sched_class;extern const struct sched_class idle_sched_class;

队列操作中函数指针指向不同策略队列的实际执行函数函数，在linux/kernel/sched/目录下，fair.c、idle.c、rt.c等文件对不同类型的策略实现了不同的函数，如fair.c中定义了

/* * All the scheduling class methods: */const struct sched_class fair_sched_class={    .next=&idle_sched_class,    .enqueue_task=enqueue_task_fair,    .dequeue_task=dequeue_task_fair,    .yield_task=yield_task_fair,    .yield_to_task=yield_to_task_fair,    .check_preempt_curr=check_preempt_wakeup,    .pick_next_task=pick_next_task_fair,    .put_prev_task=put_prev_task_fair,......    .set_curr_task=set_curr_task_fair,    .task_tick=task_tick_fair,    .task_fork=task_fork_fair,    .prio_changed=prio_changed_fair,    .switched_from=switched_from_fair,    .switched_to=switched_to_fair,    .get_rr_interval=get_rr_interval_fair,    .update_curr=update_curr_fair,......};

以选择下一个任务为例，CFS对应的是pick_next_task_fair，而rt_rq对应的则是pick_next_task_rt，等等。

由此，我们来总结一下：

每个CPU都有一个struct rq结构体，里面会有着cfs_rq, rt_rq等一系列队列每个队列由一个红黑树组织，红黑树里每一个节点为一个任务实体sched_entity每一个任务实体sched_entity对应于一个任务task_struct在task_struct中对应的sched_class会根据不同策略申明不同的对应处理函数，处理实际的调度工作

四. 调度流程

有了上述的基本策略和基本调度结构体，我们可以形成大致的骨架，下面就是需要核心的调度流程将其拼凑成一个整体，实现调度系统。调度分为两种，主动调度和抢占式调度。

主动调度即任务执行一定时间以后主动让出CPU，通过调度策略选择合适的下一个任务执行。抢占式调度即任务执行中收到了其他任务的中断，由此停止执行并切换至下一个任务。

4.1 主动调度

说到调用，逃不过核心函数schedule()。其中sched_submit_work()函数完成当前任务的收尾工作，以避免出现如死锁或者IO中断等情况。之后首先禁止抢占式调度的发生，然后调用__schedule()函数完成调度，之后重新打开抢占式调度，如果需要重新调度则会一直重复该过程，否则结束函数。

asmlinkage __visible void __sched schedule(void){    struct task_struct *tsk=current;    sched_submit_work(tsk);    do {        preempt_disable();        __schedule(false);        sched_preempt_enable_no_resched();    } while (need_resched());}EXPORT_SYMBOL(schedule);

而__schedule()函数则是实际的核心调度函数，该函数主要操作包括选取下一进程和进行上下文切换，而上下文切换又包括用户态空间切换和内核态的切换。具体的解释可以参照英文源码注释以及中文对各个步骤的注释。

/* * __schedule() is the main scheduler function. * The main means of driving the scheduler and thus entering this function are: *   1. Explicit blocking: mutex, semaphore, waitqueue, etc. *   2. TIF_NEED_RESCHED flag is checked on interrupt and userspace return *      paths. For example, see arch/x86/entry_64.S. *      To drive preemption between tasks, the scheduler sets the flag in timer *      interrupt handler scheduler_tick(). *   3. Wakeups don't really cause entry into schedule(). They add a *      task to the run-queue and that's it. *      Now, if the new task added to the run-queue preempts the current *      task, then the wakeup sets TIF_NEED_RESCHED and schedule() gets *      called on the nearest possible occasion: *       - If the kernel is preemptible (CONFIG_PREEMPT=y): *         - in syscall or exception context, at the next outmost *           preempt_enable(). (this might be as soon as the wake_up()'s *           spin_unlock()!) *         - in IRQ context, return from interrupt-handler to *           preemptible context *       - If the kernel is not preemptible (CONFIG_PREEMPT is not set) *         then at the next: *          - cond_resched() call *          - explicit schedule() call *          - return from syscall or exception to user-space *          - return from interrupt-handler to user-space * WARNING: must be called with preemption disabled! */static void __sched notrace __schedule(bool preempt){    struct task_struct *prev, *next;    unsigned long *switch_count;    struct rq_flags rf;    struct rq *rq;    int cpu;        //从当前的CPU中取出任务队列rq，prev赋值为当前任务    cpu=smp_processor_id();    rq=cpu_rq(cpu);    prev=rq->curr;        //检测当前任务是否可以调度    schedule_debug(prev);    if (sched_feat(HRTICK))        hrtick_clear(rq);        //禁止中断，RCU抢占关闭，队列加锁，SMP加锁    local_irq_disable();    rcu_note_context_switch(preempt);    /*     * Make sure that signal_pending_state()->signal_pending() below     * can't be reordered with __set_current_state(TASK_INTERRUPTIBLE)     * done by the caller to avoid the race with signal_wake_up().     *     * The membarrier system call requires a full memory barrier     * after coming from user-space, before storing to rq->curr.     */    rq_lock(rq, &rf);    smp_mb__after_spinlock();        /* Promote REQ to ACT */    rq->clock_update_flags <<=1;    update_rq_clock(rq);    switch_count=&prev->nivcsw;        if (!preempt && prev->state) {        //不可中断的任务则继续执行        if (signal_pending_state(prev->state, prev)) {            prev->state=TASK_RUNNING;        } else {            //当前任务从队列rq中出队，on_rq设置为0，如果存在I/O未完成则延时完成            deactivate_task(rq, prev, DEQUEUE_SLEEP | DEQUEUE_NOCLOCK);            prev->on_rq=0;            if (prev->in_iowait) {                atomic_inc(&rq->nr_iowait);                delayacct_blkio_start();            }            /* 唤醒睡眠进程             * If a worker went to sleep, notify and ask workqueue             * whether it wants to wake up a task to maintain             * concurrency.             */            if (prev->flags & PF_WQ_WORKER) {                struct task_struct *to_wakeup;                to_wakeup=wq_worker_sleeping(prev);                if (to_wakeup)                    try_to_wake_up_local(to_wakeup, &rf);            }        }        switch_count=&prev->nvcsw;    }        // 调用pick_next_task获取下一个任务，赋值给next    next=pick_next_task(rq, prev, &rf);    clear_tsk_need_resched(prev);    clear_preempt_need_resched();        // 如果产生了任务切换，则需要切换上下文    if (likely(prev !=next)) {        rq->nr_switches++;        rq->curr=next;        /*         * The membarrier system call requires each architecture         * to have a full memory barrier after updating         * rq->curr, before returning to user-space.         *         * Here are the schemes providing that barrier on the         * various architectures:         * - mm ? switch_mm() : mmdrop() for x86, s390, sparc, PowerPC.         *   switch_mm() rely on membarrier_arch_switch_mm() on PowerPC.         * - finish_lock_switch() for weakly-ordered         *   architectures where spin_unlock is a full barrier,         * - switch_to() for arm64 (weakly-ordered, spin_unlock         *   is a RELEASE barrier),         */        ++*switch_count;        trace_sched_switch(preempt, prev, next);        /* Also unlocks the rq: */        rq=context_switch(rq, prev, next, &rf);    } else {        // 清除标记位，重开中断        rq->clock_update_flags &=~(RQCF_ACT_SKIP|RQCF_REQ_SKIP);        rq_unlock_irq(rq, &rf);    }    //队列自平衡：红黑树平衡操作    balance_callback(rq);}

其中核心函数是获取下一个任务的pick_next_task()以及上下文切换的context_switch()，下面详细展开剖析。首先看看pick_next_task()，该函数会根据调度策略分类，调用该类对应的调度函数选择下一个任务实体。根据前文分析我们知道，最终是在不同的红黑树上选择最左节点作为下一个任务实体并返回。

/* * Pick up the highest-prio task: */static inline struct task_struct *pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf){    const struct sched_class *class;    struct task_struct *p;    /* 这里做了一个优化：如果是普通调度策略则直接调用fair_sched_class中的pick_next_task     * Optimization: we know that if all tasks are in the fair class we can     * call that function directly, but only if the @prev task wasn't of a     * higher scheduling class, because otherwise those loose the     * opportunity to pull in more work from other CPUs.     */    if (likely((prev->sched_class==&idle_sched_class ||            prev->sched_class==&fair_sched_class) &&           rq->nr_running==rq->cfs.h_nr_running)) {        p=fair_sched_class.pick_next_task(rq, prev, rf);        if (unlikely(p==RETRY_TASK))            goto again;        /* Assumes fair_sched_class->next==idle_sched_class */        if (unlikely(!p))            p=idle_sched_class.pick_next_task(rq, prev, rf);        return p;    }again:    //依次调用类中的选择函数，如果正确选择到下一个任务则返回    for_each_class(class) {        p=class->pick_next_task(rq, prev, rf);        if (p) {            if (unlikely(p==RETRY_TASK))                goto again;            return p;        }    }    /* The idle class should always have a runnable task: */    BUG();}

下面来看看上下文切换。上下文切换主要干两件事情，一是切换任务空间，也即虚拟内存；二是切换寄存器和 CPU 上下文。关于任务空间的切换放在内存部分的文章中详细介绍，这里先按下不表，通过任务空间切换实际完成了用户态的上下文切换工作。下面我们重点看一下内核态切换，即寄存器和CPU上下文的切换。

/* * context_switch - switch to the new MM and the new thread's register state. */static __always_inline struct rq *context_switch(struct rq *rq, struct task_struct *prev,           struct task_struct *next, struct rq_flags *rf){    struct mm_struct *mm, *oldmm;    prepare_task_switch(rq, prev, next);    mm=next->mm;    oldmm=prev->active_mm;    /*     * For paravirt, this is coupled with an exit in switch_to to     * combine the page table reload and the switch backend into     * one hypercall.     */    arch_start_context_switch(prev);    /*     * If mm is non-NULL, we pass through switch_mm(). If mm is     * NULL, we will pass through mmdrop() in finish_task_switch().     * Both of these contain the full memory barrier required by     * membarrier after storing to rq->curr, before returning to     * user-space.     */    if (!mm) {        next->active_mm=oldmm;        mmgrab(oldmm);        enter_lazy_tlb(oldmm, next);    } else        switch_mm_irqs_off(oldmm, mm, next);    if (!prev->mm) {        prev->active_mm=NULL;        rq->prev_mm=oldmm;    }    rq->clock_update_flags &=~(RQCF_ACT_SKIP|RQCF_REQ_SKIP);    prepare_lock_switch(rq, next, rf);    /* Here we just switch the register state and the stack. */    switch_to(prev, next, prev);    //barrier 语句是一个编译器指令，用于保证 switch_to 和 finish_task_switch 的执行顺序不会因为编译阶段优化而改变    barrier();    return finish_task_switch(prev);}

switch_to()就是寄存器和栈的切换，它调用到了 __switch_to_asm。这是一段汇编代码，主要用于栈的切换，其中32位使用esp作为栈顶指针，64位使用rsp，其他部分代码一致。通过该段汇编代码我们完成了栈顶指针的切换，并调用__switch_to完成最终TSS的切换。注意switch_to中其实是有三个变量，分别是prev, next, last，而实际在使用时，我们会对last也赋值为prev。这里的设计意图需要结合一个例子来说明。假设有ABC三个任务，从A调度到B，B到C，最后C回到A，我们假设仅保存prev和next，则流程如下

A保存内核栈和寄存器，切换至B，此时prev=A, next=B，该状态会保存在栈里，等下次调用A的时候再恢复。然后调用B的finish_task_switch()继续执行下去，返回B的队列rq，B保存内核栈和寄存器，切换至CC保存内核栈和寄存器，切换至A。A从barrier()开始运行，而A从步骤1中保存的prev=A, next=B则完美的避开了C，丢失了C的信息。因此last指针的重要性就出现了。在执行完__switch_to_asm后，A的内核栈和寄存器重新覆盖了prev和next，但是我们通过返回值提供了C的内存地址，保存在last中，在finish_task_switch中完成清理工作。

#define switch_to(prev, next, last)      \do {       \    prepare_switch_to(next);\                                               \    ((last)=__switch_to_asm((prev), (next)));  \} while (0)/* * %eax: prev task * %edx: next task */ENTRY(__switch_to_asm)......  /* switch stack */  movl  %esp, TASK_threadsp(%eax)  movl  TASK_threadsp(%edx), %esp......  jmp  __switch_toEND(__switch_to_asm)

最终调用__switch_to()函数。该函数中涉及到一个结构体TSS(Task State Segment)，该结构体存放了所有的寄存器。另外还有一个特殊的寄存器TR（Task Register）会指向TSS，我们通过更改TR的值，会触发硬件保存CPU所有寄存器在当前TSS，并从新的TSS读取寄存器的值加载入CPU，从而完成一次硬中断带来的上下文切换工作。系统初始化的时候，会调用 cpu_init()给每一个 CPU 关联一个 TSS，然后将 TR 指向这个 TSS，然后在操作系统的运行过程中，TR 就不切换了，永远指向这个 TSS。当修改TR的值得时候，则为任务调度。

/* *switch_to(x,y) should switch tasks from x to y. * * We fsave/fwait so that an exception goes off at the right time * (as a call from the fsave or fwait in effect) rather than to * the wrong process. Lazy FP saving no longer makes any sense * with modern CPU's, and this simplifies a lot of things (SMP * and UP become the same). * * NOTE! We used to use the x86 hardware context switching. The * reason for not using it any more becomes apparent when you * try to recover gracefully from saved state that is no longer * valid (stale segment register values in particular). With the * hardware task-switch, there is no way to fix up bad state in * a reasonable manner. * * The fact that Intel documents the hardware task-switching to * be slow is a fairly red herring - this code is not noticeably * faster. However, there _is_ some room for improvement here, * so the performance issues may eventually be a valid point. * More important, however, is the fact that this allows us much * more flexibility. * * The return value (in %ax) will be the "prev" task after * the task-switch, and shows up in ret_from_fork in entry.S, * for example. */__visible __notrace_funcgraph struct task_struct *__switch_to(struct task_struct *prev_p, struct task_struct *next_p){    struct thread_struct *prev=&prev_p->thread,                 *next=&next_p->thread;    struct fpu *prev_fpu=&prev->fpu;    struct fpu *next_fpu=&next->fpu;    int cpu=smp_processor_id();    /* never put a printk in __switch_to... printk() calls wake_up*() indirectly */    switch_fpu_prepare(prev_fpu, cpu);    /*     * Save away %gs. No need to save %fs, as it was saved on the     * stack on entry.  No need to save %es and %ds, as those are     * always kernel segments while inside the kernel.  Doing this     * before setting the new TLS descriptors avoids the situation     * where we temporarily have non-reloadable segments in %fs     * and %gs.  This could be an issue if the NMI handler ever     * used %fs or %gs (it does not today), or if the kernel is     * running inside of a hypervisor layer.     */    lazy_save_gs(prev->gs);    /*     * Load the per-thread Thread-Local Storage descriptor.     */    load_TLS(next, cpu);    /*     * Restore IOPL if needed.  In normal use, the flags restore     * in the switch assembly will handle this.  But if the kernel     * is running virtualized at a non-zero CPL, the popf will     * not restore flags, so it must be done in a separate step.     */    if (get_kernel_rpl() && unlikely(prev->iopl !=next->iopl))        set_iopl_mask(next->iopl);    switch_to_extra(prev_p, next_p);    /*     * Leave lazy mode, flushing any hypercalls made here.     * This must be done before restoring TLS segments so     * the GDT and LDT are properly updated, and must be     * done before fpu__restore(), so the TS bit is up     * to date.     */    arch_end_context_switch(next_p);    /*     * Reload esp0 and cpu_current_top_of_stack.  This changes     * current_thread_info().  Refresh the SYSENTER configuration in     * case prev or next is vm86.     */    update_task_stack(next_p);    refresh_sysenter_cs(next);    this_cpu_write(cpu_current_top_of_stack,               (unsigned long)task_stack_page(next_p) +               THREAD_SIZE);    /*     * Restore %gs if needed (which is common)     */    if (prev->gs | next->gs)        lazy_load_gs(next->gs);    switch_fpu_finish(next_fpu, cpu);    this_cpu_write(current_task, next_p);    /* Load the Intel cache allocation PQR MSR. */    resctrl_sched_in();    return prev_p;}

更多Linux内核视频教程文本资料免费领取后台私信【内核大礼包】自行获取。

在完成了switch_to()的内核态切换后，还有一个重要的函数finish_task_switch()负责善后清理工作。在前面介绍switch_to三个参数的时候我们已经说明了使用last的重要性。而这里为何让prev和last均赋值为prev，是因为prev在后面没有需要用到，所以节省了一个指针空间来存储last。

/** * finish_task_switch - clean up after a task-switch * @prev: the thread we just switched away from. * * finish_task_switch must be called after the context switch, paired * with a prepare_task_switch call before the context switch. * finish_task_switch will reconcile locking set up by prepare_task_switch, * and do any other architecture-specific cleanup actions. * * Note that we may have delayed dropping an mm in context_switch(). If * so, we finish that here outside of the runqueue lock. (Doing it * with the lock held can cause deadlocks; see schedule() for * details.) * * The context switch have flipped the stack from under us and restored the * local variables which were saved when this task called schedule() in the * past. prev==current is still correct but we need to recalculate this_rq * because prev may have moved to another CPU. */static struct rq *finish_task_switch(struct task_struct *prev)    __releases(rq->lock){    struct rq *rq=this_rq();    struct mm_struct *mm=rq->prev_mm;    long prev_state;    /*     * The previous task will have left us with a preempt_count of 2     * because it left us after:     *     *schedule()     *  preempt_disable();// 1     *  __schedule()     *    raw_spin_lock_irq(&rq->lock)// 2     *     * Also, see FORK_PREEMPT_COUNT.     */    if (WARN_ONCE(preempt_count() !=2*PREEMPT_DISABLE_OFFSET,              "corrupted preempt_count: %s/%d/0x%x\n",              current->comm, current->pid, preempt_count()))        preempt_count_set(FORK_PREEMPT_COUNT);    rq->prev_mm=NULL;    /*     * A task struct has one reference for the use as "current".     * If a task dies, then it sets TASK_DEAD in tsk->state and calls     * schedule one last time. The schedule call will never return, and     * the scheduled task must drop that reference.     *     * We must observe prev->state before clearing prev->on_cpu (in     * finish_task), otherwise a concurrent wakeup can get prev     * running on another CPU and we could rave with its RUNNING -> DEAD     * transition, resulting in a double drop.     */    prev_state=prev->state;    vtime_task_switch(prev);    perf_event_task_sched_in(prev, current);    finish_task(prev);    finish_lock_switch(rq);    finish_arch_post_lock_switch();    kcov_finish_switch(current);    fire_sched_in_preempt_notifiers(current);    /*     * When switching through a kernel thread, the loop in     * membarrier_{private,global}_expedited() may have observed that     * kernel thread and not issued an IPI. It is therefore possible to     * schedule between user->kernel->user threads without passing though     * switch_mm(). Membarrier requires a barrier after storing to     * rq->curr, before returning to userspace, so provide them here:     *     * - a full memory barrier for {PRIVATE,GLOBAL}_EXPEDITED, implicitly     *   provided by mmdrop(),     * - a sync_core for SYNC_CORE.     */    if (mm) {        membarrier_mm_sync_core_before_usermode(mm);        mmdrop(mm);    }    if (unlikely(prev_state==TASK_DEAD)) {        if (prev->sched_class->task_dead)            prev->sched_class->task_dead(prev);        /*         * Remove function-return probe instances associated with this         * task and put them back on the free list.         */        kprobe_flush_task(prev);        /* Task is done with its stack. */        put_task_stack(prev);        put_task_struct(prev);    }    tick_nohz_task_switch();    return rq;}

至此，我们完成了内核态的切换工作，也完成了整个主动调度的过程。

4.2 抢占式调度

抢占式调度通常发生在两种情况下。一种是某任务执行时间过长，另一种是当某任务被唤醒的时候。首先看看任务执行时间过长的情况。

4.2.1 任务运行时间检测

该情况需要衡量一个任务的执行时间长短，执行时间过长则发起抢占。在计算机里面有一个时钟，会过一段时间触发一次时钟中断，通知操作系统时间又过去一个时钟周期，通过这种方式可以查看是否是需要抢占的时间点。

时钟中断处理函数会调用scheduler_tick()。该函数首先取出当前CPU，并由此获取对应的运行队列rq和当前任务curr。接着调用该任务的调度类sched_class对应的task_tick()函数进行时间事件处理。

/* * This function gets called by the timer code, with HZ frequency. * We call it with interrupts disabled. */void scheduler_tick(void){    int cpu=smp_processor_id();    struct rq *rq=cpu_rq(cpu);    struct task_struct *curr=rq->curr;    struct rq_flags rf;    sched_clock_tick();    rq_lock(rq, &rf);    update_rq_clock(rq);    curr->sched_class->task_tick(rq, curr, 0);    cpu_load_update_active(rq);    calc_global_load_tick(rq);    psi_task_tick(rq);    rq_unlock(rq, &rf);    perf_event_task_tick();......}

以普通任务队列为例，对应的调度类为fair_sched_class，对应的时钟处理函数为task_tick_fair()，该函数会获取当前的调度实体和运行队列，并调用entity_tick()函数更新时间。

/* * scheduler tick hitting a task of our scheduling class. * NOTE: This function can be called remotely by the tick offload that * goes along full dynticks. Therefore no local assumption can be made * and everything must be accessed through the @rq and @curr passed in * parameters. */static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued){    struct cfs_rq *cfs_rq;    struct sched_entity *se=&curr->se;    for_each_sched_entity(se) {        cfs_rq=cfs_rq_of(se);        entity_tick(cfs_rq, se, queued);    }    if (static_branch_unlikely(&sched_numa_balancing))        task_tick_numa(rq, curr);    update_misfit_status(curr, rq);    update_overutilized_status(task_rq(curr));}

在entity_tick()中，首先会调用update_curr()更新当前任务的vruntime，然后调用check_preempt_tick()检测现在是否可以发起抢占。

static voidentity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued){    /*     * Update run-time statistics of the 'current'.     */    update_curr(cfs_rq);    /*     * Ensure that runnable average is periodically updated.     */    update_load_avg(cfs_rq, curr, UPDATE_TG);    update_cfs_group(curr);......    if (cfs_rq->nr_running > 1)        check_preempt_tick(cfs_rq, curr);}

check_preempt_tick() 先是调用 sched_slice() 函数计算出一个调度周期中该任务运行的实际时间 ideal_runtime。sum_exec_runtime 指任务总共执行的实际时间，prev_sum_exec_runtime 指上次该进程被调度时已经占用的实际时间，所以 sum_exec_runtime - prev_sum_exec_runtime 就是这次调度占用实际时间。如果这个时间大于 ideal_runtime，则应该被抢占了。除了这个条件之外，还会通过 __pick_first_entity 取出红黑树中最小的进程。如果当前进程的 vruntime 大于红黑树中最小的进程的 vruntime，且差值大于 ideal_runtime，也应该被抢占了。

/* * Preempt the current task with a newly woken task if needed: */static voidcheck_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr){    unsigned long ideal_runtime, delta_exec;    struct sched_entity *se;    s64 delta;    ideal_runtime=sched_slice(cfs_rq, curr);    delta_exec=curr->sum_exec_runtime - curr->prev_sum_exec_runtime;    if (delta_exec > ideal_runtime) {        resched_curr(rq_of(cfs_rq));        /*         * The current task ran long enough, ensure it doesn't get         * re-elected due to buddy favours.         */        clear_buddies(cfs_rq, curr);        return;    }    /*     * Ensure that a task that missed wakeup preemption by a     * narrow margin doesn't have to wait for a full slice.     * This also mitigates buddy induced latencies under load.     */    if (delta_exec < sysctl_sched_min_granularity)        return;    se=__pick_first_entity(cfs_rq);    delta=curr->vruntime - se->vruntime;    if (delta < 0)        return;    if (delta > ideal_runtime)        resched_curr(rq_of(cfs_rq));}

如果确认需要被抢占，则会调用resched_curr()函数，该函数会调用set_tsk_need_resched()标记该任务为_TIF_NEED_RESCHED，即该任务应该被抢占。

/* * resched_curr - mark rq's current task 'to be rescheduled now'. * * On UP this means the setting of the need_resched flag, on SMP it * might also involve a cross-CPU call to trigger the scheduler on * the target CPU. */void resched_curr(struct rq *rq){    struct task_struct *curr=rq->curr;    int cpu;.......    cpu=cpu_of(rq);    if (cpu==smp_processor_id()) {        set_tsk_need_resched(curr);        set_preempt_need_resched();        return;    }    if (set_nr_and_not_polling(curr))        smp_send_reschedule(cpu);    else        trace_sched_wake_idle_without_ipi(cpu);}

4.2.2 任务唤醒情况

某些任务会因为中断而唤醒，如当 I/O 到来的时候，I/O进程往往会被唤醒。在这种时候，如果被唤醒的任务优先级高于 CPU 上的当前任务，就会触发抢占。try_to_wake_up() 调用 ttwu_queue() 将这个唤醒的任务添加到队列当中。ttwu_queue() 再调用 ttwu_do_activate() 激活这个任务。ttwu_do_activate() 调用 ttwu_do_wakeup()。这里面调用了 check_preempt_curr() 检查是否应该发生抢占。如果应该发生抢占，也不是直接踢走当前进程，而是将当前进程标记为应该被抢占。

static void ttwu_do_wakeup(struct rq *rq, struct task_struct *p, int wake_flags,         struct rq_flags *rf){  check_preempt_curr(rq, p, wake_flags);  p->state=TASK_RUNNING;  trace_sched_wakeup(p);

4.2.3 抢占的发生

由前面的分析，我们知道了不论是是当前任务执行时间过长还是新任务唤醒，我们均会对现在的任务标记位_TIF_NEED_RESCUED，下面分析实际抢占的发生。真正的抢占还需要一个特定的时机让正在运行中的进程有机会调用一下 __schedule()函数，发起真正的调度。

实际上会调用__schedule()函数共有以下几个时机

从系统调用返回用户态：以64位为例，系统调用的链路为do_syscall_64->syscall_return_slowpath->prepare_exit_to_usermode->exit_to_usermode_loop。在exit_to_usermode_loop中，会检测是否为_TIF_NEED_RESCHED，如果是则调用__schedule()

static void exit_to_usermode_loop(struct pt_regs *regs, u32 cached_flags){    while (true) {        /* We have work to do. */        local_irq_enable();        if (cached_flags & _TIF_NEED_RESCHED)          schedule();......  }

内核态启动：内核态的执行中，被抢占的时机一般发生在 preempt_enable() 中。在内核态的执行中，有的操作是不能被中断的，所以在进行这些操作之前，总是先调用 preempt_disable() 关闭抢占，当再次打开的时候，就是一次内核态代码被抢占的机会。preempt_enable() 会调用 preempt_count_dec_and_test()，判断 preempt_count 和 TIF_NEED_RESCHED 是否可以被抢占。如果可以，就调用 preempt_schedule->preempt_schedule_common->__schedule 进行调度。

#define preempt_enable() \do { \  if (unlikely(preempt_count_dec_and_test())) \    __preempt_schedule(); \} while (0)#define preempt_count_dec_and_test() \({ preempt_count_sub(1); should_resched(0); })static __always_inline bool should_resched(int preempt_offset){  return unlikely(preempt_count()==preempt_offset &&      tif_need_resched());}#define tif_need_resched() test_thread_flag(TIF_NEED_RESCHED)static void __sched notrace preempt_schedule_common(void){  do {......    __schedule(true);......  } while (need_resched())

从中断返回内核态/用户态：中断处理调用的是 do_IRQ 函数，中断完毕后分为两种情况，一个是返回用户态，一个是返回内核态。返回用户态会调用 prepare_exit_to_usermode()，最终调用 exit_to_usermode_loop()返回内核态会调用preempt_schedule_irq()，最终调用__schedule()

common_interrupt:        ASM_CLAC        addq    $-0x80, (%rsp)         interrupt do_IRQret_from_intr:        popq    %rsp        testb   $3, CS(%rsp)        jz      retint_kernel/* Interrupt came from user space */GLOBAL(retint_user)        mov     %rsp,%rdi        call    prepare_exit_to_usermode        TRACE_IRQS_IRETQ        SWAPGS        jmp     restore_regs_and_iret/* Returning to kernel space */retint_kernel:#ifdef CONFIG_PREEMPT        bt      $9, EFLAGS(%rsp)          jnc     1f0:      cmpl    $0, PER_CPU_VAR(__preempt_count)        jnz     1f        call    preempt_schedule_irq        jmp     0b

asmlinkage __visible void __sched preempt_schedule_irq(void){......  do {    preempt_disable();    local_irq_enable();    __schedule(true);    local_irq_disable();    sched_preempt_enable_no_resched();  } while (need_resched());......}

五. 总结

?? 本文分析了任务调度的策略、结构体以及整个调度流程，其中关于内存上下文切换的部分尚未详细叙述，留待内存部分展开剖析。

源码资料

1、调度相关结构体及函数实现

2、schedule核心函数

上一篇：上海现在哪里花开了

下一篇：上联-风大雨狂路添堵，如何对下联？

nr和run有什么不同

最佳答案 53678位专家为你答疑解惑

nr和run有什么不同1

1. 背景知识

1.1 tracepoints

1.2 硬件特性之cache

2. 主要关注点

3. perf的使用

3.0 perf引入的overhead

3.1 perf list

3.2 perf top

3.3 perf stat

其他常用参数

示例

3.4 perf bench

3.4.1 perf bench sched all

3.4.1.1 sched messaging评估进程调度和核间通信

3.4.1.2 sched pipe评估pipe性能

3.4.2 perf bench mem all

3.4.3 perf bench futex

3.4 perf record

常用选项

3.5 perf report

常用参数：

3.6 perf timechart

3.6.1 结合perf timechart和perf report分析函数占比

3.7perf script

3.8perf lock

3.8.1 perf lock内核配置

3.8.2 perf lock使用

3.9perf kmem

3.9.1 perf kmem介绍

3.9.2 perf kmem使用

3.10perf sched

3.11perf probe

3.14perf trace

4. perf扩展应用

4.1Flame Graph

nr和run有什么不同2

nr和run有什么不同3

一. 前言

二. 调度策略

2.1 实时调度策略

2.2 普通调度策略

三. 调度相关的架构体

四. 调度流程

4.1 主动调度

4.2 抢占式调度

4.2.1 任务运行时间检测

4.2.2 任务唤醒情况

4.2.3 抢占的发生

五. 总结

源码资料

[ 推荐 ] 相关文章

取消回复 发表评论

最新文章

列表

取消回复发表评论