sched_stat(8)sched_stat(8)NAMEsched_stat - Displays CPU usage and process-scheduling statistics for
SMP and NUMA platforms
SYNOPSIS
/usr/sbin/sched_stat [-l] [-s] [-f] [-u] [-R] [command [cmd_arg]...]
OPTIONS
Prints the count of calls that are not multiprocessor safe and there‐
fore funneled to the master CPU. For example:
Funnelling counts
unix master calls 11174 resulting blocks 2876
The impact of funneled calls on the master CPU needs to be taken
into account when evaluating statistics for the master CPU.
Prints scheduler load-balancing statistics. For example:
Scheduler Load Balancing
| 5-second
averages
steal idle desired | current
interrupt RT
cpu trys steals steals load | load %
%
-----+-------------------------------------------------------------------------
0 | 288 3 20609 0.000 0.000
0.454 0.156
1 | 615 6 21359 0.000 0.000
0.002 0.203
2 | 996 4 20135 0.000 0.001
0.000 0.237
3 | 1302 4 16195 0.000 0.001
0.000 0.330
6 | 5 0 3029 0.000 0.000
0.000 0.034
. . .
In the displayed table, each row contains per-CPU information as
follows: The number identifier of the CPU. The number of
attempts made to steal processes/threads from other CPUs when
the CPU was not idle. The number of processes/threads actually
stolen from other CPUs when the CPU was not idle. The number of
processes/threads stolen from other CPUs when the CPU was idle.
The number of time slices that should be used on this CPU for
running timeshare threads. This information is calculated by
comparing the current load, interrupt %, and RT % statistics
obtained for this CPU with those obtained for other CPUs in the
same PAG.
When current load is less than desired load, the scheduler will
attempt to migrate timeshare threads to this CPU in order to
better balance the timeshare workload among CPUs in the same
PAG.
See DESCRIPTION for information about PAGs. Over the last five
seconds, the average number of time slices used to run timeshare
threads on this CPU. Over the last five seconds, the average
percentage of time slices that this CPU spent in interrupt con‐
text. Over the last five seconds, the average precentage of
time slices that this CPU used to run threads according to FIFO
or round-robin policy. Prints information about CPU locality in
two tables: Shows the order-of-preference (in terms of memory
affinity) that exists between a CPU and different RADs. Order-
of-preference indicates, for a given home RAD, the ranking of
other RADs in terms of increasing physical distance from that
home RAD. If a process or thread needs more memory or needs to
be scheduled on a RAD other than its home RAD, the kernel auto‐
matically searches RADs for additional memory or CPU cycles in
the order of preference shown in this table. Shows the distance
(number of hops) between different RADs and, by association,
between CPUs. The information in this table is coarser-grained
than in the preceding Radtab table and more relevant to NUMA
programming choices. For example, the expression RAD_DIST_LOCAL
+ 2 indicates RADs that are no more than two hops from a
thread's home RAD.
For example (a small, switchless mesh NUMA system):
Radtab (rads in order of preference)
CPU # Preference 0 1 2 3
-------------------- 0 0 1 2 3 1
1 0 3 2 2 2 3 0 1 3 3
2 1 0
Hoptab (hops indexed by rad)
CPU # To rad # 0 1 2 3
-------------------- 0 0 1 1 2 1
1 0 2 1 2 1 2 0 1 3 2
1 1 0
In these tables, the CPU identifiers are listed across the top
from left to right and the RAD identifiers are listed on the
left from top to bottom. For example if a process running on
CPU 2 needs additional memory, Radtab indicates that the kernel
will search for that memory first in RAD 2, then in RAD 3, then
in RAD 0, and last in RAD 1. Hoptab shows the basis of this
preference in that RAD 2 is CPU 2's local RAD, RADs 0 and 3 are
one hop away, and RAD 1 is two hops away.
The -R option is useful only on NUMA platforms, such as GS1280
and ES80 AlphServer systems, in which memory latency times
varies from one RAD to another. The information in these tables
is less useful for GS80, GS160 and GS320 AlphaServer systems
because both coarse and finer-grained memory affinity is the
same from any CPU in one RAD to any CPU in another RAD; however,
the displays can tell you which CPUs are in which RAD.
Make sure that you both maximize the size of your terminal emu‐
lator window and minimize the font size before using the -R
option; otherwise, line-wrapping will render the tables very
difficult to read on systems that have many CPUs. Prints sched‐
uling-dispatch (processor-usage) statistics for each CPU. For
example:
Scheduler Dispatch Statistics
cpu 0 local global idle remote |
total percent
---------------------------------------------------------------------------
hot 60827 12868 19158991 0 |
19232686 91.6 warm 78 21 1542019
0 | 1542118 7.3 cold 315 27289
184784 7855 | 220243 1.0
---------------------------------------------------------------------------
total 61220 40178 20885794 7855 |
20995047 percent 0.3 0.2 99.5 0.0
cpu 1 local global idle remote |
total percent
---------------------------------------------------------------------------
hot 33760 11788 16412544 0 |
16458092 89.5 warm 66 24 1707014
0 | 1707104 9.3 cold 201 26191
203513 0 | 229905 1.2
---------------------------------------------------------------------------
. . .
These statistics show the count and percentage of thread context
switches (times that the kernel switches to a new thread) for
the following categories: Threads scheduled from the CPU's Local
Run Queue Threads scheduled from the Global Run Queue of the PAG
to which the CPU belongs Threads scheduled from the Idle CPU
Queue of the PAG to which the CPU belongs Threads stolen from
Global or Local Run Queues in another PAG
Note that these statistics do not count CPU time slices that
were used to re-run the same thread.
Each SMP unit (or RAD on a NUMA system) has a Processor Affinity
Group (PAG). Each PAG contains the following queues:
A Global Run Queue from which processes or threads are scheduled
on the first available CPU One or more Local Run Queues from
which processes or threads are scheduled on a specific CPU A
queue that contains idle CPUs
A thread that is handed to an idle CPU goes directly to that CPU
without first being placed on the other queues.
If there is insufficient work queued locally to keep the PAG's
CPUs busy, threads are stolen first from the Global and then the
Local Run Queues in a remote PAG.
For each of these categories, statistics are grouped into hot,
warm, and cold subcategories. The hot statistics show context
switches to threads that last ran on the CPU only a very short
time before. The warm statistics show context switches to
threads that last ran on the CPU a somewhat longer time before.
The cold statistics indicate context switches to threads that
never ran on the CPU before. These statistics are a measure of
how well cache affinity is being maintained; that is, how likely
the data used by threads when they last ran is still in the
cache when the threads are rescheduled. You cannot evaluate this
information without knowledge of the type of work being done on
the system; maintenance of cache affinity can be very important
on systems (or processor sets) that are dedicated to running
certain applications (such as those doing high performance tech‐
nical computing) but is less critical for systems serving a
variety of applications and users. Prints processor-usage sta‐
tistics for each CPU. For example:
Processor Usage
cpu user nice system idle widle | scalls intr
csw tbsyc
-----+-------------------------------+------------------------------------------
0 | 0.0 0.0 0.7 99.2 0.1 | 3327337 50861486
41885424 317108
1 | 0.0 0.0 0.4 99.5 0.1 | 3514438 0
36710149 268667
2 | 0.0 0.0 0.4 99.5 0.1 | 3182064 0
37384120 257749
3 | 0.0 0.0 0.4 99.5 0.1 | 3528519 0
36468319 249492
6 | 0.0 0.0 0.1 99.9 0.0 | 668892 11664
11793053 352294
7 | 0.0 0.0 0.1 99.9 0.0 | 772821 0
9341527 352319
8 | 0.0 0.0 0.0 100.0 0.0 | 529050 11724
5717059 347267
9 | 0.0 0.0 0.0 100.0 0.0 | 492386 0
6603681 351509
. . .
In this table: The number identifier of the CPU. The percentage
of time slices spent running threads in user context. The per‐
centage of time slices in which lower-priority threads were
scheduled. These are user-context threads whose priority was
explicitly lowered by using an interface such as the nice com‐
mand or the class-scheduling software. The percentage of time
slices spent running threads in system context. This work
includes servicing of interrupts and system calls that are made
on behalf of user processes. An unusually high percentage in
the system category might indicate a system bottleneck. Running
kprofile and lockinfo provides more specific information about
where system time is being spent. See uprofile(1) and lock‐
info(8), respectively, for information about these utilities.
The percentage of time slices in which no threads were sched‐
uled. The percentage of time slices in which available threads
were blocked by pending I/O and the CPU was idle. If this count
is unusually high, it suggests that a bottleneck in an I/O chan‐
nel might be causing suboptimal performance. The count of sys‐
tem calls that were serviced. The count of interrupts that were
serviced. The count of thread context switches (thread schedul‐
ing changes) that completed. The number of times that the
translation buffer was synchronized.
OPERANDS
The command to be executed by sched_stat. Any arguments to the preced‐
ing command.
The command and cmd_arg operands are used to limit the length of time
in which sched_stat gathers statistics. Typically, sleep is specified
for command and some number of seconds is specified for cmd_arg.
If you do not specify a command to specify an time interval for statis‐
tics gathering, the statistics will reflect what has occurred since the
system was last booted.
DESCRIPTION
The sched_stat utility helps you determine how well the system load is
distributed among CPUs, what kinds of jobs are getting (or not getting)
sufficient cycles on each CPU, and how well cache affinity is being
maintained for these jobs.
Answers to the following questions influence how a process and its
threads are scheduled: Is the request to be serviced multiprocessor-
safe?
If not, the kernel funnels the request to the master CPU. The
master CPU must reside in the default processor set (which con‐
tains all system CPUs if none were assigned to user-defined pro‐
cessor sets) and is typically CPU 0; however, some platforms
permit CPUs other than CPU 0 to be the master CPU. Few requests
generated by software distributed with the operating system need
to be funneled to the master CPU and most of these are associ‐
ated with certain device drivers. However, if the system runs
many third-party drivers, the number of requests that must be
funneled to the master CPU might be higher. What is the job
priority?
Job priority influences how frequently a thread is scheduled.
Realtime requests and interrupts have higher priority than time-
share jobs, which include the majority of user-mode threads. So,
if a significant number of CPU cycles are spent servicing real‐
time requests and interrupts, there are fewer cycles available
for time-share jobs.
Default priority for time-share jobs can also be changed by
using the nice command, the runclass command, or through class-
scheduling software. On a busy system, cache affinity is less
likely to be maintained for a thread from a time-share job whose
priority was lowered because more time is likely to elapse
between rescheduling operations for each thread. Conversely,
cache affinity is more likely to be maintained for threads of a
higher-priority time-share job because less time elapses between
rescheduling operations. Note that the scheduler always priori‐
tizes the need for low response latency (as demanded by inter‐
rupts and real-time requests) higher than maintenance of cache
affinity, regardless of the priority assigned to a time-share
job. Are there user-defined restrictions that limit where a
process may run?
If so, the kernel must schedule all threads of that process on
CPUs in the restricted set. In some cases, user-defined restric‐
tions are explicit RAD or CPU bindings specified either in an
application or by a command (such as runon) that was used to
launch the program or reassign one of its threads.
The set of CPUs where the kernel can schedule a thread is also
influenced by the presence of user-defined processor sets. If
the process was not explicitly started in or reassigned to a
user-defined processor set, the kernel must run it and all of
its threads only on CPUs in the default processor set. Are any
CPUs idle?
The scheduler is very aggressive in its attempts to steal jobs
from other CPUs to run on an idle CPU. This means that the
scheduler will migrate processes or threads across RAD bound‐
aries to give an idle CPU work to do unless one of the preceding
restrictions is in place to prevent that. For example, the
scheduler does not cross processor set boundaries when stealing
work from another CPU, even when a CPU is idle. In general,
keeping CPUs busy with work has higher priority than maintaining
memory or cache affinity during load-balancing operations.
Explicit memory-allocation advice provided in application code influ‐
ences scheduling only to the extent that the preceding factors do not
override that advice. However, explicit memory-allocation advice does
make a difference (and thereby can improve performance) when CPUs in
the processor set where the program is running are kept busy but are
not overloaded.
To gather statistics with sched_stat, you typically follow these steps:
Start up a system workload and wait for it to get to a steady state.
Start sched_stat with sleep as the specified command and some number of
seconds as the specified cmd_arg. This causes sched_stat to gather sta‐
tistics for the length of time it takes the sleep command to execute.
For example, the following command causes sched_stat to collect statis‐
tics for 60 seconds and then print a report: # /usr/sbin/sched_stat
sleep 60
If you include options on the command line, only statistics for the
specified options are reported.
If you specify the command without any options, all options except for
-R are assumed. (See the descriptions of the -f, -l, -s, and -u options
in the OPTIONS section.)
NOTES
Running the sched_stat command has minimal impact on system perfor‐
mance.
RESTRICTIONS
The sched_stat utility is subject to change, without advance notice,
from one release to another. The utility is intended mainly for use by
other software applications included in the operating system product,
kernel developers, and software support representatives. Therefore,
sched_stat should be used only interactively; any customer scripts or
programs written to depend on its output data or display format might
be broken by changes in future versions of the utility or by patches
that might be applied to it.
EXIT STATUS
Success. An error occurred.
FILES
The pseudo driver that is opened by the sched_stat utility for RAD-
related statistics gathering.
SEE ALSO
Commands: iostat(1), netstat(1), nice(1), renice(1), runclass(1),
runon(1), uprofile(1), vmstat(1), advfsstat(8), collect(8), lock‐
info(8), nfsstat(8), sys_check(8)
Others: numa_intro(3), class_scheduling(4), processor_sets(4)sched_stat(8)