lperfex: a Hardware Performance Monitor for Linux/IA32 Systems

NOTE: Version 1.0 is now available! Versions 0.2, 0.3, and 0.5 are also available for historical purposes.

Introduction

As part of OSC's work with Linux/IA32-based clusters, I have developed a small utility, lperfex, to access the Intel P6's hardware performance counters (also known as "MSRs" or "Model Specific Registers") to measure performance characteristics of other programs. This utility is minimally intrusive and does not require recompilation of the measured program. It functions similarly to the Cray UNICOS hpm and the SGI IRIX perfex utilities. It is my hope that those interested in code performance tuning under Linux (especially for scientific and "supercomputing" type applications) will find this useful.

Usage

Using lperfex is pretty straightfoward. It is run on another program, much like the time command:

lperfex -e 41 -y ./a.out

Here is an example of the output the above command would generate:

0.220000 seconds of CPU time elapsed and 0.000000 MB of memory on oscbw01.osc.edu

Event #                 Event                                                                   Events Counted
-------                 -----                                                                   --------------
   41     Floating point operations retired (counter 0 only)                                            21790463

Statistics:
-----------
MFLOPS                       99.047560

Here's another example of an lperfex run that uses two counters:

lperfex -e 41 -e 13 -y ./a.out

0.220000 seconds of CPU time elapsed and 0.000000 MB of memory on oscbw01.osc.edu

Event #                 Event                                                                   Events Counted
-------                 -----                                                                   --------------
   41     Floating point operations retired (counter 0 only)                                          21790434
   13     L2 cache lines loaded                                                                           2503

Statistics:
-----------
MFLOPS                       99.047428
main memory->L2 bandwidth             0.364073 MB/s

Command Line Options

Here are all of the command line options to lperfex that do something useful:

-e eventno
Count event number eventno. This can be specified up to twice, although some events need be to the first counter and other need to be the second.
-h
Prints a list of valid options and possible event numbers to standard output and exits.
-o outfile
Write any output to outfile rather than to standard output.
-y
Generate a report with whatever metrics can be derived from the selected counter events. This is currently pretty limited if you're interested in anything besides MFLOPS and some directional memory bandwidth statistics.
--
Ignore all following command line arguments. Use this if the program you are running lperfex to measure has command line options of its own (eg. lperfex -e 41 -e 13 -y -- myprog -a -b -c).

Countable Events

Here are all of the events that can be counted by lperfex:

Event Number Event
0 Memory references
1 L1 data cache lines loaded
2 L1 data cache lines loaded and modified
3 L1 data cache lines flushed
4 Weighed number of cycles spent waiting while a L1 data cache miss
5 Instruction fetches
6 L1 instruction cache misses
7 ITLB misses
8 Cycles spent waiting for instruction fetches and ITLB misses
9 Cycles spent waiting on the instruction decoder
10 L2 cache instruction fetches
11 L2 cache data loads
12 L2 cache data stores
13 L2 cache lines loaded
14 L2 cache lines flushed
15 L2 cache lines loaded and modified
16 L2 cache lines modified and flushed
17 L2 cache requests
18 L2 cache address strobes
19 Cycles spent waiting on the L2 data bus
20 Cycles spent waiting on data transfer from L2 cache to processor
21 Cycles spent while DRDY is asserted
22 Cycles spent while LOCK is asserted
23 Bus requests outstanding
24 Burst read transactions
25 Read-for-ownership transactions
26 Write-back transactions
27 Instruction fetch transactions
28 Invalidate transactions
29 Partial-write transactions
30 Partial transactions
31 I/O transactions
32 Deferred transactions
33 Burst transactions
34 Total number of transactions
35 Memory transactions
36 Bus clock cycles spent while the processor is receiving data
37 Bus clock cycles spent while the processor is driving the BNR pin
38 Bus clock cycles spent while the processor is driving the HIT pin
39 Bus clock cycles spent while the processor is driving the HITM pin
40 Cycles spent while the bus is snoop-stalled
41 Floating point operations retired (counter 0 only)
42 Floating point operations executed (counter 0 only)
43 Floating point exceptions handled by microcode (counter 1 only)
44 Multiply operations (counter 1 only)
45 Divide operations (counter 1 only)
46 Cycles spent doing division (counter 0 only)
47 Store buffer blocks
48 Store buffer drain cycles
49 Misaligned memory references
50 Instructions retired
51 uOps retired
52 Instructions decoded
53 Hardware interrupts received
54 Cycles spent while interrupts are disabled
55 Cycles spent while interrupts and disabled and pending
56 Branch instructions retired
57 Mispredicted branches retired
58 Taken branches retired
59 Taken mispredicted branches retired
60 Branch instructions decoded
61 Branches which miss the BTB
62 Bogus branches
63 BACLEAR assertions
64 Cycles spent during resource related stalls
65 Cycles spent during partial stalls
66 Segment register loads
67 Cycles during which the processor is not halted

Reported Information

If the -y option is specified on the command line, lperfex will print a report of whatever performance metrics it knows how to derive from the events you've counted. Right now, this consists of megaflops (millions of floating-point operations per second), some numbers about L1 and L2 cache usage, and fractions of total cycles spent waiting on stalls and caches. A table of some of the possible analyses is shown below. Suggestions for improvements in this area are welcomed, especially if you can tell me how to compute the metric you want.

Performance Metric Event 0 Event 1
MIPS 50 or 52 N/A
MFLOPS 41 or 42 N/A
Average instructions per FP op 41 or 42 50 or 52
Average unhalted cycles per FP op 41 or 42 67
Fraction of FP ops that are multiplies 41 or 42 44
Fraction of FP ops that are divides 41 or 42 45
Average cycles per FP divide 46 45
Memory bandwidth 13 14
L2 to L1 bandwidth 1 3
L2 cache hit rate 13 18
L1 data cache hit rate 0 1
Fraction of cycles unhalted 67 N/A
Fraction of cycles waiting on resource stalls 64 N/A
Fraction of cycles waiting on L1 data cache 4 N/A
Fraction of cycles waiting on L2 data bus 19 N/A
Fraction of cycles waiting on L2 data transfer 20 N/A
Fraction of cycles waiting on bus snoop stalls 40 N/A

Most events can be used in combination with other events, except in a few specific cases where both events require themselves to be in the same counter "slot" (eg. both events need to be event0). lperfex will compute metrics for all counters and combinations of counters which which it has metrics defined; for instance, lperfex -e 41 -e 50 -y ./a.out will report MIPS, MFLOPS, and instructions per FP op.

Code

My employer, the Ohio Supercomputer Center, has graciously permitted me to release this code under the GNU General Public License. This software comes with ABSOLUTELY NO WARRANTY of any kind, so if it eats your dissertation or wrecks your marriage, it's not our problem. Copyright of this code is retained by the Ohio Supercomputer Center.

The C code which implements lperfex is available at http://www.osc.edu/~troy/lperfex/lperfex.c. It relies on a Linux kernel patch developed by Erik Hendriks of NASA Goddard Space Flight Center, which is available at http://www.beowulf.org/software/perf-0.7.tar.gz. Many thanks to Erik for developing this patch. (For those trying to apply the patch: the patch for the Linux 2.2.9 kernel should apply cleanly to later 2.2 kernels. I've used it on IA32 systems running Linux 2.2.12, .13, and .14.) The Linux/IA32 perf counter patch available from the PerfAPI project also works.

Bugs and Missing Features

Probably quite a few. Here are some of the more glaring ones I know about:

Bug Reports, Patches, Suggestions, Etc.

Please email these to troy@osc.edu.

Last updated 1 September 2000.