lperfex: a Hardware Performance Monitor for Linux/IA32 Systems

Introduction

As part of OSC's work with Linux/IA32-based clusters, I have developed a small utility, lperfex, to access the Intel P6's hardware performance counters (also known as "MSRs" or "Model Specific Registers") to measure performance characteristics of other programs. This utility is minimally intrusive and does not require recompilation of the measured program. It functions similarly to the Cray UNICOS hpm and the SGI IRIX perfex utilities. It is my hope that those interested in code performance tuning under Linux (especially for scientific and "supercomputing" type applications) will find this useful.

Usage

Using lperfex is pretty straightfoward. It is run on another program, much like the time command:

lperfex -e P6_FLOPS -y ls -al

 

Here is an example of the output the above command would generate:

ls -al:   0.04s CPU,  0.05s wallclock
(80.00% CPU utilization) on oscbw01 (4x Intel Pentium III 550.2 MHz)

TSC                             22797566
PMC[0] (P6_FLOPS)                      0

Statistics (averaged across threads):
---------------------------------------
MFLOPS                        0.000000

 

Here's another example of an lperfex run that uses two counters:

lperfex -e P6_FLOPS -e 13 -y ls -al

ls -al:   0.04s CPU,  0.04s wallclock
(100.00% CPU utilization) on oscbw01 (4x Intel Pentium III 550.2 MHz)

TSC                             21325293
PMC[0] (P6_FLOPS)                      0
PMC[1] (P6_L2_LINES_IN)             9795

Statistics (averaged across threads):
---------------------------------------
MFLOPS                        0.000000
Main memory -> L2 cache bandwidth             7.836000 MB/s

 

Command Line Options

Here are all of the command line options to lperfex that do something useful:

-e event
Count event event, which can be either a numeric identifier (41) or symbol (P6_FLOPS). This can be specified up to twice, although some events need be to the first counter and other need to be the second.
-h
Prints a list of valid options.
-l
Prints a list of valid event numbers and symbolic names.
-o outfile
Write any output to outfile rather than to standard output.
-y
Generate a report with whatever metrics can be derived from the selected counter events. This is currently pretty limited if you're interested in anything besides MFLOPS and some directional memory bandwidth statistics.

 

Countable Events

The events which can be counted by lperfex depend on what type of CPU it's run on; run lperfex -l to get a listing of valid events for a given machine. Note that in lperfex version 2.0 and later, you can specify either an event number or a symbolic event name.

Reported Information

If the -y option is specified on the command line, lperfex will print a report of whatever performance metrics it knows how to derive from the events you've counted. Right now, this consists of megaflops (millions of floating-point operations per second), some numbers about L1 and L2 cache usage, and fractions of total cycles spent waiting on stalls and caches. A table of some of the possible analyses on a P6-based CPU is shown below. Suggestions for improvements in this area are welcomed, especially if you can tell me how to compute the metric you want.

Performance Metric Event 0 Event 1
MIPS 50 or 52 N/A
MFLOPS 41 or 42 N/A
Average instructions per FP op 41 or 42 50 or 52
Average unhalted cycles per FP op 41 or 42 67
Fraction of FP ops that are multiplies 41 or 42 44
Fraction of FP ops that are divides 41 or 42 45
Average cycles per FP divide 46 45
Memory bandwidth 13 14
L2 to L1 bandwidth 1 3
L2 cache hit rate 13 18
L1 data cache hit rate 0 1
Fraction of cycles unhalted 67 N/A
Fraction of cycles waiting on resource stalls 64 N/A
Fraction of cycles waiting on L1 data cache 4 N/A
Fraction of cycles waiting on L2 data bus 19 N/A
Fraction of cycles waiting on L2 data transfer 20 N/A
Fraction of cycles waiting on bus snoop stalls 40 N/A

Most events can be used in combination with other events, except in a few specific cases where both events require themselves to be in the same counter "slot" (eg. both events need to be event0). lperfex will compute metrics for all counters and combinations of counters which which it has metrics defined; for instance, lperfex -e 41 -e 50 -y ./a.out will report MIPS, MFLOPS, and instructions per FP op.

Code and CVS Access

My employer, the Ohio Supercomputer Center, has graciously permitted me to release this code under the GNU General Public License. This software comes with ABSOLUTELY NO WARRANTY of any kind, so if it eats your dissertation or wrecks your marriage, it's not our problem. Copyright of this code is retained by the Ohio Supercomputer Center.

The C code which implements lperfex is available at http://www.osc.edu/~troy/lperfex/lperfex.tar.gz. It is also available via OSC's public CVS server, under the project name "lperfex"; see Pete's instructions for accessing it. lperfex relies on a Linux kernel patch developed by Mikael Pettersson, which is available from http://www.csd.uu.se/~mikpe/linux/perfctr/. The perfctr counter patch available from the PerfAPI project may also work, although I have not tested this recently.

Mailing List

Go to http://email.osc.edu/mailman/listinfo/lperfex to subscribe yourself to the lperfex mailing list or browse the archives.

Bugs and Missing Features

Probably quite a few. Here are some of the more glaring ones I know about:

  • Some of the statistics are computed in rather simple-minded ways, especially the memory bandwidth statistics. I've improved this to some extent in version 0.3, but the statistics are still more limited than I'd like. I have only implemented statistics for the P6 core. Suggestions and/or code for statistics on other CPUs are welcomed.
  • Counter multiplexing (the equivalent to perfex -a on IRIX, where all possible events are counted in a time-slicing fashion and then extrapolated over the total time of the run) doesn't work. In fact, Erik tells me it's not possible with the current driver implementation. The IRIX perfex implementation does this with a special system call that multiplexes the counters inside the scheduler, according to some SGI folks I've talked to. Unfortunately, I would be very surprised if Linus would let this kind of extra complexity into the mainstream kernel.
  • Many of the other arguments accepted by IRIX perfex currently cause lperfex to exit with a "Feature not yet implemented" message. I may remove some of these, as it's not clear to me how many of them are really useful. Suggestions here are welcomed as well.
  • The program always reports that 0.0 MB of memory was used. This appears to be a Linux kernel bug on IA32; the getrusage() system call does not assign a value of the maxrss element of the structure it returns. There are other program that can correctly report memory usage (for instance, the resource limiting code in the PBS batch queuing system), but I'm not sure if they use some system call other than getrusage.

 

Bug Reports, Patches, Suggestions, Etc.

Please email these to troy@osc.edu.

Last updated 23 January 2002.