As part of OSC's work with Linux/IA32-based clusters, I have developed a small utility, lperfex, to access the Intel P6's hardware performance counters (also known as "MSRs" or "Model Specific Registers") to measure performance characteristics of other programs. This utility is minimally intrusive and does not require recompilation of the measured program. It functions similarly to the Cray UNICOS hpm and the SGI IRIX perfex utilities. It is my hope that those interested in code performance tuning under Linux (especially for scientific and "supercomputing" type applications) will find this useful.
Using lperfex is pretty straightfoward. It is run on another program, much like the time command:
lperfex -e P6_FLOPS -y ls -al
Here is an example of the output the above command would generate:
ls -al: 0.04s CPU, 0.05s wallclock (80.00% CPU utilization) on oscbw01 (4x Intel Pentium III 550.2 MHz) TSC 22797566 PMC (P6_FLOPS) 0 Statistics (averaged across threads): --------------------------------------- MFLOPS 0.000000
Here's another example of an lperfex run that uses two counters:
lperfex -e P6_FLOPS -e 13 -y ls -al ls -al: 0.04s CPU, 0.04s wallclock (100.00% CPU utilization) on oscbw01 (4x Intel Pentium III 550.2 MHz) TSC 21325293 PMC (P6_FLOPS) 0 PMC (P6_L2_LINES_IN) 9795 Statistics (averaged across threads): --------------------------------------- MFLOPS 0.000000 Main memory -> L2 cache bandwidth 7.836000 MB/s
Command Line Options
Here are all of the command line options to lperfex that do something useful:
- Count event event, which can be either a numeric identifier (41) or symbol (
P6_FLOPS). This can be specified up to twice, although some events need be to the first counter and other need to be the second.
- Prints a list of valid options.
- Prints a list of valid event numbers and symbolic names.
- Write any output to outfile rather than to standard output.
- Generate a report with whatever metrics can be derived from the selected counter events. This is currently pretty limited if you're interested in anything besides MFLOPS and some directional memory bandwidth statistics.
The events which can be counted by lperfex depend on what type of CPU it's run on; run
lperfex -l to get a listing of valid events for a given machine. Note that in lperfex version 2.0 and later, you can specify either an event number or a symbolic event name.
-y option is specified on the command line, lperfex will print a report of whatever performance metrics it knows how to derive from the events you've counted. Right now, this consists of megaflops (millions of floating-point operations per second), some numbers about L1 and L2 cache usage, and fractions of total cycles spent waiting on stalls and caches. A table of some of the possible analyses on a P6-based CPU is shown below. Suggestions for improvements in this area are welcomed, especially if you can tell me how to compute the metric you want.
|Performance Metric||Event 0||Event 1|
|MIPS||50 or 52||N/A|
|MFLOPS||41 or 42||N/A|
|Average instructions per FP op||41 or 42||50 or 52|
|Average unhalted cycles per FP op||41 or 42||67|
|Fraction of FP ops that are multiplies||41 or 42||44|
|Fraction of FP ops that are divides||41 or 42||45|
|Average cycles per FP divide||46||45|
|L2 to L1 bandwidth||1||3|
|L2 cache hit rate||13||18|
|L1 data cache hit rate||0||1|
|Fraction of cycles unhalted||67||N/A|
|Fraction of cycles waiting on resource stalls||64||N/A|
|Fraction of cycles waiting on L1 data cache||4||N/A|
|Fraction of cycles waiting on L2 data bus||19||N/A|
|Fraction of cycles waiting on L2 data transfer||20||N/A|
|Fraction of cycles waiting on bus snoop stalls||40||N/A|
Most events can be used in combination with other events, except in a few specific cases where both events require themselves to be in the same counter "slot" (eg. both events need to be event0). lperfex will compute metrics for all counters and combinations of counters which which it has metrics defined; for instance, lperfex -e 41 -e 50 -y ./a.out will report MIPS, MFLOPS, and instructions per FP op.
My employer, the Ohio Supercomputer Center, has graciously permitted me to release this code under the GNU General Public License. This software comes with ABSOLUTELY NO WARRANTY of any kind, so if it eats your dissertation or wrecks your marriage, it's not our problem. Copyright of this code is retained by the Ohio Supercomputer Center.
The C code which implements lperfex is available at http://www.osc.edu/~troy/lperfex/lperfex.tar.gz. It is also available via OSC's public CVS server, under the project name "lperfex"; see Pete's instructions for accessing it. lperfex relies on a Linux kernel patch developed by Mikael Pettersson, which is available from http://www.csd.uu.se/~mikpe/linux/perfctr/. The perfctr counter patch available from the PerfAPI project may also work, although I have not tested this recently.
Go to http://email.osc.edu/mailman/listinfo/lperfex to subscribe yourself to the lperfex mailing list or browse the archives.
Bugs and Missing Features
Probably quite a few. Here are some of the more glaring ones I know about:
- Some of the statistics are computed in rather simple-minded ways, especially the memory bandwidth statistics. I've improved this to some extent in version 0.3, but the statistics are still more limited than I'd like. I have only implemented statistics for the P6 core. Suggestions and/or code for statistics on other CPUs are welcomed.
- Counter multiplexing (the equivalent to perfex -a on IRIX, where all possible events are counted in a time-slicing fashion and then extrapolated over the total time of the run) doesn't work. In fact, Erik tells me it's not possible with the current driver implementation. The IRIX perfex implementation does this with a special system call that multiplexes the counters inside the scheduler, according to some SGI folks I've talked to. Unfortunately, I would be very surprised if Linus would let this kind of extra complexity into the mainstream kernel.
- Many of the other arguments accepted by IRIX perfex currently cause lperfex to exit with a "Feature not yet implemented" message. I may remove some of these, as it's not clear to me how many of them are really useful. Suggestions here are welcomed as well.
- The program always reports that 0.0 MB of memory was used. This appears to be a Linux kernel bug on IA32; the
getrusage()system call does not assign a value of the
maxrsselement of the structure it returns. There are other program that can correctly report memory usage (for instance, the resource limiting code in the PBS batch queuing system), but I'm not sure if they use some system call other than
Bug Reports, Patches, Suggestions, Etc.
Please email these to email@example.com.
Last updated 23 January 2002.