NOTE: lperfex version 2.0 is now available; this is a MAJOR update and requires a a different kernel patch than previous versions. See the archives for older versions.
As part of OSC's work with Linux/IA32-based clusters, I have developed a small utility, lperfex, to access the Intel P6's hardware performance counters (also known as "MSRs" or "Model Specific Registers") to measure performance characteristics of other programs. This utility is minimally intrusive and does not require recompilation of the measured program. It functions similarly to the Cray UNICOS hpm and the SGI IRIX perfex utilities. It is my hope that those interested in code performance tuning under Linux (especially for scientific and "supercomputing" type applications) will find this useful.
Using lperfex is pretty straightfoward. It is run on another program, much like the time command:
lperfex -e P6_FLOPS -y ls -al
Here is an example of the output the above command would generate:
ls -al: 0.04s CPU, 0.05s wallclock (80.00% CPU utilization) on oscbw01 (4x Intel Pentium III 550.2 MHz) TSC 22797566 PMC[0] (P6_FLOPS) 0 Statistics (averaged across threads): --------------------------------------- MFLOPS 0.000000
Here's another example of an lperfex run that uses two counters:
lperfex -e P6_FLOPS -e 13 -y ls -al ls -al: 0.04s CPU, 0.04s wallclock (100.00% CPU utilization) on oscbw01 (4x Intel Pentium III 550.2 MHz) TSC 21325293 PMC[0] (P6_FLOPS) 0 PMC[1] (P6_L2_LINES_IN) 9795 Statistics (averaged across threads): --------------------------------------- MFLOPS 0.000000 Main memory -> L2 cache bandwidth 7.836000 MB/s
Here are all of the command line options to lperfex that do something useful:
-e event
P6_FLOPS). This can be
specified up to twice, although some events need be to the first
counter and other need to be the second.
-h
-l
-o outfile
-y
The events which can be counted by lperfex depend on what
type of CPU it's run on; run lperfex -l to get a listing
of valid events for a given machine. Note that in lperfex
version 2.0 and later, you can specify either an event number or a
symbolic event name.
If the -y option is specified on the command line,
lperfex will print a report of whatever performance metrics
it knows how to derive from the events you've counted. Right now,
this consists of megaflops (millions of floating-point operations per
second), some numbers about L1 and L2 cache usage, and fractions of
total cycles spent waiting on stalls and caches. A table of some of
the possible analyses on a P6-based CPU is shown below. Suggestions for improvements in this
area are welcomed, especially if you can tell me how to compute the
metric you want.
| Performance Metric | Event 0 | Event 1 |
|---|---|---|
| MIPS | 50 or 52 | N/A |
| MFLOPS | 41 or 42 | N/A |
| Average instructions per FP op | 41 or 42 | 50 or 52 |
| Average unhalted cycles per FP op | 41 or 42 | 67 |
| Fraction of FP ops that are multiplies | 41 or 42 | 44 |
| Fraction of FP ops that are divides | 41 or 42 | 45 |
| Average cycles per FP divide | 46 | 45 |
| Memory bandwidth | 13 | 14 |
| L2 to L1 bandwidth | 1 | 3 |
| L2 cache hit rate | 13 | 18 |
| L1 data cache hit rate | 0 | 1 |
| Fraction of cycles unhalted | 67 | N/A |
| Fraction of cycles waiting on resource stalls | 64 | N/A |
| Fraction of cycles waiting on L1 data cache | 4 | N/A |
| Fraction of cycles waiting on L2 data bus | 19 | N/A |
| Fraction of cycles waiting on L2 data transfer | 20 | N/A |
| Fraction of cycles waiting on bus snoop stalls | 40 | N/A |
Most events can be used in combination with other events, except in a few specific cases where both events require themselves to be in the same counter "slot" (eg. both events need to be event0). lperfex will compute metrics for all counters and combinations of counters which which it has metrics defined; for instance, lperfex -e 41 -e 50 -y ./a.out will report MIPS, MFLOPS, and instructions per FP op.
My employer, the Ohio Supercomputer Center, has graciously permitted me to release this code under the GNU General Public License. This software comes with ABSOLUTELY NO WARRANTY of any kind, so if it eats your dissertation or wrecks your marriage, it's not our problem. Copyright of this code is retained by the Ohio Supercomputer Center.
The C code which implements lperfex is available at http://www.osc.edu/~troy/lperfex/lperfex.tar.gz. It is also available via OSC's public CVS server, under the project name "lperfex"; see Pete's instructions for accessing it. lperfex relies on a Linux kernel patch developed by Mikael Pettersson, which is available from http://www.csd.uu.se/~mikpe/linux/perfctr/. The perfctr counter patch available from the PerfAPI project may also work, although I have not tested this recently.
Go to http://email.osc.edu/mailman/listinfo/lperfex to subscribe yourself to the lperfex mailing list or browse the archives.
Probably quite a few. Here are some of the more glaring ones I know about:
getrusage()
system call does not assign a value of the maxrss
element of the structure it returns. There are other program that
can correctly report memory usage (for instance, the resource
limiting code in the PBS batch queuing system), but I'm not sure if
they use some system call other than getrusage.
Please email these to troy@osc.edu.
Last updated 23 January 2002.