NOTE: Version 1.0 is now available! Versions 0.2, 0.3, and 0.5 are also available for historical purposes.
As part of OSC's work with Linux/IA32-based clusters, I have developed a small utility, lperfex, to access the Intel P6's hardware performance counters (also known as "MSRs" or "Model Specific Registers") to measure performance characteristics of other programs. This utility is minimally intrusive and does not require recompilation of the measured program. It functions similarly to the Cray UNICOS hpm and the SGI IRIX perfex utilities. It is my hope that those interested in code performance tuning under Linux (especially for scientific and "supercomputing" type applications) will find this useful.
Using lperfex is pretty straightfoward. It is run on another program, much like the time command:
lperfex -e 41 -y ./a.out
Here is an example of the output the above command would generate:
0.220000 seconds of CPU time elapsed and 0.000000 MB of memory on oscbw01.osc.edu Event # Event Events Counted ------- ----- -------------- 41 Floating point operations retired (counter 0 only) 21790463 Statistics: ----------- MFLOPS 99.047560
Here's another example of an lperfex run that uses two counters:
lperfex -e 41 -e 13 -y ./a.out 0.220000 seconds of CPU time elapsed and 0.000000 MB of memory on oscbw01.osc.edu Event # Event Events Counted ------- ----- -------------- 41 Floating point operations retired (counter 0 only) 21790434 13 L2 cache lines loaded 2503 Statistics: ----------- MFLOPS 99.047428 main memory->L2 bandwidth 0.364073 MB/s
Here are all of the command line options to lperfex that do something useful:
-e eventno
-h
-o outfile
-y
--
Here are all of the events that can be counted by lperfex:
| Event Number | Event |
|---|---|
| 0 | Memory references |
| 1 | L1 data cache lines loaded |
| 2 | L1 data cache lines loaded and modified |
| 3 | L1 data cache lines flushed |
| 4 | Weighed number of cycles spent waiting while a L1 data cache miss |
| 5 | Instruction fetches |
| 6 | L1 instruction cache misses |
| 7 | ITLB misses |
| 8 | Cycles spent waiting for instruction fetches and ITLB misses |
| 9 | Cycles spent waiting on the instruction decoder |
| 10 | L2 cache instruction fetches |
| 11 | L2 cache data loads |
| 12 | L2 cache data stores |
| 13 | L2 cache lines loaded |
| 14 | L2 cache lines flushed |
| 15 | L2 cache lines loaded and modified |
| 16 | L2 cache lines modified and flushed |
| 17 | L2 cache requests |
| 18 | L2 cache address strobes |
| 19 | Cycles spent waiting on the L2 data bus |
| 20 | Cycles spent waiting on data transfer from L2 cache to processor |
| 21 | Cycles spent while DRDY is asserted |
| 22 | Cycles spent while LOCK is asserted |
| 23 | Bus requests outstanding |
| 24 | Burst read transactions |
| 25 | Read-for-ownership transactions |
| 26 | Write-back transactions |
| 27 | Instruction fetch transactions |
| 28 | Invalidate transactions |
| 29 | Partial-write transactions |
| 30 | Partial transactions |
| 31 | I/O transactions |
| 32 | Deferred transactions |
| 33 | Burst transactions |
| 34 | Total number of transactions |
| 35 | Memory transactions |
| 36 | Bus clock cycles spent while the processor is receiving data |
| 37 | Bus clock cycles spent while the processor is driving the BNR pin |
| 38 | Bus clock cycles spent while the processor is driving the HIT pin |
| 39 | Bus clock cycles spent while the processor is driving the HITM pin |
| 40 | Cycles spent while the bus is snoop-stalled |
| 41 | Floating point operations retired (counter 0 only) |
| 42 | Floating point operations executed (counter 0 only) |
| 43 | Floating point exceptions handled by microcode (counter 1 only) |
| 44 | Multiply operations (counter 1 only) |
| 45 | Divide operations (counter 1 only) |
| 46 | Cycles spent doing division (counter 0 only) |
| 47 | Store buffer blocks |
| 48 | Store buffer drain cycles |
| 49 | Misaligned memory references |
| 50 | Instructions retired |
| 51 | uOps retired |
| 52 | Instructions decoded |
| 53 | Hardware interrupts received |
| 54 | Cycles spent while interrupts are disabled |
| 55 | Cycles spent while interrupts and disabled and pending |
| 56 | Branch instructions retired |
| 57 | Mispredicted branches retired |
| 58 | Taken branches retired |
| 59 | Taken mispredicted branches retired |
| 60 | Branch instructions decoded |
| 61 | Branches which miss the BTB |
| 62 | Bogus branches |
| 63 | BACLEAR assertions |
| 64 | Cycles spent during resource related stalls |
| 65 | Cycles spent during partial stalls |
| 66 | Segment register loads |
| 67 | Cycles during which the processor is not halted |
If the -y option is specified on the command line,
lperfex will print a report of whatever performance metrics
it knows how to derive from the events you've counted. Right now,
this consists of megaflops (millions of floating-point operations per
second), some numbers about L1 and L2 cache usage, and fractions of
total cycles spent waiting on stalls and caches. A table of some of
the possible analyses is shown below. Suggestions for improvements in this
area are welcomed, especially if you can tell me how to compute the
metric you want.
| Performance Metric | Event 0 | Event 1 |
|---|---|---|
| MIPS | 50 or 52 | N/A |
| MFLOPS | 41 or 42 | N/A |
| Average instructions per FP op | 41 or 42 | 50 or 52 |
| Average unhalted cycles per FP op | 41 or 42 | 67 |
| Fraction of FP ops that are multiplies | 41 or 42 | 44 |
| Fraction of FP ops that are divides | 41 or 42 | 45 |
| Average cycles per FP divide | 46 | 45 |
| Memory bandwidth | 13 | 14 |
| L2 to L1 bandwidth | 1 | 3 |
| L2 cache hit rate | 13 | 18 |
| L1 data cache hit rate | 0 | 1 |
| Fraction of cycles unhalted | 67 | N/A |
| Fraction of cycles waiting on resource stalls | 64 | N/A |
| Fraction of cycles waiting on L1 data cache | 4 | N/A |
| Fraction of cycles waiting on L2 data bus | 19 | N/A |
| Fraction of cycles waiting on L2 data transfer | 20 | N/A |
| Fraction of cycles waiting on bus snoop stalls | 40 | N/A |
Most events can be used in combination with other events, except in a few specific cases where both events require themselves to be in the same counter "slot" (eg. both events need to be event0). lperfex will compute metrics for all counters and combinations of counters which which it has metrics defined; for instance, lperfex -e 41 -e 50 -y ./a.out will report MIPS, MFLOPS, and instructions per FP op.
My employer, the Ohio Supercomputer Center, has graciously permitted me to release this code under the GNU General Public License. This software comes with ABSOLUTELY NO WARRANTY of any kind, so if it eats your dissertation or wrecks your marriage, it's not our problem. Copyright of this code is retained by the Ohio Supercomputer Center.
The C code which implements lperfex is available at http://www.osc.edu/~troy/lperfex/lperfex.c. It relies on a Linux kernel patch developed by Erik Hendriks of NASA Goddard Space Flight Center, which is available at http://www.beowulf.org/software/perf-0.7.tar.gz. Many thanks to Erik for developing this patch. (For those trying to apply the patch: the patch for the Linux 2.2.9 kernel should apply cleanly to later 2.2 kernels. I've used it on IA32 systems running Linux 2.2.12, .13, and .14.) The Linux/IA32 perf counter patch available from the PerfAPI project also works.
Probably quite a few. Here are some of the more glaring ones I know about:
getrusage()
system call does not assign a value of the maxrss
element of the structure it returns. There are other program that
can correctly report memory usage (for instance, the resource
limiting code in the PBS batch queuing system), but I'm not sure if
they use some system call other than getrusage.
fork() and
clone() system calls. I had hoped that the PAPI
kernel patch would fix this, and in fact with the PAPI patch
counter configurations are propagated to child processes.
However, the counter results do not appear to be aggregated when
they are returned by perf_wait(), which makes the
propagation somewhat less useful for my purposes; basically you
end up with a count for one thread instead of the entire thread
group.
msr interface (part of
devfs) has been included for Linux 2.4 instead.
Another possibility would be to rewrite lperfex to use
the lower level API from the PerfAPI project,
which would limit the number of countable event substantially but
add the possibility of making lperfex available on other
platforms such as the Cray T3E and the IBM SP series. This will
unfortunately cause all the event numbers to change. :(
Please email these to troy@osc.edu.
Last updated 1 September 2000.