In principle there are several types of optimization that may be considered:
- CPU optimization - maximizing the speed and efficiency of the operations being performed by the CPU
- I/O optimization - maximizing the efficiency of input/output operations
- Memory optimization - Maximizing the efficiency of a code's memory management
A code whose overall performance is limited by one of these factors is said to be "CPU bound", "I/O bound" or "memory bound," respectively. The main issue for OSC users will likely be on CPU optimization.
The principal focus of optimization efforts may vary considerably depending on the specific platform used. In general a couple of cases may be distinguished:
- Vector Processors
On vector machines, the hardware has been specifically designed to perform numerical computations on "vectors" - collections of numbers in arrays - as rapidly as possible. Codes in which large arrays are manipulated in loops are good candidates for running on a vector system.
Note, however, that only the innermost loop in a set of nested loops will "vectorize". The main key to achieving high performance on such systems lies in insuring that as much work as possible is done in the innermost loops, and that these loops vectorize successfully.
- Parallel Processors
For parallel computers there is a different set of keys to high performance. First, there are issues specific to the parallel architecture, for example maximizing the efficiency of inter-processor communication and maximizing the overlap of communication and calculation. Second, there is the issue of maximizing the performance of each processor. Generally these are commodity microprocessors, rather than the custom-built CPUs found in vector systems, and the main key to achieving high performance on these chips lies in understanding their memory hierarchy.
In addition to the registers, which hold data which is going into or coming out of the functional units which actually do the work (adding, multiplying, etc.), such CPUs typically have two levels of "cache," and a main (local) memory. As one heads "up" the hierarchy from primary to secondary data cache to main memory, the memory segments become slower but larger. That is, the primary cache is a small but fast (and expensive!) piece of memory, while the main memory is large but slow. As a general rule of thumb, each step up in the hierarchy costs about a factor of ten in overall memory latency.
The idea behind this sort of architecture is to balance cost and performance, and the key to the performance lies in efficient cache use. That is, we want to have data that is needed by the CPU in the cache as often as possible, and to limit the need for accessing data in the main memory. Getting data into the cache, and using it once it is there, is the principal key to single-processor performance on these platforms.
Specific strategies for optimizing code on the various platforms may be found in various sets of OSC workshop materials. In particular, see:
Some general themes include:
- Get the Right Answers
If you don't have to get the right answer, you can make a program run arbitrarily fast, so the first rule of optimization is that the program must always generate the correct results. Although this may seem obvious, it is easy to forget to check the answers along the way as you make performance improvements. A lot of frustration can be avoided by making only one change at a time and verifying that the program produces correct results after each change.
- Use Existing Tuned Code
The quickest and easiest way to improve a program's performance is to link it with libraries already tuned for the target hardware. The standard libraries on all OSC systems are all highly optimized for the platform in question. In addition, there are often optional libraries that can provide substantial performance benefits.
- Find Out Where to Tune
When confronted with a program composed of hundreds of modules and thousands of lines of code, it would require a heroic (and inefficient) effort to tune the entire program. Optimization should be concentrated on those parts of the code where the work will pay off with the biggest gains in performance. These sections of code may be identified with the help of a profiler.
- Let the Compiler Do the Work
The most important single tool for optimization is the compiler. Modern optimizing compilers, such as those found on all OSC platforms, are highly sophisticated and capable of generating extremely efficient machine code.
The main mechanism for controlling compiler optimizations are options passed to the compiler and linker. Typically there are many such options (over 100 of them for the Origin 2000 Fortran 90 and C compilers!); consult the man pages for complete details. To achieve a high level of general optimization, begin by experimenting with the following. Note, however, that aggressive optimization can change the results produced by a program, particularly if there is numerical sensitivity to the order in which calculations are performed. Be sure to verify that the optimized code produces correct results.
System Fortran C OSC IBM 1350 Linix Cluster ifort:
-O3 -ipo -ftz -unroll
-O3 -ipo -ftz -unroll
Note that as the level of compiler optimization is raised, both the compilation time and the size of the resulting executable will increase, in some cases rather significantly. This type of CPU optimization thus leads to increased usage of other resources (specifically, memory).
Note also that the compiler is not always able to produce the most efficient machine code automatically. This is because information that would allow the compiler to make optimal decisions is often lacking at compile time.
As an example, consider a program that obtains its input data from a file. The amount of data is not known until run time. For those data sets that are too big to fit in the cache, the best performance is achieved if the compiler performs cache optimizations on the code. But for those data sets that do fit in cache, these optimizations add overhead and make the program run slower. So what should the compiler do?
In situations like this, the compiler will simply make a default choice based on, among other things, the general optimization level specified. If you understand what your program is doing, however, you may be able improve its performance by selectively enabling and disabling the compiler's cache optimizations. This same basic theme applies to the other optimizations as well - giving the compiler as much information as possible about the structure of the code allows it to do its work most efficiently.
In practice this information is transmitted by fine-tuning the compiler options or by adding explicit compiler directives in the source code. The latter approach gives the programmer a great deal of control over how the compiler treats individual code segments. For additional details, see the appropriate workshop materials as well as the compiler man pages.
- Modify the Source Code
Finally, the programmer can modify the source code itself. This may include, for example, adding compiler directives for finer control over compiler optimizations, restructuring the code to eliminate cache use inefficiencies, eliminating vectorization-inhibiting conditions, and reducing the amount of program branching. Most of these work simply by allowing the compiler to do its job more efficiently.
Basic discussions of many of these techniques may be found in the OSC workshop materials, in particular those courses listed above.