This section first presents the general aspects of the roofline performance model. Two different models of roofline are described: the classical and the cache-aware models. These models are then applied to PICSAR to study effects of different optimization.

#### Presentation of the roofline performance model

The roofline performance model provides a visual analysis of the computational constraining resources of every systems from single-core to many-core architectures. It consists of a 2D graph with information on floating point performance, operational intensity (also refers to as arithmetic intensity), and memory performance. It provides a synthesized understanding of the efficiency of a kernel use of system resources, how far the kernel is from reaching the machine peak performance, and what are the limitations and trade-offs. The rooflines can be computed using the RAM or cache level bandwidth and usage (and more recently to the High-Bandwidth Memory that equips KNLs).

In a roofline graph (schematically shown in Fig. 1), the ordinate represents the floating operations per time units (in Gflop/s), whereas the abscissa represents the arithmetic intensity (AI), which is the ratio of total floating-point operations to total data movement (bytes). Roofline graphs are composed of 2 parts:

- The left roofline is a rising slope with a steepness that is given by the system peak memory bandwidth, which can be determined from the system itself or the vendor specifications.
- The right flat roofline is given by the peak floating-point performance when the optimum system peak memory bandwidth is reached.

In this study, two different roofline models will be applied to PICSAR:

**The classical roofline model [1,2] (Fig. 2):**Originally, the arithmetic intensity is determine using bytes measured out of the DRAM traffic. The classical roofline model can be generalized to any given memory or cache level if the traffic can be measured.

**The Cache-Aware Roofline Model (CARM) [3] (Fig. 3):**Operational intensity is determined from the total number of bytes transferred from all levels in memory hierarchy to the CPU. It corresponds to total amount of bytes transferred to the processing units. From another point of view, it corresponds basically to the L1 traffic (L1 operational intensity).

The main difference between the two models is that the classical roofline measures the traffic from DRAM whereas the cache-aware measures the traffic to the CPU, in other word, from L1. Nonetheless, the roofline model can be generalized: it measures the memory traffic out of a given level in memory hierarchy (MCDRAM, LLC/L3, L2) if this is possible. In this case, these 2 models are just particular cases of the generalized one. However, optimization effects and interpretations of the results are still different.

With the classical roofline model, the arithmetic intensity depends on cache reuse, varies with the number of iterations and is platform dependent. This model makes the separation between memory and computational optimizations but can be misleading when determining the performance bound.

With the cache-aware roofline model, arithmetic intensity is constant for a given algorithm, constant with number of iterations and is not platform dependent. All optimizations will appear as improvements in FLOPs. Performance will always be bounded by the nearest bandwidth limit.

Depending on where the code marker is located, general information about what slows down the code can be deduced. These are basically the same for both models. Basic interpretation is schematically given in Fig. 4. A marker located below the DRAM roofline (very low arithmetic intensity) indicates that the code is essentially memory bound (bandwidth bound): the code efficiency is limited by the memory bandwidth and the available computational power is under used. Far from the line, the code or the considered pieces of code is probably loading more data than used without saturating the bandwidth. A code marker (a loop or a function in the case of the cache-aware model) close to the memory roofline (rising line) means that the bandwidth is saturated. In both case, this code probably suffers from a lot of cache misses.

A marker located below the computation roofline (scalar roofline with high arithmetic intensity) means that the code is essentially compute bound: the code is using the full available computational power. There are different computational rooflines depending on the degrees of optimization: scalar, FMA, vector (SIMD). Computational roofline also depends on the data type: double, float, integer. The float or integer roofline will be higher than the double one because of vectorization. If the code is not vectorized, does not use FMA or uses complex instructions, this code will be located below the scalar roofline. In most case, codes or pieces of code are not strictly memory bound or compute bound but can have issues from both sides.

In order to show how behave roofline performance models and how to read them, we will consider two examples. The first one is the case of a memory bound code and is schematically described in Fig. 5. The original marker is the red circle marker. Since the code is mainly DRAM bound, a better DRAM memory management is required. A first optimization would consist for instance on a L2 cache optimization such as cache-blocking if many cache misses are observed. On the classical roofline, the marker will move to the top (lower computation time) and to the right (less access to the DRAM, better cache reuse) thanks to this optimization. This corresponds to the green circle marker. Sometime, better cache reuse also means optimizing computation with less redundancies. In this case, the number of flops can also change. In the cache-aware roofline model, arithmetic intensity is not modified by any memory optimization since L1 traffic is measured. L2 cache blocking therefore decreases computational time and increases performance. Note that performance is the same (markers have the same coordinates in Fig. 5) in both roofline models.

If L2 cache optimization is good enough, the code is not limited by the DRAM anymore but by the L2 bandwidth. A L1 cache optimization such as L1 cache blocking can be applied. However, such an optimization can not be seen easily on the DRAM classical roofline. One has to compute L2 arithmetic intensity shown via square markers in Fig. 5. L1 cache blocking as for L2, will move the marker to the top and to the right so that the last marker becomes the purple one. In the cache-aware roofline, the L1 optimization will again move up the circle marker so that the code may become L1 bound.

In the second example, we consider a compute bound code located below the scalar roofline. The cause can be:

- The code is not vectorized
- It does not use Fuse Multiply Add (FMA) operations.
- Vectorization is highly not efficient (gather/scatter)

Vectorization or FMA implementation basically does not change arithmetic intensity except if the data structure is changed so that more or less operations are performed. As a consequence, in both model, markers mainly rises because of the lower computational time as shown in Fig. 6.

#### Application of the Classical Roofline Model to PICSAR

To compute the classical roofline model, the method presented in our tutorials is considered. Arithmetic intensities (Flops/byte) and performances (Gflops/s) are calculated using Intel Vtune and Intel SDE.

The test case used for this study is an homogeneous thermalized plasma. the domain has a discretization of 100x100x100 cells with dx=dy=dz= 2 µm. The plasma is an hydrogen plasma therefore composed of two species (electron and protons) with 40 particles per cell for each.The particle are randomly distributed. Interpolation processes are performed at order 1.

Simulations are run on a single KNL node configured in SNC4. The following results have been obtained with an old version of PICSAR (November 2016). Three levels of optimization are considered. In the first one, PICSAR is run with 64 MPI ranks and no tile. The configuration is close to what would have been the code before any optimization (WARP subroutines). The corresponding marker is in blue in Fig. 7. The arithmetic intensity is equal to 0.13 flops/byte for a performance of 26 Gflops/s, 1.2 % of the peak performance.

The next optimization uses the tiling with the non-optimized (no vectorization) PIC subroutines. The tiling is a L2 cache-blocking technique for the field arrays. During interpolation processes such as the current/charge deposition or the field gathering, global current/charge grids or fields grids are divided into tiles (blocks) sufficiently small to be entirely stored in L2. In the present case, since particles are randomly distributed, two particles close in memory can deposit or gather information from very distant nodes of the grids. If these steps are performed on the global grids, this leads to many cache misses and poor cache reuse. With the tiling, all particles of a tile use the same grids in cache. OpenMP is used with the tiling, this simulation is therefore performed with 4 MPI ranks and 32 OpenMP threads per rank (hyper-threading). By improving cache reuse, arithmetic intensity is enhanced (less bytes loaded from the MCDRAM) and is 4 times more important that in the totally non-optimized case. The corresponding marker is the green triangle shown in Fig. 7.

The final case is the fully optimized code with vectorized subroutines and tiling. Vectorized subroutines also include internal cache-blocking technique. Full optimization leads both to increase of the arithmetic intensity and increase of the performance. The final arithmetic intensity is of 2.1 flops/byte, 16 times the non-optimized code. Performance is 60 Gflops/s. This represents 2.7% the peak performance of the considered KNL node evaluated at 2.2 Tflops/s (vector+FMA on double precision float).

The classical roofline performance model reveals that optimization efforts have significantly improved code performance and arithmetic intensity. The most optimized version of the code is at the scalar roofline still far from the vector+FMA roofline.

#### Application of the Cache-Aware Roofline Model to PICSAR

The Cache-Aware Roofline Performance model can be easily computed using Intel Advisor. You can go to our detailed tutorial on how to compute the roofline performance model and how to use Intel Advisor for more information. For the roofline, we focus on the current deposition and the field gathering that are the two main interpolation processes. The test case is the same as for the classical performance roofline. However, a KNL configured in quadrant cache mode is considered here. We use 1 MPI rank and 64 OpenMP processes. The roofline model is shown in Fig. 8. Loops associated to the current deposition are in orange and greens ones are for the charge deposition. The size of the markers is proportional to the computational time.

Three levels of optimization are considered. At first, the profiling is performed with the non-vectorized PICSAR subroutines coming from WARP. Since we are using OpenMP, no tiling means that there is just 1 tile per OpenMP thread. Consequently, tile size is larger than the available space for each core in L2 cache. A significant rate of L2 cache misses is expected. This first level corresponds to the circle markers. These markers are located below the scalar and the DRAM roofline.

In the second level, the profiling is performed with the non-vectorized PICSAR routines using the tiling, i.e. tile size is adjusted to fit L2 cache. Effect of the tiling can been in Fig. 8 and corresponds to the transition from circle markers to triangle markers. With tiling, the scalar current deposition reaches the DRAM roofline. This is not already the case for the scalar field gathering despite better performance.

In the third level of optimization, the profiling is performed with the vectorized subroutines and the tiling. This corresponds to the square markers in Fig. 8. The vectorized current deposition is divided in three loops due to change of data structure as explained in our page on the Morse-Nielson current deposition optimization. The top right orange marker is the first loop of the vectorized algorithm that performs the deposition on the new data structure. This step is the one that gain the much from vectorization. The second step is the smaller orange square marker on the middle, that computes node indexes and weights. This marker is just above the DRAM roofline but still below the scalar one. This step is vectorized but contains type conversion and mixed data types. The last step of the current deposition is the reduction in the original data structure (current grids). This step is the tiny orange square marker on the lower left below the DRAM roofline with an arithmetic intensity of 0.03 flop/byte. This step can be vectorized but is still scalar because vectorization is not efficient (4 scatters per grid node). However, it represents a tiny fraction of the time spent in the current deposition. The field gathering has been put in the same loop for this last step. This process is done by block and the green square marker corresponds to the most inner and vectorized loop. Thanks to vectorization opimization, the field gathering has better performance but still below the scalar roofline.

#### Conclusion

The roofline model is a powerful tool that summarizes in a single plot where the code is located relatively to the system peak performance. It can be used to see effects of different optimizations in term of arithmetic intensity and performance rather than just the computational time.

#### References

[1] – Samuel Williams, Andrew Waterman, and David Patterson. Roofline: an insightful visual performance model for multicore architectures. *Communications of the ACM,* 52(4):65-76, 2009.

[2] – Roofline performance model. http://crd.lbl.gov/departments/computerscience/PAR/research/roofline/. Accessed: 2016-07-22.

[3] – A. Ilic et al., *IEEE Computer Architecture **Letters *(2014)

[4] – Douglas Doerfler, Jack Deslippe, Samuel Williams, Leonid Oliker, Brandon Cook, Thorsten Kurth, Mathieu Lobet, Tareq Malas, Jean-Luc Vay, and Henri Vincent. Applying the Roofline Performance Model to the Intel Xeon Phi Knights Landing Processor, https://crd.lbl.gov/assets/Uploads/ixpug16-roofline.pdf.

*Mathieu Lobet, Last update: April 23, 2017*

Mathieu, in the paragraph beginning with “Depending on where the code marker is located…,” you state that “For a given arithmetic intensity, a code close to the memory roofline is saturating the bandwidth and probably doing few computations whereas far from it, the same code is saturating the bandwidth and also doing more computations.” It seems like as the code marker gets closer to any roofline, the number computations is increasing as indicated on the y-axis, whereas the same code marker farther from the roofline would indicate fewer, as opposed to more computations. Is this what you meant to say, or am I misunderstanding your point? Thank you, Bill

LikeLike

Hi Bill,

Thanks a lot for your feedback. This sentence was wrong and I must say that I do not remember the point. The text is now corrected to be clearer. Indeed, for a given AI, moving vertically means that amount of performed flops per s is changing. If AI stays constant, number of loaded bytes also changes in consequence. When the roofline is reached, the bandwidth is saturated. I really appreciate your comments, do not hesitate to suggest new corrections.

LikeLike