Towards effective assessment of steady state performance in java software: are we there yet?
Status:: 🟩
Links:: Warmup of Java applications Java Microbenchmark Harness
Metadata
Authors:: Traini, Luca; Cortellessa, Vittorio; Di Pompeo, Daniele; Tucci, Michele
Title:: Towards effective assessment of steady state performance in java software: are we there yet?
Publication Title:: "Empirical Software Engineering"
Date:: 2022
URL:: https://doi.org/10.1007/s10664-022-10247-x
DOI:: 10.1007/s10664-022-10247-x
Traini, L., Cortellessa, V., Di Pompeo, D., & Tucci, M. (2022). Towards effective assessment of steady state performance in java software: Are we there yet? Empirical Software Engineering, 28(1), 13. https://doi.org/10.1007/s10664-022-10247-x
Microbenchmarking is a widely used form of performance testing in Java software. A microbenchmark repeatedly executes a small chunk of code while collecting measurements related to its performance. Due to Java Virtual Machine optimizations, microbenchmarks are usually subject to severe performance fluctuations in the first phase of their execution (also known as warmup). For this reason, software developers typically discard measurements of this phase and focus their analysis when benchmarks reach a steady state of performance. Developers estimate the end of the warmup phase based on their expertise, and configure their benchmarks accordingly. Unfortunately, this approach is based on two strong assumptions: (i) benchmarks always reach a steady state of performance and (ii) developers accurately estimate warmup. In this paper, we show that Java microbenchmarks do not always reach a steady state, and often developers fail to accurately estimate the end of the warmup phase. We found that a considerable portion of studied benchmarks do not hit the steady state, and warmup estimates provided by software developers are often inaccurate (with a large error). This has significant implications both in terms of results quality and time-effort. Furthermore, we found that dynamic reconfiguration significantly improves warmup estimation accuracy, but still it induces suboptimal warmup estimates and relevant side-effects. We envision this paper as a starting point for supporting the introduction of more sophisticated automated techniques that can ensure results quality in a timely fashion.
Notes & Annotations
Color-coded highlighting system used for annotations
📑 Annotations (imported on 2025-04-26#21:22:41)
On the basis of our results, our practical suggestion is to never execute a benchmark for less than 5 s (and less than 50 invocations) before starting to collect measurements. When time does not represent a major concern, warmup should last for at least 30 s of continuous benchmark execution, and no less than 300 invocations.
The results for RQ3 show that developer static configurations fail to accurately estimate the end of the warmup phase, often with a non-trivial estimation error (median: 28 s). Developers tend to overestimate warmup time more frequently than underestimating it (48% vs 32%). Nonetheless, both of these kinds of estimation errors produce relevant (though diverse) side effects. For example, we showed that overestimation produces severe time wastes (median: 33 s), thereby hampering the adoption of benchmarks for continuous performance assessment. On the other hand, underestimation often leads to performance measurements that significantly deviate from those collected in the steady state (median 7%), thus leading to poor results quality and potentially wrong judgements.