Editing Comparison of Java and C++ (section)

== Performance ==
{{More citations needed section|date=September 2010}}

In addition to running a compiled Java program, computers running Java applications generally must also run the [[Java virtual machine]] (JVM), while compiled C++ programs can be run without external applications. Early versions of Java were significantly outperformed by statically compiled languages such as C++. This is because the program statements of these two closely related languages may compile to a few machine instructions with C++, while compiling into several byte codes involving several machine instructions each when interpreted by a JVM. For example:
{| class="wikitable"
! Java/C++ statement
! C++ generated code (x86)
! Java generated byte code
|-
| {{code|a[i]++;}}
| {{sxhl|2=nasm|mov edx,[ebp+4h]
mov eax,[ebp+1Ch]
inc dword ptr [edx+eax*4]}}
| {{pre|aload_1
iload_2
dup2
iaload
iconst_1
iadd
iastore}}
|}

Since performance optimization is a very complex issue, it is very difficult to quantify the performance difference between C++ and Java in general terms, and most benchmarks are unreliable and biased. Given the very different natures of the languages, definitive qualitative differences are also difficult to draw. In a nutshell, there are inherent inefficiencies and hard limits on optimizing in Java, given that it heavily relies on flexible high-level abstractions, however, the use of a powerful JIT compiler (as in modern JVM implementations) can mitigate some issues. In any case, if the inefficiencies of Java are too great, compiled C or C++ code can be called from Java via the JNI.

Some inefficiencies that are inherent to the Java language include, mainly:

* All objects are allocated on the heap. Though allocation is extremely fast in modern JVMs using 'bump allocation', which performs similarly to stack allocation, performance can still be negatively impacted due to the invocation of the garbage collector. Modern JIT compilers mitigate this problem to some extent with escape analysis or escape detection to allocate some objects on the stack, since Oracle JDK 6.
* Performance-critical projects like efficient database systems and messaging libraries have had to use internal unofficial APIs like <code>sun.misc.Unsafe</code> to gain access to manual resource management and be able to do stack allocation; effectively manipulating pseudo-pointers.
* A lot of run-time casting required even using standard containers induces a performance penalty. However, most of these casts are statically eliminated by the JIT compiler.
* Safety guarantees come at a run-time cost. For example, the compiler is required to put appropriate range checks in the code. Guarding each array access with a range check is not efficient, so most JIT compilers will try to eliminate them statically or by moving them out of inner loops (although most native compilers for C++ will do the same when range-checks are optionally used).
* Lack of access to low-level details prevents the developer from improving the program where the compiler is unable to do so.<ref>{{cite journal |first =Nathan |last= Clark |author2=Amir Hormati |author3=Sami Yehia |author4=Scott Mahlke
 |title= Liquid SIMD: Abstracting SIMD hardware using lightweight dynamic mapping |journal= Hpca'07 |pages=216–227 |year= 2007}}</ref>
* The mandatory use of reference-semantics for all user-defined types in Java can introduce large amounts of superfluous memory indirections (or jumps) (unless elided by the JIT compiler) which can lead to frequent cache misses (a.k.a. [[Thrashing (computer science)|cache thrashing]]). Furthermore, cache-optimization, usually via cache-aware or [[Cache-oblivious algorithm|cache-oblivious]] data structures and algorithms, can often lead to orders of magnitude improvements in performance as well as avoiding time-complexity degeneracy that is characteristic of many cache-pessimizing algorithms, and is therefore one of the most important forms of optimization; reference-semantics, as mandated in Java, makes such optimizations impossible to realize in practice (by neither the programmer nor the JIT compiler).
* [[garbage collection (computer science)|Garbage collection]],<ref name="hundt2011">{{cite web| last=Hundt |first=Robert  |title=Loop Recognition in C++/Java/Go/Scala |publisher=[[Scala Days]] 2011 |date=2011-04-27|location=Stanford, California |access-date=2012-11-17 |url=https://days2011.scala-lang.org/sites/days2011/files/ws3-1-Hundt.pdf |archive-url=https://ghostarchive.org/archive/20221009/https://days2011.scala-lang.org/sites/days2011/files/ws3-1-Hundt.pdf |archive-date=2022-10-09 |url-status=live |quote=Java shows a large GC component, but a good code performance. [...] We find that in regards to performance, C++ wins out by a large margin. [...] The Java version was probably the simplest to implement, but the hardest to analyze for performance. Specifically the effects around garbage collection were complicated and very hard to tune; 318&nbsp;kB}}</ref> as this form of automatic memory management introduces memory overhead.<ref name="HertzBerger2005">{{cite web |url=http://people.cs.umass.edu/~emery/pubs/gcvsmalloc.pdf |title=Quantifying the Performance of Garbage Collection vs. Explicit Memory Management |author=Matthew Hertz, Emery D. Berger |publisher=OOPSLA 2005 |date=2005 |access-date=2015-03-15 |quote=In particular, when garbage collection has five times as much memory as required, its runtime performance matches or slightly exceeds that of explicit memory management. However, garbage collection's performance degrades substantially when it must use smaller heaps. With three times as much memory, it runs 17% slower on average, and with twice as much memory, it runs 70% slower. |archive-url=https://web.archive.org/web/20170706100244/https://people.cs.umass.edu/~emery/pubs/gcvsmalloc.pdf |archive-date=6 July 2017 |url-status=dead }}</ref>

However, there are a number of benefits to Java's design, some realized, some only theorized:

* Java [[garbage collection (computer science)|garbage collection]] may have better cache coherence than the usual use of ''[[malloc]]''/''[[new (C++)|new]]'' for memory allocation. Nevertheless, arguments exist{{Weasel inline|date=March 2012}} that both allocators equally fragment the heap and neither exhibits better cache locality. However, in C++, allocation of single objects on the heap is rare, and large quantities of single objects are usually allocated in blocks via an STL container and/or with a small object allocator.<ref>{{cite book |first= Andrei |last= Alexandrescu |title= Modern C++ Design: Generic Programming and Design Patterns Applied. Chapter 4 |editor= Addison-Wesley |pages=77–96 |year= 2001 |isbn= 978-0-201-70431-0}}</ref><ref>{{cite web |url= http://www.boost.org/doc/libs/release/libs/pool/ |title= Boost Pool library |publisher= Boost |access-date=19 April 2013}}</ref>
* Run-time compiling can potentially use information about the platform on which the code is being executed to improve code more effectively. However, most state-of-the-art native (C, C++, etc.) compilers generate multiple code paths to employ the full computational abilities of the given system.<ref>[https://web.archive.org/web/20200929080449/https://www.slac.stanford.edu/comp/unix/.../icc/.../optaps_dsp_qax.htm Targeting IA-32 Architecture Processors for Run-time Performance Checking]</ref> Also, the inverse argument can be made that native compilers can better exploit architecture-specific optimizing and instruction sets than multi-platform JVM distributions.
* Run-time compiling allows for more aggressive virtual function inlining than is possible for a static compiler, because the JIT compiler has more information about all possible targets of virtual calls, even if they are in different dynamically loaded modules. Currently available JVM implementations have no problem in inlining most of the monomorphic, mostly monomorphic and dimorphic calls, and research is in progress to inline also megamorphic calls, thanks to the recent invoke dynamic enhancements added in Java 7.<ref>{{Cite web |url=http://www.azulsystems.com/blog/cliff/2011-04-04-fixing-the-inlining-problem |title=Fixing The Inlining "Problem" by Dr. Cliff Click {{!}}Azul Systems: Blogs |access-date=23 September 2011 |archive-date=7 September 2011 |archive-url=https://web.archive.org/web/20110907092432/http://www.azulsystems.com/blog/cliff/2011-04-04-fixing-the-inlining-problem |url-status=dead }}</ref> Inlining can allow for further optimisations like loop vectorisation or [[loop unwinding|loop unrolling]], resulting in a huge overall performance increase.
* In Java, thread synchronizing is built into the language,{{sfn|Bloch|2018|loc=Chapter §11 Item 78: Synchronize access to shared mutable data|pp=126-129}} so the JIT compiler can potentially, via escape analysis, elide locks,<ref>[http://java.sun.com/performance/reference/whitepapers/6_performance.html#2.1.2 Oracle Technology Network for Java Developers]</ref> significantly improve the performance of naive multi-threaded code.

Also, some performance problems occur in C++:

* Allowing pointers to point to any address can make optimizing difficult due to the possibility of [[pointer aliasing]].
* Since the code generated from various instantiations of the same class template in C++ is not shared (as with type-erased generics in Java), excessive use of templates may lead to significant increase of the executable code size ([[code bloat]]). However, because function templates are aggressively inlined, they can sometimes reduce code size, but more importantly allow for more aggressive static analysis and code optimizing by the compiler, more often making them more efficient than non-templated code. In contrast, Java generics are necessarily less efficient than non-genericized code.
* Because in a traditional C++ compiler, dynamic linking is performed after code generating and optimizing in C++, function calls spanning different dynamic modules cannot be inlined.  However modern C++ compilers like MSVC and Clang+LLVM offer link-time-code-generation options that allow modules to be compiled to intermediate formats which allows inlining at the final link stage.