Editing Supercomputer (section)

==Software and system management==
===Operating systems===
{{Main|Supercomputer operating systems}}
Since the end of the 20th century, [[supercomputer operating systems]] have undergone major transformations, based on the changes in [[supercomputer architecture]].<ref name=Padua426 >''Encyclopedia of Parallel Computing'' by David Padua 2011 {{ISBN|0-387-09765-1}} pages 426–429</ref> While early operating systems were custom tailored to each supercomputer to gain speed, the trend has been to move away from in-house operating systems to the adaptation of generic software such as [[Linux]].<ref name=MacKenzie >''Knowing machines: essays on technical change'' by Donald MacKenzie 1998 {{ISBN|0-262-63188-1}} page 149-151</ref>

Since modern [[massively parallel]] supercomputers typically separate computations from other services by using multiple types of [[Locale (computer hardware)|nodes]], they usually run different operating systems on different nodes, e.g. using a small and efficient [[lightweight kernel operating system|lightweight kernel]] such as [[CNK operating system|CNK]] or [[Compute Node Linux|CNL]] on compute nodes, but a larger system such as a full [[Linux distribution]] on server and [[I/O]] nodes.<ref name=EuroPar2004>''Euro-Par 2004 Parallel Processing: 10th International Euro-Par Conference'' 2004, by Marco Danelutto, Marco Vanneschi and Domenico Laforenza, {{ISBN|3-540-22924-8}}, page 835</ref><ref name=EuroPar2006 >''Euro-Par 2006 Parallel Processing: 12th International Euro-Par Conference'', 2006, by Wolfgang E. Nagel, Wolfgang V. Walter and Wolfgang Lehner {{ISBN|3-540-37783-2}} page</ref><ref name=Alam>''[https://web.archive.org/web/20190801201606/https://pdfs.semanticscholar.org/2aeb/c9b51047d5b79462f47d89f30f0f90389280.pdf An Evaluation of the Oak Ridge National Laboratory Cray XT3]'' by Sadaf R. Alam etal ''International Journal of High Performance Computing Applications'' February 2008 vol. 22 no. 1 52–80</ref>

While in a traditional multi-user computer system [[job scheduling]] is, in effect, a [[task scheduling|tasking]] problem for processing and peripheral resources, in a massively parallel system, the job management system needs to manage the allocation of both computational and communication resources, as well as gracefully deal with inevitable hardware failures when tens of thousands of processors are present.<ref name=Yariv >Open Job Management Architecture for the Blue Gene/L Supercomputer by Yariv Aridor et al. in ''Job scheduling strategies for parallel processing'' by Dror G. Feitelson 2005 {{ISBN|978-3-540-31024-2}} pages 95–101</ref>

Although most modern supercomputers use [[Linux]]-based operating systems, each manufacturer has its own specific Linux distribution, and no industry standard exists, partly due to the fact that the differences in hardware architectures require changes to optimize the operating system to each hardware design.<ref name=Padua426 /><ref>{{cite web |url=http://www.top500.org/overtime/list/32/os |title=Top500 OS chart |publisher=Top500.org |access-date=31 October 2010 |url-status=dead |archive-url=https://web.archive.org/web/20120305234455/http://www.top500.org/overtime/list/32/os |archive-date=5 March 2012  }}</ref>

===Software tools and message passing===
{{Main|Message passing in computer clusters}}
{{See also|Parallel computing|Parallel programming model}}
[[File:Wide-angle view of the ALMA correlator.jpg|thumb|Wide-angle view of the [[Atacama Large Millimeter Array|ALMA]] correlator<ref>{{cite news|title=Wide-angle view of the ALMA correlator|url=http://www.eso.org/public/images/eso1253a/|access-date=13 February 2013|newspaper=ESO Press Release}}</ref>]]

The parallel architectures of supercomputers often dictate the use of special programming techniques to exploit their speed. Software tools for distributed processing include standard [[Application programming interface|APIs]] such as [[Message Passing Interface|MPI]]<ref>{{cite book |first=Frank |last=Nielsen | title=Introduction to HPC with MPI for Data Science |  year=2016
| publisher=Springer |isbn=978-3-319-21903-5 |pages=185–221}}</ref> and [[Parallel Virtual Machine|PVM]], [[Virtual tape library|VTL]], and [[Open-source software|open source]] software such as [[Beowulf (computing)|Beowulf]].

In the most common scenario, environments such as [[Parallel Virtual Machine|PVM]] and [[Message Passing Interface|MPI]] for loosely connected clusters and [[OpenMP]] for tightly coordinated shared memory machines are used. Significant effort is required to optimize an algorithm for the interconnect characteristics of the machine it will be run on; the aim is to prevent any of the CPUs from wasting time waiting on data from other nodes. [[GPGPU]]s have hundreds of processor cores and are programmed using programming models such as [[CUDA]] or [[OpenCL]].

Moreover, it is quite difficult to debug and test parallel programs. [[Testing high-performance computing applications|Special techniques]] need to be used for testing and debugging such applications.