Editing Strassen algorithm (section)

== Implementation considerations ==
{{more citations needed section|date=January 2015}}
The description above states that the matrices are square, and the size is a power of two, and that padding should be used if needed. This restriction allows the matrices to be split in half, recursively, until limit of scalar multiplication is reached. The restriction simplifies the explanation, and analysis of complexity, but is not actually necessary;<ref>{{cite journal |last=Higham |first=Nicholas J. |title=Exploiting fast matrix multiplication within the level 3 BLAS |journal=ACM Transactions on Mathematical Software |volume=16 |issue=4 |year=1990 |pages=352–368 |url=http://www.maths.manchester.ac.uk/~higham/papers/high90s.pdf |doi=10.1145/98267.98290|hdl=1813/6900 |s2cid=5715053 |hdl-access=free }}</ref>
and in fact, padding the matrix as described will increase the computation time and can easily eliminate the fairly narrow time savings obtained by using the method in the first place.

A good implementation will observe the following:

* It is not necessary or desirable to use the Strassen algorithm down to the limit of scalars. Compared to conventional matrix multiplication, the algorithm adds a considerable <math>O(n^{2})</math> workload in addition/subtractions; so below a certain size, it will be better to use conventional multiplication. Thus, for instance, a <math>1600 \times 1600</math> does not need to be padded to <math>2048 \times 2048</math>, since it could be subdivided down to <math>25 \times 25</math> matrices and conventional multiplication can then be used at that level.
* The method can indeed be applied to square matrices of any dimension.<ref name="dalberto">{{cite conference |first1=Paolo |last1=D'Alberto |first2=Alexandru |last2=Nicolau |title=Using Recursion to Boost ATLAS's Performance |year=2005 |conference=Sixth Int'l Symp. on High Performance Computing |url=https://www.ics.uci.edu/~paolo/Reference/paoloA.ishp-vi.pdf}}</ref> If the dimension is even, they are split in half as described. If the dimension is odd, zero padding by one row and one column is applied first.  Such padding can be applied on-the-fly and lazily, and the extra rows and columns discarded as the result is formed. For instance, suppose the matrices are <math>199 \times 199</math>. They can be split so that the upper-left portion is <math>100 \times 100</math> and the lower-right is <math>99 \times 99</math>. Wherever the operations require it, dimensions of <math>99</math> are zero padded to <math>100</math> first. Note, for instance, that the product <math>M_2</math> is only used in the lower row of the output, so is only required to be <math>99</math> rows high; and thus the left factor <math>A_{21} + A_{22}</math> used to generate it need only be <math>99</math> rows high; accordingly, there is no need to pad that sum to <math>100</math> rows; it is only necessary to pad <math>A_{22}</math> to <math>100</math> columns to match <math>A_{21}</math>.

Furthermore, there is no need for the matrices to be square. Non-square matrices can be split in half using the same methods, yielding smaller non-square matrices. If the matrices are sufficiently non-square it will be worthwhile reducing the initial operation to more square products, using simple methods which are essentially  <math>O(n^{2})</math>,  for instance:

* A product of size <math>[2N \times N] \ast [N \times 10N]</math> can be done as 20 separate <math>[N \times N] \ast [N \times N]</math> operations, arranged to form the result;
* A product of size <math>[N \times 10N] \ast [10N \times N]</math> can be done as 10 separate <math>[N \times N] \ast [N \times N]</math> operations, summed to form the result.

These techniques will make the implementation more complicated, compared to simply padding to a power-of-two square; however, it is a reasonable assumption that anyone undertaking an implementation of Strassen, rather than conventional multiplication, will place a higher priority on computational efficiency than on simplicity of the implementation.

In practice, Strassen's algorithm can be implemented to attain better performance than conventional multiplication even for matrices as small as <math>500 \times 500</math>, for matrices that are not at all square, and without requiring workspace beyond buffers that are already needed for a high-performance conventional multiplication.<ref name="huang et al.">{{cite conference |last1=Huang |first1=Jianyu |last2=Smith |first2=Tyler M. |last3=Henry |first3=Greg M. |last4=van de Geijn |first4=Robert A. |date=13 Nov 2016 |title=Strassen's Algorithm Reloaded |url=https://www.researchgate.net/publication/315365781 |conference=SC16: The International Conference for High Performance Computing, Networking, Storage and Analysis |publisher=IEEE Press |doi=10.1109/SC.2016.58 |isbn=9781467388153 |pages=690–701 |access-date=1 Nov 2022 |conference-url=https://ieeexplore.ieee.org/xpl/conhome/7875333/proceeding}}</ref>