Downsampling an image reduces the number of samples that can represent the signal. In terms of frequency domain, when a signal is downsampled, the high-frequency portion of the signal will be aliased with the low-frequency portion. When applied to image processing, the desired outcome is to preserve only the low-frequency portion. In order to do this, the original image needs to be preprocessed (alias-filtered) to remove the high-frequency portion so that aliasing will not occur.
The optimal digital filter to remove the high-frequency portion (with the sharpest cutoff) is sinc function. The reason is that the Sinc function's frequency domain representation is a nearly constant 1 over the entire low-frequency region, and nearly constant 0 over the entire high-frequency region.
$$\text{sinc}(x)=\frac{\sin(\pi x)}{\pi x}$$
The impulse response of the sinc filter is infinite. Lanczos filter is a modified sinc filter which attenuates the sinc coefficients and truncates them once the values drop to insignificance.
However, being optimal in frequency domain does not imply being optimal in human eyes. There are upsampling and downsampling methods that do not obey linear transformations but produce better results than linear ones.
With regard to the statement about $n \times n$, it is important to keep in mind that during image sampling, the choice of coordinates correspondence between the high-resolution signal and the low-resolution signal is not arbitrary, nor is it sufficient to align them to the same origin (0) on the real or discrete number line.
The minimum requirement in the coordinates correspondence is that
- Upsampling an image containing arbitrary random values by an integer factor, then downsampling by the same integer factor, should result in the same image with minimal change numerically.
- Upsampling/downsampling an image consisting of just one uniform value, followed by the opposite operation, should result in an image consisting of the same value uniformly, with minimal numerical deviations.
- Repeatedly applying pairs of upsampling/downsampling should minimize the shift in image content as much as possible.