First, allow me to address the subsample shift property in relation to non-linear signal mappings. It is fairly straightforward to see time shifting is not as simple a property as for linear systems. Consider a discrete time signal given by the sequence $$\dots ,1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, \dots$$
This sequence is invariant under memoryless nonlinearities of the form $x\mapsto x^n$ for natural $n$. If the above sequence is shifted by a fraction of the sampling interval, the sample values will not be $0$ and $1$ anymore, and the resulting sequence will lose its invariance under the nonlinear map given.
Enforcing this invariance and allowing only nonlinear maps that preserve it is equivalent to removing aliasing, as I will be showing here when I find a little time to expand upon my answer.
Edit: Some more details.
For simplicity, we will look at a discrete time signal centred around $t=0$ with a sampling interval of $T=1$. The discrete signal can be expanded using a power series $$x(t) = \sum_{n=0}^{2N} D_{N}^{(n)} \frac{t^n}{n!}$$
which describes $x[t]$ in the interval $t\in [-N,N]$. The coefficients $D_{N}^{(n)}$ are chosen so that the polynomial is the minimal interpolating polynomial on this interval. They coincide with the $n$-th discrete derivatives of $x[t]$ at $t=0$, also on this interval. The interpolating nature of this expansion makes it a natural linear homomorphism to the space of continuous time signals. It is worth noting, that in the limiting case $N\to\infty$, the interpolation approaches the bandlimited interpolation of the $\mathrm{sinc}$ kernel.
Applying a differential time shift $dt$ to a smooth continuous time signal $x(t)$ can be written as $$x(t-dt)=x(t) - dt \frac{\partial}{\partial t}x(t)=(1-dt \frac{\partial}{\partial t})x(t)$$
For finite shifts $\delta t$, we can concatenate many smaller shifts and take the limit $$x(t-\delta t)=\lim_{n\to\infty}\left(1-\frac{\delta t}{n}\frac{\partial}{\partial t}\right)^n \,x(t)=\exp\left(-\delta t \frac{\partial}{\partial t}\right)\,x(t)$$
Therefore the linear operator $S(\delta t)=\exp\left(-\delta t\frac{\partial}{\partial t}\right)$ shifts smooth continuous time signals.
The interpolation polynomial above is also an expansion of the $\exp$ function into a power series, with the discrete derivative taking the place of the partial derivative of the shift operator. For finite orders of the interpolation polynomial, the continuous time shift operator therefore approximates shifts of the interpolating function. This approximation becomes exact in the limit of infinite order.
With this understanding, we can calculate an (approximately) alias free memoryless nonlinearity $f$ acting on $x[t]$. We only need to evaluate $f(x[0])$, all other times follow by integer sample shifts.
With the interpolated continuous-time signal $x(t)$ approximating a band-limited reconstruction of $x[t]$, we can apply the nonlinearity in continuous time and then band-limit the result using an approximation of a band-limiting kernel $b(t)$ to avoid aliasing. A sufficient condition for a feasible symbolic evaluation of the band-limited result is that $f$ and $b$ be polynomials. Then $f(x(t))$ is a polynomial and we can directly calculate
$$y(0) = \int_{-N}^{N} f\left(x(t)\right) b(t) dt$$
which, in the limit of large orders $N$, achieves both full translation invariance and perfect alias rejection. This is not a proof that both are equivalent, but a good starting point to understand how these two properties are linked.
Suitable choices for $b$ include polynomial expansions of the $\mathrm{sinc}$ function. For example
$$b_N(t)=\frac{(N^2-t^2) \prod_{n=1}^{2N}(t^2-n^2) }{N^2 \prod_{n=1}^{2N}n^2}$$
for an approximately bandlimited kernel on the interval $t \in [-N,N]$.
This much must suffice as theoretical motivation for now.
Example
The simplest, non-trivial example is that with the minimal neighbourhood involvement and a crude approximation to a band-limited kernel. Do not expect good anti-aliasing properties from it. It's only here to demonstrate the general procedure of creating an anti-aliased memoryless nonlinearity.
We use the lowest possible order, $N=1$, and arrive at the expansion
$$x(t) = D_1^{(0)} t^0 + D_1^{(1)} t^1 + D_1^{(2)} \frac{t^2}{2}$$
where
$$
D_1^{(0)} = x[0]\\
D_1^{(1)}=\frac{1}{2}(x[1] - x[-1])\\
D_1^{(2)}=x[1]-2x[0]+x[-1]$$
and as a single expression
$$x(t)= x[0] + \frac{1}{2}(x[1]-x[-1])\, t + \frac{1}{2}(x[1]-2x[0]+x[-1])\,t^2$$
The nonlinearity is assumed to be a monomial $x\mapsto x^k$ and we take the $b_1$ from above and get
$$y_k[0] = \int_{-1}^{+1} \left(x[0] + \frac{1}{2}(x[1]-x[-1])\, t + \frac{1}{2}(x[1]-2x[0]+x[-1])\,t^2\right)^k \, \cdot \frac{1}{4}(4-t^2)(t^2-1)^2 dt$$
We can evaluate this expression for $k=2$ and also remove the $t=0$ simplification made earlier to arrive at a nonlinear filter
$$y_2[n] = \frac{4}{3465}(688 x[n]^2 + 40 x[n-1]^2 + 40 x[n+1]^2 - 41 x[n-1] x[n+1] + 82 x[n](x[n+1]+x[n-1]))$$
Generalising for non-smooth nonlinearities
The argument so far has required that the non-linearity can be well approximated by a polynomial. If that is not the case, then the integral will generally be harder to evaluate and the existence of a closed form solution is not even guaranteed. This is where the equivalence of commutativity of the filter with sub-sample shifts comes in.
The most general form of a nonlinear filter on a neighborhood $[-N,N]$ around $t=0$ is that of a nonlinear map
$$ y[0] = F( x[-N],x[-N+1],\dots,x[N-1],x[N] )$$
For a memoryless non-linearity, we want to map constant input signals to constant output signals, according to the non-linear transfer function $f$. If the constant input is $X$, then we have the condition
$$F(X,X,\dots,X) = f(X)$$
and from the shift-invariance we have the condition
$$ F( S(\delta t) x[-N], S(\delta t) x[-N+1],\dots, S(\delta t)x[N-1],S(\delta t) x [N]) = S(\delta t)F( x[-N],x[-N+1],\dots,x[N-1],x[N] )$$ for all $\delta t \in \mathbb{R}$
In general, there is no symbolic solution for this problem. It is however approachable with numerical optimization methods. In many cases, the free parameters can be reduced further by restricting the form of $F$.
I believe I should have answered all your questions with exception of the request for literature pointers. I am not aware of any. I don't know if this theory has ever been presented. But I believe the idea is not too difficult to come up with, so it has probably been done before. If you find something, please do let me know.