Yes, if you fix total photon count, then you also fix signal-to-noise ratio which will be invariant under area changes. However, in many cases, you don't fix total photon count. You fix photon density, photons per unit area.
An example: you have a 400mm f/5.6 lens and are shooting on full frame. You observe two features of the lens:
- It collects enough light during daylight hours, so you can use fast exposure time and don't need image stabilization and never have excessive noise
- The lens is just too short for your uses
...therefore, because of (2), you are considering switching to a 1.6x crop camera. The lens would then effectively be 640mm lens. But can you rely on feature (1) with a crop camera?
The answer is: no you can't. Total photon count is less on the smaller sensor, because the sensor is smaller and can't see all the light the full frame sensor would see. Thus, the full frame sensor would collect 1.6^2 = 2.56 times more light than the crop sensor would collect. Therefore, on the crop camera the lens is not effectively a 640mm f/5.6 lens but rather a 640mm f/8.96 lens where 8.96 = 1.6 * 5.6. (And before someone complains about equal ISOs and equal exposure, on full frame you can use 2.56x times higher ISO and still get the same noise level, because full frame cameras collect more light.)
Another example: you are considering between 20 megapixel and 80 megapixel cameras having equal sensor sizes and shoot in low light. Which camera has lower noise?
With no post-processing, on the 20 megapixel cameras the areas of pixels are larger. Thus, they collect more light since photon density is constant. Therefore, 20 megapixel camera has lower noise.
However, in this case the difference isn't so clear because the total amount of collected light across all pixels is constant. With suitable post processing, using a noise removal algorithm that considers neighboring pixels too, it may be possible to combine information from neighboring pixels in a way that makes the 80 megapixel camera look more like a 20 megapixel camera, so with such an algorithm, you could use the 80 megapixel camera in low light and get equal results, while at the same time if you happen to shoot in good light, you can enjoy the real benefits of the 80 megapixel camera.
As for your second question, it's exactly the same as my second example which is a borderline case. In raw images, the 100x diluted sensor has more noise. However, suitable algorithms can make the 100x bigger sensor work like the small sensor, recovering the information from the noise, but if you use such algorithms, you can't enjoy the full resolution of the big sensor that has 100x the pixel count.
The raw SNR in non-processed image is indeed:
photons/pixel / sqrt(photons/pixel)
but you can by reducing effective pixel count in final post-processed image make the SNR in an image to behave more like:
total_photons / sqrt(total_photons)
As for 12mm f/something in MFT and 12mm f/something in full frame, we can observe:
- Effective focal length is different so field of view will be different; the images won't be the same (they also won't be the same based alone on the different aspect ratios)
- Physical aperture size (12mm/something) is the same so both collect the same amount of light at the lens level
- MFT sensor is smaller so less of that light ends up being at the sensor
- Both MFT and FF shooters need to use the same ISO to get the same exposure
- ...but all FF sensors have less sensitivity to high ISO than small sensors (unless you are comparing a sensor from 25 years ago to a today's state of the art sensor; let's make this fair and compare two sensors with similar technology levels), so the FF shooter will have less noise if lack of light creates noise in the final image
If you made this 12mm/something vs 24mm/(2*something) then the images would be equivalent. The field of view would be the same. The FF sensor would also collect the same amount of light so noise would be the same. With 12mm/something and 24mm/something, the field of view would be the same, but the FF sensor would collect more light. There also would be depth of field and background blur differences.
Note that in the MFT vs FF example, light was NOT diluted. Light per unit area was the same. However, in your second example, light was diluted. It's a different example, then. In the MFT vs FF example, you don't need to have different pixel counts -- the FF sensor can very well have the same pixel count as the MFT sensor and still have lower noise.
So the MFT vs FF is clear: FF is better. However, a sensor with 100x the size and 100x the pixel count, and with diluted light (meaning you use a lens with higher f-number and larger crop circle to get the diluted light), the big sensor is worse ... until you post-process the images to take into account information from neighboring pixels in which case the 100x sensor would not effectively use 100x pixels as individual pixels, and would be equally good.