The video and audio were not packaged in the same file: InSight did not produce a video file with audio tracks. As you can see in the screenshot, the 'audio' was recorded by the seismometer instrument (SEIS) as vibration data.
This can be converted to audio (after all, audio is also vibration data), but this was done after reception. NASA did some processing to get usable audio:
Far below the human range of hearing, this sonification from SEIS had to be sped up and slightly processed to be audible through headphones.
The video and audio may have been sent in the same batch (InSight stores its data, and usually uploads to an orbiting satellite 1-2 times a day), so they may have been received 'simultaneously'.