Broadly speaking, you don't absolutely need camera orientation or GPS data. The algorithms can in principle fit the best possible imputed UAV location in 3D space plus the camera orientation for each image while fitting the best possible 3D model of the surface, generating orthophotos and a DEM as a byproduct. This is similar, but in 3D, like panorama stitchers (including Microsoft ICE mentioned in another answers) glue together 2D images without a priori info about their relative placement.
UAV GPS coordinates, alone or with camera orientation info, can provide valuable information to the 3D stitching process. Some implementations actively use all of that information, others priarily use it just to georeference the resulting output. I would expect all such additional information will increase the speed and accuracy of the 3D model creation, and in particular the amount of noise/distortion/motion/lighting change the feature fitting process can handle without giving up, though I haven't seen empirical discussion of that.
There is significant imprecision (and inaccuracy) in non-RTK/PPK GPS data, plus lens distortion in individual images, together making it impossible to rely on recorded GPS and camera orientation data for each image alone and get a "perfect match". A degree of image data fitting (finding overlapping features) is always needed as a correction. So it's a question of degree rather than absolutes.
For an example of how this can be approached, see the workflows offered by MapsMadeEasy. Their DJI workflow uses DJI-specific camera orientation EXIF tags as well as GPS UAV location tags. Their "classic" workflow does not require or use the camera orientation tags. And their "flat map" workflow uses GPS and camera orientation tags but does not do feature matching/image data fitting, just trusts the tags.
If you have access to a DJI drone in particular, you could try a test run with these 3 worksflows of the environment you want your homebrew system to handle, plus the AgiSoft workflow you found, to compare.
Finally, to your question of "what's the use of ... in photogrammetry", taking aside the issue of stitching multiple photos together, my question and answer here deals with the math how those specific variables translate into georeferencing a single image.