5

I'm starting the process of importing hundreds of thousands of photos into a new DAM system that my company purchased. There are 46,000 RAW files (mostly .cr2).

We don't need the RAW files anymore. But we don't want to delete them if there isn't a corresponding .jpg file.

Is there some way (application, script, etc.) to identify all of the RAW files that have a corresponding .jpg and then delete the RAW files?

That would save probably hundreds of hours of work and free up massive amounts of storage space.

Alaska Man
  • 3,608
  • 10
  • 17
  • 7
  • 8
    "We don't need the RAW files anymore." – Famous last words. – xiota Dec 14 '20 at 22:54
  • 1
    External drives are so cheap nowadays... move them, do not delete them. – Rafael Dec 14 '20 at 23:16
  • A way to accomplish your objective is to retain all the files and not delete anything. This also mitigates a likely class of problems in the future. 46,000 RAW files is only a handful of terabytes and will fit on a couple of hundred dollars worth of hard disk storage. A likely reason there are not good tools dedicated to your goal is that experience suggests it a bad idea in the long term. Internet browser history aside, there are very few circumstances where “I wish I had deleted this file” occurs in the future. Good luck. – Bob Macaroni McStevens Dec 15 '20 at 20:07
  • With my D7Mk2 I get for that file number about 900GB of storage used... doesn't strike me as a good idea deleting the original images, if you think the jpg are worth keeping. – planetmaker Dec 15 '20 at 21:46
  • 3
    are you SURE that a file 1234.CR2 is the corresponding raw image to 1234.jpg? My Canond D7Mk2 counts images up to 9999 and then starts over. So the filenames are not unique! So for 43k raw images taken by one person with one camera, the filenames WILL repeat without files being identical (unless the files themselves are also named by date or another scheme) – planetmaker Dec 15 '20 at 21:56
  • Disagreeing with some comments above. Deleting duplicates is useful, it avoids repeating tasks and creating discrepancies that raise questions layer (with is the CR2 rated 4 stars when the JPEG only has 2?). If you don't delete, then you set aside, but stiff you need some way to identify the duplicates. – xenoid Dec 17 '20 at 14:44
  • External drives are so cheap nowadays... Remember, one person's "cheap" is another person's budget for food for the week, month, or even longer. – End Anti-Semitic Hate Dec 22 '20 at 04:29

4 Answers4

2

This is a mistake. Disk space is cheap. A raw image is only about twice the size of a good jpeg image. For my Nikon D7100 the comparison is about 30MB vs 15 MB.

Here's what is going to happen: Marketing is going to get a Jpeg. They are going to say, "that sky isn't blue enough. Let's saturate it more" And they edit it. And Lo and Behold because you're mapping 8 bits of information into 8 bits of information there are rounding steps, and the sky is banded, or becomes mottled. And that expensive model's flawless skin now is pixelated at looks like it's made from coarse sandpaper.

Back into photoshop. Mask the sky. Introduce noise into the saturation channel. Now increase saturation. Ok, it worked this time. But it took 15 minutes of an expensive person's time. (Good photoshop techs don't come cheap.) Or worse, they just blur the sky. No bands, but it loses something. Cloud edges don't pop anymore.

Never throw information away.

46,000 images at 30 MB each would be 1.38 TB. Buy a pair of enterprise quality 2 TB drives, and mirror them. You're set up for a few years.

A larger problem is keeping the versioning in sync. The JPeg image should show up in your system as being a derived image from the Raw master, and keywords applied to the master should propagate to the JPeg. Whether you can do this is a function of the DAM software you got.

Tips: You need unique IDs for images in the system.

Look at using exiftool and using metadata to rename images. I would suggest naming them

OriginalCreatedDateTime.hundreths_CameraMake-SerialNumber

So 2020-01-11_10:25:15.72_Canon-1127341.cr2

This guarantees you a unique number even if you are a local newspaper with 11 Canon cameras on staff. Note: Use a naming scheme that does not include spaces or characters that have special meaning to various operating systems. Avoid /@3&<>!?* at least.

Note that this fails big time with scanned images. Scanned images in DAMs are a difficult proposition. You need to run a salvage operation for metadata.

Your dam should be set to write this into any image on export as a keyword. That way 2 years from now, when the Marketing department says, "We need a 3000 pixel version of this image for a billboard instead of the 256 pixel version used or our mobile website, you can actually find it. (Yes this happens. I'm doing it now for my website. For 2000 images.)

Sherwood Botsford
  • 1,768
  • 14
  • 23
0

Some clarification needed

First of all 3 questions in order to allow for a more complete answer:

  1. What Operating system (Windows, OS X, Linux) will you run these scripts on?
  2. are the JPG files using the same name as the RAW files? Just a different suffix? Or if not, is there another way to link a RAW file to a JPEG file?
  3. Is there a fixed folder structure/hierarchy?

Example Bash Script

You can create a simple script that will iterate over all RAW files and check if there is a JPEG variant and delete if so. Depending on the above answers I can provide you a script for that.

If those files are within the same folder and easily matched it's a very quick and easy script that will execute in minutes. If those files are organized in folders it will require some more extensive find commands that will take a bit more time for the script to execute.

For bash for example this will work if

  • all files are in one directory, put script in that directory
  • .CR2 and .JPEG extension are the only differences between the filesets
for f in *.CR2 ; do [ -e "${f%.CR2}.JPEG" ] && rm "$f"; done

Use with caution cause this will remove files using that rm command at the end!?

Hans Cappelle
  • 595
  • 2
  • 9
  • I'm using a Mac (the newest operating system). The JPGs and RAW files have the same names, although they might be in different folders. – user1074239 Dec 15 '20 at 20:10
0

If you are confident that the files are correspond to a jpeg equivalent, you could use file explorer and sort them by file name then delete all the images with the raw extension.

ChrisA
  • 1
-1

As far as I can tell, there are no good answers in the answer linked above:

  • The accepted answer only checks raw files in the same directory.
  • Those that search the raw file elsewhere will possibly erase the wrong files, because they make the wrong assumption that image file names are unique(*).

To be really safe:

  • If the file time stamps have been preserved you can check the file time stamps, with at least a 2-seconds fuzz margin since CR2 and JPEG have a different stamp and the FAT filesystem on the camera cards only keeps time to 2 seconds of accuracy.
  • Otherwise you would have to check the EXIF data of like-named JPEG and RAW file and see if they match.

I do have a similar script based on file time stamps (it reconciles a CR2 with its JPEG counterpart elsewhere in the directory tree)

Quickly whipped up script using file timestamps, seems to work on my files, use at your own risk:

#! /bin/bash

Change these to your liking or set them from parameters

jpegDir=/path/to/jpegDir # Top directory for JPEG rawDir=/path/to/rawDir # Top directpry for Raw timeFuzzSeconds=10 # Max time difference between JPE and raw

shopt -s extglob shopt -s globstar

For all JPEG, find if a like-named raw file exist with a similar timestamp, and delete it

for jpg in $jpegDir/*/.@(jpg|JPG|jpeg|JPEG) do jpgBase=${jpg##/} rawRootName=${jpgBase%.} jpgTime=$(stat -c "%Y" "$jpg") rawMinTime=$((jpgTime-timeFuzzSeconds)) rawMaxTime=$((jpgTime+timeFuzzSeconds))

printf "Searching %s between %s and %s\n" $rawRootName "$(date -d @$rawMinTime '+%F %T')" "$(date -d @$rawMaxTime '+%F %T')"

# Replace &quot;-print&quot; at the end by &quot;-delete&quot; when you are confident that it works as expected
# &quot;-print&quot; will only show you the files without touching them
find $rawDir  -newermt @$rawMinTime  ! -newermt @$rawMaxTime -regextype egrep -regex '.*/'$rawRootName'\.(CR2|NEF|DNG)' -print

done

(*) The counter used for names in cameras rolls over at 9999, so the assumption doesn't hold for a large collection from a single camera, and even less so if there are several cameras.

xenoid
  • 21,297
  • 1
  • 28
  • 62