I have some failed experiment result files, and their contents are exactly a single \n (newline).
I would like to list them all (perhaps with something like find or grep), to know what are the files and later delete them.
I have some failed experiment result files, and their contents are exactly a single \n (newline).
I would like to list them all (perhaps with something like find or grep), to know what are the files and later delete them.
Create a reference file outside of the search path (it will be . in the example):
echo >/tmp/reference
Now we have a known file identical to what you're looking for. Then compare all regular files under the search path (. here) to the reference file:
find . -type f -size 1c -exec cmp -s -- /tmp/reference {} \; -print
-size 1c is not necessary, it can be omitted; it's only to improve the performance. It's a quick preliminary test that rejects files of wrong sizes, without spawning additional processes. Relatively costly cmp … processes will be created only for files of the right size.
-s makes cmp itself silent. We don't need its output, just the exit status.
-- is explained here: What does "--" (double-dash) mean? It's really not needed in our example case, i.e. if the reference file is specified as /tmp/reference and the search path is .. I used -- in case someone carelessly chooses path(s) that would otherwise make cmp misbehave or fail; with -- it should just work.
-exec is used as a test, it will succeed if and only if cmp returns exit status zero; and for a tested file this will happen if the file is identical to /tmp/reference. This way, find will give you the pathnames of files that are identical to the reference file.
The method can be used to find files with any fixed content; you just need a reference file with the exact content (and don't forget to adjust -size … if you use it; -size "$(</tmp/reference wc -c)c" will be handy). In our specific case a simple echo was used to create the file because it prints one newline character, which is exactly the content you want to find.
To make find attempt to delete each matching file, use -delete (xor -exec rm -- {} +) after -print.
Search for files that are a single byte. Compare them to the known value. Print and/or delete if matched
find /path/to/files -type f -size 1c -exec sh -c 'printf "\n" | cmp -s -- - "$1"' _ {} \; -print
Optionally append -delete to delete, and remove -print if you want a silent run.
With GNU grep, you can use -z to treat the entire file as a single line (-z makes grep use NUL as the line terminator, so as long as your files don't actually contain NUL, \0, it has the effect of treating the whole file as a single line). If we combine that with -l to just print the file name and -P for PCREs to use \n, we can search for "lines" that only have a single \n and nothing else:
grep -lPz '^\n$' *
For example, given these three files:
printf 'foo\n' > good_file_1
printf '\n\n\n\n' > good_file_2
printf '\n' > bad_file
Running the grep above gives:
$ grep -lPz '^\n$' *
bad_file
You can also make it recursive, using the bash globstar option (from man bash):
globstar
If set, the pattern ** used in a pathname expansion context will match all files and zero or more directo‐ ries and subdirectories. If the pattern is followed by a /, only directories and subdirectories match.
So, for example, in this situation:
$ mkdir -p ./some/long/path/here/
$ cp bad_file some/long/path/here/bad_file_2
$ tree
.
├── bad_file
├── good_file_1
├── good_file_2
└── some
└── long
└── path
└── here
└── bad_file_2
5 directories, 4 files
Enabling globstar and running grep on **/* will find both bad files (I am redirecting standard error because grep complains about being given directories to search instead of files; such errors are expected and can safely be ignored):
$ grep -lPz '^\n$' **/* 2>/dev/null
bad_file
some/long/path/here/bad_file_2
Alternatively, use find to only search files:
$ find . -type f -exec grep -lPz '^\n$' {} +
./some/long/path/here/bad_file_2
./bad_file
With zsh:
zmodload zsh/mapfile
print -rC1 -- **/*(ND.L1e[$' [[ $mapfile[$REPLY] = "\n" ]] '])
print -rC1: prints raw on 1 ColumnN: nullglob: don't complain if there's no match, and pass an empty list to print instead.D: dotglob: don't skip hidden files.: regular files only (like -type f in find or file/f in rawhide).L1: of Length 1.e[code] runs the code on the file to further determine if that's a match$mapfile[$REPLY] expands to the contents of the file (whose path is in $REPLY).POSIXly, and avoiding spawning one or more process per file (assuming a sh implementation where read, [ and printf are builtin which is usually the case):
find . -type f -size 1c -exec sh -c '
for file do
IFS= read -r line < "$file" && [ -z "$line" ] && printf "%s\n" "$file"
done' sh {} +
(note that contrary to with zsh above, the list is not sorted).
With rawhide (list not sorted either):
rh -e 'file && size == 1 && "
".body' .
With grep implementations that can cope with non-text files (NUL bytes and non-delimited lines at least) such as GNU grep in the C locale, you can also do:
LC_ALL=C find . -type f -size 1c -exec grep -l '^$' {} +
find . -size 1c -exec sh -c '[ -z "$(< $1)" ]' sh '{}' ';' -print
Looks for files of size exactly one byte, where the result of reading the file (in a shell) is empty -- sh strips trailing newlines from command substitutions.
$(<...) is a ksh operator, not a sh operator. In ksh88, $1 should be quoted.
– Stéphane Chazelas
Jan 05 '24 at 09:27
Just to present a novel alternative, in FreeBSD, this could be done as:
find . -maxdepth 1 -size 1c \
-exec md5 -q '--check=68b329da9893e34099c7d8ad5cb9c940 {} >/dev/null' \; -print
However, an md5 hash, even of a small file, is likely somewhat more expensive than a simple cmp.
I tried to find a way to phrase the cmp method using bash's command substitution (and BSD find), but it's a bit klunky:
find . -maxdepth 1 -size 1c -exec bash -c 'cmp -s "{}" <(echo)' \; -print
Again, likely slightly more expensive to create the newline file multiple times than Kamil's method of creating the reference file once, and comparing against it repeatedly.
md5 makes fewer system calls, it could be faster. Especially if it can check multiple files per invocation with find -exec md5 {} +. (BTW, GNU Coreutils md5sum doesn't have an option to supply a hash on the command line to check against. But you could get it to print the hashes for multiple files and grep that.) Hrm, a duplicate-file finder could probably be best, if there's one that lets you look for duplicates only between two sets, not within, and one set can be the reference file alone. Or perl could be fast at this, with good binary file support and no fork/exec.
– Peter Cordes
Jan 06 '24 at 06:16
echo > '$(reboot)' for instance.
– Stéphane Chazelas
Jan 06 '24 at 14:11
-exec md5 -q '--check=68b329da9893e34099c7d8ad5cb9c940 {} >/dev/null' \; doesn't make sense. You'd want -exec md5 -q --check=68b329da9893e34099c7d8ad5cb9c940 {} \;. You'd want to discard md5's output, but to do that without discarding that of find, you'd need to invoke a shell like: -exec sh -c 'exec md5sum --check=... "$1" > /dev/null' sh {} \;
– Stéphane Chazelas
Jan 06 '24 at 14:22
-size 1cbefore the exec would probably make the whole thing faster as it would not need to spawncmpfor every file (though of course it would need to be adapted if the target file has a different size). – jcaron Jan 05 '24 at 18:09stator asize=$(find "$temp" -printf ...)an arg for-size. Or if you take the contents as a string arg, thensize="${#1}c"or something, andprintf "%s" "$1" > "$tmp"frommktempor something. – Peter Cordes Jan 06 '24 at 06:08