5

With the file command I need to verify many files if they ASCII or other format

Sometimes I get from file command:

  file1: ASCII English text

And sometimes I get different answer from file command

  file2: Non-ISO extended-ASCII English text, with very long lines

I am really not sure if there are other answers with different syntax

My question is:

I write the follwing ksh syntax to verify if file is a ASCII but I not sure if the

following syntax is the optimal syntax in order to verify ASCII format?

   [[ ` file  $some_file | grep –c ASCII ` = 1 ]] && print "you have ascii file for sure"

If someone have other suggestion to verify ASCII format for sure!, I will very glad to see that

jennifer
  • 1,117
  • ASCII? In the days of internet and Unicode? You must be joking. – u1686_grawity Oct 26 '10 at 22:23
  • You do realize that file is a heuristic guess and not a guarantee, right? yes | head -c $((2**20)) > blah; dd if=/dev/urandom bs=1 count=1024 >> blah; file blah says blah: ASCII text even though it's not. – ephemient Oct 27 '10 at 19:07
  • yes I am understand but what I need to do if I want to make selection of files type , what the best thing to do? , any idea? – jennifer Oct 27 '10 at 20:21

3 Answers3

8
if LC_ALL=C grep -q '[^[:print:][:space:]]' file; then
    echo "file contains non-ascii characters"
else
    echo "file contains ascii characters only"
fi
ephemient
  • 25,184
  • hi ephemient - please explain LC_ALL=C before the grep command , why? – jennifer Oct 26 '10 at 22:43
  • 2
    LC_ALL=C forces grep to treat [[:print:]] as the "printable ASCII" character class. Otherwise it means "printable ", which may be non-ASCII. For example, most Linux boxes are set up with UTF-8 locales, in which case [[:print:]] would match non-ASCII character sequences that are valid UTF-8 printable characters. – ephemient Oct 26 '10 at 22:49
  • 1
    @jennifer: name=value command is the syntax for temporarily setting an environment variable, in this case LC_ALL, for a single command. Setting locale to C makes sure [[:print:]] only matches ASCII characters (and not accented characters from your language). – u1686_grawity Oct 26 '10 at 22:50
  • why I get "file contains non-ascii characters" for /etc/hosts , as you know hosts file is ASCII file? – jennifer Oct 26 '10 at 22:51
  • @jennifer: Fixed. Probably included a tab or something like that; I forgot [[:print:]] is [[:graph:] ] not [[:graph:][:space:]]. – ephemient Oct 26 '10 at 22:55
  • @ephemient hi , I check some files and I find that your code return "file contains non-ascii characters" but from file command I get: Non-ISO extended-ASCII English text, with very long lines how to support this? – jennifer Oct 27 '10 at 08:41
  • @ephemient hi again , from my previous remark from my point if I get "Non-ISO extended-ASCII English text" its also ASCII file , please if you can help me to update your code to support this – jennifer Oct 27 '10 at 10:03
  • @jennifer: "Non-ISO extended-ASCII" is not any specific encoding at all. I don't understand what you want to happen – it's clearly not ASCII. Note that there are many different ISO-8859-* and non-ISO variants of extended ASCII character sets that file does not differentiate between, and any attempt to determine the character set is (at best) a guess. – ephemient Oct 27 '10 at 13:19
  • @ephemient hi - but by VI its seems as simple text ordinary file with text and remarks what's the different – jennifer Oct 27 '10 at 13:24
  • I am really not understand -:( why I get Non-ISO extended-ASCII on ordinary text file with simple text lines , maybe bug in the file command? – jennifer Oct 27 '10 at 13:26
  • @jennifer: What does perl -ne 'END {print join($", sort {$a <=> $b} keys %c), $/} undef @c{map ord, split //}' say for this file? Any values (other than 9, 10, or 13) below 32 or above 126? What is the output of locale? – ephemient Oct 27 '10 at 13:31
  • hi @ephemient the output:9 10 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 91 93 94 95 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 124 126 147 148 – jennifer Oct 27 '10 at 13:41
  • perl -ne 'END {print join($", sort {$a <=> $b} keys %c), $/} undef @c{map ord, split //}' file_test 9 10 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 91 93 94 95 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 124 126 147 148 – jennifer Oct 27 '10 at 13:42
  • @ephemient hi did you have some concussions about what I send from the output? – jennifer Oct 27 '10 at 13:59
  • @jennifer: Yes, the 147 and 148 indicate that it's NOT ASCII. What is your locale? – ephemient Oct 27 '10 at 14:14
  • very strange this is configuration file with parameters and values , my target is to find all configuration files , and if they ascii files then I need to update those file by sed – jennifer Oct 27 '10 at 14:20
  • So maybe I need to defied the Non-ISO extended-ASCII as configuration file also as ascii – jennifer Oct 27 '10 at 14:21
  • @ephemient what's you think about the following: maybe I use the simple file command to verify if ASCII or Non-ISO extended-ASCII and then edit those files?

    [[ file $some_file | grep –c ASCII = 1 ]] || [[ file $some_file | grep –c "Non-ISO extended-ASCII" = 1 ]] && print "you have ascii file for sure"

    – jennifer Oct 27 '10 at 14:32
  • @ephemient hi , did you agree with me about my solution? – jennifer Oct 27 '10 at 15:00
  • @jennifer: You haven't answered. What is your locale? In any case, if you are trying to detect whether you should sed a file or not by a skim of its contents, I believe your approach is fundamentally flawed. No, I do not agree at all. – ephemient Oct 27 '10 at 19:00
  • sorry but I don't understand about the "locale" - did you mean if the machine is linux or solaris then my machine is alinux machine – jennifer Oct 27 '10 at 20:07
  • about what you said that my approach is not very good , OK but what the other option did you have other idea? – jennifer Oct 27 '10 at 20:09
  • as I said my target to edit files some files are binary and some files are configuration files and some of them are application - what I need to do is only to edit the configuration files (ASCII) did you have other suggestion? – jennifer Oct 27 '10 at 20:11
  • What is the output of the locale command? § If the list of files were purely advisory, then a heuristic seems okay, but if you're actually going to be mangling them, it would be better to keep a registry of which files are to be affected. Linux package managers like dpkg and rpm keep track of configuration files; you can tie into their system, or build your own. – ephemient Oct 27 '10 at 20:13
  • the output: LANG=en_US LC_CTYPE="en_US" LC_NUMERIC="en_US" LC_TIME="en_US" LC_COLLATE="en_US" LC_MONETARY="en_US" LC_MESSAGES="en_US" LC_PAPER="en_US" LC_NAME="en_US" LC_ADDRESS="en_US" LC_TELEPHONE="en_US" LC_MEASUREMENT="en_US" LC_IDENTIFICATION="en_US" LC_ALL= – jennifer Oct 27 '10 at 20:26
  • and on my solaris machine: locale LANG=C LC_CTYPE="C" LC_NUMERIC="C" LC_TIME="C" LC_COLLATE="C" LC_MONETARY="C" LC_MESSAGES="C" LC_ALL= – jennifer Oct 27 '10 at 20:27
  • @jennifer: Take the LC_ALL=C part out and the command I gave will recognize those files on your Linux machine — but keep in mind that there will be unavoidable false positives, e.g. you can't tell the difference between ISO-8859-1 and ISO-8859-15 encodings, nevermind foreign encodings like SJIS or GBK. – ephemient Oct 27 '10 at 20:38
  • so the final syntax to verify if the file is ASCII is by : if grep -q '[^[:print:][:space:]]' file; then... ( I am right ?) – jennifer Oct 27 '10 at 20:41
  • @ephemient please your last opinion on my last remark -:) – jennifer Oct 27 '10 at 21:24
  • if I want to summary this issue you say that (grep -q '[^[:print:][:space:]]' file) syntax is more safe then to use the file command to match the ASCI string , am I right? – jennifer Oct 27 '10 at 21:26
  • @jennifer: file doesn't actually look at the whole file; it looks at the beginning, maybe looks at the end, and makes a guess. This actually looks at the whole file, so I believe that this method is safer. However, just by checking whether the contents of a file are consistent with a particular encoding (ASCII or otherwise) is still pretty meaningless on its own. – ephemient Oct 27 '10 at 21:28
  • OK I will use your syntax (grep -q '[^[:print:][:space:]]' file) in my code I hope everything will be OK – jennifer Oct 27 '10 at 21:31
1

How about...

if file -ib "$file" | grep -Eqs '^text/plain(;|$)'; then
    echo "It's text/plain."
fi

I don't know how common is --mime-type; if it's standard, use

if file -b --mime-type "$file" | grep -qs '^text/plain$'; then

Alternatively grep -qs '^text/' for any text type.

u1686_grawity
  • 452,512
0

Since you're parsing the output with code I'd suggest using the -i option on file so it outputs MIME types instead human-friendly strings. The MIME type output is more regular and that makes it a little easier to deal with in code.

As for the output types a look at man file says that:

/usr/share/file/magic
    Default list of magic numbers

/usr/share/file/magic.mime
    Default list of magic numbers, used to output  mime types
    when the -i option is specified.

Take a look at those files for all the MIME types it can report to determine which types you'll care about when parsing the output from file. I suspect all you'll care is that the MIME type starts with text/.

Ian C.
  • 6,139