2

I just created a new web server using a Linux box and NginX. It seems to be working correctly for most content.

I just converted a site which used to be on a Windows Server box and for the most part the content and functionality (HTML and JavaScript) transferred ok. However, I have discovered that NginX seems to have problems with image names (perhaps other files also??) which contain accented characters, for example é,è,ô, etc.

If there were only a hand-full I could just manually rename the files, but there are hundreds (maybe thousands) which make a manual process unworkable. Can someone offer a way to easily rename files with these accented characters?

Thanks..RDK

edit: The original conversion of the old site had lots of pages using "windows-1252" and a few using "UTF-8" for the "charset". The later had issues displaying page content for special characters usually as the "black diamond ?" symbol. The other seemed to have issues with JavaScript and also some display issues. After a web search on that issue I change all "charset=" values to "iso-8859-1" which is the default for many browsers which corrected all of those issues. But now I have the special characters in file names problem...

RDK
  • 23

2 Answers2

2

One way to remove diacritics with a Perl's rename:

if you need to rename diacritics to ascii equivalent:

rename -u utf8 '
    BEGIN{use Text::Undiacritic qw(undiacritic)}
    s/.*/undiacritic($&)/e
' éééé.txt 
rename(éééé.txt, eeee.txt)

One other way is to use the detox utility. Available with Debian/Ubuntu and other distros as a package.


Another last way is to use this script, based on convmv(1) translated in English from a French project : forum.ubuntu-fr.org:

It's intended to change wrong charset to utf8. (Not a script of mine, but Lapogne71), could be an issue solver.

#!/bin/bash

VERSION="v0.04"

#---------------------------------------------------------------------------------------

This script allows to loop the "convmv" utility that allows converting file names coded in

something other than UTF-8 to UTF-8

Restart the script with the ALLCODES argument if no result has been found

#---------------------------------------------------------------------------------------

here are the colors of the text displayed in the shell

RED="\033[1;31m" NORMAL="\033[0;39m" BLUE="\033[1;36m" GREEN="\033[1;32m"

echo echo -e "$GREEN $0 $NORMAL $VERSION" echo

echo "---------------------------------------------------------- This script allows to loop the 'convmv' utility that allows converting file names coded in something other than UTF-8 to UTF-8. Restart the script with the ALLCODES argument if no result has been found. ----------------------------------------------------------"

The main loop launches convmv tests to "visually" detect the original encoding

We only loop over the iso-8859* and cp* code families as they are the most likely ones (EBCDIC codes have also been removed from the list)

CODES_LIST=" iso-8859-1 iso-8859-2 iso-8859-3 iso-8859-4 iso-8859-5 iso-8859-6 iso-8859-7 iso-8859-8 iso-8859-9 iso-8859-10 iso-8859-11 iso-8859-13 iso-8859-14 iso-8859-15 iso-8859-16 cp437 cp737 cp775 cp850 cp852 cp855 cp856 cp857 cp860 cp861 cp862 cp863 cp864 cp865 cp866 cp869 cp874 cp932 cp936 cp949 cp950 cp1250 cp1251 cp1252 cp1253 cp1254 cp1255 cp1256 cp1257 cp1258 "

We check if the convmv utility is installed

path=which convmv 2> /dev/null if [ -z "$path" ]; then echo -e "$RED ERROR: convmv is not installed, please install it by typing:" echo echo -e "$BLUE sudo apt-get install convmv " echo echo -e "$RED ==> program exit" echo echo -e "$NORMAL" exit 1 fi

To loop over all the codepages supported by convmv, the ALLCODES argument must be provided

if [ "$1" = "ALLCODES" ]; then CODES_LIST=convmv --list echo echo -e "$RED Check which original encoding seems correct (press 'y' and validate if waiting for display)$NORMAL" echo fi

Main loop of the program

for CODAGE in $CODES_LIST; do echo -e "$BLUE--- Encoding hypothesis: $RED $CODAGE $BLUE---$NORMAL" echo # echo -e "$RED Press 'y' and validate if no list is displayed $NORMAL" convmv -f $CODAGE -t utf-8 -r * 2>&1 | grep -v Perl | grep -v Starting | grep -v notest | grep -v Skipping > /tmp/affichage_convmv.txt NOMBRE_FICHIERS=cat /tmp/affichage_convmv.txt | wc -l if [ $NOMBRE_FICHIERS -eq 0 ]; then echo echo -e "$RED No filename to convert " $NORMAL echo echo -e "$BLUE Exiting program ... $NORMAL" echo rm /tmp/affichage_convmv.txt 2>/dev/null exit 0 fi

# sed 's ..  ' source.txt   ==> this removes the first 2 characters from a string
echo -e $GREEN "Original filenames coded in $CODAGE: " $NORMAL
# ALTERNATIVE cat /tmp/affichage_convmv.txt | cut -f 2 -d '"' | sed 's ..  '
cat /tmp/affichage_convmv.txt | cut -f 2 -d '"'
echo
echo -e $GREEN "Filenames converted to UTF-8: " $NORMAL
# ALTERNATIVE cat /tmp/affichage_convmv.txt | cut -f 4 -d '"' | sed 's ..  '
cat /tmp/affichage_convmv.txt | cut -f 4 -d '"'
echo

echo -n -e $GREEN "Found encoding? $RED [N]$NORMAL""on /$RED o$NORMAL""ui /$RED q$NORMAL""uit: "
read confirm
echo

# request for file conversion using convmv
if [ "$confirm" = O ] || [ "$confirm" = o ];then
    echo -e "$BLUE Convert filenames now from encoding $CODAGE? $NORMAL"
    echo -e "$BLUE   ==> convmv -f $CODAGE -t utf-8 * --notest $NORMAL"
    echo -n -e $GREEN "Confirm conversion $RED [N]$NORMAL""on /$RED o$NORMAL""ui /$RED r$NORMAL""ecursive: "
    read confirm
    echo

    case $confirm in
        O|o)    convmv -f $CODAGE -t utf-8 * --notest 2>/dev/null
            echo
            echo -e "$BLUE File name conversion done... $NORMAL" ;;
        R|r)    convmv -f $CODAGE -t utf-8 * -r --notest 2>/dev/null
            echo
            echo -e "$BLUE Recursive file name conversion done... $NORMAL" ;;
        *)      echo -e "$BLUE Exiting program... $NORMAL" ;;
    esac

    echo
    rm /tmp/affichage_convmv.txt 2>/dev/null
    exit 0

# request for program exit

elif [ "$confirm" = Q ] || [ "$confirm" = q ];then echo -e "$BLUE Exiting program... $NORMAL" echo rm /tmp/affichage_convmv.txt 2>/dev/null exit 0 fi clear done rm /tmp/affichage_convmv.txt 2>/dev/null

  • Hmmm, the original conversion had lots of pages using "windows-1252" and a few using "UTF-8" for the "charset". The later had issues displaying page content for special characters usually as the "black diamond ?" symbol. The other seemed to have issues with JavaScript and also some display issues. After a web search on that issue I change all "charset=" values to "iso-8859-1" which corrected all of those issues. But now I have the special characters in file names problem... I'll update my question with this information. – RDK Feb 19 '23 at 19:48
  • The "black diamond ?" you called is accurately Unicode replacement character – n0099 Feb 20 '23 at 06:12
1

You need to open each file with Notepad++, Sublime, VSCode, or some other text editor that supports switching the charset. Switch the charset to UTF-8 and then save the file. If you are editing the files directly in linux, you might consider using iconv to convert each file to UTF-8 encoding.

Then after you've converted all your text based files to UTF-8 charset, test out nginx and the characters should display. If not, you can also try (in nginx.conf or whichever .conf file has your server configuration) add this line:

charset UTF-8;


The files and webserver must be the same charset, so converting everything to UTF-8 is the simplest way to avoid these issues in the future.