mass&recursive iconv [convert all to UTF-8]

ntg_net · 20 Σεπτεμβρίου 2017

Γίνεται να μετατρέψω μαζικά και recursive όλα τα αρχεία σε UTF-8 στο ακελο ~/Movies/ ?

Είτε είναι το encoding Windows-1253 ή ISO-8859-7 ή Windows-1252 να τα αλλάξει όλα σε UTF-8

Να μην διαλέγω δηλαδή

iconv -f ISO-8859-7 -t UTF-8 sub1.srt > sub1.srt

Και τέλος να τα γυρίσει όλα σε Line Ending: Unix/Linux

Edit1

Βρήκα και αυτό το script αλλά πάλι δεν είναι αυτό ακριβώς που θέλω

#!/bin/bash
#enter input encoding here
FROM_ENCODING="value_here"
#output encoding(UTF-8)
TO_ENCODING="UTF-8"
#convert
CONVERT=" iconv  -f   $FROM_ENCODING  -t   $TO_ENCODING"
#loop to convert multiple files 
for  file  in  *.txt; do
$CONVERT   "$file"   -o  "${file%.txt}.utf8.converted"
done
exit 0

Edit2

Βρήκα και αυτό εδώ και μάλλον θα κάνει

#!/bin/bash

# Created by LEXO, http://www.lexo.ch
# Version 1.0
#
# This bash script converts all files from within a given directory from any charset to UTF-8 recursively
# It takes track of those files that cannot be converted automatically. Usually this happens when the original charset
# cannot be recognized. In that case you should load the corresponding file into a development editor like Netbeans
# or Komodo and apply the UTF-8 charset manually.
#
# This is free software. Use and distribute but do it at your own risk.
# We will not take any responsibilities for failures and do not provide any support.

#checking Parameters
if [ ! -n "$1" ] ; then
    echo "You did not supply any directory at the command line."
    echo "You need to provide the path to the directory that contains the files which you want to be converted"
    echo ""
    echo "Example: $0 /path/to/directory"
    echo ""
    echo "Important hint: You should not run this script from within the same directory where the files are stored"
    echo "that you want to convert right now."
exit
fi

# This array contains file extensions that need to be checked no matter if the filetype is binary or not.
# Reason: Sometimes it happens that .htm(l), .php, .tpl files etc. have a binary charset type. This script
# does not convert binary file types into utf-8 because it might destroy your data. So we need to include these file types
# into the conversion system manually to tell the conversion that binary files with these special extensions may be converted anyway.
filestoconvert=(htm html php txt tpl asp css js)

# define colors
# default color
reset="\033[0;00m"
# Successful conversion (green)
success="\033[1;32m"
# No conversion needed (blue)
noconversion="\033[1;34m"
# file skipped because it's not mentioned in the filestoconvert array (white)
fileskipped="\033[1;37m"
# files that could not be converted aka error (red)
fileconverterror="\033[1;31m"

## function to convert all files in a directory recusrively
function convert {
#clear screen first
clear

dir=$1

# Get a recursive file list
files=(`find $dir -type f`);
fileerrors=""

#loop counter
i=0

find "$dir" -type f |while read inputfile
do
if [ -f "$inputfile" ] ; then
charset="$(file -bi "$inputfile"|awk -F "=" '{print $2}')"
if [ "$charset" != "utf-8" ]; then
#if file extension is in filestoconvert variable the file will always be converted
filename=$(basename "$inputfile")
extension="${filename##*.}"
# If the current file has not an extension that is listed in the array $filestoconvert the current file is being skipped (no conversion occurs)
if in_array $extension "${filestoconvert[@]}" ; then
# create a tempfile and remember all of the current file permissions to be able to reapply those to the new converted file after conversion
tmp=$(mktemp)
owner=`ls -l "$inputfile" | awk '{ print $3 }'`
group=`ls -l "$inputfile" | awk '{ print $4 }'`
octalpermission=$( stat --format=%a "$inputfile" )
echo -e "$success $inputfile\t$charset\t->\tUTF-8 $reset"
iconv -f "$charset" -t utf8 "$inputfile" -o $tmp &>2
RETVAL=$?
if [ $RETVAL > 0 ] ; then
# There was an error converting the file. Remember this and inform the user about the file not being converted at the end of the conversion process.
fileerrors="$fileerrors\n$inputfile"
fi
mv "$tmp" "$inputfile"
#re-apply previous file permissions as well as user and group settings
chown $owner:$group "$inputfile"
chmod $octalpermission "$inputfile"
else
echo -e "$fileskipped $inputfile\t$charset\t->\tSkipped because its extension (.$extension) is not listed in the 'filestoconvert' array. $reset"
fi
else
echo -e "$noconversion $inputfile\t$charset\t->\tNo conversion needed (file is already UTF-8) $reset"
fi
    fi
(( ++i ))
done
echo -e "$success Done! $reset"
echo -e ""
echo -e ""
if [ ! $fileerrors == "" ]; then
    echo -e "The following files had errors (origin charset not recognized) and need to be converted manually (e.g. by opening the file in an editor (IDE) like Komodo or Netbeans:"
    echo -e $fileconverterror$fileerrors$reset
fi
exit 0
} #end function convert()

# Check if a value exists in an array
# @param $1 mixed Needle
# @param $2 array Haystack
# @return Success (0) if value exists, Failure (1) otherwise} #end function in_array()
# Usage: in_array "$needle" "${haystack[@]}"
in_array() {
local needle=$1
local hay=$2
shift
for hay; do
#    echo "Hay: $hay , Needle: $needle"
[[ $hay == $needle ]] && return 0
done
return 1
} #end function in_array

#start conversion
convert $1

becoming_I · 21 Σεπτεμβρίου 2017

Νομίζω ότι αυτό που θες γίνεται σε 1 γραμμή εφόσον είσαι στον ίδιο φάκελο, για δοκίμασε

for s in *.srt; do iconv -f ISO-8859-7 -t UTF-8 $s > $s; done

ntg_net · 21 Σεπτεμβρίου 2017

Νομίζω ότι αυτό που θες γίνεται σε 1 γραμμή εφόσον είσαι στον ίδιο φάκελο, για δοκίμασε
for s in *.srt; do iconv -f ISO-8859-7 -t UTF-8 $s > $s; done

Αν είναι Windows-1253 και όχι ISO-8859-7, πιστεύεις το κάνει μετατροπή?

Oxygene · 21 Σεπτεμβρίου 2017

1. Εγκατέστησε την uchardet και την dos2unix

2. Τρέξε τον παρακάτω κώδικα στον φάκελο που θες (δανείστηκα κομμάτια κώδικα από το subedit για το γράψω):

#!/bin/bash

# Set globbing to case insensitive
shopt -s nocaseglob
 
# Capture the output of find to an array
old_IFS=$IFS        # save the field separator
IFS=\n'                   # new field separator, the end of line
 
unset filesarray
filesarray=( $(find ./ -type f -iname "*.srt" 2> /dev/null | sort) )
 
IFS=$old_IFS        # restore default field separator

# Restore globbing to case sensitive
shopt -u nocaseglob
 
for inputfilename in "${filesarray[@]}"; do
 
    encoding=$(file -b --mime-encoding "$inputfilename")
    if [[ $encoding == "binary" ]] || ! [[ $(file -b "$inputfilename") == *"text"* ]] && ! [[ $(file -b "$inputfilename") == *"Bio-Rad .PIC Image File"* ]]; then
        echo "${red}Error: \"$inputfilename\" is not a text file or it is UTF-16 without BOM${normal}"
    fi

    # Detect encoding of input file
    # We use both file and uchardet (if it is available) because no one is perfect.
    # file doesn't get right some non-unicode encodings, but for UTF-16 is preferable
    # than uchardet because the latter doesn't specify if it is UTF-16BE or UTF-16LE.
    if [[ $encoding == "utf-16"* ]] || [[ $encoding == *"ascii"* ]]; then
        :
    elif [[ -x /usr/bin/uchardet ]]; then
        encoding=$(uchardet "$inputfilename")
    fi

    # If the encoding is detected as ISO-8859-7, make it CP1253
    if [[ ${encoding^^} == "ISO-8859-7" ]]; then
        encoding="CP1253"
    fi

    # If text file is UTF-16 convert it to UTF-8
    if [[ $encoding == "utf-16"* ]]; then
        tempvar="$(cat "$inputfilename" | iconv -f $encoding -t utf-8)"
        printf "%s\n" "$tempvar" > "$inputfilename"
        encoding="utf-8"
        unset tempvar
    fi
 
    # Convert subtitle to UTF-8
    iconv -f $encoding -t utf-8 "$inputfilename" 2> /dev/null
   
    # Make all line endings unix-like
    dos2unix -q "$inputfilename" 2> /dev/null
 
done

Επεξ/σία 21 Σεπτεμβρίου 2017 από Oxygene

becoming_I · 21 Σεπτεμβρίου 2017

Γίνεται να μετατρέψω μαζικά και recursive όλα τα αρχεία σε UTF-8 στο ακελο ~/Movies/ ?

Βασικά (καλημέρα σας) δεν είδα το recursive, νόμισα ότι όλα ήταν στον ίδιο φάκελο. Νομίζω ότι ο Oxygene σε έκανε άρχοντα!!

Oxygene · 21 Σεπτεμβρίου 2017

Βασικά (καλημέρα σας) δεν είδα το recursive, νόμισα ότι όλα ήταν στον ίδιο φάκελο. Νομίζω ότι ο Oxygene σε έκανε άρχοντα!!

Ούτε εγώ το είχα δεί, αλλά το τροποποίησα για να υποστηρίζει recursive

georgegreece · 19 Φεβρουαρίου 2021

Καλημερα,

για ενα σχετικο θεμα conversion απο utf8 σε utf16 (σε μηχανημα WINDOWS 10 - με library iconv, gettex κλπ)

αν ξερεις κανεις γιατι το παρακατω script δεν δουλευει. Εχει ενα θεμα με το BOM απ οτι καταλαβα. Το utf16 σκετο δεν οριζει BOM. Βγαζει μισα κινεζικα , μισα οΚ. Σε notepad++ τα βλεπω UCS-2-BE BOM .

@ECHO OFF
cd C:\Users\George\Desktop\testscript
for %f in (*.txt) do "C:\Program Files (x86)\UNIXUTIL\ICONV\bin\iconv.exe" -f utf-8 -t utf-16 "%f" > C:\Users\George\Desktop\testscript\resultfolder\%f

δοκιμασα και utf-16BE, utf-16LE . Σε notepad++ φαινονται ANSI με : nul κλπ .

Σύνδεση

mass&recursive iconv [convert all to UTF-8]

Προτεινόμενες αναρτήσεις

ntg_net

becoming_I

ntg_net

Oxygene

becoming_I

Oxygene

georgegreece

Δημιουργήστε ένα λογαριασμό ή συνδεθείτε για να σχολιάσετε

Δημιουργία λογαριασμού

Σύνδεση

Σύνδεση