Discussion:
UTF-8 encoding problems under Apache 2 with mod_perl 2.
Tamer Embaby
2007-04-04 08:59:37 UTC
Permalink
All,

I have character encoding problem with my environment:

$ uname -a
SunOS vulcano 5.10 Generic_118844-26 i86pc i386 i86pc

Server: Apache/2.0.58 (Unix) mod_perl/2.0.3 Perl/v5.8.4

I'm hosting commercial application using mod_perl, the site we are
dealing with has Arabic character so I changed the following in Apache
to add support for UTF-8 charset:

AddDefaultCharset UTF-8

The application itself doesn't handle character set encoding as I
verified
with the vendor that they don't have anything to do with character
encoding
and they verified that their application is working fine in the same
settings so that the problem is with my environment.

Somehow something is transforming characters with encoding above 0x7f to

HTML character entities &#XX; so that the document with Arabic letters
arrive to the browser corrupted.

I started to suspect it's something either with Apache or mod_perl that
is
doing that, Apache itself is capable of serving static files with UTF-8
encoding correctly (without transforming UTF-8 character to HTML char
entities).

Below is additional info about my server.

Would anyone have an idea about what might be causing this? And how to
correct it.

I have a hunch that it's something to do with the Locale passed to the
mod_perl that I should be using "PerlPassEnv LANG" or something.

Any pointers are appreciated.

Thanks,
Tamer

----- INFO BEGIN -----
$ ../../bin/apachectl -l
Compiled in modules:
core.c
mod_access.c
mod_auth.c
mod_include.c
mod_log_config.c
mod_env.c
mod_setenvif.c
prefork.c
http_core.c
mod_mime.c
mod_status.c
mod_autoindex.c
mod_asis.c
mod_cgi.c
mod_negotiation.c
mod_dir.c
mod_imap.c
mod_actions.c
mod_userdir.c
mod_alias.c
mod_so.c

$ locale -a
C
POSIX
de
es
fi
fr
iso_8859_1
nl
ru
sl

$ locale
LANG=C
LC_CTYPE="C"
LC_NUMERIC="C"
LC_TIME="C"
LC_COLLATE="C"
LC_MONETARY="C"
LC_MESSAGES="C"
LC_ALL=

$ perl -V
Summary of my perl5 (revision 5 version 8 subversion 4) configuration:
Platform:
osname=solaris, osvers=2.10, archname=i86pc-solaris-64int
uname='sunos localhost 5.10 i86pc i386 i86pc'
config_args=''
hint=recommended, useposix=true, d_sigaction=define
usethreads=undef use5005threads=undef useithreads=undef
usemultiplicity=undef
useperlio=define d_sfio=undef uselargefiles=define usesocks=undef
use64bitint=define use64bitall=undef uselongdouble=undef
usemymalloc=n, bincompat5005=undef
Compiler:
cc='gcc', ccflags ='-D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64
-D_TS_ERRNO',
optimize='-O2 -fno-strict-aliasing',
cppflags=''
ccversion='GNU gcc', gccversion='', gccosandvers=''
intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=12345678
d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12
ivtype='long long', ivsize=8, nvtype='double', nvsize=8,
Off_t='off_t', lseeksize=8
alignbytes=4, prototype=define
Linker and Libraries:
ld='gcc', ldflags =''
libpth=/lib /usr/lib /usr/ccs/lib
libs=-lsocket -lnsl -ldl -lm -lc
perllibs=-lsocket -lnsl -ldl -lm -lc
libc=/lib/libc.so, so=so, useshrplib=true, libperl=libperl.so
gnulibc_version=''
Dynamic Linking:
dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-R
/usr/perl5/5.8.4/lib/i86pc-solaris-64int/CORE'
cccdlflags='-fPIC', lddlflags='-G'


Characteristics of this binary (from libperl):
Compile-time options: USE_64_BIT_INT USE_LARGE_FILES
Locally applied patches:
22667 The optree builder was looping when constructing the ops
...
22715 Upgrade to FileCache 1.04
22733 Missing copyright in the README.
22746 fix a coredump caused by rv2gv not fully converting a PV
...
22755 Fix 29149 - another UTF8 cache bug hit by substr.
22774 [perl #28938] split could leave an array without ...
22775 [perl #29127] scalar delete of empty slice returned
garbage
22776 [perl #28986] perl -e "open m" crashes Perl
22777 add test for change #22776 ("open m" crashes Perl)
22778 add test for change #22746 ([perl #29102] Crash on assign
...
22781 [perl #29340] Bizarre copy of ARRAY make sure a pad op's
...
22796 [perl #29346] Double warning for int(undef) and abs(undef)
...
22818 BOM-marked and (BOMless) UTF-16 scripts not working
22823 [perl #29581] glob() misses a lot of matches
22827 Smoke [5.9.2] 22818 FAIL(F) MSWin32 WinXP/.Net SP1 (x86/1
cpu)
22830 [perl #29637] Thread creation time is hypersensitive
22831 improve hashing algorithm for ptr tables in perl_clone:
...
22839 [perl #29790] Optimization busted: '@a = "b", sort @a' ...
22850 [PATCH] 'perl -v' fails if local_patches contains code
snippets
22852 TEST needs to ignore SCM files
22886 Pod::Find should ignore SCM files and dirs
22888 Remove redundant %SIG assignments from FileCache
23006 [perl #30509] use encoding and "eq" cause memory leak
23074 Segfault using HTML::Entities
23106 Numeric comparison operators mustn't compare addresses of
...
23320 [perl #30066] Memory leak in nested shared data structures
...
23321 [perl #31459] Bug in read()
Built under solaris
Compiled at Jan 21 2005 15:48:11
@INC:
/usr/perl5/5.8.4/lib/i86pc-solaris-64int
/usr/perl5/5.8.4/lib
/usr/perl5/site_perl/5.8.4/i86pc-solaris-64int
/usr/perl5/site_perl/5.8.4
/usr/perl5/site_perl
/usr/perl5/vendor_perl/5.8.4/i86pc-solaris-64int
/usr/perl5/vendor_perl/5.8.4
/usr/perl5/vendor_perl

----- INFO END -----

--

Tamer Embaby <***@itworx.com>

" f u cn rd ths, u cn gt a gd jb n cmptr prgrmmng. "
Perrin Harkins
2007-04-04 14:12:44 UTC
Permalink
Post by Tamer Embaby
Server: Apache/2.0.58 (Unix) mod_perl/2.0.3 Perl/v5.8.4
You probably want a later Perl than that. Many unicode bugs have been
fixed since then.
Post by Tamer Embaby
I have a hunch that it's something to do with the Locale passed to the
mod_perl that I should be using "PerlPassEnv LANG" or something.
I don't know that much about unicode, but I do remember that Perl does
some automatic encoding in certain situations. There was that problem
in Red Hat a few years ago when they set LANG to UTF-8 and it broke
all kinds of CPAN module tests when Perl tried to read all files as
UTF-8. Why don't you try setting LANG to UTF-8 and see if it helps
your situation.

- Perrin
Iosif Fettich
2007-04-04 14:30:22 UTC
Permalink
Hello,

quick question:

is there currently a way to get the ServerAliases used for an virtual host
in config ?

My Apache2::ServerRec man page offers the "names" method for that, but it
seems to be just a placeholder for code to come, as the function would
return an APR::ArrayHeader object and this falls in the

---
since: 2.0.00

META: we don't have "APR::ArrayHeader" yet

---

category. Well, if that's not yet working, is there another way to reach
these values ?

Many thanks,

Iosif Fettich
Jeff Nokes
2007-04-05 02:15:02 UTC
Permalink
We also do everything (not source code, which is in ISO-8859-1, only content) in UTF-8 where I work, and we support many different languages. We never use any apache configurations or make any explicit reference to the OS locale being used. As of Perl 5.8*, internally Perl assumes UTF-8 for all strings/character data, unless you say otherwise. I do know that there have been issues with the core Encode module (which handles much of the character encoding control) and UTF-8 character data in versions 5.8.2-5.8.6 (there are bugs filed against it), but I think as of 5.8.7, most have been fixed. Anyway, you are probably safe with 5.8.8; but we still use 5.8.1.

Our stack is:
RedHat Linux 7.2
Apache 1.3x
mod_perl 1.29
HTML::Mason + HTML::Template
Perl 5.8.1

... and we have had no issues with UTF-8 corruption of our content.

I'm not sure this helps you or not.
- Jeff


----- Original Message ----
From: Tamer Embaby <***@itworx.com>
To: ***@perl.apache.org
Sent: Wednesday, April 4, 2007 1:59:37 AM
Subject: UTF-8 encoding problems under Apache 2 with mod_perl 2.

All,

I have character encoding problem with my environment:

$ uname -a
SunOS vulcano 5.10 Generic_118844-26 i86pc i386 i86pc

Server: Apache/2.0.58 (Unix) mod_perl/2.0.3 Perl/v5.8.4

I'm hosting commercial application using mod_perl, the site we are
dealing with has Arabic character so I changed the following in Apache
to add support for UTF-8 charset:

AddDefaultCharset UTF-8

The application itself doesn't handle character set encoding as I
verified
with the vendor that they don't have anything to do with character
encoding
and they verified that their application is working fine in the same
settings so that the problem is with my environment.

Somehow something is transforming characters with encoding above 0x7f to

HTML character entities &#XX; so that the document with Arabic letters
arrive to the browser corrupted.

I started to suspect it's something either with Apache or mod_perl that
is
doing that, Apache itself is capable of serving static files with UTF-8
encoding correctly (without transforming UTF-8 character to HTML char
entities).

Below is additional info about my server.

Would anyone have an idea about what might be causing this? And how to
correct it.

I have a hunch that it's something to do with the Locale passed to the
mod_perl that I should be using "PerlPassEnv LANG" or something.

Any pointers are appreciated.

Thanks,
Tamer

----- INFO BEGIN -----
$ ../../bin/apachectl -l
Compiled in modules:
core.c
mod_access.c
mod_auth.c
mod_include.c
mod_log_config.c
mod_env.c
mod_setenvif.c
prefork.c
http_core.c
mod_mime.c
mod_status.c
mod_autoindex.c
mod_asis.c
mod_cgi.c
mod_negotiation.c
mod_dir.c
mod_imap.c
mod_actions.c
mod_userdir.c
mod_alias.c
mod_so.c

$ locale -a
C
POSIX
de
es
fi
fr
iso_8859_1
nl
ru
sl

$ locale
LANG=C
LC_CTYPE="C"
LC_NUMERIC="C"
LC_TIME="C"
LC_COLLATE="C"
LC_MONETARY="C"
LC_MESSAGES="C"
LC_ALL=

$ perl -V
Summary of my perl5 (revision 5 version 8 subversion 4) configuration:
Platform:
osname=solaris, osvers=2.10, archname=i86pc-solaris-64int
uname='sunos localhost 5.10 i86pc i386 i86pc'
config_args=''
hint=recommended, useposix=true, d_sigaction=define
usethreads=undef use5005threads=undef useithreads=undef
usemultiplicity=undef
useperlio=define d_sfio=undef uselargefiles=define usesocks=undef
use64bitint=define use64bitall=undef uselongdouble=undef
usemymalloc=n, bincompat5005=undef
Compiler:
cc='gcc', ccflags ='-D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64
-D_TS_ERRNO',
optimize='-O2 -fno-strict-aliasing',
cppflags=''
ccversion='GNU gcc', gccversion='', gccosandvers=''
intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=12345678
d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12
ivtype='long long', ivsize=8, nvtype='double', nvsize=8,
Off_t='off_t', lseeksize=8
alignbytes=4, prototype=define
Linker and Libraries:
ld='gcc', ldflags =''
libpth=/lib /usr/lib /usr/ccs/lib
libs=-lsocket -lnsl -ldl -lm -lc
perllibs=-lsocket -lnsl -ldl -lm -lc
libc=/lib/libc.so, so=so, useshrplib=true, libperl=libperl.so
gnulibc_version=''
Dynamic Linking:
dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-R
/usr/perl5/5.8.4/lib/i86pc-solaris-64int/CORE'
cccdlflags='-fPIC', lddlflags='-G'


Characteristics of this binary (from libperl):
Compile-time options: USE_64_BIT_INT USE_LARGE_FILES
Locally applied patches:
22667 The optree builder was looping when constructing the ops
...
22715 Upgrade to FileCache 1.04
22733 Missing copyright in the README.
22746 fix a coredump caused by rv2gv not fully converting a PV
...
22755 Fix 29149 - another UTF8 cache bug hit by substr.
22774 [perl #28938] split could leave an array without ...
22775 [perl #29127] scalar delete of empty slice returned
garbage
22776 [perl #28986] perl -e "open m" crashes Perl
22777 add test for change #22776 ("open m" crashes Perl)
22778 add test for change #22746 ([perl #29102] Crash on assign
...
22781 [perl #29340] Bizarre copy of ARRAY make sure a pad op's
...
22796 [perl #29346] Double warning for int(undef) and abs(undef)
...
22818 BOM-marked and (BOMless) UTF-16 scripts not working
22823 [perl #29581] glob() misses a lot of matches
22827 Smoke [5.9.2] 22818 FAIL(F) MSWin32 WinXP/.Net SP1 (x86/1
cpu)
22830 [perl #29637] Thread creation time is hypersensitive
22831 improve hashing algorithm for ptr tables in perl_clone:
...
22839 [perl #29790] Optimization busted: '@a = "b", sort @a' ...
22850 [PATCH] 'perl -v' fails if local_patches contains code
snippets
22852 TEST needs to ignore SCM files
22886 Pod::Find should ignore SCM files and dirs
22888 Remove redundant %SIG assignments from FileCache
23006 [perl #30509] use encoding and "eq" cause memory leak
23074 Segfault using HTML::Entities
23106 Numeric comparison operators mustn't compare addresses of
...
23320 [perl #30066] Memory leak in nested shared data structures
...
23321 [perl #31459] Bug in read()
Built under solaris
Compiled at Jan 21 2005 15:48:11
@INC:
/usr/perl5/5.8.4/lib/i86pc-solaris-64int
/usr/perl5/5.8.4/lib
/usr/perl5/site_perl/5.8.4/i86pc-solaris-64int
/usr/perl5/site_perl/5.8.4
/usr/perl5/site_perl
/usr/perl5/vendor_perl/5.8.4/i86pc-solaris-64int
/usr/perl5/vendor_perl/5.8.4
/usr/perl5/vendor_perl

----- INFO END -----

--

Tamer Embaby <***@itworx.com>

" f u cn rd ths, u cn gt a gd jb n cmptr prgrmmng. "
Jeff Pang
2007-04-05 02:20:01 UTC
Permalink
We also do everything (not source code, which is in ISO-8859-1, only content) in UTF-8 where I >work, and we support many different languages.
Jeff,how did you do it by using utf-8 for everything?can you give a rough description?Thanks.



--
mailto: ***@earthlink.net
http://home.arcor.de/jeffpang/
Jeff Nokes
2007-04-05 03:33:38 UTC
Permalink
Well,
We completely separate out our content, from our presentation templates, from our source. We use HTML::Mason mostly as a layer of abstraction to mod_perl's raw API, and then use HTML::Template to munge our content with our templates in pre-release batch mode and/or dynamically.

We keep all of our strings in versioned content files, in XML format, something like the following:

<str id="landHelp.004">
<content>Here's how to do it:</content>
</str>
...
<str id="commChat.217">
<content>How to participate</content>
</str>

This is an example of a US English XML string file. All of the different locales we support have their own string files, with the base being the US English one, meaning we always translate enUS -> other locale. So, for the traditional Chinese XML file, we would have the equivalent strings for those example stringIDs in-between the <content></content> tags, but in Chinese, the same for Polish, etc.

Then, a template might look like:

<option value="-1">------------------</option>
<option value="-2"><!-- TMPL_VAR NAME=landHelp.004 --></option>

... etc. This is just HTML::Template syntax inside of a standard HTML template. The post-munging phase would be:

<option value="-1">------------------</option>
<option value="-2">Here's how to do it:</option>


So, whenever we read in our string files for munging with templates, we tell Perl that the file is UTF-8 formatted, by creating the file handle as such, and that's it really; internally Perl automatically treats that string content as UTF-8 unless we state otherwise explicitly. We use the Encode module all the time to convert between UTF-8 and Big-5, or ISO-8859-2, or whatever for email templates with the same content. Of course in your web page templates, you have to have your character encoding set properly as well as in your emails for this all to work with respective clients. Apache really doesn't care, it just sees 8-bit data and serves it to the client.


I also have the following set in the ENVironment of the user running apache, but I've completely commented it out and I see no difference in behavior, but I keep in there for posterity I guess ... :-)

# Perl Unicode Support
# This ENV will force the entire Perl interpreter in Apache to have the
# following IO layers/streams forced to use UTF-8 as the desired charset.
# See `perldoc perlrun` and `perldoc peruniintro` for more details.
# I 1 STDIN is assumed to be in UTF-8
# O 2 STDOUT will be in UTF-8
# E 4 STDERR will be in UTF-8
# S 7 I + O + E
# i 8 UTF-8 is the default PerlIO layer for input streams
# o 16 UTF-8 is the default PerlIO layer for output streams
# D 24 i + o
# A 32 the @ARGV elements are expected to be strings encoded in UTF-8
# L 64 normally the "IOEioA" are unconditional,
# the L makes them conditional on the locale environment
# variables (the LC_ALL, LC_TYPE, and LANG, in the order
# of decreasing precedence) -- if the variables indicate
# UTF-8, then the selected "IOEioA" are in effect
PERL_UNICODE=SDA
export PERL_UNICODE


It's important to understand how Perl deals with character data internally, and how it uses the UTF-8 flag it sets, etc. You should probably read up on it if you haven't at the following links:

http://search.cpan.org/~jhi/perl-5.8.0/pod/perluniintro.pod
http://search.cpan.org/~jhi/perl-5.8.0/pod/perlunicode.pod
http://search.cpan.org/~dankogai/Encode-2.18/Encode.pm


Hope this helps you out.
- Jeff



----- Original Message ----
From: Jeff Pang <***@earthlink.net>
To: ***@perl.apache.org
Sent: Wednesday, April 4, 2007 7:20:01 PM
Subject: Re: UTF-8 encoding problems under Apache 2 with mod_perl 2.
We also do everything (not source code, which is in ISO-8859-1, only content) in UTF-8 where I >work, and we support many different languages.
Jeff,how did you do it by using utf-8 for everything?can you give a rough description?Thanks.



--
mailto: ***@earthlink.net
http://home.arcor.de/jeffpang/
Clinton Gormley
2007-04-05 07:59:22 UTC
Permalink
I realise that you're probably not using template toolkit, but on a
separate but related note:

For those on the list using Template Toolkit, if your templates contain
UTF8, you need to prefix them with a UTF8 BOM for them to be recognised
as UTF8, otherwise TT gets really confused.

See here for more details:
http://template-toolkit.org/pipermail/templates/2004-June/006270.html
http://template-toolkit.org/pipermail/templates/2005-July/007532.html

In other words the first three bytes of the file need to be :
"\x{EF}\x{BB}\x{BF}"

I have a small script which I run on my templates which checks for the
BOM and adds one as necessary:

#!/usr/bin/perl
use strict;
use warnings FATAL => 'all';

our $root = '/PATH/TO/TEMPLATES';
our $bom = "\x{EF}\x{BB}\x{BF}";

process_dir($root);

sub process_dir {
my $dir = shift;
my @files = glob( $dir . "/*" );
foreach my $file (@files) {
if ( -f $file && $file =~ /\.tt$/ ) {
process_file($file);
}
elsif ( -d $file && $file !~ m|/\.svn| ) {
process_dir($file);
}
}
}

sub process_file {
my $name = my $file = shift;
$name =~ s/^$root//;
print sprintf( "Processing : %-50s", $name );
local ( *FH, $/ );
open( FH, '<:bytes', $file )
or die "can't open $file: $!";
my $a = <FH>;
close FH;

my $b = $a;
$a =~ s/$bom//g;
$a = $bom . $a;
if ( $a ne $b ) {
open( FH, '>:bytes', $file )
or die "can't write to $file : $!";
print FH $a;
close FH
or die "can't close file $file";
print " ...Updated\n";
}
else {
print "\n";
}
}

Loading...