i18n: add infrastructure for translating Git with gettext

Change the skeleton implementation of i18n in Git to one that can show
localized strings to users for our C, Shell and Perl programs using
either GNU libintl or the Solaris gettext implementation.

This new internationalization support is enabled by default. If
gettext isn't available, or if Git is compiled with
NO_GETTEXT=YesPlease, Git falls back on its current behavior of
showing interface messages in English. When using the autoconf script
we'll auto-detect if the gettext libraries are installed and act
appropriately.

This change is somewhat large because as well as adding a C, Shell and
Perl i18n interface we're adding a lot of tests for them, and for
those tests to work we need a skeleton PO file to actually test
translations. A minimal Icelandic translation is included for this
purpose. Icelandic includes multi-byte characters which makes it easy
to test various edge cases, and it's a language I happen to
understand.

The rest of the commit message goes into detail about various
sub-parts of this commit.

= Installation

Gettext .mo files will be installed and looked for in the standard
$(prefix)/share/locale path. GIT_TEXTDOMAINDIR can also be set to
override that, but that's only intended to be used to test Git itself.

= Perl

Perl code that's to be localized should use the new Git::I18n
module. It imports a __ function into the caller's package by default.

Instead of using the high level Locale::TextDomain interface I've
opted to use the low-level (equivalent to the C interface)
Locale::Messages module, which Locale::TextDomain itself uses.

Locale::TextDomain does a lot of redundant work we don't need, and
some of it would potentially introduce bugs. It tries to set the
$TEXTDOMAIN based on package of the caller, and has its own
hardcoded paths where it'll search for messages.

I found it easier just to completely avoid it rather than try to
circumvent its behavior. In any case, this is an issue wholly
internal Git::I18N. Its guts can be changed later if that's deemed
necessary.

See <AANLkTilYD_NyIZMyj9dHtVk-ylVBfvyxpCC7982LWnVd@mail.gmail.com> for
a further elaboration on this topic.

= Shell

Shell code that's to be localized should use the git-sh-i18n
library. It's basically just a wrapper for the system's gettext.sh.

If gettext.sh isn't available we'll fall back on gettext(1) if it's
available. The latter is available without the former on Solaris,
which has its own non-GNU gettext implementation. We also need to
emulate eval_gettext() there.

If neither are present we'll use a dumb printf(1) fall-through
wrapper.

= About libcharset.h and langinfo.h

We use libcharset to query the character set of the current locale if
it's available. I.e. we'll use it instead of nl_langinfo if
HAVE_LIBCHARSET_H is set.

The GNU gettext manual recommends using langinfo.h's
nl_langinfo(CODESET) to acquire the current character set, but on
systems that have libcharset.h's locale_charset() using the latter is
either saner, or the only option on those systems.

GNU and Solaris have a nl_langinfo(CODESET), FreeBSD can use either,
but MinGW and some others need to use libcharset.h's locale_charset()
instead.

=Credits

This patch is based on work by Jeff Epler <jepler@unpythonic.net> who
did the initial Makefile / C work, and a lot of comments from the Git
mailing list, including Jonathan Nieder, Jakub Narebski, Johannes
Sixt, Erik Faye-Lund, Peter Krefting, Junio C Hamano, Thomas Rast and
others.

[jc: squashed a small Makefile fix from Ramsay]

Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Ramsay Jones <ramsay@ramsay1.demon.co.uk>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
diff --git a/gettext.c b/gettext.c
index ae5394a..f75bca7 100644
--- a/gettext.c
+++ b/gettext.c
@@ -5,6 +5,18 @@
 #include "git-compat-util.h"
 #include "gettext.h"
 
+#ifndef NO_GETTEXT
+#	include <locale.h>
+#	include <libintl.h>
+#	ifdef HAVE_LIBCHARSET_H
+#		include <libcharset.h>
+#	else
+#		include <langinfo.h>
+#		define locale_charset() nl_langinfo(CODESET)
+#	endif
+#endif
+
+#ifdef GETTEXT_POISON
 int use_gettext_poison(void)
 {
 	static int poison_requested = -1;
@@ -12,3 +24,108 @@
 		poison_requested = getenv("GIT_GETTEXT_POISON") ? 1 : 0;
 	return poison_requested;
 }
+#endif
+
+#ifndef NO_GETTEXT
+static void init_gettext_charset(const char *domain)
+{
+	const char *charset;
+
+	/*
+	   This trick arranges for messages to be emitted in the user's
+	   requested encoding, but avoids setting LC_CTYPE from the
+	   environment for the whole program.
+
+	   This primarily done to avoid a bug in vsnprintf in the GNU C
+	   Library [1]. which triggered a "your vsnprintf is broken" error
+	   on Git's own repository when inspecting v0.99.6~1 under a UTF-8
+	   locale.
+
+	   That commit contains a ISO-8859-1 encoded author name, which
+	   the locale aware vsnprintf(3) won't interpolate in the format
+	   argument, due to mismatch between the data encoding and the
+	   locale.
+
+	   Even if it wasn't for that bug we wouldn't want to use LC_CTYPE at
+	   this point, because it'd require auditing all the code that uses C
+	   functions whose semantics are modified by LC_CTYPE.
+
+	   But only setting LC_MESSAGES as we do creates a problem, since
+	   we declare the encoding of our PO files[2] the gettext
+	   implementation will try to recode it to the user's locale, but
+	   without LC_CTYPE it'll emit something like this on 'git init'
+	   under the Icelandic locale:
+
+	       Bj? til t?ma Git lind ? /hlagh/.git/
+
+	   Gettext knows about the encoding of our PO file, but we haven't
+	   told it about the user's encoding, so all the non-US-ASCII
+	   characters get encoded to question marks.
+
+	   But we're in luck! We can set LC_CTYPE from the environment
+	   only while we call nl_langinfo and
+	   bind_textdomain_codeset. That suffices to tell gettext what
+	   encoding it should emit in, so it'll now say:
+
+	       Bjó til tóma Git lind í /hlagh/.git/
+
+	   And the equivalent ISO-8859-1 string will be emitted under a
+	   ISO-8859-1 locale.
+
+	   With this change way we get the advantages of setting LC_CTYPE
+	   (talk to the user in his language/encoding), without the major
+	   drawbacks (changed semantics for C functions we rely on).
+
+	   However foreign functions using other message catalogs that
+	   aren't using our neat trick will still have a problem, e.g. if
+	   we have to call perror(3):
+
+	   #include <stdio.h>
+	   #include <locale.h>
+	   #include <errno.h>
+
+	   int main(void)
+	   {
+		   setlocale(LC_MESSAGES, "");
+		   setlocale(LC_CTYPE, "C");
+		   errno = ENODEV;
+		   perror("test");
+		   return 0;
+	   }
+
+	   Running that will give you a message with question marks:
+
+	   $ LANGUAGE= LANG=de_DE.utf8 ./test
+	   test: Kein passendes Ger?t gefunden
+
+	   In the long term we should probably see about getting that
+	   vsnprintf bug in glibc fixed, and audit our code so it won't
+	   fall apart under a non-C locale.
+
+	   Then we could simply set LC_CTYPE from the environment, which would
+	   make things like the external perror(3) messages work.
+
+	   See t/t0203-gettext-setlocale-sanity.sh's "gettext.c" tests for
+	   regression tests.
+
+	   1. http://sourceware.org/bugzilla/show_bug.cgi?id=6530
+	   2. E.g. "Content-Type: text/plain; charset=UTF-8\n" in po/is.po
+	*/
+	setlocale(LC_CTYPE, "");
+	charset = locale_charset();
+	bind_textdomain_codeset(domain, charset);
+	setlocale(LC_CTYPE, "C");
+}
+
+void git_setup_gettext(void)
+{
+	const char *podir = getenv("GIT_TEXTDOMAINDIR");
+
+	if (!podir)
+		podir = GIT_LOCALE_PATH;
+	bindtextdomain("git", podir);
+	setlocale(LC_MESSAGES, "");
+	init_gettext_charset("git");
+	textdomain("git");
+}
+#endif