No Support > Scheduled for cpg1.5.x

[Done]: [Suggestion for code review: select_lang.inc.php] automatic language detection

(1/2) > >>

ripat:
First, I would like to say that I'am impressed by the quality of Coppermine and the by the amount of work it represents.

Living in a country where 3 different languages are spoken, I paid a special attention to the automatic language detection based on the Accepted-Language and User-Agent HTTP strings.

GENERAL REMARK

* Accepted-Language string

  the $available_languages array is based on the RFC 1766 which is now obsolete.
  The current RFC for language tag is the RFC 4646 which in combination with RFC 4647,
  replaces RFC 3066, which replaced RFC 1766.
   --> http://www.w3.org/International/questions/qa-lang-priorities
   --> http://www.faqs.org/rfcs/rfc4646.html
   --> http://www.w3.org/International/articles/bcp47/
 
  Underscore as region sub-tag has never been in any of the above RFC. So,
   'en' => array('en([-_][[:alpha:]]{2})?|english', 'english', 'en')
  may be changed in a faster:
   'en' => array('en(-[[:alpha:]]{2})?|english', 'english', 'en')

* User-Agent string
 
  It is chaos here. Localisation information sent in the User-Agent string
  is almost never compliant with RFC 2068 and 1945. Even Mozilla sends a non standard string.
     --> http://www.mozilla.org/build/user-agent-strings.html
   
  More info:
   --> http://www.faqs.org/rfcs/rfc2068.html
   
  I don't go further into this for this time. Accurate language tag matching is not easy.

MY SUGGESTION
The code below is faster and has more features. Faster by the use of PCRE regex functions that are *much* faster than the POSIX ones. In a little benchmark (100 loops) the new code is 3 times faster if there is a Accepted-Language string and up to 5 times faster on the User-Agent string.

As for the new feature, in the definition of the http Accepted-Language string w3c says:
Each language-range MAY be given an associated quality value which represents an estimate of the user's preference for the languages specified by that range. The quality value defaults to "q=1".
http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.4

My code below takes the user preferences into account by sorting the languages tokens on their weight (q=0.x)

For example: if the Accepted-Language strings looks like: ww,ww-zz,de=0.2;q=0.1,it;q=0.5,en;q=0.3, the code will disregard the non-existing ww or ww-zz tags and will pick-up the language-tag that has the higher q factor, it in this case.


--- Code: ---function lang_detect_q($available_languages) {
    if (!empty($_SERVER['HTTP_ACCEPT_LANGUAGE'])) {
        $language_tokens = explode(',', $_SERVER['HTTP_ACCEPT_LANGUAGE']);
        // loop through each Accept-Language token and find quality level (i.e. q=0.8)
        $lang_tag = $quality_tag = array();
        foreach ($language_tokens as $language_token ) {
            // explodes on ;q
            $q_explode = explode(';q=', $language_token);
            // if no q factor in token default q value = 1
            $q = isset($q_explode[1]) ? $q_explode[1] : 1;
            // add language_tag and quality_tag to array
            $lang_tag[]    = $q_explode[0];
            $quality_tag[] = $q;
        }
        // sorts array on key in reverse order (higher quality first)
        // array_multisort was too slow
        arsort($quality_tag);
        // loop throuh every quality_tag array
        foreach ($quality_tag as $q_key => $q_val) {
            // loop through each available_languages
            foreach ($available_languages as $key => $language) {
                if (preg_match('#^(?:'. $language[0] .')#i', $lang_tag[$q_key])){
                    // exit function on first match.
                    return $available_languages[$key][1];
                }
            }
        }

    // if Accept-Language not present in the client's http header, we try the User-Agent string
    } elseif (!empty($_SERVER['HTTP_USER_AGENT'])) {     
        // once again, loop through each available_languages
        foreach ($available_languages as $key => $language) {
            if (preg_match('#[(,; [](?:'. $language[0] .')[]),;]#i', $_SERVER['HTTP_USER_AGENT'])) {
                // exit function on first match.
                return $available_languages[$key][1];
            }
        }
    }
    // if nothing found --> exit function with false (or default language value if necessary)
    return false;
}

$lang = lang_detect_q($available_languages);
// If we catched a valid language, configure it
if ($lang) {
    $USER['lang'] = $lang;
}

--- End code ---

As for the $available_languages array, the PCRE functions run slightly faster when the grouping parenthesis (option1|option2) are rendered non capturing as in (?:option1|option2). So,
'fr' => array('fr(?:-[[:alpha:]]{2})?|french', 'french', 'fr'),

Let me know if something need to be changed.

Nibbler:
Good work, is this tested on the main web browsers?

ripat:
Yes I did.

IE 5.5
IE 6.0
IE 7.0
FF 2.0 (Linux)
FF 2.0 (OS-X)
Opera (Linux)
Opera (Windows)
Safari 9.2 (OS-X)

And even CURL and wget :=)

They are all OK but it's normal as they all send pretty standard Accepted-Language strings. If that string is not present, like for CURL and wget, the fallback on the User-Agent string is far less efficient as they are far from standard and don't always contain the localisation tag.

What I mean is that the language detection relies on string sent by the browser in the http header. Pretty straight forward. Not like that html/css stuff when the client receives the html page and must parse it correctly!

Jean-Luc.

Nibbler:
Committed to 1.5.

Nibbler:
Would be nice to hook the language detection into the language manager.

Navigation

[0] Message Index

[#] Next page

Go to full version