Docker UTF-8 python [SOLVED]

EDIT: Correct solution was provided by Ulrich Eckhardt. The HTML report did not contain the metachar set and was interpreting a different encoding. By putting this snippet into the HTML report’s head, the issue was solved.

<head>
  <meta charset="UTF-8">
</head>

I’ve run into an issue where طريق دخان appears like طريق دخان on an HTML report, using python 2.7 running on a centos7 docker container. (And other non-ascii letters also appear with the same issue)
The same script on my local machine displays the characters correctly, the problem is probably some environment setting that I didn’t add in the dockerfile.

I’d like to either know what docker setting I’m missing, or what encoding issue causes طريق دخان to convert to طريق دخان

Quick overview of the script:

  • The script downloads a JSON file that contains street names (such as
    طريق دخان)
  • On the raw JSON file, that name would appear like this:
    u0637u0631u064au0642 u062fu062eu0627u0646
  • The JSON is fetched using requests.get(), which should auto-convert to unicode.
  • The script would output the unicode strings into an HTML report

I used this library to generate the HTML reports

I’ve modified this library code slightly to work with unicode. (Otherwise it would run into an error: ‘ascii’ codec can’t encode character: ordinal not in range(128) ) Now it’ll encode the cell into utf-8, before converting it into a string.

        if(type(self.text) == unicode):
            text = str((self.text).encode('utf-8'))
        else:
            text = str(self.text)

On my local machine, the HTML report would have cells that correctly display the non-ascii letters on google chrome.

When the same script is run on docker, the HTML report has outputs that look like this: طريق دخان on google chrome.

I wish I could run this on python 3, but I’m stuck with 2.7 :[

I’ve tried to add these things to the dockerfile without success:

  • ENV PYTHONIOENCODING=utf-8
  • RUN yum -y -q reinstall glibc-common
  • RUN locale-gen en_US.UTF-8
  • ENV LANG en_US.UTF-8
  • ENV LANGUAGE en_US:en
  • ENV LC_ALL en_US.UTF-8

Source: StackOverflow