UTF-8 Codec Error When Assigning Uncompressed GZip File from URL to String Variable

Question

I am downloading a gzip log from a URL and then saving it to a variable. I then want to later iterate over that string variable line by line. If I just save the file and open it in Notepad++, I can see that the saved log file is in UTF-8 encoding.

I wanted to skip saving the file and then reopening to parse it, so I have attempted to assign the file contents to a variable and then use io.StringIO to iterate over each line within the variable. This process works fine but occasionally I get the following error to blow up when the script reaches the line return str(file_content, 'utf-8').

Exception Raised in connect function: 'utf-8' codec can't decode byte 0xe0 in position 138037: invalid continuation byte

Here is the section of code that makes the request and then assigns to string variable.

# Making a get request with basic authentication
        request = urllib.request.Request(url)
        base64string = base64.b64encode(bytes('%s:%s' % ('xxxxx', 'xxxxx'),'ascii'))
        request.add_header("Authorization", "Basic %s" % base64string.decode('utf-8'))
        
        # open request and then use gzip to read the shoutcast log that is in gzip format, then save uncompressed version
        with urllib.request.urlopen(request) as response:
            with gzip.GzipFile(fileobj=response) as uncompressed:
                file_content = uncompressed.read()
                return str(file_content, 'utf-8')

you may add argument `errors='ignore'` (or any other error handler) to your str function. If you don't want to loose anything, you may want to register your own error handler and verify, what causes the issue. See https://docs.python.org/3/library/codecs.html#error-handlers for more information. — Maciej Wrobel, Feb 01 '22 at 18:41
You commented about the same time I saw [https://stackoverflow.com/questions/606191/convert-bytes-to-a-string](https://stackoverflow.com/questions/606191/convert-bytes-to-a-string) that pointed me in similar direction. In my case, `.decode("utf-8", "surrogateescape")` solved my problem. — user5919866, Feb 02 '22 at 17:14

UTF-8 Codec Error When Assigning Uncompressed GZip File from URL to String Variable

0 Answers0