Discussion:
[Xmltv-devel] [ xmltv-Bugs-3324199 ] tv_imdb: character encoding of xml input not followed
SourceForge.net
2011-07-14 06:40:40 UTC
Permalink
Bugs item #3324199, was opened at 2011-06-21 21:39
Message generated for change (Comment added) made by dekarl
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=424135&aid=3324199&group_id=39046

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: tv_imdb
Group: None
Status: Open
Resolution: None
Priority: 5
Private: No
Submitted By: user1024 (user1024)
Assigned to: Jerry Veldhuis (jveldhuis)
Summary: tv_imdb: character encoding of xml input not followed
Initial Comment:
I have an xmltv file with utf-8 encoding. When I applied tv_imdb to this file an actor tag was added but the actors name was not encoded in utf-8. Firefox gives an "XML Parsing Error: not well-formed" page when I try to open the output xmltv file.

Attached is a patch of a work around which when applied generates an output file which will open without error in Firefox. This patch is intended to illustrate the problem not to provide a general solution. I expect that the same issue applies to other tags not just the author tag.

I am unsure at which stage the issue should be addressed. I imagine that the encoding could be changed when the imdb database files are created, when text is read from the database file or when the text is inserted into the output structure, or is it a problem with the imdb database files? etc.

----------------------------------------------------------------------
Comment By: Karl Dietz (dekarl)
Date: 2011-07-14 08:40

Message:
Nick, seems it's nothing with your grabber. (and we're checking quite a bit
of garbled utf-8 characters now at the nightly tester)
User1024, seems like tv_imdb could use some love with regard to characters
outside the ASCII set, mind to give it a try? tv_validate_file should find
all instances of strange stuff instead of proper utf-8.

----------------------------------------------------------------------

Comment By: user1024 (user1024)
Date: 2011-06-22 22:24

Message:
Thanks for looking into this.

Here are the details of the grabber used.
~$ tv_grab_uk_rt --version
XMLTV module version 0.5.59
This is tv_grab_uk_rt version 1.331, 2010/11/19 11:31:00

The file output from the grabber appears to have the correct utf-8
encoding and will open without error in Firefox.

When I pipe the output from the grabber into tv_imdb the new file
generated has additional tags added, as expected. However this new file
generated from tv_imdb does not open successfully in Firefox. One of the
additional tags contains an actor's name which appears to be incorrectly
encoded. The error in Firefox points to a character in the actors name.

A snippet of the xmltv file output from tv_imdb (with the patch applied)
is given below. The error in Firefox points to the "Curd Jürgens" actor
tag.

<programme start="20110529110000 +0100" stop="20110529125500 +0100"
channel="filmfour.channel4.com">
<title>The Enemy Below</title>
<desc lang="en">This is one of the best submarine movies...</desc>
<credits>
<director>Dick Powell</director>
<actor>Robert Mitchum</actor>
<actor>Curd Jürgens</actor>
<actor>David Hedison</actor>
...

----------------------------------------------------------------------

Comment By: Nick Morrott (knowledgejunkie)
Date: 2011-06-22 15:42

Message:
Can you provide details on the grabber (name and version) you used to
produce the XMLTV data you are feeding to tv_imdb?

If the XMLTV data is not being created as valid UTF-8 by the grabber, it
should ideally be fixed there first.

----------------------------------------------------------------------

You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=424135&aid=3324199&group_id=39046
Loading...