Discussion:
[Xmltv-devel] Are the new encoding tests complaining to much? Or is is not enough?
Karl Dietz
2010-10-22 07:23:09 UTC
Permalink
Morning Grabber Maintainers,

the results from the nightly run with the latest addition of encoding
tests are in. I would like to know if you care about such errors and
want more of them or if you consider these tests as to much.


This is going to be detailed, so grab some drink first.


tv_grab_no_gfeed: look here "ng="nb">You DonÂ’t Mess With the"
that's [C2][92], the UTF-8 encoding of character 0x92 which is
a private use control character. But if we follow the way of the
data from iso-8859-x to utf-8 and replace the iso-8859-x with
windows-1252 we end up with unicode character 0x2019 which is the
"right single quotation mark" which looks quite like the apostroph
that should have been there in the first place. (turn off the stupid
wizard that does such things to your office documents now, it does more
harm then help. :)
Solution: As this site runs NonameTV they already have a fix to pull
from upstream available that fixes this issue at their import from this
tv channels press site. (after the update it's going to still be the
wrong character, but it's going to be correctly encoded)

tv_grab_huro: look here "- Margitsziget HERE lombok nélkül "
HERE is a correctly encoded soft hyphen, which is used to help the
computer with breaking long words into multiple lines. Depending on the
way you look at the logfile you'll see a space or a hyphen (minus).
Notice that the text says "Some" space soft-hyphen space "thing" when
it actually wants to say either choose from "Something" or "Some-"
newline "thing" or it's supposed to be a normal hyphen and it's
"Some - Thing".
I have no idea what would be right, but I bet the programs used to
display don't know either ;) (I have yet to see an intentionally and
correcly used non-breaking-space or soft-hyphen in guide data, so I
added them as errors to get some more samples)
But google suggests that it's two words that don't go together, so
it should have been a normal hyphen.
Solution: Extend _huro to either change soft-hyphen into normal ones or
simply drop them (and collapse the two remaining spaces into one)

tv_grab_fr: look here "e rire de bon cÂœur en regardant"
It's the same problem as with _no_gfeed above. Notice that many guides
write it as "coeur" but some can handle "cœur", so can we if we want.
Currently _fr does downconvert "œ" into "oe" in other places.
I'd prefer the real thing here, as we are writing a unicode encoded
file (and it's 2010 after all ;)
Solution: Extend _fr to handle misencoded windows-1252 characters like
the others. Possible change the conversion to return the right character
if that's what is prefered from french users.

Regards,
Karl
Nick Morrott
2010-10-22 19:37:52 UTC
Permalink
Post by Karl Dietz
Morning Grabber Maintainers,
the results from the nightly run with the latest addition of encoding
tests are in. I would like to know if you care about such errors and
want more of them or if you consider these tests as to much.
I think they are a very welcome addition, and would like to thank you
for developing the test_grabbers script to include these tests.

I try to catch as much of the bad UTF-8 in tv_grab_uk_rt as I can but
there's always a chance something may still get through, so these
tests offer a second line of defense. The Radio Times recently (few
months ago) changed their listings provider, and I'm not sure whether
the bad UTF-8 (and possibly Windows-1252) users were seeing before the
provider change was due to their old provider's data, or whether the
bugs were introduced during processing at the RT.

Enjoy the weekend,
Nick
--
Nick Morrott

MythTV Official wiki: http://mythtv.org/wiki/
MythTV users list archive: http://www.gossamer-threads.com/lists/mythtv/users

"An investment in knowledge always pays the best interest." - Benjamin Franklin
Loading...