[Xmltv-devel] Are the new encoding tests complaining to much? Or is is not enough?

Karl Dietz

2010-10-22 07:23:09 UTC

Morning Grabber Maintainers,

the results from the nightly run with the latest addition of encoding
tests are in. I would like to know if you care about such errors and
want more of them or if you consider these tests as to much.

This is going to be detailed, so grab some drink first.

tv_grab_no_gfeed: look here "ng="nb">You DonÂ’t Mess With the"
that's [C2][92], the UTF-8 encoding of character 0x92 which is
a private use control character. But if we follow the way of the
data from iso-8859-x to utf-8 and replace the iso-8859-x with
windows-1252 we end up with unicode character 0x2019 which is the
"right single quotation mark" which looks quite like the apostroph
that should have been there in the first place. (turn off the stupid
wizard that does such things to your office documents now, it does more
harm then help. :)
Solution: As this site runs NonameTV they already have a fix to pull
from upstream available that fixes this issue at their import from this
tv channels press site. (after the update it's going to still be the
wrong character, but it's going to be correctly encoded)

tv_grab_huro: look here "- Margitsziget HERE lombok nélkül "
HERE is a correctly encoded soft hyphen, which is used to help the
computer with breaking long words into multiple lines. Depending on the
way you look at the logfile you'll see a space or a hyphen (minus).
Notice that the text says "Some" space soft-hyphen space "thing" when
it actually wants to say either choose from "Something" or "Some-"
newline "thing" or it's supposed to be a normal hyphen and it's
"Some - Thing".
I have no idea what would be right, but I bet the programs used to
display don't know either ;) (I have yet to see an intentionally and
correcly used non-breaking-space or soft-hyphen in guide data, so I
added them as errors to get some more samples)
But google suggests that it's two words that don't go together, so
it should have been a normal hyphen.
Solution: Extend _huro to either change soft-hyphen into normal ones or
simply drop them (and collapse the two remaining spaces into one)

tv_grab_fr: look here "e rire de bon cÂœur en regardant"
It's the same problem as with _no_gfeed above. Notice that many guides
write it as "coeur" but some can handle "cœur", so can we if we want.
Currently _fr does downconvert "&oelig;" into "oe" in other places.
I'd prefer the real thing here, as we are writing a unicode encoded
file (and it's 2010 after all ;)
Solution: Extend _fr to handle misencoded windows-1252 characters like
the others. Possible change the conversion to return the right character
if that's what is prefered from french users.

Regards,
Karl