[Xmltv-devel] New grabber (tv_grab_fi_sv)

Discussion:

[Xmltv-devel] New grabber (tv_grab_fi_sv)

Per Lundberg

2010-10-16 20:42:08 UTC

Hi there,

I've spent a couple of hours spread out over the last month or so
writing a little grabber for the Finnish TV channels - in Swedish.
(There is already an exising grabber for Finnish channels but with the
program listings in Finnish, and also using a different source for the
information.)

My grabber uses HTML screen-scraping from a pretty simple set of web
pages on the YLE web site, which works pretty OK I think.

When running this through the validation (tv_validate_grabber) I get a
couple of errors:

1) --version doesn't work. I don't seem to be able to get it to work
either; it seems to *expect* a CVS $Id$? This makes things a bit hard
for me. :-) I actually the code checked in to my own Mercurial
repository, but it doesn't really help in this case. I guess this will
automatically resolve itself if/when this grabber makes it into the
xmltv CVS repository.

2) tv_sort failed on /tmp/Hec3_fJmB1/t_1_2.xml, probably because of
strange start or stop times. See /tmp/Hec3_fJmB1/t_1_2.sort.log

3) The data is not additive. See /tmp/Hec3_fJmB1/t__1_2.diff

Strangely enough, these temporary folders don't seem to be available
when the validate program is completed...

Anyway, how serious are these errors? For me personally, it doesn't
really make much of a difference - the grabber works OK for me
(feeding MythTV) and that's really nice. However, getting this into
the xmltv upstream would also be nice of course.

I've attached the code to this email. If anyone of you feels like
helping me sorting out these issues, feel very free to do so. Thanks
in advance!

--
Best regards,
Per Lundberg

Robert Eden

2010-10-17 01:28:19 UTC

None of those are very serious... as you said, the grabber is still useful.

Does the target site have anything to prohibit a scraper?

I see you're using Date::Calc for only one function. Can you switch to
Date:::Manip so it matches the other grabbers? Looks like a simple
enough change to avoid another dependency.

I assume you're willing be be the maintainer..

Robert

Post by Per Lundberg
Hi there,
I've spent a couple of hours spread out over the last month or so
writing a little grabber for the Finnish TV channels - in Swedish.
(There is already an exising grabber for Finnish channels but with the
program listings in Finnish, and also using a different source for the
information.)
My grabber uses HTML screen-scraping from a pretty simple set of web
pages on the YLE web site, which works pretty OK I think.
When running this through the validation (tv_validate_grabber) I get a
1) --version doesn't work. I don't seem to be able to get it to work
either; it seems to *expect* a CVS $Id$? This makes things a bit hard
for me. :-) I actually the code checked in to my own Mercurial
repository, but it doesn't really help in this case. I guess this will
automatically resolve itself if/when this grabber makes it into the
xmltv CVS repository.
2) tv_sort failed on /tmp/Hec3_fJmB1/t_1_2.xml, probably because of
strange start or stop times. See /tmp/Hec3_fJmB1/t_1_2.sort.log
3) The data is not additive. See /tmp/Hec3_fJmB1/t__1_2.diff
Strangely enough, these temporary folders don't seem to be available
when the validate program is completed...
Anyway, how serious are these errors? For me personally, it doesn't
really make much of a difference - the grabber works OK for me
(feeding MythTV) and that's really nice. However, getting this into
the xmltv upstream would also be nice of course.
I've attached the code to this email. If anyone of you feels like
helping me sorting out these issues, feel very free to do so. Thanks
in advance!
------------------------------------------------------------------------
------------------------------------------------------------------------------
Download new Adobe(R) Flash(R) Builder(TM) 4
The new Adobe(R) Flex(R) 4 and Flash(R) Builder(TM) 4 (formerly
Flex(R) Builder(TM)) enable the development of rich applications that run
across multiple browsers and platforms. Download your free trials today!
http://p.sf.net/sfu/adobe-dev2dev
------------------------------------------------------------------------
_______________________________________________
xmltv-devel mailing list
https://lists.sourceforge.net/lists/listinfo/xmltv-devel

Per Lundberg

2010-10-21 19:06:02 UTC

On Sun, Oct 17, 2010 at 4:28 AM, Robert Eden <***@gmail.com> wrote:

Hi Robert, thanks for your answer.

Post by Robert Eden
None of those are very serious... as you said, the grabber is still useful.
Does the target site have anything to prohibit a scraper?

Not that I could find. YLE is the "public service" company of Finland,
so I doubt they would have a big problem with it.

Post by Robert Eden
I see you're using Date::Calc for only one function. Can you switch to
Date:::Manip so it matches the other grabbers? Looks like a simple enough
change to avoid another dependency.

Alright, I've made that change. See the attached file. There is a
quite strange bug/problem in the program though; it uses the
utc_offset function (which is a part of XMLTV I believe) and I am
forced to enter an "incorrect" time zone to make it work (i.e. to
produce correct tz offsets - it *should* be EET, but I get +0400 if I
do it like that, which should be +0300 while in DST which we're still
in). There is a FIXME about this in the code. Do you have any ideas?

Post by Robert Eden
I assume you're willing be be the maintainer..

Yes. My Sourceforge username is perlundberg.

--
Best regards,
Per Lundberg

Karl Dietz

2010-10-21 21:02:48 UTC

Hi Per,

I see you're using Date::Calc for only one function. Can you switch to
Date:::Manip so it matches the other grabbers? Looks like a simple enough
change to avoid another dependency.

I have taken a look at you grabber and noticed some small issues.

"use IO::Scalar" seems to be missing
the stop time gets written over the start time
you can copy any other $Id...$ to make --version work
programs that start/stop past midnight are not handled
many lines end in \r\n but most end in \n

Just out of curiosity and before adding lots of code to handle all the
corner cases, whats wrong with the guide over here?
http://svenska.yle.fi/programguide/index.php?g=3&d=20101020&lang=se
It has the same data nicely grouped into programs starting between
0 and 24 o'clock, comes with starting and stop times for every program,
has categories. (notice that all popups and category metadata is hidden
in one big html file)

But if you consider rewriting the data collection you could ask YLE for
permission and add the channels to the swedb site instead. (which is
looking for help from someone with a bit of Perl knowledge for some
time now, see http://blog.xmltv.se/2009/12/tvswedbse-behover-hjalp.html
) I can help you get up to speed over there or here, no matter which
way you prefer ;)

Regards,
Karl

Per Lundberg

2010-10-23 18:35:39 UTC

On Fri, Oct 22, 2010 at 12:02 AM, Karl Dietz
<***@spaetfruehstuecken.org> wrote:

Hi Karl, thanks for your nice reply and ideas.

Post by Karl Dietz
I have taken a look at you grabber and noticed some small issues.
"use IO::Scalar" seems to be missing

Thanks - I've added it now.

Post by Karl Dietz
the stop time gets written over the start time

Hehe, interesting. :-) You're right here too; this was a bug in my
code, also fixed now.

Post by Karl Dietz
you can copy any other $Id...$ to make --version work

Done.

Post by Karl Dietz
programs that start/stop past midnight are not handled

True... you're right, this is a bug which should be fixed.

Post by Karl Dietz
many lines end in \r\n but most end in \n

This might be a slight glitch, it's something that I haven't noticed before.

Post by Karl Dietz
Just out of curiosity and before adding lots of code to handle all the
corner cases, whats wrong with the guide over here?
http://svenska.yle.fi/programguide/index.php?g=3&d=20101020&lang=se

Good question! :-) I chose the version at
http://www.yle.fi/ohjelmaopas/svindex.html (which is what my grabber
is using at the moment) because it "looked simple". And it actually
"feels" pretty simple. The format is pretty straightforward, easy to
grab the data from. That's why.

I also like the idea of using an "old", quite boring web page. Yes, on
one hand, the chance for YLE removing it is bigger - but then again,
the chance that they would make breaking changes to the page is
slightly less than for a more "modern" page. :-)

Having said that, after looking at the page you suggested - it's
actually better than the one I'm using right now. It has more channels
- I actually didn't even know that channels like Subtv had their
guides available in Swedish. Man, this country is more
Swedish-speaking than you might first think when you get here! (I'm in
an immigrant from Sweden actually...) I mean, we're talking about
purely Finnish-lingual channels per se, but it's quite nice that
they're publishing their program guides in Swedish anyway.

</end-of-side-note>

Anyway, I will consider moving over to this page instead, if the data
is reasonably simple to parse.

Post by Karl Dietz
It has the same data nicely grouped into programs starting between
0 and 24 o'clock, comes with starting and stop times for every program,
has categories. (notice that all popups and category metadata is hidden
in one big html file)

The point about stop times for every program is actually quite a big
plus, since that makes the XMLTV data a bit nicer... As you might have
noticed, the xmltv data I generate right now doesn't have stop times
for all (most) programs.

Post by Karl Dietz
But if you consider rewriting the data collection you could ask YLE for
permission and add the channels to the swedb site instead. (which is
looking for help from someone with a bit of Perl knowledge for some
time now, see http://blog.xmltv.se/2009/12/tvswedbse-behover-hjalp.html
) I can help you get up to speed over there or here, no matter which
way you prefer ;)

Thanks... I actually did ask YLE whether they had an xmltv format or
similar. I did get a reply saying that they were working on something
like it, but it wasn't ready at the moment.

Still, of course their data is available *somewhere*, it's just a
matter of getting it in file format (FTP or similar)... Their channels
are available at the Swedish web page http://www.kolla.tv, so it must
be simply a matter of politics.

If you could be willing to take care of the political aspect of this,
I /might/ be willing to make it be available in swedb.se, but one
problem with this is that this would only be the YLE channels I guess;
I doubt Yle can give us access to their data for Subtv, MTV3 etc...
:-)

So, rewriting my scraper to use the other web page might be the way to
go. Do you have the energy to do so? I could give you access to my
private Mercurial repository in that case, so we can coordinate the
work. I looked at the HTML from that other site a little now; my
spontaneous feeling is that "I would like to write this parser using
LINQ to XML instead"... :-) But of course, it can be done in Perl
also, it's just a bit inferior. :P

--
Best regards,
Per Lundberg

Karl Dietz

2010-10-25 21:36:14 UTC

Hi Per,

I'll try to make this one a bit shorter ;)

I do appreciate your grabber, but I think that using the older YLE site
is making it harder then necessary.
As the guide you're after is in swedish and there happens to be an
established site, looking for some help, I thought it might be worth
to see if you can join forces or not.

Of course it's quite some work to talk to all stations, collect their
data and release it as one guide listing. If YLE are preparing their
own guide API at the moment it might be simpler to ask if they are ok
with personal use of their web site until the api is ready.

Regards,
Karl

Per Lundberg

2010-10-28 18:58:01 UTC

Hi,

Post by Karl Dietz
I'll try to make this one a bit shorter ;)

Hehe, no probs.

Post by Karl Dietz
I do appreciate your grabber, but I think that using the older YLE site
is making it harder then necessary.

Well, actually not. :-) The actual *parsing* of that page was actually
pretty simple. It's just that it didn't contain good stop times (for
example) that caused it to be a bit technically worse than it could
be.

I've now rewritten the grabber to use the other page you suggested;
the file is attached. I did this slightly differently than in the
previous grabber; instead of just relying on regexps I used the
XML::LibXML parser for creating a structure out of the X(HT)ML data
from the web page.

So, improvements from the previous grabber are these:

- More channels are supported (Subtv, JIM and Nelonen Sport, in
addition to the channels previously supported).
- Proper stop times.
- Proper handling of date rollovers.

It also handles channel configurations (yes/no) like the previous
grabber. This was harder to implement here, since the data doesn't
come separately for each channel but in "groups" of channels.

What it does *not* handle correctly which I noticed right now is that
links (<a href>) inside the actual program descriptions can occur
sometimes. These will currently be ignored; the XMLTV data will still
be valid, which is most important, but it would be nice to have the
link URL (or link text) be presented in verbatim, instead of the text
just being "chopped". :-)

Oh, and one terribly important point: because I use the ~~ operator.
Not a problem for me, but it might be an issue if/when we want to
include this in the upstream xmltv distribution.

Post by Karl Dietz
As the guide you're after is in swedish and there happens to be an
established site, looking for some help, I thought it might be worth
to see if you can join forces or not.

I think you must make a distinction between Swedish as a language and
Swedish as a nationality. :-) The Swedb site *could* actually be a
suitable place for these listings, yes, but the thing is that the
amount of Swedish people (from Sweden) who would be interested in the
data is probably very low. I would say, only people living in
"Tornedalen" (Haparanda etc) would be candidates for using this;
people living close enough to the border to be able to receive the
Finnish channels.

And then again, if they would watch Finnish channels, they would
probably know Finnish well enough to read the Finnish program guides
instead... :) Of course, purely hypothetical as it is.

Post by Karl Dietz
Of course it's quite some work to talk to all stations, collect their
data and release it as one guide listing. If YLE are preparing their
own guide API at the moment it might be simpler to ask if they are ok
with personal use of their web site until the api is ready.

Yes. But really, do we even need such a permission? I mean, what is
the difference from me reading their web page and typing all the data
manually into a MySQL database and me writing a program to do the
same? Really? :-)

It's just a whole lot more efficient this way, of course... and yes,
sure, there could be opinions about it from the suppliers of the
listing data. I could ask them just for sure.

--
Best regards,
Per Lundberg

Karl Dietz

2010-10-28 22:14:28 UTC

Hi Per,

I just tried out the grabber on a FreeBSD box which made me aware of
the current minimum system requirements for xmltv. According to
Makefile.PL xmltv will work with the following versions:
perl 5.6.1 or better
Date::Manip 5.42
LWP 5.65
Storable 2.04
XML::Parser 2.34
XML::Twig 3.10
XML::Writer 0.600

Your grabber requires at least:
perl 5.10 (for the ~~ operator and Date::Manip 6)
Date::Manip 6

While Debian lenny comes with perl 5.10 it does not come with
Date::Manip 6. (FreeBSD doesn't either)

@core devs, the readme of Date::Manip talks about extremely limited
time zone support in version 5, some supported systems don't provide
version 6 and some grabbers already use DateTime. What should we suggest
to use for new grabbers?

Regards,
Karl

Robert Eden

2010-10-29 04:44:33 UTC

Post by Karl Dietz
perl 5.10 (for the ~~ operator and Date::Manip 6)
Date::Manip 6
While Debian lenny comes with perl 5.10 it does not come with
Date::Manip 6. (FreeBSD doesn't either)
@core devs, the readme of Date::Manip talks about extremely limited
time zone support in version 5, some supported systems don't provide
version 6 and some grabbers already use DateTime. What should we suggest
to use for new grabbers?

I think it is too early to require 5.10 and Date::Manip 6 for all of XMLTV.

Many of our users use older systems that may not be upgraded except for
XMLTV. (to fix a scraper problem). Personally, I'm still on Perl 5.8,7
because it works fine for me. (and the exe)

Requiring perl 5.10 and Date::Manip 6 is a significant enough change I
can see people needing to upgrade their OS. I don't think we should be
the trigger for that.

That said, there's no reason XMLTV's Makefile.PL can just not install
the _fi_sv grabber, which is what it will do if dependencies are noted
in Makefile.PL.

I can see this causing major grief for the packagers.... if they want to
include _fi_sv in their package they need to Perl 5.10/Date::Manip
dependencies. How do the packagers deal with _il and it_dvb? Packagers
could be put in the position of a package for newer distros and older
ones (without _fi_sv)

Robert

Chris Butler

2010-10-29 08:36:10 UTC

Post by Robert Eden
I can see this causing major grief for the packagers.... if they want to
include _fi_sv in their package they need to Perl 5.10/Date::Manip
dependencies. How do the packagers deal with _il and it_dvb? Packagers
could be put in the position of a package for newer distros and older
ones (without _fi_sv)

It's fairly straightforward for me to simply disable grabbers with missing
dependencies and not install them (this is what I did with _it_dvb until I'd
packaged Linux::DVB).

--
Chris Butler <***@debian.org>
GnuPG Key ID: 4096R/49E3ACD3

Per Lundberg

2010-10-29 19:03:01 UTC

On Fri, Oct 29, 2010 at 11:36 AM, Chris Butler <***@debian.org> wrote:

Hi Chris, always nice to email a DD (I'm a former DD myself). :-)

Post by Chris Butler
It's fairly straightforward for me to simply disable grabbers with missing
dependencies and not install them (this is what I did with _it_dvb until I'd
packaged Linux::DVB).

For the record, I've developed the grabber on a Debian machine
(running squeeze):

***@terah:~/Projects/mythtv$ dpkg -l libxmltv-perl perl libdate-manip-perl
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name Version Description
+++-====================-====================-========================================================
ii libdate-manip-perl 6.11-1 module for manipulating dates
ii libxmltv-perl 0.5.57-3 Perl libraries related
to the XMLTV file format for TV l
ii perl 5.10.1-14 Larry Wall's Practical
Extraction and Report Language

Also, Karl, regarding what you asked me about: I have now emailed YLE
and asked them whether it's okay to use this grabber with their web
site. They saw no problem with this, but asked me (=us) to not fetch
the data more often than once an hour. Fair enough. :-) I think my
mythtv is set to fetch it once a day or so. (the default setting)

Now, there's just this one bug that we would need to resolve, since it
bugs me a bit: the utc_offset() method behaves oddly. This function is
provided by one of the XMLTV perl modules. I use it like this:

'start' => utc_offset(xmltv_time($date, $start_time), $TIMEZONE),
'stop' => utc_offset(xmltv_end_time($date, $start_time,
$end_time), $TIMEZONE)

(xmltv_time() and xmltv_end_time() produces xmltv-style datetime strings).

The "funny" thing here is that $TIMEZONE is set to CET. However,
that's weird: my timezone is set to Europe/Helsinki - which should be
equal to EET. The "date" command gives me EEST as the time zone, which
should be correct since we are still at DST here (until Sunday this
weekend).

However, the funny thing is that only $TIMEZONE=CET gives me the
correct UTC offset (which should be +0300 now and +0200 after the
weekend). If I change $TIMEZONE to the proper value (EET), it gives me
+0400. What on earth could be causing this?!?

Any ideas are welcome.

--
Best regards,
Per Lundberg

Karl Dietz

2010-11-04 10:08:38 UTC

Hi Per,

meanwhile I got access to a Debian machine again and could test the
latest version of your grabber.

Post by Per Lundberg
Also, Karl, regarding what you asked me about: I have now emailed YLE
and asked them whether it's okay to use this grabber with their web
site. They saw no problem with this, but asked me (=us) to not fetch
the data more often than once an hour. Fair enough. :-) I think my
mythtv is set to fetch it once a day or so. (the default setting)

Great, I think we should commit the grabber then and work out the last
details in CVS. (the validator pointed out some minor issues)

Robert, I can add the grabber if it's ok with you. I just need someone
to fixup Makefile.PL details afterwards.

Post by Per Lundberg
Now, there's just this one bug that we would need to resolve, since it
bugs me a bit:...

I have no idea on the TZ issue. But we can always convert the grabber
over to use DateTime later :)

Regards,
Karl

Robert Eden

2010-11-04 15:58:23 UTC

Post by Karl Dietz
Robert, I can add the grabber if it's ok with you. I just need someone
to fixup Makefile.PL details afterwards.

Go for it.. I'll also add Per to SF as a developer for future commits.

Karl, try and update the Makefile.PL, I'm sure you can figure it out.

Robert.

Karl Dietz

2010-11-05 08:06:34 UTC

Post by Robert Eden

Post by Karl Dietz
Robert, I can add the grabber if it's ok with you. I just need someone
to fixup Makefile.PL details afterwards.

Go for it.. I'll also add Per to SF as a developer for future commits.
Karl, try and update the Makefile.PL, I'm sure you can figure it out.

done, and fixed a small hickup. Tonights run is almost clean:*

Testing tv_grab_fi_sv*

file contains unexpected control characters
look here "L: SKA Pietari - Metallurg Novo"
^
Errors found in t_fi_sv_1_2.xml
tv_grab_fi_sv did not validate ok. Seet_fi_sv_commands.log <http://debian.crustynet.org.uk/%7Exmltv-tester/sid/nightly/0/t_fi_sv_commands.log> for a list of the commands that were used
tv_grab_fi_sv has errors: badiso8859

That's a \x96 aka windows-1252 encoded dash (|should be –| or
|–|)
which got handled as if it was a "Start of Protected Area" code.
Just inserting a decode('windows-1252') or s|\x92|–|g or
s|\x92|-|g at the right place should fix that.
Or you can change the encoding in the header.

I have added Per to all trackers with auto assignment of any new ticket with
category tv_grab_fi_sv to him.

Per, I see you're using mythtv. Here's some notes on mapping of
categories and program types
http://repo.or.cz/w/nonametv.git/blob/HEAD:/examples/Categories

Regards,
Karl

Per Lundberg

2010-11-05 11:49:29 UTC

Hi Karl et al,

Thanks for commiting the grabber and adding me to the CVS repository
(I noticed that I have been added now, as mentioned)

On Fri, Nov 5, 2010 at 10:06 AM, Karl Dietz

Post by Karl Dietz
done, and fixed a small hickup. Tonights run is almost clean:*
[...]
That's a \x96 aka windows-1252 encoded dash (|should be –| or
|–|)
which got handled as if it was a "Start of Protected Area" code.
Just inserting a decode('windows-1252') or s|\x92|–|g or s|\x92|-|g
at the right place should fix that.
Or you can change the encoding in the header.

Changing the encoding seems like the easiest way. Is this an OK
solution, is windows-1252 an acceptable charset for XMLTV data? How
would this work with e.g. the tv_grab_combiner - would it be able to
cope with some of the listings being in iso-8859-1, some in
Windows-1252 and some in UTF-8...?

Just doing a "decode" like mentioned above seems the cleanest and most
"compatible" solution. I'll look into it when I have some time.

Post by Karl Dietz
I have added Per to all trackers with auto assignment of any new ticket with
category tv_grab_fi_sv to him.

Kudos!

Post by Karl Dietz
Per, I see you're using mythtv. Here's some notes on mapping of categories
and program types
http://repo.or.cz/w/nonametv.git/blob/HEAD:/examples/Categories

Thanks. I remember you mentioning that the "new" data source I've
moved over to now supports program categories (which it supposedly
does). However, there is a slight problem with this: the categories
are not visible anywhere on the site. :-) I guess I could ask my
existing YLE contacts about this, to see if they would have a list...
so we don't have to guess.

--
Best regards,
Per Lundberg

Nick Morrott

2010-11-05 12:03:57 UTC

Post by Per Lundberg
Hi Karl et al,
Thanks for commiting the grabber and adding me to the CVS repository
(I noticed that I have been added now, as mentioned)
On Fri, Nov 5, 2010 at 10:06 AM, Karl Dietz

Post by Karl Dietz
That's a \x96 aka windows-1252 encoded dash (|should be –| or
|–|)
which got handled as if it was a "Start of Protected Area" code.
Just inserting a decode('windows-1252') or s|\x92|–|g or s|\x92|-|g
at the right place should fix that.
Or you can change the encoding in the header.

Changing the encoding seems like the easiest way. Is this an OK
solution, is windows-1252 an acceptable charset for XMLTV data? How
would this work with e.g. the tv_grab_combiner - would it be able to
cope with some of the listings being in iso-8859-1, some in
Windows-1252 and some in UTF-8...?

I think at a minimum, grabbers should support at least UTF-8 (after
all, it is 2010). As long as the source data can be successfully
decoded in Perl, outputting as UTF-8 should not be a problem.

Are there any circumstances where it would be preferable for a grabber
to _only_ output in Windows-1252 or ISO-8859-1, rather than UTF-8?

Cheers,
Nick

--
Nick Morrott

MythTV Official wiki: http://mythtv.org/wiki/
MythTV users list archive: http://www.gossamer-threads.com/lists/mythtv/users

"An investment in knowledge always pays the best interest." - Benjamin Franklin

Per Lundberg

2010-11-07 18:22:56 UTC

Hi Nick,

Post by Nick Morrott
I think at a minimum, grabbers should support at least UTF-8 (after
all, it is 2010). As long as the source data can be successfully
decoded in Perl, outputting as UTF-8 should not be a problem.

Hehe, you've got a point there. :-)

Post by Nick Morrott
Are there any circumstances where it would be preferable for a grabber
to _only_ output in Windows-1252 or ISO-8859-1, rather than UTF-8?

AFAIK, there isn't any "standard" way (I mean, supported by the
XMLTV::Options module) to handle different character encodings. And
really, I don't know if it would be worth it to implement this. I
mean, XML has a standard way of defining the character encoding, so
outputting anything supported by XML should be "OK" (= acceptable), as
far as validity of the XML data goes.

Then again, within the XMLTV project, we can of course be more strict
than this in our recommendations; that's a slightly different matter.
I will consider adapting the grabber to output pure UTF-8, it's the
most forward-looking solution to this problem.

--
Best regards,
Per Lundberg

Karl Dietz

2010-11-05 23:27:26 UTC

Hi,

Post by Per Lundberg
Thanks. I remember you mentioning that the "new" data source I've
moved over to now supports program categories (which it supposedly
does). However, there is a slight problem with this: the categories
are not visible anywhere on the site. :-) I guess I could ask my
existing YLE contacts about this, to see if they would have a list...
so we don't have to guess.

Hmm, I saw this and thought it might be easy to map it to program type,
but did'nt
really try to map it all (or how good the data is)

Visa:<select id="highlight_dropdown"name="h"tabindex="5">
<option value="0"selected="selected"></option>
<option value="1">Nyheter</option>
<option value="2">Aktualiteter</option>
<option value="3">Fakta</option>
<option value="4">Sport</option>
<option value="6">Utländska serier</option>
<option value="7"></option>
<option value="8">Barnprogram</option>
<option value="9">Undervisning</option>
<option value="10">Underhållning</option>
</select>

$('#highlight_dropdown').change(function() {
$.cookie('highlight', this.value);
$('.programme.highlight').removeClass('highlight');
$('.programme.category'+this.value).addClass('highlight');
});

<div class="programme clear category6"id="mtv1140"style="">
... details for one program ...
</div>

Regards,
Karl

Per Lundberg

2010-11-07 18:33:53 UTC

On Sat, Nov 6, 2010 at 1:27 AM, Karl Dietz

Hmm, I saw this and thought it might be easy to map it to program type, but
did'nt really try to map it all (or how good the data is)

Duh! You've obviously looked more carefully into this than me, I
didn't see that... :-) Do you know Swedish btw?

Visa:<select id="highlight_dropdown"name="h"tabindex="5">
<option value="0"selected="selected"></option>
<option value="1">Nyheter</option>
<option value="2">Aktualiteter</option>
<option value="3">Fakta</option>
<option value="4">Sport</option>
<option value="6">Utländska serier</option>
<option value="7"></option>
<option value="8">Barnprogram</option>
<option value="9">Undervisning</option>
<option value="10">Underhållning</option>
</select>

Very quickly transliterating this into English, this would be something like:

1 = News
2 = Actualities
3 = Documentaries
4 = Sport
6 = "Foreign TV shows". This is quite a weird category. :-) I guess
the point of it is to be able to deliberately exclude/include programs
which are subtitled in Finnish or Swedish. It doesn't really say
anything what *kind* of TV shows we are talking about, only that the
audio stream is likely to be in a "foreign" language...
8 = Children programs
9 = "Education" - slightly similar to documentaries, but this seems to
include a set of programs aimed at education in school.
10 = Entertainment

Adding these mappings to the grabber should be pretty simple, since we
have the category as a CSS class selector. When outputting the
categories, is it enough to give the YLE-provided names in Swedish, or
should I provide them in English as well? What's the reasonable
"standard" way of doing it? Anyone using the grabber is reasonably
fluent in Swedish (or shall we put it like this; they are probably
more fluent in Swedish than in Finnish...) but of course, adding the
English categories could be done as well, to be a bit nice with
immigrants/foreigners poking around. :-)

--
Best regards,
Per Lundberg

Karl Dietz

2010-11-08 08:35:23 UTC

Hi,

Post by Per Lundberg

Hmm, I saw this and thought it might be easy to map it to program type, but
did'nt really try to map it all (or how good the data is)

Duh! You've obviously looked more carefully into this than me, I
didn't see that... :-) Do you know Swedish btw?

I just took a peek at the site with the google translator, that gave it
away :-)
MythTV, Windows 7 MCE and probably many others have program types like:
tvshows, sports, series, movies

If you want to use the canned searches in MythTV, like "upcoming movies with
good rating" you'll have to use the special "categories" as seen here:
http://svn.mythtv.org/trac/browser/trunk/mythtv/libs/libmythtv/mpeg/dvbdescriptors.cpp#L205

Post by Per Lundberg
{ "", "movie", "series", "sports", "tvshow", };

mapped with an additional one off ("film") for _uk_rt here (not case
sensitive):
http://svn.mythtv.org/trac/browser/trunk/mythtv/programs/mythfilldatabase/xmltvparser.cpp#L344

Post by Per Lundberg

Visa:<select id="highlight_dropdown"name="h"tabindex="5">
<option value="0"selected="selected"></option>
<option value="1">Nyheter</option>
<option value="2">Aktualiteter</option>
<option value="3">Fakta</option>
<option value="4">Sport</option>
<option value="6">Utländska serier</option>
<option value="7"></option>
<option value="8">Barnprogram</option>
<option value="9">Undervisning</option>
<option value="10">Underhållning</option>
</select>

Adding these mappings to the grabber should be pretty simple, since we
have the category as a CSS class selector. When outputting the
categories, is it enough to give the YLE-provided names in Swedish, or
should I provide them in English as well? What's the reasonable
"standard" way of doing it? Anyone using the grabber is reasonably
fluent in Swedish (or shall we put it like this; they are probably
more fluent in Swedish than in Finnish...) but of course, adding the
English categories could be done as well, to be a bit nice with
immigrants/foreigners poking around. :-)

That's a very good question! As the grabbers are used by programs
mostly I'd not expect them to speak any language fluently :-)

And as there are lots of nice standards, but no standard way in xmltv land
I tend to go along these lines (with the help of some online translator):
http://repo.or.cz/w/nonametv.git/blob/HEAD:/examples/Categories

Over at NonameTV we write two values, one for the program type and one
for the genre. For my german guide I'm staying with the categories listed
above, e.g.:
<title lang="de">Beckmann</title>
<desc lang="de">...</desc>
<category lang="en">tvshow</category>
<category lang="en">Talk</category>

I wouldn't be surprised if the list came from the guide source that
mythtv used first...

Regards,
Karl

PS: I have no idea if any other program parses the xmltv categories
for program types, but it would be nice if they could.

Ben Bucksch

2010-11-08 11:10:16 UTC

Post by Karl Dietz
PS: I have no idea if any other program parses the xmltv categories
for program types, but it would be nice if they could.

Yes, zeipis (my mythtv rewrite) also uses "movie", "series" etc., in the
same way (and the same names) as mythtv, and in more way, e.g. when you
schedule a movie for recording, it by default uses "find one airing",
for series it uses "record each episodes once" by default.

20 Replies
2 Views
Permalink to this page
Disable enhanced parsing

Thread Navigation

Per Lundberg 2010-10-16 20:42:08 UTC

Robert Eden 2010-10-17 01:28:19 UTC

Per Lundberg 2010-10-21 19:06:02 UTC

Karl Dietz 2010-10-21 21:02:48 UTC

Per Lundberg 2010-10-23 18:35:39 UTC

Karl Dietz 2010-10-25 21:36:14 UTC

Per Lundberg 2010-10-28 18:58:01 UTC

Karl Dietz 2010-10-28 22:14:28 UTC

Robert Eden 2010-10-29 04:44:33 UTC

Chris Butler 2010-10-29 08:36:10 UTC

Per Lundberg 2010-10-29 19:03:01 UTC

Karl Dietz 2010-11-04 10:08:38 UTC

Robert Eden 2010-11-04 15:58:23 UTC

Karl Dietz 2010-11-05 08:06:34 UTC

Per Lundberg 2010-11-05 11:49:29 UTC

Nick Morrott 2010-11-05 12:03:57 UTC

Per Lundberg 2010-11-07 18:22:56 UTC

Karl Dietz 2010-11-05 23:27:26 UTC

Per Lundberg 2010-11-07 18:33:53 UTC

Karl Dietz 2010-11-08 08:35:23 UTC

Ben Bucksch 2010-11-08 11:10:16 UTC

about - legalese

Loading...