[Xmltv-devel] New grabber

Discussion:

[Xmltv-devel] New grabber - tv_grab_pl

Michał Cichowicz

2010-04-21 00:04:33 UTC

I have started writing a grabber for Poland fetching data from
http://tv.wp.pl/. There was a service http://tv-guide.ubuntu.pl/ doing
quite similar but publishing the prepared xmltv file. Unfortunately
there was a copyright issue and this service had to be shut down.
Grabbing data personally shouldn't raise any jurisprudential problems -
I hope.

Best regards,
Michal Cichowicz

Michał Cichowicz

2010-04-23 02:19:55 UTC

Permalink

Post by MichaÅ Cichowicz
I have started writing a grabber for Poland fetching data from
http://tv.wp.pl/. There was a service http://tv-guide.ubuntu.pl/ doing
quite similar but publishing the prepared xmltv file. Unfortunately
there was a copyright issue and this service had to be shut down.
Grabbing data personally shouldn't raise any jurisprudential problems -
I hope.

Is there any restriction to programming language for the grabber to be
included in the pack?
All the grabbers are written in Perl. Unfortunately I'm not very
familiar with script languages in general and prefer native ones.
Can it be written in C++?

Best Regards,
Michal Cichowicz

Robert Eden

2010-04-23 04:07:33 UTC

Permalink

Post by MichaÅ Cichowicz
Is there any restriction to programming language for the grabber to be
included in the pack?
All the grabbers are written in Perl. Unfortunately I'm not very
familiar with script languages in general and prefer native ones.
Can it be written in C++?

To be part of the main XMLTV distribution, it probably should be Perl.
I can see lots of Makefile.PL issues if we mix languages.

Now, the XMLTV SourceForge project could host another "product" for a
different grabber. We would simply create a different top level folder
(there's already one for a "libxmltv" thing that didn't catch on) The
new "Files" download category also makes this sort of thing pretty
trivial. I could make sure someone on the sub-project has "release
admin" rights to put stuff in "Files" and trust them not to screw up the
main product.

The XMLTV SF project would basically be providing the hosting and making
it easier to find it. If you extend it out to say 20 different
SourceForge projects, it does probably make things more efficient to
keep it together.

What do the package maintainers think? Would you rather support a
separate package? Would you rather build it from another SF project (or
another host), or somewhere off the XMLTV SF project.

No matter where it's hosted, it should be mentioned in the XMLTV.ORG
wiki, that's for sure.

Robert

Francois Gouget

2010-04-23 07:24:35 UTC

Permalink

On Fri, 23 Apr 2010, Micha³ Cichowicz wrote:
[...]

Independently of the project/packaging aspects, I'll just throw in a
word of caution concerning the choice of language.

Grabbers generally need to parse web pages that have relatively little
structure. So in addition to the HTML/DOM searches they usually have to
do a lot of string matching, substring extraction, etc. That's the kind
of task which can get quite tiresome if you have to worry about
allocating buffers big enough and then about freeing them, but for which
languages like perl and python are pretty well suited. In particular
perl's regular expression support is quite handy.

Also perl grabbers can rely on a library to handle the details of
validating and generating the xmltv file, to retrieve the web pages
slowly and to perform date/time arithmetic. If these have no suitable
equivalent in your language of choice that may increase the scope of
your project (but once you've done the hard work writing other grabbers
in that language would become easier).

So writing a grabber in another language might be harder than you think,
possibly harder than just learning perl. So there you have it: the word
of caution. But ultimately the choice is yours.

--
Francois Gouget <***@free.fr> http://fgouget.free.fr/
Cahn's Axiom: When all else fails, read the instructions.

Michał Cichowicz

2010-04-23 13:18:41 UTC

Permalink

Post by Francois Gouget
So writing a grabber in another language might be harder than you think,
possibly harder than just learning perl. So there you have it: the word
of caution. But ultimately the choice is yours.

Actually the grabber is almost ready, written in maybe 12 hours. Right
now I'm only making it configurable (baseline and manualconfig
compliant), because the full channel list with listings for every
available date, without programme descriptions is about 17MiB size.
Standard C regex utility and libxml are also very handy. ;)

Best Regards,
Michal Cichowicz

Robert Eden

2010-04-23 14:37:48 UTC

Permalink

Post by Francois Gouget
Grabbers generally need to parse web pages that have relatively little
structure.

Minor terminology quibble..

grabber eq something that downloads data
scraper eq something that downloads data by scraping a web site.

All grabbers aren't scrapers... (but in this case it is of course)
(_na_dd, eu_epgdata, uk_rt all come to mind)

Robert

Michał Cichowicz

2010-11-28 01:22:32 UTC

Permalink

Post by MichaÅ Cichowicz

A long time ago (about 7 months) I have mentioned I was writing a
grabber for Poland. I had some personal issues and more important things
to do but since then, despite the fact it was actually working, I
suspended my actions concerning it.

Now, I have some problems:

1. There's so huge amount of data (278 channels) on the website and
description of every single programme is on a separate site, it takes
quite a long while to grab all the channels, even when using several
connections at once (consider 90 minutes outputting 23 MB of XML data).
Actually, grabbing all the channels causes the validator to terminate
the grabber because of time-out. Grabbing 12 channels for 8 days takes
over 4 minutes (over 2 sec per channel-day). If I disabled grabbing
descriptions, it would take a lot faster. Can I validate the grabber
only on some channels? On 12 channels it passes the validation (version
0.5.56 from Ubuntu Lucid). Should I add an option to disable grabbing
descriptions?

2. Sometimes the service supplies malicious (non-additive) data and I'm
not sure if the grabber should glimpse at a following day to verify
correctness.

3. I have discovered today that for many foreign channels the service
misinforms about the language. Those channels are listed in
Polish-language category but titles and descriptions are in native ones
(e.g. German, French, Italian or Czech). My grabber supplies the
language information.

4. I have no idea how it looks like in other countries but in Poland it
is a kind of tradition that tv-listing date is being changed at about 5
AM. My grabber uses the website calendar as reference, so grabbing for
"today" at 1 AM actually causes grabbing for "yesterday". I believe it
makes more sense to do so.

5. It's still written in C++ and I have better things to do than
learning Perl. I would like to know if there is any interest in the
grabber at all. I have an account on SF and I'm ready to send to source
code. It requires libxml-2.0 and libcurl to compile.

Best Regards,
Michał Cichowicz

Continue reading on narkive:

Search results for '[Xmltv-devel] New grabber - tv_grab_pl' (Questions and Answers)

replies

How can I download videos off youtube?

started 2008-03-26 02:07:59 UTC

software