Post by MichaÅ CichowiczPost by MichaÅ CichowiczI have started writing a grabber for Poland fetching data from
http://tv.wp.pl/. There was a service http://tv-guide.ubuntu.pl/ doing
quite similar but publishing the prepared xmltv file. Unfortunately
there was a copyright issue and this service had to be shut down.
Grabbing data personally shouldn't raise any jurisprudential problems -
I hope.
Is there any restriction to programming language for the grabber to be
included in the pack?
All the grabbers are written in Perl. Unfortunately I'm not very
familiar with script languages in general and prefer native ones.
Can it be written in C++?
A long time ago (about 7 months) I have mentioned I was writing a
grabber for Poland. I had some personal issues and more important things
to do but since then, despite the fact it was actually working, I
suspended my actions concerning it.
Now, I have some problems:
1. There's so huge amount of data (278 channels) on the website and
description of every single programme is on a separate site, it takes
quite a long while to grab all the channels, even when using several
connections at once (consider 90 minutes outputting 23 MB of XML data).
Actually, grabbing all the channels causes the validator to terminate
the grabber because of time-out. Grabbing 12 channels for 8 days takes
over 4 minutes (over 2 sec per channel-day). If I disabled grabbing
descriptions, it would take a lot faster. Can I validate the grabber
only on some channels? On 12 channels it passes the validation (version
0.5.56 from Ubuntu Lucid). Should I add an option to disable grabbing
descriptions?
2. Sometimes the service supplies malicious (non-additive) data and I'm
not sure if the grabber should glimpse at a following day to verify
correctness.
3. I have discovered today that for many foreign channels the service
misinforms about the language. Those channels are listed in
Polish-language category but titles and descriptions are in native ones
(e.g. German, French, Italian or Czech). My grabber supplies the
language information.
4. I have no idea how it looks like in other countries but in Poland it
is a kind of tradition that tv-listing date is being changed at about 5
AM. My grabber uses the website calendar as reference, so grabbing for
"today" at 1 AM actually causes grabbing for "yesterday". I believe it
makes more sense to do so.
5. It's still written in C++ and I have better things to do than
learning Perl. I would like to know if there is any interest in the
grabber at all. I have an account on SF and I'm ready to send to source
code. It requires libxml-2.0 and libcurl to compile.
Best Regards,
Michał Cichowicz