Discussion:
[Xmltv-devel] XMLTV Cache Database
Mariano Cosentino
2010-06-02 13:41:35 UTC
Permalink
Hi everyone,
the subject of this email might not be the correct one, but
I find hard to label it correctly.
In Argentina, outside the prime-time, every workday has
almost exactly the same programming (soap opera, talk shows, old
reruns, etc.), and the most of the time the description will be just
the generic for the show and not specific to the episode; Also, the
movie channels will constantly repeat the same movies several times a
week (even several times a day)
As I look at the way TV_GRAB_AR works can not help but
notice that it will retrieve the same information over and over again,
thus placing a useless load on the providers, and I really want to
avoid bothering them too much.
To the point: I recently rewrote the TV_GRAB_AR, and
included a really simple cache that checks the Provider's programID
and will not download program descriptions that had already been
downloaded in that run. This has greatly reduced the number of program
descriptions that I need to download. But I feel that we must go a
step forward and use an persistent cache, one that we can use to check
in subsequent runs so we never have to re-download the same data
again.
Besides reducing the workload, this should also help on data
enrichment, as this cache database can be matched to the rest of the
available sources to have better information.
Now, before start working on it, I wanted to know what do
you think about it, or if anyone have tried that and/or see any issues
with it, even more interesting will be to know that someone had
already done it (so i can be a really lazy guy and just adapt it to
our needs).
Best Regards, Marianok
Robert Eden
2010-06-02 16:51:28 UTC
Permalink
Post by Mariano Cosentino
To the point: I recently rewrote the TV_GRAB_AR, and
included a really simple cache that checks the Provider's programID
and will not download program descriptions that had already been
downloaded in that run. This has greatly reduced the number of program
descriptions that I need to download. But I feel that we must go a
step forward and use an persistent cache, one that we can use to check
in subsequent runs so we never have to re-download the same data
again.
caching is a great idea.... but if you cache between runs, how will you
know if the details are updated for an existing episode-id?

Robert
Tom Furie
2010-06-02 17:38:12 UTC
Permalink
Post by Robert Eden
caching is a great idea.... but if you cache between runs, how will you
know if the details are updated for an existing episode-id?
Download and compare with the locally cached copy of course! :-)

Cheers,
Tom
--
Etiquette is for those with no breeding; fashion for those with no taste.
Mattias Holmlund
2010-06-02 18:35:36 UTC
Permalink
Post by Mariano Cosentino
To the point: I recently rewrote the TV_GRAB_AR, and
included a really simple cache that checks the Provider's programID
and will not download program descriptions that had already been
downloaded in that run. This has greatly reduced the number of program
descriptions that I need to download. But I feel that we must go a
step forward and use an persistent cache, one that we can use to check
in subsequent runs so we never have to re-download the same data
again.
Perhaps you could use HTTP::Cache::Transparent. It is a persistent http
cache and it is already used by tv_grab_se_swedb and tv_grab_uk_rt.

/Mattias
Mariano Cosentino
2010-06-04 18:25:20 UTC
Permalink
Date: Wed, 02 Jun 2010 11:51:28 -0500
caching is a great idea.... but if you cache between runs, how will you
know if the details are updated for an existing episode-id?
Well, I know that the source we use for TV_Grab_AR uses a different ID
for each episode (in the very few cases that the bother to detail
episodes) if the description is different, so the only thing I have to
worry about is the possibility for an episode (or a generic program
description) changing from time to time (like a change on the cast).
Most of the time, the "episode" information will be right in the title
"LOST: SEASON 1 EPISODE 1: PILOT", but the description will be the
same for all episodes "Some people crash in an island and try to
survive unknown dangers", and the program id on their web server is
also the same for all episodes.

I would probably have to run an sporadic "update job", that could be
every x times (i.e. every 500 instances), or after a certain period of
times (if the cached description is older that x days). Even that will
be much better that downloading "Maria is a housewife that discovers
that his husband is cheating on her, but Maria will take revenge" 5
times a week, 240 times a year.

Also, like I mentioned in my previous, I think we can use this
database not just as a cache but also as an additional data source and
a repository, a repository that can be enriched with information from
other places to have a good source of data even if the one provided by
the source is not so good. Instead of having a generic description for
LOST, we could have the real information for the exact episode. Kind
of a local IMDB but with better search and editing capabilities.
Perhaps you could use HTTP::Cache::Transparent. It is a persistent http
cache and it is already used by tv_grab_se_swedb and tv_grab_uk_rt.
/Mattias
Thanks mattias, I'll try that for reducing the load on the provider.

Regards, Marianok

Loading...