[Xmltv-devel] imdb.pm issues with episodes of a series

Discussion:

h***@gmail.com

2014-05-01 07:54:51 UTC

While working on tv_imdb I've come across an issue with the way it currently handles episodes of a series. Basically it's just ignoring eps and lumping everything together as one title. So you get the entire series all rolled up.

I believe this is wrong. E.g. try listing the actors for:

grep '"Doctor Who" (1963)' stage3.data

or

grep '"Coronation Street" (1960)' stage3.data

I think it's no wonder it needs so much memory to run ;-)

However the final stage (creating the actual database) then ignores these records since they don't match stage1.data which *does* still contain the episode in the title!

grep '"Doctor Who" (1963)' stage1.data

So the net result is that "Doctor Who" (1963) has no data at all (except ratings) in the final database:

doctor%20who%20%281963%29 "Doctor Who" (1963) 1963 tv_series 0216306
0216306:<> <> <> 0..001221. 42 7.3 <> <>

I propose changing this so IMDB.pm becomes 'episode-aware' (will require changes to the file layout).

I can then subsequently amend the searching lookup (tv_imdb) so it properly looks for series title + episode title.

Your thoughts please?

Rgds,
Geoff

Karl Dietz

2014-05-01 18:26:52 UTC

Permalink

Post by h***@gmail.com
I propose changing this so IMDB.pm becomes 'episode-aware' (will require changes to the file layout).
I can then subsequently amend the searching lookup (tv_imdb) so it properly looks for series title + episode title.
Your thoughts please?

h***@gmail.com

2014-05-01 20:16:49 UTC

Permalink

Post by Karl Dietz
While I've not yet done much with unit test in Perl it might be good to
start with adding tests for the IMDB functions so you can refactor with
a safety net (aka noticing breakage early).

Good idea, I'll look into it.

What do you think about switching to a SQL engine, something like DBD::CSV ( http://search.cpan.org/~hmbrand/DBD-CSV-0.41/lib/DBD/CSV.pm )? Or should we stick with plain old Search::Dict?

Rgds,
Geoff

Robert Eden

2014-05-02 02:21:17 UTC

Permalink

Post by h***@gmail.com

Good idea, I'll look into it.
What do you think about switching to a SQL engine, something like DBD::CSV ( http://search.cpan.org/~hmbrand/DBD-CSV-0.41/lib/DBD/CSV.pm )? Or should we stick with plain old Search::Dict?

Wow... . DBD::CSV allows a SQL interface to a CSV backend with just
Perl? I never knew such a thing existed!

Ever use it? How's the performance? Does it memory map things for
faster performance?

Robert

h***@gmail.com

2014-05-02 09:05:33 UTC

Permalink

Post by Robert Eden
Wow... . DBD::CSV allows a SQL interface to a CSV backend with just
Perl? I never knew such a thing existed!
Ever use it? How's the performance? Does it memory map things for
faster performance?

I've not used this particular one but I expect there's a slight performance overhead in using a SQL engine, and also with using a CSV as the datafile, but the benefits in terms of searching power are great. E.g. it would be much easier to implement 'close' matching and would better handle multiple files. (After adding keywords and plot summaries to the datafile I think we need to split the file into several parts (e.g. 1 for actors, 1 for keywords, 1 for plot, 1 for everything else)).

Which is why most people prefer something like SQLite to CSV. Perhaps the subtext to my question was, "do we need to stick to CSV file format?". Does anyone use the moviedb.dat file for another purpose?

Rgds,
Geoff

Jerry Veldhuis

2014-05-01 20:27:22 UTC

Permalink

Geoff,
Can you send me a short sample of the xml your using for input and
what you'd expect as output ?
I'll take a closer look.

thx
jerry

On Thu, May 1, 2014 at 12:26 PM, Karl Dietz

Post by h***@gmail.com

Post by h***@gmail.com
I propose changing this so IMDB.pm becomes 'episode-aware' (will require

changes to the file layout).

Post by h***@gmail.com
I can then subsequently amend the searching lookup (tv_imdb) so it

properly looks for series title + episode title.

Post by h***@gmail.com
Your thoughts please?

I like it.
While I've not yet done much with unit test in Perl it might be good to
start with adding tests for the IMDB functions so you can refactor with
a safety net (aka noticing breakage early).
Regards,
Karl
------------------------------------------------------------------------------
"Accelerate Dev Cycles with Automated Cross-Browser Testing - For FREE
Instantly run your Selenium tests across 300+ browser/OS combos. Get
unparalleled scalability from the best Selenium testing platform available.
Simple to use. Nothing to install. Get started now for free."
http://p.sf.net/sfu/SauceLabs
_______________________________________________
xmltv-devel mailing list
https://lists.sourceforge.net/lists/listinfo/xmltv-devel

h***@gmail.com

2014-05-02 09:05:47 UTC

Permalink

Post by Jerry Veldhuis
Geoff,
Can you send me a short sample of the xml your using for input and
what you'd expect as output ?
I'll take a closer look.
thx
jerry

Hi Jerry,

For which? Do you mean for sub-title matching or for unit testing?

Cheers,
Geoff

Jerry Veldhuis

2014-05-02 14:47:15 UTC

Permalink

I was going to look into the sub-title match.

BTW, I have experience with DBD::CSV, the performance would be a major
issue.
I've suggested moving to DBD::SQLite before, but there was opposition to
adding the perl dependencies.
I doubt anyone is using the .dat files directly.

jerry

Post by h***@gmail.com

Post by Jerry Veldhuis
Geoff,
Can you send me a short sample of the xml your using for input and
what you'd expect as output ?
I'll take a closer look.
thx
jerry

Hi Jerry,
For which? Do you mean for sub-title matching or for unit testing?
Cheers,
Geoff
------------------------------------------------------------------------------
"Accelerate Dev Cycles with Automated Cross-Browser Testing - For FREE
Instantly run your Selenium tests across 300+ browser/OS combos. Get
unparalleled scalability from the best Selenium testing platform available.
Simple to use. Nothing to install. Get started now for free."
http://p.sf.net/sfu/SauceLabs
_______________________________________________
xmltv-devel mailing list
https://lists.sourceforge.net/lists/listinfo/xmltv-devel

h***@gmail.com

2014-05-02 16:30:05 UTC

Permalink

Post by Jerry Veldhuis
I was going to look into the sub-title match.

Cool.

I think we could use the opportunity to also improve some of the matching (e.g. for off-by-one-year matches)

Here's the scratchings I made on the back of an old envelope: (in no particular order of importance or scale or completeness)

IDX file
- remove year "(xxxx)" from search key and title field
- add subtitle (minus the SnnEnn) as a separate field
- add series/ep as a separate field
- change 'tv_movie' etc to single char to save space
- remove double quotes from search key (better to use a combo of Title+Rectype than %27title%27 )
- remove punctuation, definite articles and spaces from the search key

DAT file
- split into multiple files (e.g. (i) actors (ii) keywords (iii) plots (iv) rest )
- calc idx key in stage 1 and store in stage1.data
- all subsequent stages can then create their files as key + data on-the-fly rather than title + data (reduces performance load in final stage, and final stage then not necessary at all for actors/keywords/plots files (if separate files - see above)

Searching
- remove punctuation, definite articles and spaces from the search-for title
- *expect* multiple hits for a search
- filter the hits based on:
director name (incoming <credits><director>)
Snn/Enn (incoming <episode-num>)
year (incoming <date>)
title (incoming <title>)
sub-title (incoming <sub-title>)
all the above except <title> are optional and may not be present in incoming data!
- if no exact match then widen the filter (using same results list)
year +/- 2 (when director input)
year +/- 1 (when no director input)
?? scoring system for words corresponding between incoming title and IDX title
?? more?

The basic principle is to search once and then filter the results to find the 'best' match. (rather than multiple searches for 'exact', 'tv_series' and then 'close')

This is based on experience I have of matching films against IMDb for a different project. I matched records using an element scoring system and found the best items were Title > Director's name > Year. I.e. when possible, match on Title+Director+Year(+/- 2). If no director name available in input data then use Title+Year(+/- 1).

I'm not suggesting you do all (or any ;) ) of this - I'm happy to give it a go when I get time, but any help will be gratefully received :-)

Cheers,
Geoff

p.s. pls ensure you pick up the latest IMDB.pm version 1.62 I uploaded to CVS this morning - it has the changes in for keywords and plots.

h***@gmail.com

2014-05-02 17:32:49 UTC

Permalink

Post by Jerry Veldhuis
BTW, I have experience with DBD::CSV, the performance would be a major
issue.
I've suggested moving to DBD::SQLite before, but there was opposition to
adding the perl dependencies.

Do you know what deps it needs? I didn't think it needed any:

++++++++++++++++++
DBD::SQLite is a Perl DBI driver for SQLite, that includes the entire thing in the distribution. So in order to get a fast transaction capable RDBMS working for your perl project you simply have to install this module, and *nothing* else.
++++++++++++++++++

Robert Eden

2014-05-02 19:11:26 UTC

Permalink

Post by h***@gmail.com

++++++++++++++++++
DBD::SQLite is a Perl DBI driver for SQLite, that includes the entire thing in the distribution. So in order to get a fast transaction capable RDBMS working for your perl project you simply have to install this module, and *nothing* else.
++++++++++++++++++

I just tried installing DBD::SQLite on my current desktop and it
installed built pretty easily (Windows Strawberry Perl).

I'll try building it on my xmltv.exe system and see how it goes.. that's
my biggest concern. It's an old version of ActivePerl.

(It's activeperl because that's what was used to build xmltv.exe... it's
not a free product, but I can probably get Schedules Direct to pay for
it if we need to upgrade). Looking into modernizing xmltv.exe is on
my todo list.

Robert

Robert Eden

2014-05-04 01:25:04 UTC

Permalink

Post by Robert Eden
I just tried installing DBD::SQLite on my current desktop and it
installed built pretty easily (Windows Strawberry Perl).
I'll try building it on my xmltv.exe system and see how it goes..
that's my biggest concern. It's an old version of ActivePerl.

I just build DBD::SQLite on my xmltv.exe system and it passed tests just
fine! The DLL is about 700k, but IMHO increasing XMLTV that much is
worth the capability of using SQLite! Give it a shot!

In other news, I just built XMLTV for the first time on my current
windows machine using Strawberry Perl. It was easy. I'm going to do it
again, and write up the procedure and add it to the Wiki. Almost anyone
should be able to run from source.. it was that easy!

Lastly, I came across Indigostar's perl2exe which seem to work similarly
to the Activestate Perl Resrouce Kit for less money. I'm going to play
with that too, and that would allow xmltv.exe to use a modern perl! (I
also looked at PAR and pp modules, but it looks like making them work
will be difficult)

Robert

h***@gmail.com

2014-05-12 09:42:09 UTC

Permalink

Post by Robert Eden
I just build DBD::SQLite on my xmltv.exe system and it passed tests just
fine! The DLL is about 700k, but IMHO increasing XMLTV that much is
worth the capability of using SQLite! Give it a shot!

[...]

Post by Robert Eden
Lastly, I came across Indigostar's perl2exe which seem to work similarly
to the Activestate Perl Resrouce Kit for less money. I'm going to play
with that too, and that would allow xmltv.exe to use a modern perl! (I
also looked at PAR and pp modules, but it looks like making them work
will be difficult)

Do we think using SQLite is a flyer then? If so I'll start work on d/b design and look at modifying the Crunch package accordingly.

Rgds,
Geoff

p.s. it would be good to get the Windoze version onto at least 5.8.1 (for a number of reasons; such as working unicode handling, etc).

Ben Bucksch

2014-05-12 10:19:28 UTC

Permalink

There are a couple of requests on the Feature Request board to add a
programme's CRID (unique programme identifier) to the DTD.
c.f.
FR #71 Add support for representing CRIDs
http://sourceforge.net/p/xmltv/feature-requests/71/
FR #105 XMLTV DTD: Add attributes for DVB ONID and SID
http://sourceforge.net/p/xmltv/feature-requests/105/
There probably aren't many sources where a CRID is available but,
where they are available, then it opens new possibilities for
downstream data handling.
If we were to do this, I think a "system" attribute might also be
useful (as per the episode-num element). This wouldn't be predefined
but just some value agreed between grabber and application.

I concur.
Please read http://article.gmane.org/gmane.comp.tv.xmltv.devel/8844/
(Subject: "Need role name and source id", Poster: me, Date: 2009-01-31
I made made a request for such a field as well, because my source
(tvmovie.de) provided consistent IDs for each "programme" (movie, series
episode, newscast, sports broadcast etc.).

And I added these IDs as separate element anyway, because indeed, they
give important possibilities in downstream applications: Using it as
database record ID for a proper normalized database (which my TV
recording app does), allowing to build a database of all showings of a
film over years, or at least using it for duplicate detection - again,
even over years, i.e. knowing that we don't need to record this film,
because it was already recorded 4 years ago and is still sitting as
video on the disk.

To allow all this, 2 properties are critically important for IDs:
* Unique - There must never ever be 2 different programs with the same
ID. That must be true globally for all programs.
* Stable - There must never be the same programs with 2 different IDs.
I.e. if the same program is aired again 3 years later, it must have the
same ID as 3 years before.

If these 2 critieria are met, then the ID can be used as database key,
duplication detection etc. If they are not met, then the downstream
application will go wrong badly: Recording the wrong shows (e.g.
recording the documentation about "Titanic" instead of the movie), not
recording shows that should be recorded, massive duplication etc..

You are right that we need a "system" attribute. It doesn't matter who
is the source of the ID, as long as the system is in itself consistent
and meets above criteria over a long time. However, we obviously must
not mix systems, so the system attribute is important.
In my TV recording app, I generate an internal ID by prepending the
system attribute to the ID.

What's special about a CRID that it shouldn't go into episode-num?
Many apps already look to episode-num for unique identifiers, so they
won't have to
do much other than accept crid as a "trusted unique provider".

Several reasons:

* Episodes are inherently a concept of series, but I need it mostly
for movies. It makes no sense to call a movie ID an "episode-num". (*)
* Episode-nums are following a certain schema that is commonly used in
the USA: Season " / " episode within season. Downstream apps parse
this.
* Episode-nums are inherently not unique, but only unique within the
series. The IDs we need must be globally unique, across all programs.

(*) E.g. there was recently a patch to eu_epgdata posted that treats
type=movie with episode as type=series. While not immediately
conflicting with this here, it shows the danger of blurring concepts.

Continue reading on narkive:

Search results for '[Xmltv-devel] imdb.pm issues with episodes of a series' (Questions and Answers)

replies

Who is Meryl Streep?

started 2008-08-09 05:49:46 UTC

celebrities

replies

Tell me about the physical features of Johnny Depp and Jessica Simpson. Who are they and what do they do.?

started 2007-04-09 16:15:34 UTC

celebrities

replies

Sylvester Stallone?