Discussion:
[Xmltv-devel] Proposal for new DTD element "crid"
h***@gmail.com
2014-05-11 16:14:36 UTC
Permalink
There are a couple of requests on the Feature Request board to add a programme's CRID (unique programme identifier) to the DTD.

c.f.
FR #71 Add support for representing CRIDs
http://sourceforge.net/p/xmltv/feature-requests/71/

FR #105 XMLTV DTD: Add attributes for DVB ONID and SID
http://sourceforge.net/p/xmltv/feature-requests/105/


There probably aren't many sources where a CRID is available but, where they are available, then it opens new possibilities for downstream data handling.

If we were to do this, I think a "system" attribute might also be useful (as per the episode-num element). This wouldn't be predefined but just some value agreed between grabber and application.


E.g. _uk_atlas has crids available and up to now I've encoded them as a second episode-num element, e.g.

<episode-num system="xmltv_ns"> 6.16/70. </episode-num>
<episode-num system="brand.series.episode"> mr2.ryp52.r8cnm </episode-num>

This could morph into:
<crid system="brand.series.episode">mr2.ryp52.r8cnm</crid>


FR#71 could simply be:
<crid system="item">123456</crid>

FR#105 could be
<crid system="onid.sid">12.34</crid>


So, should we add a new element to the DTD for <crid> ?

Or do you think that the episode-num entity adequately handles CRIDs ?

Rgds,
Geoff
Robert Eden
2014-05-12 04:02:08 UTC
Permalink
Post by h***@gmail.com
There are a couple of requests on the Feature Request board to add a programme's CRID (unique programme identifier) to the DTD.
c.f.
FR #71 Add support for representing CRIDs
http://sourceforge.net/p/xmltv/feature-requests/71/
FR #105 XMLTV DTD: Add attributes for DVB ONID and SID
http://sourceforge.net/p/xmltv/feature-requests/105/
There probably aren't many sources where a CRID is available but, where they are available, then it opens new possibilities for downstream data handling.
If we were to do this, I think a "system" attribute might also be useful (as per the episode-num element). This wouldn't be predefined but just some value agreed between grabber and application.
E.g. _uk_atlas has crids available and up to now I've encoded them as a second episode-num element, e.g.
<episode-num system="xmltv_ns"> 6.16/70. </episode-num>
<episode-num system="brand.series.episode"> mr2.ryp52.r8cnm </episode-num>
<crid system="brand.series.episode">mr2.ryp52.r8cnm</crid>
<crid system="item">123456</crid>
FR#105 could be
<crid system="onid.sid">12.34</crid>
So, should we add a new element to the DTD for <crid> ?
Or do you think that the episode-num entity adequately handles CRIDs ?
What's special about a CRID that it shouldn't go into episode-num? Many apps already look to episode-num for unique identifiers, so they won't have to
do much other than accept crid as a "trusted unique provider".

Robert
h***@gmail.com
2014-05-12 06:58:22 UTC
Permalink
Post by Robert Eden
What's special about a CRID that it shouldn't go into episode-num? Many apps already look to episode-num for unique identifiers, so they won't have to
do much other than accept crid as a "trusted unique provider".
Maybe a perception / terminology thing perhaps: e.g. a film will have a CRID but it seems slightly odd to look in something labelled "episode-num" for something that isn't a series/ episode? Or does it? :-)
Karl Dietz
2014-05-21 04:51:19 UTC
Permalink
Post by h***@gmail.com
Post by Robert Eden
What's special about a CRID that it shouldn't go into episode-num? Many apps already look to episode-num for unique identifiers, so they won't have to
do much other than accept crid as a "trusted unique provider".
Maybe a perception / terminology thing perhaps: e.g. a film will have a CRID but it seems slightly odd to look in something labelled "episode-num" for something that isn't a series/ episode? Or does it? :-)
Like Nick noted the CRID is a broadcasters ID in a special system.
One programme may have a bag of related CRIDs, think series link and
episode id. Multiple grabbers should return the same id for the same
broadcasters content. But one grabber will likely return different ids
for the same content transmitted by different broadcasters.

What I'm missing is a way to signal "this is an episode of series X but
we don't know which one". Similar to having a series link id but no
programme id.

Also the CRIDs should have a way to signal additional information that
is encoded in the content_type bits of the content_identifier descriptor
in DVB-SI.

Regards,
Karl

PS: I'd love to see an XMLTV-TNG schema that uses modern XML concepts
and data types in addition to allowing more complex content, like
separate description/titles/ids for series and episodes. Also having
more examples would be nice. E.g. how to model the 8 o'clock news in a
way that allows it to be repeated across multiple channels with one of
the later repeats carrying deaf-signage etc.
Ben Bucksch
2014-05-21 10:39:56 UTC
Permalink
Post by Karl Dietz
Post by h***@gmail.com
Post by Robert Eden
What's special about a CRID that it shouldn't go into episode-num? Many apps already look to episode-num for unique identifiers, so they won't have to
do much other than accept crid as a "trusted unique provider".
Maybe a perception / terminology thing perhaps: e.g. a film will have a CRID but it seems slightly odd to look in something labelled "episode-num" for something that isn't a series/ episode? Or does it? :-)
Like Nick noted the CRID is a broadcasters ID in a special system.
1. Agreed, there seems to be a specific definition for the format of
CRIDs. We should use another tag name, e.g. <stableid> What we want here
is just an opaque string that meets the 2 requirements mentioned: stable
over years, and unique within the system.

2. Could you please add these requirements, including definitions of it
that I sent in my last post, to the DTD text?
Post by Karl Dietz
One programme may have a bag of related CRIDs, think series link and
episode id. Multiple grabbers should return the same id for the same
broadcasters content. But one grabber will likely return different ids
for the same content transmitted by different broadcasters.
What I'm missing is a way to signal "this is an episode of series X but
we don't know which one". Similar to having a series link id but no
programme id.
Also the CRIDs should have a way to signal additional information that
is encoded in the content_type bits of the content_identifier descriptor
in DVB-SI.
Regards,
Karl
PS: I'd love to see an XMLTV-TNG schema that uses modern XML concepts
and data types in addition to allowing more complex content, like
separate description/titles/ids for series and episodes. Also having
more examples would be nice. E.g. how to model the 8 o'clock news in a
way that allows it to be repeated across multiple channels with one of
the later repeats carrying deaf-signage etc.
------------------------------------------------------------------------------
"Accelerate Dev Cycles with Automated Cross-Browser Testing - For FREE
Instantly run your Selenium tests across 300+ browser/OS combos.
Get unparalleled scalability from the best Selenium testing platform available
Simple to use. Nothing to install. Get started now for free."
http://p.sf.net/sfu/SauceLabs
_______________________________________________
xmltv-devel mailing list
https://lists.sourceforge.net/lists/listinfo/xmltv-devel
h***@gmail.com
2014-05-21 12:54:53 UTC
Permalink
Post by Ben Bucksch
1. Agreed, there seems to be a specific definition for the format of
CRIDs. We should use another tag name, e.g. <stableid>
What is wrong with the tag <crid> ? The proposed use here is totally in line with the Wikipedia entry.

Don't get hung up on the "format" section in that wiki entry. However if you really want to follow it then why not simply specify the crid as a locator, e.g.:

<title>Scary Movie</title>
<crid>crid://atlas.metabroadcast.com/episode/8hmr</crid>

or

<title>EastEnders</title>
<crid>crid://atlas.metabroadcast.com/brand/cf2</crid>
<crid>crid://atlas.metabroadcast.com/episode/cwhz8v</crid>
<episode-num system="xmltv_ns">.4859.</episode-num>
Post by Ben Bucksch
2. Could you please add these requirements, including definitions of it
that I sent in my last post, to the DTD text?
Feel free to edit the proposed text and post your version.
Ben Bucksch
2014-05-22 13:02:37 UTC
Permalink
[http://en.wikipedia.org/wiki/CRID]
Don't get hung up on the "format" section in that wiki entry.
Well, it says: "In fact, a CRID is a so-called URI
<http://en.wikipedia.org/wiki/Uniform_resource_identifier>." That's
quite affirmative, and we (or at least my intended usage of the <crid>
tag) would violate it, because I'd treat it as opaque string, not URI.
If Wikipedia indeed has the correct definition of "CRID", I wouldn't
want to hijack and abuse it in a non-conformant way.

Is there a more authoritative definition of what a "CRID" is?
Post by Ben Bucksch
2. Could you please add these requirements, including definitions of it
that I sent in my last post, to the DTD text?
Feel free to edit the proposed text and post your version.
Will do in my next post.
h***@gmail.com
2014-05-22 14:33:31 UTC
Permalink
Post by Ben Bucksch
Is there a more authoritative definition of what a "CRID" is?
See the references (RFC4078, etc) at the bottom of the WP article.
h***@gmail.com
2014-05-21 15:37:10 UTC
Permalink
Post by Karl Dietz
One programme may have a bag of related CRIDs, think series link and
episode id. Multiple grabbers should return the same id for the same
broadcasters content. But one grabber will likely return different ids
for the same content transmitted by different broadcasters.
I agree.
Post by Karl Dietz
What I'm missing is a way to signal "this is an episode of series X but
we don't know which one". Similar to having a series link id but no
programme id.
Isn't that simply a CRID for the series? e.g. an episode might have one crid for the series and another for the episode.

Generic episode:
<crid system="series">xxxx</crid>

Specific episode:
<crid system="series">xxxx</crid>
<crid system="episode">xxxx</crid>
Post by Karl Dietz
Also the CRIDs should have a way to signal additional information that
is encoded in the content_type bits of the content_identifier descriptor
in DVB-SI.
Isn't that part of the transport stream - is a grabber going to have access to that information?

Rgds,
Geoff
Karl Dietz
2014-05-21 15:56:12 UTC
Permalink
Post by h***@gmail.com
Post by Karl Dietz
What I'm missing is a way to signal "this is an episode of series X but
we don't know which one". Similar to having a series link id but no
programme id.
Isn't that simply a CRID for the series? e.g. an episode might have one crid for the series and another for the episode.
<crid system="series">xxxx</crid>
<crid system="series">xxxx</crid>
<crid system="episode">xxxx</crid>
The idea behind the XMLTV grabbers is to outsource the things you have
to know about the data source to make sense of it into the grabbers.
Encoding information in the absence of other information appears to be a
step backwards.

Lets say you have an episode that you know the series CRID but not the
episode CRID. But you do know which episode it is (say via the episode
title). Now you must either make up a fake episode CRID or wrongly
signal an unknown episode.

With your example as input a grabber should turn the implicit "has a
series CRID but no episode CRID, so it must be generic due to our
knowledge of the data source" into an explicit "this is a generic episode".

My local cable feed transmits CRIDs for the events in a way that does
not allow to see if its a generic episode or a specific one. Both may or
may not appear next to a series CRID. So your implicit signaling would
fail for that data source. (Also their source is so bad, I've regularly
seen three different CRIDs for the same episode :(
Post by h***@gmail.com
Post by Karl Dietz
Also the CRIDs should have a way to signal additional information that
is encoded in the content_type bits of the content_identifier descriptor
in DVB-SI.
Isn't that part of the transport stream - is a grabber going to have access to that information?
I'm not sure I understand. Yes, the type is next to the CRID in the
transport stream. But it should be next to the CRID in other data
sources, too. How else do you know if you should treat it as a seriesid
or a programid?

Regards,
Karl
h***@gmail.com
2014-05-21 17:05:40 UTC
Permalink
Post by Karl Dietz
With your example as input a grabber should turn the implicit "has a
series CRID but no episode CRID, so it must be generic due to our
knowledge of the data source" into an explicit "this is a generic episode".
Yes I take your point. I guess we are spoilt with the Atlas data source since it has definitive info for series and episode; so the grabber *knows* the lack of an episode id means it's a generic. i.e. the "has a series CRID but no episode CRID" *is* explicit.

Conversely if Atlas has no series id then we *know* it's a one-off (e.g. film).

I don't have any experience of another source which provides CRIDs so I bow to your experience.
Post by Karl Dietz
My local cable feed transmits CRIDs for the events in a way that does
not allow to see if its a generic episode or a specific one. Both may or
may not appear next to a series CRID. So your implicit signaling would
fail for that data source. (Also their source is so bad, I've regularly
seen three different CRIDs for the same episode :(
True, but I think that's all you can do - i.e. if you have no CRID specifying an episode then you can only assume it's a generic. Far from ideal, but that's what my PVR does.

It's the age-old story; there's only so much one can do to cater for bad incoming data :-(

A question might be: how does one differentiate between 'explicit' signaling (e.g. Atlas) and 'implicit' signaling?
Post by Karl Dietz
I'm not sure I understand. Yes, the type is next to the CRID in the
transport stream. But it should be next to the CRID in other data
sources, too. How else do you know if you should treat it as a seriesid
or a programid?
Ok can I go back a step then: what 'additional information' would you like to add?

Rgds,
Geoff
Nick Morrott
2014-05-12 10:00:46 UTC
Permalink
Post by h***@gmail.com
There are a couple of requests on the Feature Request board to add a programme's CRID (unique programme identifier) to the DTD.
c.f.
FR #71 Add support for representing CRIDs
http://sourceforge.net/p/xmltv/feature-requests/71/
FR #105 XMLTV DTD: Add attributes for DVB ONID and SID
http://sourceforge.net/p/xmltv/feature-requests/105/
<snip>
Post by h***@gmail.com
So, should we add a new element to the DTD for <crid> ?
I would like to see a new and separate element to allow the inclusion
of crid* to <programme> elements.

Semantically the purpose of a CRID is very different to the (original)
purpose of the <episode-num> element, and I'd prefer to not shoehorn a
universal resource identifier for content into an element whose
defined purpose is to give season/episode/part numbering for episodic
broadcasts (but which are, e.g., irrelevant for movies and TV
specials).

The Wikipedia article on CRIDs (http://en.wikipedia.org/wiki/Crid)
includes a lot of useful information stemming from previous TV-Anytime
work and links to ETSI papers.

I think the potential for a usable CRID implementation in the DTD
necessitates using a new element that sits directly under the
<programme> element. In addition to clearly separating the
implementation of CRIDs from episode numbering, it also makes it clear
to current and future XMLTV data consumers that this element has a
clearly defined purpose.

Cheers,
Nick
h***@gmail.com
2014-05-16 16:48:36 UTC
Permalink
So as a straw-man proposal then:

++++++++++++++++++++++++++++++++++++++++++++++++

<!ELEMENT programme (title+, sub-title*, desc*, credits?, date?,
category*, keyword*, language?, orig-language?,
length?, icon*, url*, country*, episode-num*,
video?, audio?, previously-shown?, premiere?,
last-chance?, new?, subtitles*, rating*,
star-rating*, review*, crid*)>


<!-- CRID : Content Reference Identifier

Not the episode number or series number.

This is an identifier which uniquely identifies some 'content'
within all the programmes for this grabber. A CRID may refer to a
series (a 'group' CRID), or an individual programme.

There are several ways of defining a CRID, so the 'system'
attribute lets you specify which you mean.

By definition, a CRID must *uniquely* identify some content within
the context of a specific grabber. When using CRIDs in downstream
applications they should construct a URI consisting of grabber name
+ CRID. Where this is not unique then the 'system' attribute should
also be included. This is to ensure a reference to a CRID is unique
and does not overlap between grabbers. This is to allow for XML
data from multiple grabbers to be combined without their CRIDs
conflicting.

-->
<!ELEMENT crid (#PCDATA)>
<!ATTLIST crid system CDATA #IMPLIED>

++++++++++++++++++++++++++++++++++++++++++++++++


Please feel free to fix/amend as necessary.

Geoff
Ben Bucksch
2014-05-22 13:03:20 UTC
Permalink
Suggestion for definition:


<!ELEMENT programme (title+, sub-title*, desc*, credits?, date?,
category*, keyword*, language?, orig-language?, length?, icon*, url*,
country*, episode-num*, video?, audio?, previously-shown?, premiere?,
last-chance?, new?, subtitles*, rating*, star-rating*, review*, cid*)>

<!ELEMENT cid (#PCDATA)>
<!ATTLIST cid system CDATA>


<!-- CID : Content Identifier

This is an identifier which uniquely identifies some 'content' within
all the programmes for this grabber. An ID may refer to a film, episode
of a series, or e.g. a news or sports broadcast.

If the video content (film, episode etc.) is the same, the ID should be
the same, even if broadcasted at a different time. If the video content
is different, the ID must be different. Concrete criteria:

* Unique - There must never ever be 2 different videos with the same
ID. That must be true globally for all programs.
* Stable - There must never be the same videos with 2 different IDs.
I.e. if the same video is broadcasted within days or weeks, the ID
MUST be the same. If the same video is broadcasted again 3 years
later, it SHOULD have the same ID as 3 years before.

These IDs can be used as database key, duplication detection etc.

If any of the above criteria are not met by your IDs, you MUST NOT use
the <cid> tag. If the above criteria are not met, then the downstream
application will run into serious problems:

* Showing the wrong title and description for a programme
* recording the wrong shows, e.g. recording the documentation called
"Titanic" instead of the movie
* massive duplication and waste of disk space by re-recording shows
* not recording shows that should be recorded


The IDs themselves are opaque strings, and valid within a "system" that
you need to specify. IDs guarantee their properties only within the
system. A system could be the IDs of the data source, or a third-party
database.
h***@gmail.com
2014-05-22 14:28:43 UTC
Permalink
Post by Ben Bucksch
* Unique - There must never ever be 2 different videos with the same
ID. That must be true globally for all programs.
* Stable - There must never be the same videos with 2 different IDs.
In Utopia yes I'd agree with you; unfortunately that will never be possible. Unless you maintain some sort of central database to which all grabbers refer then you are never going to get a unique id across all grabbers.

Even more so where the id is provided by the broadcaster/data source (rather than being invented by the grabber) - as the WP article freely acknowledges.

Likewise you will never guarantee that the same content has the same ID without trying to match every incoming programme against some global database of all the programmes ever found by any grabber at any time ever.


Is there any particular reason you want the id as a plain string rather than formatted as a locator? (Can't your downstream application parse out the string from the locator?)
Ben Bucksch
2014-05-22 14:42:17 UTC
Permalink
Post by h***@gmail.com
Unless you maintain some sort of central database to which all grabbers refer then you are never going to get a unique id across all grabbers.
The spec proposal doesn't ask for an ID across all grabbers - that's
what the "system" is for. The criteria (unique and stable) must be true
only within the system. I assume that each grabber which uses <cid> will
use its own "system" for the source, e.g. <cid
system="tvmovie.de">7645487</cid>. Alternatively, the "tvmoviedb" or
"IMDB" could be "system"s.

What's important is that the grabber author verified that the IDs from
the source fulfill these criteria.
Post by h***@gmail.com
Is there any particular reason you want the id as a plain string rather than formatted as a locator?
For me, IDs are always just plain strings. I don't see a reason to make
it more complicated and make an URI out of it.

If you want to have a CRID, you can always do "crid://" + system + "/" +
id . But that would be longer, and presumably make e.g. DB index
searches slower, so I wouldn't do that in my app.

Ben
h***@gmail.com
2014-05-22 15:31:04 UTC
Permalink
Post by Ben Bucksch
The spec proposal doesn't ask for an ID across all grabbers - that's
what the "system" is for.
Ah right, my bad. I read your "That must be true globally for all programs" and interpreted "program" as grabber script, rather than "programme" (which the Americans misspell as "program") ;-)
Post by Ben Bucksch
What's important is that the grabber author verified that the IDs from
the source fulfill these criteria.
So you're still gong to need to maintain a database in the grabber to be able to do this.

And obviously there's no way a grabber could verify, for example, an IMDb id was unique in the IMDb database.

I think some things you have to take on trust; if the quality of the data source looks reliable then you have to assume the id they pass to the grabber is correct. Else don't use it.

Rgds,
Geoff
Ben Bucksch
2014-05-22 15:43:07 UTC
Permalink
Post by h***@gmail.com
Post by Ben Bucksch
What's important is that the grabber author verified that the IDs from
the source fulfill these criteria.
So you're still gong to need to maintain a database in the grabber to be able to do this.
The grabber source often has such a database, not the grabber. E.g.
tvmovie.de has (had?) such IDs that come as part of their XML.
Post by h***@gmail.com
And obviously there's no way a grabber could verify, for example, an IMDb id was unique in the IMDb database.
I think some things you have to take on trust; if the quality of the data source looks reliable then you have to assume the id they pass to the grabber is correct. Else don't use it.
Exactly. I verify this with manual spot checks and then use the grabber
over time and watch whether problems appear.
h***@gmail.com
2014-05-23 08:05:25 UTC
Permalink
Post by Ben Bucksch
Post by h***@gmail.com
I think some things you have to take on trust; if the quality of the data source looks reliable then you have to assume the id they pass to the grabber is correct.
Exactly. I verify this with manual spot checks and then use the grabber
over time and watch whether problems appear.
* Stable - There must never be the same videos with 2 different IDs.
I.e. if the same video is broadcasted within days or weeks, the ID
MUST be the same.
How do you propose the grabber proves this?

(I don't think you can say "MUST" here.)
Ben Bucksch
2014-05-23 11:29:27 UTC
Permalink
Post by h***@gmail.com
Post by Ben Bucksch
* Stable - There must never be the same videos with 2 different IDs.
I.e. if the same video is broadcasted within days or weeks, the ID
MUST be the same.
How do you propose the grabber proves this?
(I don't think you can say "MUST" here.)
Same way, with spot checks and watching how it works over time.

Why is this important? Applications like MythTV and Zeipis (mine) record
airings based on abstract schedules like "All Star Trek". At least here,
it is very common to air the new episode on the afternoon, and repeat it
during the night or in the next morning. Zeipis and co need a way to
detect this and avoid re-recording the show. We can use the
title/subtitle, and that works most of the time, but not always. Better
would be a unique ID to know that this is the same content.

In such a case, it is trivial to verify that the IDs are stable: If the
re-airing has the same ID, the IDs are stable. If it doesn't, the ID
(from the source) is useless for us and should be ignored. If a grabber
were to add such non-stable IDs, and Zeipis would rely on them, Zeipis
would re-record the same show over and over again. Zeipis would need to
add hacks to avoid such broken and useless IDs, and they'd cause a big
problem for users and developers. Therefore, it is critical that this
criteria is met by the grabber and that's why it's a MUST.

Ensuring (by the source) and verifying (by the grabber author) whether
the ID is stable over years is a lot lot harder. It would still be very
useful, though, so that a re-airing of an older movie that I have
already recorded and stored is not recorded again. I really want that
stability over years, but it's very hard to guarantee. This is why it's
a SHOULD.

Ben
h***@gmail.com
2014-05-23 13:16:08 UTC
Permalink
Post by Ben Bucksch
Same way, with spot checks and watching how it works over time.
I understand why it's important; I've been running PVRs using EPG data for over 12 years. Consequently I am well aware of the many limitations and data quality issues with source data providers.

The problem is your use of the word "MUST". To say that occasional spot checks will detect programmes which fail this test, breaks your requirement for "MUST". At the time the grabber is writing out the record it cannot guarantee the id is unique and/or stable and therefore it fails the "MUST" test and so cannot write it.

No ids will ever pass your requirement for "MUST".
Ben Bucksch
2014-05-23 13:39:55 UTC
Permalink
Post by h***@gmail.com
Post by Ben Bucksch
Same way, with spot checks and watching how it works over time.
I understand why it's important; I've been running PVRs using EPG data for over 12 years. Consequently I am well aware of the many limitations and data quality issues with source data providers.
The problem is your use of the word "MUST". To say that occasional spot checks will detect programmes which fail this test, breaks your requirement for "MUST". At the time the grabber is writing out the record it cannot guarantee the id is unique and/or stable and therefore it fails the "MUST" test and so cannot write it.
No ids will ever pass your requirement for "MUST".
I guess we have a different idea of "MUST". It doesn't mean the grabber
author must guarantee against all odds, bugs, and future changes. It
just says he needs to verify that it's true - as far as he reasonably
can - and he MUST act when he comes to know that it's not true anymore.

A SHOULD allows a knowing violation, with reason. A MUST does not allow
knowing violations. No MUST will guarantee against bugs or future changes.

I *do* want to make sure that no grabber author comes with a different
idea of what an ID is, he says "but the spec allows it". For example,
IDs for whole series (not episodes) should not be added as cid.
Likewise, IDs that are generated purely based on title (e.g. "Titanic"
as doc and movie) should not be added. The grabber source (not the
grabber itself) must have some sort a database on their end, and the ID
provides that critical link between my database and theirs.

Ben
h***@gmail.com
2014-05-23 13:59:32 UTC
Permalink
Post by Ben Bucksch
I guess we have a different idea of "MUST". It doesn't mean the grabber
author must guarantee against all odds, bugs, and future changes. It
just says he needs to verify that it's true - as far as he reasonably
can - and he MUST act when he comes to know that it's not true anymore.
See RFC2119. MUST is absolute; it doesn't mean PROBABLY ;-)

http://tools.ietf.org/html/rfc2119
Ben Bucksch
2014-05-23 14:16:40 UTC
Permalink
See RFC2119. MUST is absolute; it doesn't mean PROBABLY;-)
http://tools.ietf.org/html/rfc2119
I know very well what MUST means, I've implemented IMAP/POP/SMTP/MIME
protocol parts in Thunderbird.

The 2 criteria I wrote for IDs *are* an "absolute requirement of the
specification" (that's all RFC 2119 says about "MUST"). I do mean it in
this way. If you violate this, you're not XMLTV.
But no implementation of anything (not even Thunderbird's MIME handling)
can guarantee absence of bugs or know about future changes that might
break things.

MUST means: 'If you don't violate this rule, you are violating this
specification, and you must fix it. There's no debate about it, no
excuses, you cannot intentionally derive from it for any reason.'. This
is how I mean it: If you violate these criteria, you have to fix it
immediately or you're out.

SHOULD means: 'You really need to do this. But there may' (quote) "exist
valid reasons in particular circumstances to ignore a particular item,
but the full implications must be understood and carefully weighed
before choosing a different course."

I do *not* want every new grabber author coming and arguing that his IDs
don't match these criteria, but they are useful anyway for this or that
reason, and that's why he added them here, and if I don't like it, I can
ignore them in my app, and whatever. I want to make sure right here what
we mean with "ID", and that it will have serious consequences when these
assumptions are not met, so that we don't repeat this argument every
year with new devs every time.

Ben
Ben Bucksch
2014-05-23 13:57:01 UTC
Permalink
Post by Ben Bucksch
The grabber source often has such a database, not the grabber. E.g.
tvmovie.de has (had?) such IDs that come as part of their XML.
To expand on this:
Every editorial office needs this, in their own interest, so that they
don't have to re-write the description for every airing.
Similarly, they need to verify the match, so that the TV magazine
doesn't write "best-selling movie of all times" for a documentation
titled "Titanic".

If they are so nice to give us their DB ID, that's very valuable,
because it allows me to make this link between their DB and mine, even
if the description differs slightly.
Some sources do give these IDs in the data. I'd like to capture these,
and use them.

To have a proper abstraction of grabbers (international apps using them
need it, that's why we have XMLTV), we need to nail down what exactly we
mean with "ID", so that we don't have conflicting interpretations and
assumptions, with resulting problems downstream.

Ben
h***@gmail.com
2014-05-24 10:56:56 UTC
Permalink
Revised proposals for discussion.


PROPOSAL A:

++++++++++++++++++++++++++++++++++++++++++++++++

<!ELEMENT programme (title+, sub-title*, desc*, credits?, date?,
category*, keyword*, language?, orig-language?,
length?, icon*, url*, country*, episode-num*,
video?, audio?, previously-shown?, premiere?,
last-chance?, new?, subtitles*, rating*,
star-rating*, review*, crid*)>


<!-- CRID : Content Reference Identifier

Not the episode number or series number.

This is an identifier which uniquely identifies some 'content'
within all the programmes for this grabber. A CRID may refer to a
series (a 'group' CRID), or an individual programme.

If there are several ways of defining a CRID, the 'system'
attribute lets you specify which you mean.

The element content must follow the syntax defined in RFC4078.
E.g.
<title>Scary Movie</title>
<crid>crid://atlas.metabroadcast.com/film/8hmr</crid>

<title>Doctor Who</title>
<crid>crid://atlas.metabroadcast.com/series/cqk6</crid>
<crid>crid://atlas.metabroadcast.com/episode/c7sny</crid>

By definition, a CRID must *uniquely* identify some content within
the context of a specific grabber. It is unlikely a grabber will have
sufficient information available to it to be able to create its own
CRID; it is expected therefore that only CRIDs from the data
source will be used.

For example an IMDb id will be ok, but a programme id used in
a web link on the source website is likely to be ephemeral and
so is NOT acceptable.

A CRID must also be consistent across multiple grabbers using
the same data source. When adding a new grabber, you must
check if a <crid> system has already been defined for your data
source in another grabber and adopt its syntax. You must not
use a CRID which has the same value as that from another
grabber unless they refer to exactly the same content.

Certain CRID sources are predefined and any grabber adding a
<crid> for one of these data sources must follow the following
syntax:

IMDb: crid://www.imdb.com/title/tt0175142
Rotten Tomatoes: crid://www.rottentomatoes.com/m/godzilla_2014/
TVRage: crid://www.tvrage.com/shows/id-3203/episodes/40999
EPGuides: crid://epguides.com/DoctorWho/guide.shtml#ep002

(Broadly this is the URL to the content but with the protocol
changed)

-->
<!ELEMENT crid (#PCDATA)>
<!ATTLIST crid system CDATA #IMPLIED>

++++++++++++++++++++++++++++++++++++++++++++++++




PROPOSAL B:

Similar to above except the element name is not <crid> but <cid> and its content is a simple string not a locator. It also refers only to individual programmes and not to any groupings (e.g. series).


++++++++++++++++++++++++++++++++++++++++++++++++

<!ELEMENT programme (title+, sub-title*, desc*, credits?, date?,
category*, keyword*, language?, orig-language?,
length?, icon*, url*, country*, episode-num*,
video?, audio?, previously-shown?, premiere?,
last-chance?, new?, subtitles*, rating*,
star-rating*, review*, cid*)>


<!-- CID : Content Identifier

Not the episode number or series number.

This is an identifier which uniquely identifies some 'content' within
all the programmes for this grabber. A CID may refer to a film,
episode of a series, or e.g. a news or sports broadcast.

A CID can only ever refer to an individual programme; never a series
or other grouping.

The 'system' attribute must define the source of the CID. This is to
enable the downstream application to uniquely identify a programme
across multiple grabbers which might otherwise use the same CID.
It is suggested you make the 'system' the internet domain name of
the data source.

The element content is a character string.
E.g.
<title>Scary Movie</title>
<cid system="atlas.metabroadcast.com">8hmr</cid>

<title>Doctor Who</title>
<cid system="atlas.metabroadcast.com">c7sny</cid>

By definition, a CID must *uniquely* identify some content within
the context of a specific grabber. It is unlikely a grabber will have
sufficient information available to it to be able to create its own
CID; it is expected therefore that only CIDs from the data source
will be used.

For example an IMDb id will be ok, but a programme id used in
a web link on the source website is likely to be ephemeral and
so is NOT acceptable.

A CID must also be consistent across multiple grabbers using
the same data source. When adding a new grabber, you must
check if a <cid> 'system' has already been defined for your data
source in another grabber and adopt its syntax. Conversely, you
must not use a 'system' which has the same value as that from
another grabber unless they refer to exactly the same content, in
which case they must use exactly the same CID value for
corresponding programmes.

The combination of CID 'system' + CID value must be globally
unique and refer to one, and only one, programme.

For the avoidance of doubt: the same CID value IS permitted to
refer to differing content (either globally or within any given
grabber) provided they have different 'system' attributes.

Certain 'systems' are predefined and any grabber adding a
<cid> for one of these data sources must follow the following
syntax:

IMDb: system="IMDb" value= tt0175142
e.g. <cid system="IMDb">tt0175142</cid>

-->
<!ELEMENT cid (#PCDATA)>
<!ATTLIST cid system CDATA>

++++++++++++++++++++++++++++++++++++++++++++++++
Robert Eden
2014-05-25 04:33:18 UTC
Permalink
Hi Guys... sorry I haven't been posting, I've been swamped at work and
doing a lot of travel... this discussion was going to need more brain
cycles than I could give, but I do like the way it's turning out.

One thing I'm still not clear about, is why CRID needs to be it's own
field and can't be a system (or systems) on episode-num.

I guess I'm a pragmatist, but CRID will be useless if you can't get apps
to use it. Those apps are currently using episode-num for that
purpose. Some grabbers (tv_grab_na_dd for example) gets a unique ID
from the data source and has put it in there. (unique when it begins
with EP anyway). Apps are already looking for it there... if it's in
episode-num I think it would gain more use.

Here's an entry from tv_grab_na_dd

<programme start="20140524010000 -0600" stop="20140524013000 -0600"
channel="I10149.labs.zap2it.com">
<title lang="en">South Park</title>
<sub-title lang="en">Die Hippie, Die</sub-title>
<desc lang="en">The citizens of South Park seek Cartman's help to
battle hippies.</desc>
<credits>
<actor>Trey Parker</actor>
<actor>Matt Stone</actor>
</credits>
<date>20050316</date>
<category lang="en">Sitcom</category>
<category lang="en">Animated</category>
<category lang="en">Series</category>
<episode-num system="dd_progid">EP00229827.0142</episode-num>
<episode-num system="onscreen">902</episode-num>
<previously-shown start="20050316000000" />
<subtitles type="teletext" />
<rating system="VCHIP">
<value>TV-MA</value>
</rating>
<rating system="advisory">
<value>Graphic Language</value>
</rating>
</programme>


BTW, as far as Geoff's two proposals, I like B.

Robert
Ben Bucksch
2014-05-25 04:51:10 UTC
Permalink
Post by Robert Eden
One thing I'm still not clear about, is why CRID needs to be it's own
field and can't be a system (or systems) on episode-num.
I guess I'm a pragmatist, but CRID will be useless if you can't get apps
to use it.
That's right. But for apps to use it, it must meet the requirements that
I defined. If only one single grabber fails the requirement, I cannot
use the IDs anymore, or all kinds of bugs will break lose. I am not
going to hardcode a long list of "systems" of random grabbers that meet
or fail the requirements.

The episode-num, as it's traditionally defined, is sure to FAIL these
requirements. It's not useful to collapse these 2 concepts, because they
are totally different.

Please read my posts, I explain the difference there and why it matters
for apps.

@Geoff, I would really like the 2 requirements to be listed very clearly
as bulleted list in the definition, exactly as I wrote it, not as prose
text.
h***@gmail.com
2014-05-25 08:38:37 UTC
Permalink
Post by Robert Eden
One thing I'm still not clear about, is why CRID needs to be it's own
field and can't be a system (or systems) on episode-num.
[...]
Post by Robert Eden
Apps are already looking for it there... if it's in
episode-num I think it would gain more use.
I tend to agree but the issue is one of semantics... "episode-num". This doesn't cater for things which aren't "episodes" - e.g. films, other one-offs, and series.

Perhaps it's ok for the CID proposal, which is to all intents and purposes simply a "programme id", but it is, I feel, the wrong label for a group CRID (e.g. "series id").

A CRID (but not a CID) can be a group item and refer to "series" - so you'd end up with things like

<episode-num system="xx_series_id">12345</episode-num>

which just looks wrong :-(
Post by Robert Eden
BTW, as far as [the] two proposals, I like B.
My main issue with B is that you can't have group ids, so you are effectively missing out on some very useful information.

For example, with A:

title = Brazil with Michael Palin
sub-title = Into Amazonia
id episode = srxyt
id series = sp2sf
id brand = sp2sg <--a group of series


If all you are interested in in an app is whether this programme has already been seen or not then B is fine, but if you want to do searching - e.g. "show me all episodes of series Brazil with Michael Palin" - then B is useless.

(@Ben)
It's also worth noting that generic episodes of a series usually have a unique id (which is the same for all showings of that generic). If your app is going to use this id to NOT record the programme then how do you know this is the right thing to do? What I mean is how will you identify a generic episode which means "no episode info is available" from a true generic? I can see you will miss a lot of episodes if your incoming guide data are bad (as they often are with many of the smaller satellite channels).

The general rule for a PVR is to record any specific episode which it hasn't recorded before PLUS ALL the generic episodes. Only this way can you be sure you won't miss an episode of a series.

E.g.
id episode
1001 Ep 1
1000 Unknown episode
1003 Ep 3
1000 Unknown episode
1000 Unknown episode
1006 Ep 6


For me, option B is too specific and obviously geared towards one particular use of the id. It would be a wasted opportunity.

See also FR#71 & FR#105 which started this off - option B does nothing to address these.

Geoff
Ben Bucksch
2014-05-25 06:31:59 UTC
Permalink
Post by h***@gmail.com
Revised proposals for discussion.
Thanks, Geoff, for the revision.

For my taste, it's using too much space for the explanation of "system",
and what it's not, but doesn't sufficiently nail down what we *do* expect.
Most importantly, the concept of stability is lacking or not
sufficiently clear, but it's central to the idea of a CID.
I like that you're predefining some well-known sources.
I incorporated some of your proposal B into my earlier definition, so
here's proposal C:




<!ELEMENT programme (title+, sub-title*, desc*, credits?, date?,
category*, keyword*, language?, orig-language?, length?, icon*, url*,
country*, episode-num*, video?, audio?, previously-shown?, premiere?,
last-chance?, new?, subtitles*, rating*, star-rating*, review*, cid*)>

<!ELEMENT cid (#PCDATA)>
<!ATTLIST cid system CDATA>


<!-- CID : Content Identifier

This is an identifier which uniquely identifies some 'content' within
all the programmes for this grabber. A CID may refer to a film, episode
of a series, or e.g. a news or sports broadcast.

The CID is an opaque string, and only valid within a "system" that you
specify. The system is generally your grabber source, so the CIDs only
need to be unique for your grabber. See below for further explanation.

If the video content is the same, the CID should be the same, even if
broadcasted at a different time. If the video content is different, the
CID must be different.

1. Unique

There must never ever be 2 different contents with the same CID.

This means that a CID can only refer to an episode, not a whole series
or other grouping, because every episode would have the same ID, but
different content, and that's not unique anymore. It also means that an
episode number like "209" cannot be a CID, because there are several
series that have an episode called "209", but the content is different,
so again the number is not unique.

'Content' refers to what the end user actually wants to see, e.g. the
movie or news. Non-primary content like advertizing should be ignored
and is not considered different content. OTOH, each 8PM evening news
must get a different CID, because it's new for the user.

2. Stable

There must never be the same content with 2 different IDs.
If the video content is the same, the ID should be the same, even if
broadcasted at a different time.
I.e. if the same video is broadcasted again in the night, the next day
or in the coming weeks, the ID MUST be the same. If the same video is
broadcasted again 3 years later, it SHOULD have the same ID as 3 years
before.

It is unlikely a grabber will have sufficient information available to
it to be able to create its own CID; it is expected therefore that only
CIDs from the data source will be used.

For example an IMDb ID will be a good and probably very stable CID. A
programme ID used in a web link on the source website is probably
different for every airing, so it's not 'stable', and it's NOT
acceptable as CID. If your source provides an ID for the movie, and it
stays the same for all airings of that movie, it can be used as CID.

If any of the above criteria are not met by your IDs, you MUST NOT use
the <cid> tag.

Usecases

CIDs can be used for e.g.:
* database key, to create a normalized database where each film has one
entry, and the user can see all airings of that film
* duplication detection ("I already have recorded/seen this") while
creating recording schedules for PVRs and similar means.

If the above criteria of uniqueness and stability are not met, then the
downstream application will run into serious problems:

* Showing the wrong title and description
* recording the wrong shows, e.g. recording the documentation called
"Titanic" instead of the movie
* massive duplication and waste of disk space by re-recording shows
* not recording shows that should be recorded

system attribute

The "system" attribute defines the source of the CID. A "system" could
be the IDs of the data source, or a third-party database. This is to
enable the downstream application to uniquely identify content across
multiple grabbers which might otherwise use the same CID.

The combination of CID "system" + CID value MUST be globally unique and
refer to one, and only one, content. But the same CID value is permitted
to refer to differing content provided that the "system" attribute is
different.

CIDs guarantee their properties only within the "system", so that
grabbers only need to ensure that the CID is unique and stable within
their system.

It is suggested you make the "system" the internet domain name of the
data source, e.g. system="tvdata.com". If you use a third party
database, check the predefined systems below.

A CID SHOULD be consistent across multiple grabbers using the same data
source. When adding a new grabber, you MUST check if a <cid> "system"
has already been defined for your data source in another grabber and
SHOULD adopt its syntax. Conversely, you MUST NOT use a "system" which
has the same value as that from another grabber unless they refer to
exactly the same content, in which case they must use exactly the same
CID value for the same content.


Common systems

Certain "systems" are predefined and any grabber adding a <cid> for one
of these particular data sources MUST follow the following syntax:
IMDb
system: IMDb value: tt0246460
Please note upper/lowercase
e.g. <cid system="IMDb">tt0246460</cid>
<title>James Bond 007 - Die Another Day</title>

themoviedb
system: themoviedb value: 1571
e.g. <cid system="themoviedb">1571</cid>
<title>Die Hard</title>

thetvdb
system: thetvdb value: seriesid + "/" + id
TODO seriesid necessary or id alone sufficient?
e.g. <cid system="thetvdb">71470/46568</cid>
<title>Star Trek: TNG</title><sub-title>Deja Q</sub-title>

Atlas
http://atlas.metabroadcast.com/
system: Atlas value: 8hmr
e.g. <cid system="Atlas">b00779vr</cid>
<title>Star Trek</title><sub-title>The Immunity Syndrome</sub-title>
h***@gmail.com
2014-05-25 08:39:06 UTC
Permalink
Post by Ben Bucksch
I incorporated some of your proposal B into my earlier definition, so
If it needs that many words (852) to explain it then it's getting too complicated.
Ben Bucksch
2014-05-25 10:22:20 UTC
Permalink
Post by h***@gmail.com
If it needs that many words (852) to explain it then it's getting too complicated.
It doesn't. I think my first proposal was sufficient. I merely tried to
incorporate your text, as a compromise.

Loading...