I posted earlier about various problems and ambiguities in the RSS 2.0 specification. Reaction to that post, and email discussion with some other developers, confirms conclusions I reached months ago: that no one is in a position to provide definitive answers. The RSS Board is either unwilling or unable to produce a clear and unambiguous spec, and, though there is no shortage of opinions on how various things might be resolved, there are no clear winners on which ones to choose.
RSS feed developers are left blundering around in the dark. Each developer makes slightly different decisions about the meaning, purpose and context of each element, and news aggregator developers are left to deal with the mess.
So, in a (perhaps futile) attempt to resolve a few of the problems, here’s a draft interpretation of the RSS 2.0 spec, as I plan to implement it in Textpattern. It’s tentative and probably incomplete. This is not intended to duplicate or contradict the spec; any feed implementation that follows these proposals should be entirely valid according to the written specification.
I’m making this public in the hope of reaching a consensus with some other CMS and weblog application developers (cough) on a common interpretation of the RSS 2.0 spec. (I don’t consider this to be the best or only interpretation; just a starting point for negotiation).
I’ve only included the elements that I think require clarification. Everything else is as per the RSS 2.0 spec as published at January 30, 2005. Other than those required by the spec, only the elements listed here will be included in a Textpattern RSS 2.0 feed.
Dates – all dates will be expressed with 4 digit years, and in GMT. The time zone string will be “GMT”, not ”+0000”.
Character encoding – The encoding will always be specified:
<?xml version="1.0" encoding="utf-8"?>
UTF-8 is preferred but not mandatory.
Title – encoded as per the item title (see below). The name of the feed. For the site’s main feed, this will be the name of the site. Feeds for a specific section or category will also include the section or category name.
Description – encoded as per the item title (see below). Contains the site’s slogan.
Generator – a valid URL that indicates the application’s main web site and version number:
<generator>http://textpattern.com/?v=4.0.7</generator>
pubDate – the time and date that the contents of the feed were last updated. This is not necessarily the time of the most recent item in the feed, since the item pubDate permits a future date.
Title – does not include HTML markup, but HTML special characters will be double-encoded as for the description and content:encoded (i.e. after XML decoding the description should be suitable for handling as a HTML fragment, no raw ampersands or < characters):
Plain Text with bold
<title>Plain text with bold</title>
A & B, 1 < 2
<title>A &amp; B, 1 &lt; 2</title>
Comments – included if comments are enabled or available for the item. This URL might be identical to the link URL. If no <comments> element is provided, this means there are no comments for the item, and it is not possible to submit a comment.
pubDate – the time and date at which the item is deemed to be published. This will usually (but not always) be the time and date at which the item first became visible to the public. The pubDate will not change if an item is edited, unless the user explicitly requests it.
It is possible for the user to set the publication date to a time in the past or future.
Description – consider this the equivalent of the atom “summary” element. It will contain a user-supplied excerpt describing the article, if one has been entered. If no excerpt has been supplied by the user, the “description” element will include a short string containing the first few words of the article body.
The description is encoded as CDATA. HTML special characters should be entity-encoded within the CDATA (i.e. after XML decoding the description should be suitable for handling as HTML, no raw ampersands or < characters other than those used for markup):
Here’s a quick summary.
<description><![CDATA[<p>Here's a quick <b>summary</b>.</p>]]></description>
A & B, 1 < 2
<description><![CDATA[<p>A & B, 1 < 2</p>]]></description>
content:encoded – the full article body with HTML markup. If the user has elected to produce summary-only feeds, no “content:encoded” element will be included.
Encoded as CDATA. HTML special characters will be entity-encoded within the CDATA (i.e. after XML decoding the description should be suitable for handling as HTML, no raw ampersands or < characters other than those used for markup):
An article with markup
<content:encoded><![CDATA[<p>An <i>article</i> with <b>markup</b></p>]]></content:encoded>
A & B, 1 < 2
<content:encoded><![CDATA[<p>A & B, 1 < 2</p>]]></content:encoded>
dc:creator – Used instead of <author>, since author requires an email address. This will contain the authors “real name” as recorded in the user account; it might be a full name, or a first name or nickname only.
guid – always isPermalink="false". This is always set, and is always globally unique. It won’t change once an item has been published, even if the article is updated or moved to a new location within the site.
category – Only included if the article belongs to one or more categories. More than one category may be included. Hierarchical categories are indicated with forward-slashes, as described in the spec:
<category>parentcat/childcat</category>
enclosure – May be included multiple times, if the item has multiple associated files. The first enclosure should be considered the most important; news aggregators that are unable to handle multiple enclosures would be advised to settle on the first one, as a compromise.
PS, I should acknowledge this document, a draft set of recommendations by the RSS Board. Its purpose is quite different to that of this proposal, so I think there’s a need for the two to coexist. In particular, the RSS Profile seeks to clarify the entire RSS 2.0 spec; this proposal essentially describes a subset implementation of RSS 2.0.
21 April 2006, 08:41 by Alex ·
Commenting is closed for this article.
Alex is a software developer from Melbourne, Australia. Threshold State is his consulting business.
“Labor is committed to introducing mandatory ISP filtering.” – Stephen Conroy, the new Communications Minister.
An excellent, minimal text editor for Windows.
All *.wordpress.com blogs have been blocked in Turkey – apparently because of one person, Turkish creationist Adnan Oktar.
The Opera browser team measured the percentage of people who use certain features. Several popular feature requests turned out to be unused, or almost so.
Looking good.
— Matt Apr 29, 04:18 pm #
Why the timezone as text, out of interest?
— Jeff Waugh Apr 29, 11:09 pm #
There is, of course, an easier solution to this problem.
— Simon Jessey Apr 30, 12:36 am #
Bring your thoughts to the RSS Advisory Board.
http://www.rssboard.org/
We’re looking for good men.
— Randy Charles Morin Apr 30, 06:34 am #
Simon: Textpattern has supported Atom 1.0 from day 1. That doesn’t change the fact that we still need to support RSS.
Jeff: "GMT" seems more common than "+0000". It really doesn't matter which we use; the point is to reduce the number of variables by making a few simple decisions and sticking to them.
— Alex Apr 30, 09:10 am #
Thanks Alex – In that case, I’d suggest ‘standardising’ on the simpler-to-parse variant (+0000). Not a big deal though.
— Jeff Waugh Apr 30, 04:23 pm #
Randy: I suppose you are right. It just annoys me a little bit that it has been left up to application developers to fix a problem that should’ve been taken care of by the creator(s) of the original specification. “Ass-backwards” is the phrase that springs to mind.
— Simon Jessey May 1, 10:30 pm #
Sorry, but I meant to address my last comment to Alex. The styling of the comments makes it a little unclear as to who is saying what.
— Simon Jessey May 1, 10:32 pm #
Why create yet another specification (even if you don’t call it that it ammounts to the same thing? What is wrong with just using RSS 1.0/1.1 which is based on the very nicely structured and standard W3C RDF specifications? It was specifically created to remove ambiguities by using an established standard as a base.
— WD Milner Aug 4, 11:17 am #