RSS feeds: valid, useful, or accurate?

Matt Mullenweg points out that the RSS feed validator is moving the goal posts, and not for the first time:

They’ve also started marking all uses of content:encoded as potentially causing problems, which is funny because it actually avoids a ton of problems and (again) people have been using it in RSS 2.0 feeds for 3+ years now, and I even asked Dave Winer about it in the past and he said that was fine.

I’ve been struggling with similar problems for some time, trying to improve Textpattern’s RSS support. Textpattern still only supports RSS 0.921. I decided some time ago to update to RSS 2.0 as soon as I was sure I could produce a feed that was not just valid (most of the time, at least), but also accurate and useful. That is to say, the contents of the feed should reflect the best possible representation of the article contents. The same kind of principle as used in modern, semantic, standards-compliant web design.

I could easily produce a perfectly valid RSS 2.0 feed the old-fashioned way: strip all tags and entities from the title and description, and just include the first 200 characters of plain ASCII text. That would be valid, but neither accurate nor useful.

The root of the problem is that the RSS 2.0 spec is far too vague. What should I do if the article title contains HTML code or entities? The spec doesn’t say. Should the description include HTML? The spec doesn’t say. Should I use <description>, <content:encoded>, or both? The spec doesn’t say. If I use both, should the description and content:encoded include identical content, or different? The spec doesn’t say. What do I do with an article that contains both a short excerpt and a long body? The spec doesn’t say. Can I safely include UTF-8 characters, or should I entity-encode them? The spec doesn’t say. Should I stick with core elements for things like dates, use namespace elements instead, or include both? The spec doesn’t say.

There is no shortage of conflicting opinions on these questions. The problem isn’t coming up with answers, it’s choosing the right ones.

Even those responsible for the spec seem to have given up hope of resolving these issues. An entire industry has sprung up around the service of simply interpreting and fixing all these little semantic differences between feeds.

So to Matt, and any other RSS application developers in earshot, I’d like to propose a modest solution: let’s agree on a simple set of answers. I’m not talking about rewriting the spec, just a brief statement of interpretations and best practices for RSS 2.0 feed producers. Things like:

  • How to deal with entities and HTML in each of the title2, description and content:encoded elements
  • Which extensions to use, and for what
  • When to duplicate content in both core elements and extensions
  • How to deal with multi-part articles
  • Multiple enclosures per item: yes or no

Never mind the spec lawyers or the feed validator guys. I’ll listen to what they have to say when they can agree on some basic answers to fundamental questions, instead of making endless arbitrary, conflicting decisions. Until then, let’s see if the application developers can show them how it’s done.

As I said, most of the answers are already available; it’s simply a matter of agreeing on which ones to choose.

1 I notice Matt seems to have reverted to RSS 0.92, at least for now. A quiet protest?

2 The title element is the most problematic, in my experience. For example, imagine an article with a title like this:

HTML Tutorial: the <br /> tag

How should I encode it? Single entity encoding works in some readers:

<title>HTML Tutorial: the &lt;br /&gt; tag</title>

But others think I’m trying to insert HTML markup, and strip the tag out, so it looks like this:

HTML Tutorial: the tag

For those readers, double encoding might work:

<title>HTML Tutorial: the &amp;lt;br /&amp;gt; tag</title>

But readers that expect single encoding will probably display this incorrectly:

HTML Tutorial: the &lt;br /&gt; tag

Here’s a sample feed including both test cases, plus a third case with numeric single-encoding. Google Reader strips the tag from the single and numeric encoded cases, but displays the double-encoded title correctly:

Bloglines also appears to interpret the double-encoded example the way we want, but interprets the single and numeric cases as actual HTML, inserting a line break in the title:

Ironically, the Feed Validator issues a warning about the double-encoded case—the only one that worked correctly in both readers. The Validator’s recommendation is to use numeric encoding instead, which didn’t work well in either reader. Yet the validator is the only source I’ve found that addresses the issue.

All three cases are considered valid, incidentally. Which proves the point that there’s a big difference between valid, useful and accurate.

Two is way too few data points. Here’s some related reading:

Microsoft and titles
Titles in existing feeds
non-ASCII characters

Sam Ruby    Apr 17, 07:44 am    #

Far be it from me to volunteer anyone for assistance here, but I suspect Nick Bradbury could provide some invaluable help with this effort. Maybe you can persuade him to help a bit!

Steve Pilgrim    Apr 17, 08:39 am    #

What about using CDATA tags to wrap your HTML? This works, is efficient, and is valid.

Mark Kaplan    Apr 19, 11:07 am    #

Mark,

That doesn’t solve the problem. The issue is confused by the fact that there are two separate encodings involved: XML encoding, and HTML encoding. Any field in an XML document (and hence an RSS feed) must have special XML characters escaped, either using entities or CDATA.

The problem is the interpretation of the content after it has been un-escaped by the XML parser. Later revisions of the RSS spec say that the contents of the description element should be interpreted as HTML data – meaning special HTML characters must be entity-encoded after the data has been decoded by the XML parser. The fact that XMl and HTML (can) use the same entity encoding methods makes this all the more confusing.

The specific issue I’ve written about above is with the title element, not description. The spec makes no statement about whether the title should be interpreted as HTML (and thus have an additional entity encoding layer, over and above the XML entity or CDATA encoding); or whether it should be interpreted as plain text. Some readers interpret it one way, some the other.

Alex    Apr 19, 11:52 am    #

Sam,

You’re absolutely right: two is not enough. But how many cases is a developer expected to test, in order to discover something that should be in the spec in the first place? The evidence I’ve found sofar is sufficient to demonstrate that the answer, in this case, is that there is no solution.

The links you provide seem to support this: nobody is willing to take a position one way or the other; and, in the case of the title element, there is no possible encoding method that will display correctly (or degrade gracefully) in all (or even most) RSS readers.

Alex    Apr 19, 12:50 pm    #

I wonder… why make so much effort to adapt something broken ? Why not accept the flaws of the 2.0, and with others write a new draft for a 2.1 version that does close the holes ?

Jérémie    Apr 19, 10:49 pm    #

“I wonder… why make so much effort to adapt something broken ? Why not accept the flaws of the 2.0, and with others write a new draft for a 2.1 version that does close the holes ?”

Or one could use something based on existing XML and RDF standards such as RSS 1.0/1.1 ?

WD Milner    May 21, 03:56 am    #

Commenting is closed for this article.