Martin @ Blog

software development and life.

Flower

HTML Purifier

Kore Nordmann explains why in his opinion one shouldn’t use BBCode for comments and forums. I think he has a point, but it only holds when the BBCode is parsed using regular expressions, as he explains in another article. Actually, you’re not really parsing the BBCode when using regular expressions, because it is pattern matching. He explains why it makes no difference to use HTML syntax instead of BBCode syntax. Obviously, he has a very good point, because the BBCode syntax is not well defined, while HTML syntax – especially for the things that normally are allowed in blog comments or on forums – are well defined and known by many people.

An intresting observation is that, even despite the good explanation of the problem with BBCode – a false sense of security when parsing it with regexps – is that people demonstrate in the comments that they really don’t understand it. For example, one comment states that it is almost impossible to block all not allowed HTML using blacklists… Obviously, one shouldn’t use blacklists, but whitelists. By default, all < and > should be replaced by &lt; and &gt;.

HTML Purifier is a library that parses HTML and uses a whitelist to allow certain HTML tags and attributes. Why should one develop something like this from scratch when there is alreay a library available?

2 Responses to “HTML Purifier”

  1. November 12th, 2007 at 10:38

    Felix says:

    Ah, da’s wel handig dat HTML Purifier, ik heb ‘t maar direct in m’n CMS gebouwd, kan ik eindelijk dat lelijke BB-script eruit mikken, dat was me al zo lang aan ‘t irriteren 🙂 Ik heb ook maar gelijk TinyMCE toegevoegd, ik weet niet of er nog een “betere” is, maar deze lijkt opzich wel aardig te werken…

  2. November 24th, 2007 at 8:42

    p3t0r says:

    I’ve used a similar library in Java for a couple of sites I’ve worked on: http://people.apache.org/~andyc/neko/doc/html/
    It’s quite good!