Contents:

1. Introduction

It is often desirable, for certain kinds of web sites (wikis, forums, blogs, and so on), to allow advanced users to customize any text input using HTML tags; in some cases, where the site allows users to publish new pages or edit existing content, allowing users to edit the HTML directly is even more desirable. However, if you can't trust the users, allowing HTML means opening up to a number of vulnerabilities (especially cross-site scripting, phishing, etc.); the solution to the problem is to "sanitize" any HTML text coming from an untrusted source before embedding it in the final web page. If all potentially harmful content is removed, it is possible to safely embed the HTML into any page.

2. Overview

This program defines the filter-html function, that can be used to sanitize any HTML text. The string passed as argument is fully parsed and completely regenerated as HTML 4.01 Transitional (or XHTML 1.0 Strict if the source document was XHTML), which is returned as a string.

If the /all refinement is used, a complete HTML document is returned; otherwise, the text returned contains only the "body" part of the document and can be embedded directly inside another HTML document.

More options can be specified with the /with refinement, followed by a block containing set-word!'s and values. See below for the list of available options you can set.

Note:

The source html string is assumed to use the UTF-8 encoding. If your text uses any other encoding, you must convert it to UTF-8 before calling this function.

The output string will always be ASCII (all non-ASCII characters will be converted to a character entity), unless the emit-utf8 option is set to true, in which case the output string will be UTF-8 encoded.

Overview

Filtering rules
Support functions

filter-html: func [
 "Sanitize HTML text"
 html [string!]
 /all {Return a complete HTML document (including <html> tag, <head> etc.)}
 /with options [block!] "Specify options for the filter"

 /local filter-html's locals
] [
 Parse and regenerate the html text
]

2.1 Filtering options

The /with refinement of filter-html allows passing a block of set-word!'s to set specific filtering options. The available options are:

all
equivalent to the /all refinement (logic!).
id-prefix
allows specifying a prefix to be prepended to all tag IDs (or anchor names, and so on) that are specified in the source document (thus preventing clashes with IDs in the host document). If you are embedding the filtered HTML in a document that makes use of IDs, anchors etc., then you should specify an id-prefix that ensures no clashes. (Note: currently the CSS is not being parsed, so references to IDs in the CSS will break when this is used.) If specified, it must be a string! value, and be valid as an HTML tag ID encoded in UTF-8.
emit-utf8
can be used to disable the conversion of all non-ASCII characters to HTML entities (e.g. if your page is in Japanese then you don't want every character to be encoded as an HTML entity, as that would take much more space than UTF-8). Should be a logic! value.
filter-uris
allows specifying a handler function for all URIs that appear in the document. See URI filtering for more details.
pretty-print
set to true to add indentation and newlines to the output.

Filtering options

default-options: context [
 all: no
 id-prefix: none
 emit-utf8: no
 filter-uris: none
 pretty-print: no
]

3. URI filtering

You can provide a filter function for all URIs in the HTML document via the filter-uris option. The option must be set to a function taking one object as argument and returning none, a string, or an object. The object passed as argument is what is returned by parse-uri, with the addition of two fields: tag-name and attribute-name, set to the name of the tag (a string, that may end in "/" for empty tags) and the name of the attribute (as a word!) the URI is in.

If your function returns none, the attribute is discarded. If a string! value is returned, that is used as the value for the attribute (should be a correctly formed URI); otherwise, you can return an object that is passed to form-uri (for eg. you can just modify the object passed as argument and return it).

URI filtering

uri-filter: func [uri] [
 case [
  all [uri/tag-name = 'img uri/attribute-name = 'src] [
   ; disable images
   none
  ]
  uri/host = "www2.site.com" [
   ; replace domain
   uri/host: "www.site.com"
   uri
  ]
  uri/scheme = "javascript" [
   ; disable javascript: urls
   none
  ]
  ; and so on...
 ]
]
filter-html/with "..." [
 filter-uris: :uri-filter
]

4. Parse and regenerate the html text

The HTML text is parsed into a tree using load-html, and a set of custom rules for the niwashi. The rules do most of the job of filtering. We then normalize the tree to an XHTML 1.0 one, and form it back to HTML text.

Parse and regenerate the html text

options: make default-options any [options [ ]]
if all [options/all: yes]
current-options: options
filter-uris?: any-function? get in options 'filter-uris
tree: load-html/with html filter-rules
either options/all [
 result: make-node 'root
 doctype: get-node tree/childs/declaration
 doctype: case [
  none? doctype ['none]
  xhtml = get-node doctype/prop/value [
   node: make-node 'xml-proc
   set-node node/prop/value: <?xml version="1.0" encoding="UTF-8"?>
   set-node node/parent: result
   node: make-node 'declaration
   set-node node/prop/value: xhtml
   set-node node/parent: result
   'xhtml
  ]
  html4 = get-node doctype/prop/value [
   node: make-node 'declaration
   set-node node/prop/value: html4
   set-node node/parent: result
   'html4
  ]
  'else ['none]
 ]
 html: make-node 'html
 if doctype = 'xhtml [
  set-node html/prop/xmlns: http://www.w3.org/1999/xhtml
 ]
 set-node html/parent: result
 if lang: get-node tree/childs/html/prop/lang [
  set-node html/properties: [lang: lang xml/lang: lang]
 ]
 node: get-node tree/childs/html/childs/head
 set-node node/parent: html
 node: get-node tree/childs/html/childs/body
 set-node node/parent: html
] [
 result: get-node tree/childs/html/childs/body
]
form-html/with result [
 utf8?: options/emit-utf8
 pretty?: options/pretty-print
]

4.1 filter-html's locals

filter-html's locals

tree result node lang doctype

5. Filtering rules

Filtering rules

filter-rules: [
 always [
  ignore [object param applet comment meta]
  on script remove-script
  except [text whitespace xml-proc declaration] check-attributes
  inside style [
   on text sanitize-style
  ]
  on [i u b em strong span strike] unwrap-empty-fmt
  on [i u b em strong span strike] merge-with-previous
  on a remove-empty-a
  on span unwrap-noattrs
 ]
]

6. Support functions

Support functions

xhtml: {<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">}
html4: {<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">}
Filtering options
Allowed tag attributes
!set-assoc: macro [assoc word value] [
 (:either) _pos: (:find) assoc word [
  (:poke) _pos 2 value/only
 ] [
  insert/only (:insert) (:tail) assoc word value/only
 ]
]
check-attributes: func [node /local attributes valid-attributes value uris name node-properties pos] expand-macros [
 name: !get-node-type node
 attributes: !get-node-properties node
 node-properties: reduce [
  'lang any [
   process-attr select attributes 'xml/lang 'name* none
   process-attr select attributes 'lang 'name* none
  ]
 ]
 style: copy ""
 uris: clear [ ]
 valid-attributes: any [
  select attributes-map name
  [ ]
 ]
 foreach [attr-name type defvalue] union/skip valid-attributes global-attrs 3 [
  insert/only insert tail node-properties attr-name
   value: process-attr select attributes attr-name type defvalue
  if all [type = 'uri value] [
   append uris attr-name
  ]
 ]
 foreach uri uris [
  pos: find node-properties uri
  either filter-uris? [
   value: current-options/filter-uris make pick pos 2 [
    tag-name: name
    attribute-name: uri
    target: get-node node/prop/target
   ]
   either object? value [
    poke pos 2 form-uri value
    !set-assoc node-properties 'target value/target
   ] [
    poke pos 2 value
   ]
  ] [
   value: pick pos 2
   poke pos 2 either all [
    name <> 'iframe
    find [
     #[none] "http" "https" "ftp" "telnet" "news" "mailto" "nntp" "gopher"
    ] value/scheme
   ] [
    form-uri value
   ] [
    none
   ]
  ]
 ]
 if value: select attributes 'style [
  append style value
 ]
 if not empty? style [
  !set-assoc node-properties 'style (sanitize-css style)
 ]
 fix-attrs node-properties
 !set-node-properties node node-properties
]
fix-attrs: func [attributes /local lang id val pos] [
 if lang: select attributes 'lang [
  insert insert/only tail attributes 'xml/lang lang
 ]
 if all [pos: find attributes 'id current-options/id-prefix] [
  poke pos 2 join current-options/id-prefix pick pos 2
 ]
 forskip attributes 2 [
  if all [string? val: pick attributes 2 empty? val] [poke attributes 2 #[none]]
 ]
 attributes
]
; IE sucks
non-paren: complement charset "()"
parens: [
 #"(" any non-paren any [parens any non-paren] #")"
]
sanitize-css: func [css-text [string! none!] /local mk1 mk2] [
 if string? css-text [
  ; safe but slow
  ;css-text: copy css-text
  ; remove comments (IE sucks)
  parse/all css-text [
   any [
    to "/*" mk1: thru "*/" mk2: (mk1: remove/part mk1 mk2) :mk1
   ]
  ]
  ; remove "expression( ... )" (IE sucks)
  parse/all css-text [
   any [
    to "expression" mk1: "expression" [
     any space-char parens mk2: (mk1: remove/part mk1 mk2) :mk1
     |
     none
    ]
   ]
  ]
  css-text
 ]
]
process-attr: func [value type defvalue /local opts] [
 switch defvalue [none [defvalue: none]]
 unless value [return defvalue]
 if block? type [opts: next type type: first type]
 type: get in attr-types type
 value: any [type value opts defvalue]
]
attr-types: context [
 Attribute types
]
global-attrs: [
 dir [enum "LTR" "RTL"] none
 id name* none
 class name-list none
 title cdata none
 bgcolor [style* color "background-color"] none
 ; "middle" is wrong, but should be treated like "center"
 ; we'll need to convert it eventually...
 align [enum "left" "center" "middle" "right" "justify"] none
]
sanitize-style: func [node] expand-macros [
 sanitize-css !get-node-property node 'value
]
unwrap-empty-fmt: func [node /local childs] expand-macros [
 if !get-node-parent node [
  childs: !get-node-childs node
  if any [
   empty? childs
   br-only? childs
  ] [
   unwrap-node node
  ]
 ]
]
br-only?: func [childs /local node] expand-macros [
 all [
  1 = length? childs
  node: first childs
  'br = !get-node-type node
 ]
]
remove-empty-a: func [node] expand-macros [
 if all [
  empty? !get-node-childs node
  not !get-node-property node 'name
 ] [
  !remove-node-quick node
 ]
]
unwrap-noattrs: func [node] expand-macros [
 if !get-node-parent node [
  unless foreach [name value] !get-node-properties node [
   unless none? value [break/return true]
  ] [
   unwrap-node node
  ]
 ]
]
merge-with-previous: func [node /local prev] expand-macros [
 if all [
  prev: !get-node-previous node
  equal? !get-node-type node !get-node-type prev
  equal? !get-node-properties node !get-node-properties prev
 ] [
  set-node node/parent: prev
  unwrap-node node
 ]
]
remove-script: func [node] expand-macros [
 !remove-node-quick node
]

6.1 Allowed tag attributes

Allowed tag attributes

attributes-map: [
 body [
  background [style* uri "background-image"] none
  text [style* color "color"] none
  link color none
  vlink color none
  alink color none
 ]
 title [ ]
 base [
  href uri none
  target [enum "_blank"] none
 ]
 style [
  type force "text/css"
  media media-desc none
 ]
 link [
  href uri none
  hreflang name* none
  type cdata none
  rel name-list none
  rev name-list none
  charset cdata none
  target [enum "_blank"] none
  media media-desc none
 ]
 ul [
  type [style* [enum "disc" "circle" "square"] "list-style-type"] none
 ]
 ol [
  type [style* list-style "list-style-type"] none
  start number none
 ]
 blockquote [cite uri none]
 form [
  action force "#" ; disable forms!
  method force "GET"
  name name* none
 ]
 table [
  summary cdata none
  align [enum "left" "center" "right"] none
  width [style* lengthpx "width"] none
  frame [
   enum "void" "above" "below" "hsides"
    "lhs" "rhs" "vsides" "box" "border"
  ] none
  rules [
   enum "none" "groups" "rows" "cols" "all"
  ] none
  border [style* pixels "border-width"] none
  cellspacing length none
  cellpadding length none
 ]
 ins [
  cite uri none
  datetime cdata none
 ]
 del [
  cite uri none
  datetime cdata none
 ]
 hr [
  ; this is quite hard to convert to CSS automatically
  ; so will keep the deprecated attributes
  align [enum "left" "right" "center"] none
  noshade [bool "noshade"] none
  size number none
  width length none
 ]
 br [clear brclear none]
 a [
  ;name name* none
  ; in the real world name may be an arbitrary string
  name cdata none
  href uri none
  hreflang name* none
  type cdata none
  rel name-list none
  rev name-list none
  charset cdata none
  target [enum "_blank"] none
  shape [enum "default" "rect" "circle" "poly"] none
  coords cdata none
 ]
 map [name name* none]
 area [
  shape [enum "default" "rect" "circle" "poly"] none
  coords cdata none
  nohref [bool "nohref"] none
  alt cdata none
  href uri none
  target [enum "_blank"] none
 ]
 select [
  name name* none
  size number none
  multiple [bool "multiple"] none
  disabled force "disabled"
 ]
 textarea [
  name name* none
  rows number none
  cols number none
  readonly [bool "readonly"] none
  disabled force "disabled"
 ]
 button [
  name name* none
  value cdata none
  type [enum "submit" "button" "reset"] none
  disabled force "disabled"
 ]
 label [for name* none]
 input [
  type [
   enum "text" "password" "checkbox" "radio"
    "submit" "reset" "file" "hidden"
    "image" "button"
  ] none
  name name* none
  value cdata none
  size number none
  maxlength number none
  checked [bool "checked"] none
  src uri none
  alt cdata none
  accept cdata none
  readonly [bool "readonly"] none
  disabled force "disabled"
  usemap uri none
  ismap [bool "ismap"] none
 ]
 q [cite uri none]
 img [
  src uri none
  alt cdata none
  longdesc uri none
  name name* none
  usemap uri none
  ismap [bool "ismap"] none
  width length none
  height length none
  hspace [imgmargin "left" "right"] none
  vspace [imgmargin "top" "bottom"] none
  border [style* pixels "border-width"] none
  align imgalign none
 ]
 li [
  type [style* list-style "list-style-type"] none
  value number none
 ]
 colgroup [
  span number none
  width multi-length none
  valign [style* [enum "baseline" "top" "bottom" "middle"] "vertical-align"] none
 ]
 col [
  span number none
  width multi-length none
  valign [style* [enum "baseline" "top" "bottom" "middle"] "vertical-align"] none
 ]
 thead [
  valign [style* [enum "baseline" "top" "bottom" "middle"] "vertical-align"] none
 ]
 tfoot [
  valign [style* [enum "baseline" "top" "bottom" "middle"] "vertical-align"] none
 ]
 tbody [
  valign [style* [enum "baseline" "top" "bottom" "middle"] "vertical-align"] none
 ]
 tr [
  valign [style* [enum "baseline" "top" "bottom" "middle"] "vertical-align"] none
 ]
 td [
  headers name-list none
  scope [enum "row" "col" "rowgroup" "colgroup"] none
  abbr cdata none
  axis cdata none
  rowspan number none
  colspan number none
  nowrap [style* [bool "nowrap"] "white-space"] none
  width [style* lengthpx "width"] none
  height [style* lengthpx "height"] none
  ; char alignment not supported (browsers do not support it anyway)
  valign [style* [enum "baseline" "top" "bottom" "middle"] "vertical-align"] none
  ; non-standard
  background [style* uri "background-image"] none
 ]
 th [
  headers name-list none
  scope [enum "row" "col" "rowgroup" "colgroup"] none
  abbr cdata none
  axis cdata none
  rowspan number none
  colspan number none
  nowrap [style* [bool "nowrap"] "white-space"] none
  width [style* lengthpx "width"] none
  height [style* lengthpx "height"] none
  ; char alignment not supported (browsers do not support it anyway)
  valign [style* [enum "baseline" "top" "bottom" "middle"] "vertical-align"] none
  ; non-standard
  background [style* uri "background-image"] none
 ]
 option [
  selected [bool "selected"] none
  value cdata none
  label cdata none
  disabled force "disabled"
 ]
 optgroup [
  label cdata none
  disabled force "disabled"
 ]
 ; gets converted to a span
 font [
  size font-size none
  color [style* color "color"] none
  face [style* cdata "font-family"] none
 ]
 iframe [
  src uri none
  name name* none
  width [style* lengthpx "width"] none
  height [style* lengthpx "height"] none
  frameborder [style* pixels "border-width"] none
  ; youtube/vimeo iframe tags
  allowfullscreen [bool "allowfullscreen"] none
  mozallowfullscreen [bool "mozallowfullscreen"] none
  webkitallowfullscreen [bool "webkitallowfullscreen"] none
 ]
 video [
  autoplay [bool "autoplay"] none
  preload [enum "none" "metadata" "auto"] none
  controls [bool "controls"] none
  loop [bool "loop"] none
  poster uri none
  width length none
  height length none
  mediagroup cdata none
  muted [bool "muted"] none
  src uri none
 ]
 source [
  src uri none
  type cdata none
  media media-desc none
 ]
]

6.2 Attribute types

These are the validation/handling functions for various attribute types.

Attribute types

enum: func [value opts] [
 if value: find opts value [value/1]
]
name*: func [value opts] [
 if parse/all trim/lines value [name] [value]
]
name-list: func [value opts] [
 if parse/all trim/lines value [name any [" " name]] [
  value
 ]
]
force: func [value opts] [none] ; force default
style*: func [value opts] [
 if value: process-attr value opts/1 none [
  repend style either opts/1 = 'uri [
   [opts/2 ": url('" form-uri value "');"]
  ] [
   [opts/2 ": " value ";"]
  ]
 ]
 none
]
color: func [value opts] [
 if parse/all trim/lines value [
  "#" 3 6 hexdigit
  |
  "aliceblue" | "antiquewhite" | "aqua" | "aquamarine" | "azure" |
  "beige" | "bisque" | "black" | "blanchedalmond" | "blue" |
  "blueviolet" | "brown" | "burlywood" | "cadetblue" | "chartreuse" |
  "chocolate" | "coral" | "cornflowerblue" | "cornsilk" | "crimson" |
  "cyan" | "darkblue" | "darkcyan" | "darkgoldenrod" | "darkgray" |
  "darkgreen" | "darkkhaki" | "darkmagenta" | "darkolivegreen" |
  "darkorange" | "darkorchid" | "darkred" | "darksalmon" |
  "darkseagreen" | "darkslateblue" | "darkslategray" | "darkturquoise" |
  "darkviolet" | "deeppink" | "deepskyblue" | "dimgray" |
  "dodgerblue" | "feldspar" | "firebrick" | "floralwhite" |
  "forestgreen" | "fuchsia" | "gainsboro" | "ghostwhite" | "gold" |
  "goldenrod" | "gray" | "green" | "greenyellow" | "honeydew" |
  "hotpink" | "indianred" | "indigo" | "ivory" | "khaki" | "lavender" |
  "lavenderblush" | "lawngreen" | "lemonchiffon" | "lightblue" |
  "lightcoral" | "lightcyan" | "lightgoldenrodyellow" | "lightgreen" |
  "lightgrey" | "lightpink" | "lightsalmon" | "lightseagreen" |
  "lightskyblue" | "lightslateblue" | "lightslategray" |
  "lightsteelblue" | "lightyellow" | "lime" | "limegreen" | "linen" |
  "magenta" | "maroon" | "mediumaquamarine" | "mediumblue" |
  "mediumorchid" | "mediumpurple" | "mediumseagreen" |
  "mediumslateblue" | "mediumspringgreen" | "mediumturquoise" |
  "mediumvioletred" | "midnightblue" | "mintcream" | "mistyrose" |
  "moccasin" | "navajowhite" | "navy" | "oldlace" | "olive" |
  "olivedrab" | "orange" | "orangered" | "orchid" | "palegoldenrod" |
  "palegreen" | "paleturquoise" | "palevioletred" | "papayawhip" |
  "peachpuff" | "peru" | "pink" | "plum" | "powderblue" | "purple" |
  "red" | "rosybrown" | "royalblue" | "saddlebrown" | "salmon" |
  "sandybrown" | "seagreen" | "seashell" | "sienna" | "silver" |
  "skyblue" | "slateblue" | "slategray" | "snow" | "springgreen" |
  "steelblue" | "tan" | "teal" | "thistle" | "tomato" | "turquoise" |
  "violet" | "violetred" | "wheat" | "white" | "whitesmoke" | "yellow" |
  "yellowgreen" | "transparent"
 ] [
  value
 ]
]
uri: func [value opts] [
 make parse-uri/relative value [
  source-uri: value
 ]
]
cdata: func [value opts] [if value [trim/lines value]]
media-desc: func [value opts /local names nm] [
 names: clear [ ]
 if all [value parse/all trim/lines value [
  copy nm name (append names nm)
  any [
   thru "," opt " " copy nm name (append names nm)
  ]
  to end
 ]] [
  names: intersect names [
   "screen" "tty" "tv" "projection" "handheld"
   "print" "braille" "aural" "all"
  ]
  if empty? names [return none]
  value: copy first names
  foreach name next names [
   repend value [", " name]
  ]
  value
 ]
]
list-style: func [value opts] [
 if find ["disc" "circle" "square"] value [return value]
 select/case [
  "1" "decimal" #[none]
  "a" "lower-alpha" #[none]
  "A" "upper-alpha" #[none]
  "i" "lower-roman" #[none]
  "I" "upper-roman" #[none]
 ] value
]
number: func [value opts] [
 if parse/all trim/lines value [some digit] [value]
]
pixels: func [value opts] [
 if parse/all trim/lines value [some digit] [append value "px"]
]
bool: func [value opts] [
 opts/1
]
length: func [value opts] [
 if parse/all trim/lines value [some digit opt "%"] [value]
]
lengthpx: func [value opts] [
 if parse/all trim/lines value [some digit ["%" | (append value "px") to end]] [value]
]
multi-length: func [value opts] [
 if parse/all trim/lines value [some digit ["%" | "*" | none]] [value]
]
font-size: func [value opts] [
 trim/lines value
 any [
  if find [
   "1" "2" "3" "4" "5" "6" "7" "+1" "+2" "+3" "+4" "+5" "+6" "+7"
   "-1" "-2" "-3" "-4" "-5" "-6" "-7"
  ] value [value]
  length value none
 ]
]
imgmargin: func [value opts] [
 if value: number value none [
  repend style [
   "margin-" opts/1 ": " value "px;"
   "margin-" opts/2 ": " value "px;"
  ]
 ]
 none
]
imgalign: func [value opts] [
 append style any [select [
  "bottom" "vertical-align: bottom;" #[none]
  "middle" "vertical-align: middle;" #[none]
  "top" "vertical-align: top;" #[none]
  "left" "float: left;" #[none]
  "right" "float: right;" #[none]
 ] value ""]
 none
]
brclear: func [value opts] [
 append style any [select [
  "none" "clear: none;" #[none]
  "left" "clear: left;" #[none]
  "right" "clear: right;" #[none]
  "all" "clear: both;" #[none]
 ] value ""]
 none
]