[X][HT]ML Parser

1. Introduction

Parsing XML, XHTML and HTML (as well as other variants) has increasingly become one of the most common tasks in programming. We present here a simple but robust "ML" parser; since this is fairly low level and rather permissive with the input, it can be used in a very wide range of situations.

2. Overview

The parse-ml function is rather low-level and will incrementally parse a ML string and call the provided callback function with the data as it gets parsed. |callback| **must* take two arguments, a "command" and its data. (This is designed to fit well with the FSM dialect.) The command can be a tag!, or the word!'s comment, text, whitespace, declaration or xml-proc. In the tag! case, it can be in the form <name>, <name/> or </name>, respectively indicating a start tag, an empty tag, or an end tag. The "data" is the text string in the word case, while it is a block of attributes (as name/value pairs - names can be word! or path! values, depending on the presence of a namespace specification) in the tag case. (It is just none for end tags.)

See 〈Use parse-ml to convert html to a block〉 for an example usage of parse-ml. It is basically a replacement for load/markup. It will return a block with all the tags parsed and converted to block! or tag!. For an example, see 〈Example usage of load-markup〉.

Note:

The source html string is assumed to use the UTF-8 encoding. If your text uses any other encoding, you must convert it to UTF-8 before calling this function.

〈Overview〉 ≡

〈Parsing rules〉

parse-ml: func [
"Parse *ML text"
html [string!]
callback [any-function!]
] [
〈Parse the html text〉
]

load-markup: func [
"LOAD/MARKUP replacement that parses tags and more"
html [string!]

/local 〈load-markup's locals〉
] [
〈Use parse-ml to convert html to a block〉
]

3. Parse the html text

We use parse on the source text, parsing tags, text and character entities, and send the parsed elements to the callback function given as argument.

〈Parse the html text〉 ≡

cb: :callback
parse/all html html-rule

4. Use parse-ml to convert html to a block

We just supply a callback to parse-ml that collects everything into a block. Tags with attributes are appended as a block, and text strings are joined together as much as possible.

〈Use parse-ml to convert html to a block〉 ≡

result: copy [ ]
parse-ml html func [cmd data] [
switch/default cmd [
  text whitespace [
   either all [not empty? result string? last result] [
    append last result data
   ] [
    append result data
   ]
  ]
  comment declaration xml-proc [
   ; append them as tags
   if tag? data: attempt [load data] [append result data]
  ]
] [
  ; cmd is tag!
  either all [block? data not empty? data] [
   insert data cmd
   append/only result data
  ] [
   append result cmd
  ]
]
]
result

4.1 load-markup's locals

〈load-markup's locals〉 ≡

result

5. Example usage of load-markup

〈Example usage of load-markup〉 ≡

>> load-markup {Some <b>html</b>. See <a href="http://www.example.com">link</a>.}
== ["Some " [<b>] "html" </b> ". See " [<a> href "http://www.example.com"] "link" </a> "."]

6. Parsing rules

We're using a simple parser. An HTML document is a sequence of comments, document type declarations, CDATA sections, XML processing instructions, script or style elements, start tags, empty tags, end tags, and text. (The script and the style elements are parsed separately because their contents is actually CDATA.)

〈Parsing rules〉 ≡

Our rule for comments is a bit strict and may miss some comments that some browsers would still consider as such. We will improve it if we find problems in the future; it should be ok for XHTML documents.

Doctype declarations are parsed with a simplified rule that could possibly break in some documents; we don't expect it to break with the usual declarations found in HTML documents though. Doctype declarations are simply ignored.

CDATA sections are like normal text sections except that the "<" and "&" characters lose any special meaning; this means that text is passed on verbatim.

〈Parsing rules〉 +≡

comment: [
copy txt [
  ; more MS useless bullshit
  "<![" thru "]>"
  |
  ""
] (cb 'comment txt)
]
declaration: [copy txt ["<!doctype" space-char thru #">"] (cb 'declaration txt)] ; simplified - may break
cdata: ["<![CDATA[" copy txt to "]]>" 3 skip (cb 'text any [txt copy ""])]

where cb is set to the callback function given by the user (see 〈Parse the html text〉).

The following definitions are used in the rules for tags:

〈Parsing rules〉 +≡

value-chars: union letter+ charset "/:@%#?,+&=;" ; very relaxed
broken-value-chars: union letter+ charset "/:@%#?,+&; "
garbage: union value-chars charset {"'}
text-char: complement charset "< ^/^-"

XML processing instructions are simply passed on literally (this also catches the XML declaration).

〈Parsing rules〉 +≡

proc: [
; MS useless bullshit (which of course we throw away)
"<?XML:NAMESPACE PREFIX = O />"
|
copy txt ["<?" name thru "?>"] (cb 'xml-proc txt)
]

Start tags and empty tags are processed with the same rule. Start tags will send the <tagname> command to the callback function, while empty tags will send <tagname/>. Attributes parsing is somewhat relaxed, and should work with the commonly used markup. They are sent as a block of name/value pairs to the callback function, as the command's data. The decode-entities function is used to convert entities in the attribute values. Note we are also parsing the namespace for attribute names and creating a path! instead of a word! in that case.

End tags are sent to the second stage as </tagname>.

〈Parsing rules〉 +≡

start-empty-tag: [
#"<"
copy nm [name opt [#":" name]] any space-char
(attributes: make block! 16) any [attribute | some garbage] ; ignore any garbage
[
  "/>" (cb head insert insert make tag! 3 + length? nm nm #"/" attributes)
  |
  #">" (cb to tag! nm attributes)
  |
  ; someone made a typo... and we have to make it work anyway!
  pos: #"<" :pos (cb to tag! nm attributes)
]
|
#"<" (cb 'text copy "<")
]
attribute: [
[
  copy attnmns name #":" copy attnmtxt name (
   attnm: make path! reduce [to word! attnmns to word! attnmtxt]
  )
  |
  copy attnmtxt name (attnm: to word! attnmtxt)
] any space-char [
  #"=" any space-char attr-value any space-char (
   insert insert/only tail attributes attnm either attval [
    decode-entities attval
   ] [
    copy ""
   ]
  )
  |
  none (insert insert/only tail attributes attnm attnmtxt)
]
]
attr-value: [
#"^"" copy attval to #"^"" skip
|
#"'" copy attval to #"'" skip
|
; handle the fact that the world is full of idiots
copy attval some broken-value-chars pos: #">" :pos
|
copy attval any value-chars
]
end-tag: ["</" copy nm [name opt [#":" name]] any space-char #">" (cb append copy </> nm none)]

Script and style elements need to be parsed separately, because they're special in HTML 3.2 and 4.0/4.1, and the CDATA syntax of XHTML is not well supported by current browsers. We are trying to parse three cases here: the style or script text is escaped with "/* <![CDATA[ */" and "/* ]]> */" (XHTML way with comments to make it work with current browsers), it is hidden into a comment (not valid in XHTML but common practice in HTML), or it is just left unescaped (invalid in XHTML but valid in HTML). In the first case we try to remove the extra "/*" and "*/" from the text; in the second case we try to remove the common "//" comment before the closing "-->"; in the third case we just take everything up to </script> or </style>.

〈Parsing rules〉 +≡

script-style: [
#"<" copy nm ["script" | "style"] any space-char
(attributes: make block! 16) any attribute
#">" (cb to tag! nm attributes nm: append copy </> nm)
[
  any space-char "/*" any space-char "<![CDATA[" any space-char "*/" any space-char
  copy txt to "]]>" 3 skip any space-char "*/" any space-char
  nm (
   txt: any [txt copy ""]
   trim/tail txt
   if "/*" = skip tail txt -2 [
    clear skip tail txt -2
    trim/tail txt
   ]
   cb 'text txt
   cb nm none
  )
  |
  any space-char "" 3 skip any space-char
  nm (
   txt: any [txt copy ""]
   trim/tail txt
   if "//" = skip tail txt -2 [
    clear skip tail txt -2
    trim/tail txt
   ]
   cb 'text txt
   cb nm none
  )
  |
  copy txt to nm nm (cb 'text any [txt copy ""] cb nm none)
]
]

Text parsing handles character and named entities, as well as normal text. Text is sent to the callback as data for the text command. Whitespace is sent separately with the whitespace command, since for e.g. whitespace alone does not start a paragraph.

〈Parsing rules〉 +≡

text: [
some [
  copy txt some space-char (cb 'whitespace txt)
  |
  copy txt some text-char (cb 'text decode-text txt 'html)
]
]

We also need to make the words we are using as variables local to the context:

〈Parsing rules〉 +≡

nm: none
attributes: [ ]
attnm: attnmtxt: attnmns: none
attval: none
txt: none

6.1 Entity conversion functions

The decode-entities function is used to replace entities with UTF-8 sequences in attribute values.

〈Parsing rules〉 +≡

decode-entities: func [attribute] [
decode-text attribute 'html
]

[X][HT]ML Parser

Gabriele Santilli

1.1.4

Contents:

1. Introduction

2. Overview

Note:

3. Parse the html text

4. Use parse-ml to convert html to a block

4.1 load-markup's locals

5. Example usage of load-markup

6. Parsing rules

6.1 Entity conversion functions