A standards-compliant URI parser

1. Introduction

URL parsing code in REBOL 2, despite being quite good for the job it needs to do, has a number of problems and limitations:

it does not handle percent-encoding (since this is incorrectly handled by the url! datatype parser - see note below);
it only works on URLs, as they were intended to be used with REBOL ports: it does not handle the more generic URI syntax;
lacks support for IPv6 (since REBOL 2 does not support IPv6 anyway), which I hope can be added to REBOL 3.

Also, REBOL 2 does not provide any tools for handling URIs, such as URI normalization, comparison, relative URI resolution and so on. I propose to improve the current code for use in REBOL 3.

Design flaw in url! load'ing

REBOL 2 has a design flaw in the way url! values are load'ed and mold'ed. load will decode all percent-encoded characters, while mold re-encodes only characters with an ASCII code less than 33. This is incorrect, and it makes it impossible to use URLs in some cases (the common example being a user name which contains a @ character). The standard states that "the components and subcomponents significant to the scheme-specific dereferencing process (if any) must be parsed and separated before the percent-encoded octets within those components can be safely decoded", with the exception of "percent-encoded octets corresponding to characters in the unreserved set, which can be decoded at any time".

I propose that REBOL 3 should follow the rules in the standard specification.

2. Overview

The parse-uri function takes a URI as input (relative URIs are accepted if the /relative refinement is used) and returns an object containing the scheme, userinfo, host, port, path, query and fragment elements of the URI. Note that only path is guaranteed to never be none, and is always a block of path segments; if it is an absolute path, then the first element of the block is the word 'root; if a host is present, then the path is always absolute, so the 'root word is omitted.

No syntax is assumed for the URI components. The decode-uri-fields function can be used to further decode the URI components according to the following syntax: userinfo is subdelimited by ":" and gets parsed to a block of segments (usually, username and password), and query is subdelimited by "&" and "=" and gets parsed to a block of blocks (in the form [["name1" "value1"] ["name2" "value2"] ...]); the fragment is also checked for the common Web 2.0 usage of "?", "&" and "=" and parsed to a block if that's the case, with the format ["fragment" ["name1" "value1"] ["name2" "value2"] ...]; all segments are then percent-decoded.

〈Overview〉 ≡

〈Parse rules〉

parse-uri: func [
"Parse a URI into its components"
uri [any-string!]
/relative "Allow relative URIs"

/local 〈parse-uri's locals〉
] [
〈parse-uri's body〉
]
decode-uri-fields: func [
"Decode the fields of a URI object (see docs)"
obj [object!]

/local 〈decode-uri-fields' locals〉
] [
〈Decode obj's fields into subfields〉
]

The form-uri function is the inverse of parse-uri. It takes an object like the one resulting from parse-uri as input and returns a string! (we don't use url! for the problem cited above, to avoid percent-encoding confusions). (You can also supply a block! like in [scheme: "http" host: "www.rebol.com" ...].)

All the components are assumed to be already percent-encoded etc., otherwise you must use the encode-uri-fields function to properly encode them: in that case, userinfo (if not none) should be a block and its elements will be joined and separated by a ":", query (if not none) should be a block of name/value pairs like the one returned by decode-uri-fields and it will be formed into a query using "&" and "=" with each name and value percent-encoded, fragment can also be a block as described above, and the other components will be percent-encoded.

Please note that scheme (if not none) must be a valid URI scheme name and that port (if not none) must only contain digits.

Note that in the current implementation of encode-uri-fields is not used the components are percent-encoded in place, thus the passed object is modified.

〈Overview〉 +≡

encode-uri-fields: func [
"Encode the fields of a URI object (see docs)"
obj [object! block!]

/local 〈encode-uri-fields' locals〉
] [
〈Encode the URI's fields〉
]
form-uri: func [
"Form an URI string from its components"
obj [object! block!] {URI components (scheme, userinfo, host, port, path, query, fragment)}

/local 〈form-uri's locals〉
] [
〈Form an URI string from its components〉
]

3. Parse rules

These rules are taken almost literally from the specification. I just adapted them to parse and relaxed them a bit to allow non-ascii chars that are not percent-encoded (so we can parse technically invalid but commonly used URLs and normalize them as valid URLs). Please see RFC3986 (http://www.gbiv.com/protocols/uri/rfc/rfc3986.html) for more details and an explanation of the rules.

〈Parse rules〉 ≡

uri-rule: [copy seg scheme-rule #":" (scheme: seg) hier-part opt [#"?" query-rule] opt [#"#" fragment-rule]]

hier-part: ["//" authority path-abempty | path-absolute | path-rootless | none]

uri-reference: [uri-rule | relative-ref]

absolute-uri: [copy seg scheme-rule #":" (scheme: seg) hier-part opt [#"?" query-rule]]

relative-ref: [relative-part opt [#"?" query-rule] opt [#"#" fragment-rule]]

relative-part: ["//" authority path-abempty | path-absolute | path-noscheme | none]

scheme-rule: [alpha-char any [alpha-char | digit | #"+" | #"-" | #"."]]

authority: [opt [copy seg userinfo-rule #"@" (userinfo: seg)] host-rule opt [#":" port-rule]]
userinfo-rule: [any [relaxed | pct-encoded | sub-delims | #":"]]
host-rule: [copy host [ip-literal | ipv4address | reg-name]]
port-rule: [copy port any digit]

ip-literal: [#"[" [ipvfuture | ipv6address] #"]"]

ipvfuture: [#"v" some hexdigit #"." some [unreserved | sub-delims | #":"]]

ipv6address: [
6 [h16 #":"] ls32
| "::" 5 [h16 #":"] ls32
| opt h16 "::" 4 [h16 #":"] ls32
| opt [h16 opt [#":" h16]] "::" 3 [h16 #":"] ls32
| opt [h16 1 2 [#":" h16]] "::" 2 [h16 #":"] ls32
| opt [h16 1 3 [#":" h16]] "::" h16 #":" ls32
| opt [h16 1 4 [#":" h16]] "::" ls32
| opt [h16 1 5 [#":" h16]] "::" h16
| opt [h16 1 6 [#":" h16]] "::"
]

h16: [1 4 hexdigit]
ls32: [h16 #":" h16 | ipv4address]

ipv4address: [dec-octet #"." dec-octet #"." dec-octet #"." dec-octet]

dec-octet: ["25" digit0-5 | #"2" digit0-4 digit | #"1" 2 digit | digit1-9 digit | digit]

reg-name: [any [unreserved | pct-encoded | sub-delims]]

path-abempty: [any [#"/" segment (append path seg)]]
path-absolute: [#"/" (append path 'root) [segment-nz (append path seg) any [#"/" segment (append path seg)] | (append path "")]]
path-noscheme: [segment-nz-nc (append path seg) any [#"/" segment (append path seg)]]
path-rootless: [segment-nz (append path seg) any [#"/" segment (append path seg)]]

segment: [copy seg any pchar (seg: any [seg ""])]
segment-nz: [copy seg some pchar]
segment-nz-nc: [copy seg some [relaxed | pct-encoded | sub-delims | #"@"]]

pchar: [relaxed | pct-encoded | sub-delims | #":" | #"@"]

query-rule: [copy query any [pchar | #"/" | #"?"]]
fragment-rule: [copy fragment any [pchar | #"/" | #"?"]]

pct-encoded: [#"%" 2 hexdigit]

digit0-5: charset "012345"
digit0-4: charset "01234"
digit1-9: charset "123456789"
unreserved: union alpha-char union digit charset "-._~"
gen-delims: charset ":/?#[]@"
sub-delims: charset "!$&'()*+,;="
reserved: union gen-delims sub-delims
relaxed: exclude complement reserved charset "%"

3.1 Words used by the rules

The words used by the rules need to be made local to the uri-parser object.

〈Parse rules〉 +≡

scheme: userinfo: host: port: seg: query: fragment: none
path: [ ]
vars: [scheme userinfo host port path query fragment]

3.2 parse-uri's body

We initialize the words in vars to none, and path to an empty block. (A path is always present in an URI, but can be empty.)

Depending on the use of the /relative refinement, we use the uri-reference or the uri-rule rule to parse the URI; if parsing is successfull, we return an object containing the words in vars.

〈parse-uri's body〉 ≡

set vars none
path: make block! 8
if parse/all uri either relative [uri-reference] [uri-rule] [
set obj: context [
scheme: userinfo: host: port: path: query: fragment: none
] reduce vars
obj
]

3.2.1 parse-uri's locals

〈parse-uri's locals〉 ≡

obj

3.3 Decode obj's fields into subfields

Decoding the URI fields in the most common case means:

splitting userinfo into parts separated by ":", and percent-decode them;
decode the host;
decode each segment in the path (of course skipping 'root if present);
split the query into [name value] pairs using "&" and "=" as delimiters, and decode them;
decode the fragment, and split it if necessary (see above).

〈Decode obj's fields into subfields〉 ≡

if obj/userinfo [
parse/all obj/userinfo [
  (obj/userinfo: make block! 3)
  any [
   copy name to #":" skip (append obj/userinfo percent-decode any [name ""])
  ]
  copy name to end (append obj/userinfo percent-decode any [name ""])
]
]
if obj/host [percent-decode obj/host]
foreach segment obj/path [if string? segment [percent-decode segment]]
if obj/query [
parse/all obj/query [
  (obj/query: make block! 16)
  some [
   copy name some query-chars [
    #"=" copy val some query-chars
    |
    #"=" (val: copy "")
    |
    (val: copy "")
   ]
   (append/only obj/query reduce [percent-decode name percent-decode val])
   [#"&" | end]
   |
   some [#"&" | #"="]
  ]
]
]
if obj/fragment [
either parse/all obj/fragment [
  (fragments: make block! 5)
  copy name to #"?" skip (append fragments percent-decode name)
  some [
   copy name some query-chars [
    #"=" copy val some query-chars
    |
    #"=" (val: copy "")
    |
    (val: copy "")
   ]
   (append/only fragments reduce [percent-decode name percent-decode val])
   [#"&" | end]
   |
   some [#"&" | #"="]
  ]
] [
  obj/fragment: fragments
] [
  percent-decode obj/fragment
]
]
obj

We need a charset in the rule for decoding the query:

〈Parse rules〉 +≡

query-chars: complement charset "&="

3.3.1 decode-uri-fields' locals

〈decode-uri-fields' locals〉 ≡

name val fragments

4. The percent-decode and percent-encode functions

The functions are optimized for the common case of few percent-encoded characters per string (most usually zero), so they work in place.

〈Parse rules〉 +≡

percent-decode: func [
string [any-string!]
] [
decode-text/to string 'url string
]
percent-encode: func [
string [any-string!]
] [
encode-text/to string 'url string
]

5. Form an URI string from its components

The code follows the pseudo-code in the specifications.

〈Form an URI string from its components〉 ≡

if block? obj [
obj: make context [
  scheme: userinfo: host: port: path: query: fragment: none
] obj
]

result: make string! 256

if obj/scheme [
insert insert result obj/scheme #":"
]

if any [obj/userinfo obj/host obj/port] [
append result "//"
if obj/userinfo [
  insert insert tail result obj/userinfo #"@"
]
if obj/host [
  append result obj/host
]
if obj/port [
  insert insert tail result #":" obj/port
]
]

if not empty? obj/path [
path: obj/path
if not any [
  obj/userinfo obj/host obj/port
] [
  if string? path/1 [append result path/1]
  path: next path
]
foreach segment path [
  if string? segment [insert insert tail result #"/" segment]
]
]

if obj/query [
insert insert tail result #"?" obj/query
]

if obj/fragment [
insert insert tail result #"#" obj/fragment
]

result

5.1 form-uri's locals

〈form-uri's locals〉 ≡

result path

5.2 Encode the URI's fields

〈Encode the URI's fields〉 ≡

if block? obj [
obj: make context [
  scheme: userinfo: host: port: path: query: fragment: none
] obj
]
if all [block? obj/userinfo not empty? obj/userinfo] [
result: percent-encode copy obj/userinfo/1
foreach value next obj/userinfo [
  insert insert tail result #":" percent-encode value
]
obj/userinfo: result
]
if obj/host [
; ip-literal must not be encoded
unless parse/all obj/host ip-literal [
  percent-encode obj/host
]
]
foreach segment obj/path [
if string? segment [percent-encode segment]
]
if all [block? obj/query not empty? obj/query] [
result: percent-encode copy obj/query/1/1
insert insert tail result #"=" percent-encode obj/query/1/2
foreach pair next obj/query [
  repend result [
   #"&" percent-encode pair/1 #"=" percent-encode pair/2
  ]
]
obj/query: result
]
case [
all [block? obj/fragment 1 < length? obj/fragment] [
  result: percent-encode copy obj/fragment/1
  repend result [
   #"?" percent-encode obj/fragment/2/1 #"=" percent-encode obj/fragment/2/2
  ]
  foreach pair next next obj/fragment [
   repend result [
    #"&" percent-encode pair/1 #"=" percent-encode pair/2
   ]
  ]
  obj/fragment: result
]
string? obj/fragment [percent-encode obj/fragment]
]
obj

5.2.1 encode-uri-fields' locals

〈encode-uri-fields' locals〉 ≡

result

A standards-compliant URI parser

Gabriele Santilli

2.0.1

Contents:

1. Introduction

Design flaw in url! load'ing

2. Overview

3. Parse rules

3.1 Words used by the rules

3.2 parse-uri's body

3.2.1 parse-uri's locals

3.3 Decode obj's fields into subfields

3.3.1 decode-uri-fields' locals

4. The percent-decode and percent-encode functions

5. Form an URI string from its components

5.1 form-uri's locals

5.2 Encode the URI's fields

5.2.1 encode-uri-fields' locals