Contents:

1. Introduction

URL parsing code in REBOL 2, despite being quite good for the job it needs to do, has a number of problems and limitations:

Also, REBOL 2 does not provide any tools for handling URIs, such as URI normalization, comparison, relative URI resolution and so on. I propose to improve the current code for use in REBOL 3.

Design flaw in url! load'ing

REBOL 2 has a design flaw in the way url! values are load'ed and mold'ed. load will decode all percent-encoded characters, while mold re-encodes only characters with an ASCII code less than 33. This is incorrect, and it makes it impossible to use URLs in some cases (the common example being a user name which contains a @ character). The standard states that "the components and subcomponents significant to the scheme-specific dereferencing process (if any) must be parsed and separated before the percent-encoded octets within those components can be safely decoded", with the exception of "percent-encoded octets corresponding to characters in the unreserved set, which can be decoded at any time".

I propose that REBOL 3 should follow the rules in the standard specification.

2. Overview

The parse-uri function takes a URI as input (relative URIs are accepted if the /relative refinement is used) and returns an object containing the scheme, userinfo, host, port, path, query and fragment elements of the URI. Note that only path is guaranteed to never be none, and is always a block of path segments; if it is an absolute path, then the first element of the block is the word 'root; if a host is present, then the path is always absolute, so the 'root word is omitted.

No syntax is assumed for the URI components. The decode-uri-fields function can be used to further decode the URI components according to the following syntax: userinfo is subdelimited by ":" and gets parsed to a block of segments (usually, username and password), and query is subdelimited by "&" and "=" and gets parsed to a block of blocks (in the form [["name1" "value1"] ["name2" "value2"] ...]); the fragment is also checked for the common Web 2.0 usage of "?", "&" and "=" and parsed to a block if that's the case, with the format ["fragment" ["name1" "value1"] ["name2" "value2"] ...]; all segments are then percent-decoded.

Overview

Parse rules

parse-uri: func [
 "Parse a URI into its components"
 uri [any-string!]
 /relative "Allow relative URIs"

 /local parse-uri's locals
] [
 parse-uri's body
]
decode-uri-fields: func [
 "Decode the fields of a URI object (see docs)"
 obj [object!]

 /local decode-uri-fields' locals
] [
 Decode obj's fields into subfields
]

The form-uri function is the inverse of parse-uri. It takes an object like the one resulting from parse-uri as input and returns a string! (we don't use url! for the problem cited above, to avoid percent-encoding confusions). (You can also supply a block! like in [scheme: "http" host: "www.rebol.com" ...].)

All the components are assumed to be already percent-encoded etc., otherwise you must use the encode-uri-fields function to properly encode them: in that case, userinfo (if not none) should be a block and its elements will be joined and separated by a ":", query (if not none) should be a block of name/value pairs like the one returned by decode-uri-fields and it will be formed into a query using "&" and "=" with each name and value percent-encoded, fragment can also be a block as described above, and the other components will be percent-encoded.

Please note that scheme (if not none) must be a valid URI scheme name and that port (if not none) must only contain digits.

Note that in the current implementation of encode-uri-fields is not used the components are percent-encoded in place, thus the passed object is modified.

Overview +≡

encode-uri-fields: func [
 "Encode the fields of a URI object (see docs)"
 obj [object! block!]

 /local encode-uri-fields' locals
] [
 Encode the URI's fields
]
form-uri: func [
 "Form an URI string from its components"
 obj [object! block!] {URI components (scheme, userinfo, host, port, path, query, fragment)}

 /local form-uri's locals
] [
 Form an URI string from its components
]

3. Parse rules

These rules are taken almost literally from the specification. I just adapted them to parse and relaxed them a bit to allow non-ascii chars that are not percent-encoded (so we can parse technically invalid but commonly used URLs and normalize them as valid URLs). Please see RFC3986 (http://www.gbiv.com/protocols/uri/rfc/rfc3986.html) for more details and an explanation of the rules.

Parse rules

uri-rule: [copy seg scheme-rule #":" (scheme: seg) hier-part opt [#"?" query-rule] opt [#"#" fragment-rule]]

hier-part: ["//" authority path-abempty | path-absolute | path-rootless | none]

uri-reference: [uri-rule | relative-ref]

absolute-uri: [copy seg scheme-rule #":" (scheme: seg) hier-part opt [#"?" query-rule]]

relative-ref: [relative-part opt [#"?" query-rule] opt [#"#" fragment-rule]]

relative-part: ["//" authority path-abempty | path-absolute | path-noscheme | none]

scheme-rule: [alpha-char any [alpha-char | digit | #"+" | #"-" | #"."]]

authority: [opt [copy seg userinfo-rule #"@" (userinfo: seg)] host-rule opt [#":" port-rule]]
userinfo-rule: [any [relaxed | pct-encoded | sub-delims | #":"]]
host-rule: [copy host [ip-literal | ipv4address | reg-name]]
port-rule: [copy port any digit]

ip-literal: [#"[" [ipvfuture | ipv6address] #"]"]

ipvfuture: [#"v" some hexdigit #"." some [unreserved | sub-delims | #":"]]

ipv6address: [
         6 [h16 #":"] ls32
 | "::" 5 [h16 #":"] ls32
 | opt h16 "::" 4 [h16 #":"] ls32
 | opt [h16 opt [#":" h16]] "::" 3 [h16 #":"] ls32
 | opt [h16 1 2 [#":" h16]] "::" 2 [h16 #":"] ls32
 | opt [h16 1 3 [#":" h16]] "::" h16 #":" ls32
 | opt [h16 1 4 [#":" h16]] "::" ls32
 | opt [h16 1 5 [#":" h16]] "::" h16
 | opt [h16 1 6 [#":" h16]] "::"
]

h16: [1 4 hexdigit]
ls32: [h16 #":" h16 | ipv4address]

ipv4address: [dec-octet #"." dec-octet #"." dec-octet #"." dec-octet]

dec-octet: ["25" digit0-5 | #"2" digit0-4 digit | #"1" 2 digit | digit1-9 digit | digit]

reg-name: [any [unreserved | pct-encoded | sub-delims]]

path-abempty: [any [#"/" segment (append path seg)]]
path-absolute: [#"/" (append path 'root) [segment-nz (append path seg) any [#"/" segment (append path seg)] | (append path "")]]
path-noscheme: [segment-nz-nc (append path seg) any [#"/" segment (append path seg)]]
path-rootless: [segment-nz (append path seg) any [#"/" segment (append path seg)]]

segment: [copy seg any pchar (seg: any [seg ""])]
segment-nz: [copy seg some pchar]
segment-nz-nc: [copy seg some [relaxed | pct-encoded | sub-delims | #"@"]]

pchar: [relaxed | pct-encoded | sub-delims | #":" | #"@"]

query-rule: [copy query any [pchar | #"/" | #"?"]]
fragment-rule: [copy fragment any [pchar | #"/" | #"?"]]

pct-encoded: [#"%" 2 hexdigit]

digit0-5: charset "012345"
digit0-4: charset "01234"
digit1-9: charset "123456789"
unreserved: union alpha-char union digit charset "-._~"
gen-delims: charset ":/?#[]@"
sub-delims: charset "!$&'()*+,;="
reserved: union gen-delims sub-delims
relaxed: exclude complement reserved charset "%"

3.1 Words used by the rules

The words used by the rules need to be made local to the uri-parser object.

Parse rules +≡

scheme: userinfo: host: port: seg: query: fragment: none
path: [ ]
vars: [scheme userinfo host port path query fragment]

3.2 parse-uri's body

We initialize the words in vars to none, and path to an empty block. (A path is always present in an URI, but can be empty.)

Depending on the use of the /relative refinement, we use the uri-reference or the uri-rule rule to parse the URI; if parsing is successfull, we return an object containing the words in vars.

parse-uri's body

set vars none
path: make block! 8
if parse/all uri either relative [uri-reference] [uri-rule] [
 set obj: context [
  scheme: userinfo: host: port: path: query: fragment: none
 ] reduce vars
 obj
]

3.2.1 parse-uri's locals

parse-uri's locals

obj

3.3 Decode obj's fields into subfields

Decoding the URI fields in the most common case means:

Decode obj's fields into subfields

if obj/userinfo [
 parse/all obj/userinfo [
  (obj/userinfo: make block! 3)
  any [
   copy name to #":" skip (append obj/userinfo percent-decode any [name ""])
  ]
  copy name to end (append obj/userinfo percent-decode any [name ""])
 ]
]
if obj/host [percent-decode obj/host]
foreach segment obj/path [if string? segment [percent-decode segment]]
if obj/query [
 parse/all obj/query [
  (obj/query: make block! 16)
  some [
   copy name some query-chars [
    #"=" copy val some query-chars
    |
    #"=" (val: copy "")
    |
    (val: copy "")
   ]
   (append/only obj/query reduce [percent-decode name percent-decode val])
   [#"&" | end]
   |
   some [#"&" | #"="]
  ]
 ]
]
if obj/fragment [
 either parse/all obj/fragment [
  (fragments: make block! 5)
  copy name to #"?" skip (append fragments percent-decode name)
  some [
   copy name some query-chars [
    #"=" copy val some query-chars
    |
    #"=" (val: copy "")
    |
    (val: copy "")
   ]
   (append/only fragments reduce [percent-decode name percent-decode val])
   [#"&" | end]
   |
   some [#"&" | #"="]
  ]
 ] [
  obj/fragment: fragments
 ] [
  percent-decode obj/fragment
 ]
]
obj

We need a charset in the rule for decoding the query:

Parse rules +≡

query-chars: complement charset "&="

3.3.1 decode-uri-fields' locals

decode-uri-fields' locals

name val fragments

4. The percent-decode and percent-encode functions

The functions are optimized for the common case of few percent-encoded characters per string (most usually zero), so they work in place.

Parse rules +≡

percent-decode: func [
 string [any-string!]
] [
 decode-text/to string 'url string
]
percent-encode: func [
 string [any-string!]
] [
 encode-text/to string 'url string
]

5. Form an URI string from its components

The code follows the pseudo-code in the specifications.

Form an URI string from its components

if block? obj [
 obj: make context [
  scheme: userinfo: host: port: path: query: fragment: none
 ] obj
]

result: make string! 256

if obj/scheme [
 insert insert result obj/scheme #":"
]

if any [obj/userinfo obj/host obj/port] [
 append result "//"
 if obj/userinfo [
  insert insert tail result obj/userinfo #"@"
 ]
 if obj/host [
  append result obj/host
 ]
 if obj/port [
  insert insert tail result #":" obj/port
 ]
]

if not empty? obj/path [
 path: obj/path
 if not any [
  obj/userinfo obj/host obj/port
 ] [
  if string? path/1 [append result path/1]
  path: next path
 ]
 foreach segment path [
  if string? segment [insert insert tail result #"/" segment]
 ]
]

if obj/query [
 insert insert tail result #"?" obj/query
]

if obj/fragment [
 insert insert tail result #"#" obj/fragment
]

result

5.1 form-uri's locals

form-uri's locals

result path

5.2 Encode the URI's fields

Encode the URI's fields

if block? obj [
 obj: make context [
  scheme: userinfo: host: port: path: query: fragment: none
 ] obj
]
if all [block? obj/userinfo not empty? obj/userinfo] [
 result: percent-encode copy obj/userinfo/1
 foreach value next obj/userinfo [
  insert insert tail result #":" percent-encode value
 ]
 obj/userinfo: result
]
if obj/host [
 ; ip-literal must not be encoded
 unless parse/all obj/host ip-literal [
  percent-encode obj/host
 ]
]
foreach segment obj/path [
 if string? segment [percent-encode segment]
]
if all [block? obj/query not empty? obj/query] [
 result: percent-encode copy obj/query/1/1
 insert insert tail result #"=" percent-encode obj/query/1/2
 foreach pair next obj/query [
  repend result [
   #"&" percent-encode pair/1 #"=" percent-encode pair/2
  ]
 ]
 obj/query: result
]
case [
 all [block? obj/fragment 1 < length? obj/fragment] [
  result: percent-encode copy obj/fragment/1
  repend result [
   #"?" percent-encode obj/fragment/2/1 #"=" percent-encode obj/fragment/2/2
  ]
  foreach pair next next obj/fragment [
   repend result [
    #"&" percent-encode pair/1 #"=" percent-encode pair/2
   ]
  ]
  obj/fragment: result
 ]
 string? obj/fragment [percent-encode obj/fragment]
]
obj

5.2.1 encode-uri-fields' locals

encode-uri-fields' locals

result