-
-
Notifications
You must be signed in to change notification settings - Fork 415
Description
Describe the inspiration for your proposal
Not long ago I read many times the RFCs related to idn-hostname parser I made, 5890/5891, which then took me through 5892/5893 to finally reach UTS#46. I realized at that time that in fact 5890/5891 were just a framework that prepared what was to come. Then, I started reading RFCs 3986/3987 for the identifier-js uri/iri validator to see exactly the same thing: they were not the rigid texts that I had in mind but were also frameworks that distribute and delegate responsibilities to various other specifications. While building the validator I always compared it to whatwg-url, which incorporates both uri/iri in a single specification, and I was struck by the fact that they treat urls differently depending on the scheme. I looked to see where they got the specifications from and found that they go back exactly to 3986/3987 RFCs. I have often made the mistake of jumping directly to the relevant normative texts when reading RFCs, but here I realized that something was missing, and I sought to understand more precisely what the authors of the texts were thinking. In a field of "URI/IRI" complexity, they couldn't put all the possible variations into a single text. So, they established this general framework for any valid URI/IRI, and delegated each scheme specific documentation to the related RFC's from the very beginning.
Abstract
A Uniform Resource Identifier (URI) is a compact sequence of
characters that identifies an abstract or physical resource. This
specification defines the generic URI syntax and a process for
resolving URI references that might be in relative form, along with
guidelines and security considerations for the use of URIs on the
Internet. The URI syntax defines a grammar that is a superset of all
valid URIs, allowing an implementation to parse the common components
of a URI reference without knowing the scheme-specific requirements
of every possible identifier. This specification does not define a
generative grammar for URIs; that task is performed by the individual
specifications of each URI scheme.
So, what are those specific schemes:
http,httpsRFC7230ws,wssRFC6455fileRFC8089mailtoRFC6068dataRFC2397
There are few more notorious schemes, like the obsoletedftp, or whatwg specificblob.
I made a research about each of those schemes specificity (not 100% sure about accuracy yet):
| Scheme | Userinfo | Host (reg_name) | Port | Path | Query | Fragment |
|---|---|---|---|---|---|---|
| http | optional | required | optional | required | optional | optional |
| https | optional | required | optional | required | optional | optional |
| ws | useless | required | optional | required | optional | useless |
| wss | useless | required | optional | required | optional | useless |
| file | optional | optional | forbidden | required | optional | optional |
| ftp | optional | required | optional | required | useless | useless |
| mailto | useless | useless | useless | required | optional | useless |
| data | useless | useless | useless | required | useless | useless |
| blob | useless | useless | useless | required | useless | useless |
When we read the related specs we find that the network resource specific schemes define their uris like: http-URI, https-URI, ws-URI, wss-URI, file-URI, all being subsets of the generic URI, and all pointing their base back to RFC3986. As for the other uri parts, for example this is how are they defined in http:
URI-reference = <URI-reference, see [RFC3986], Section 4.1> absolute-URI = <absolute-URI, see [RFC3986], Section 4.3> relative-part = <relative-part, see [RFC3986], Section 4.2> scheme = <scheme, see [RFC3986], Section 3.1> authority = <authority, see [RFC3986], Section 3.2> uri-host = <host, see [RFC3986], Section 3.2.2> port = <port, see [RFC3986], Section 3.2.3> path-abempty = <path-abempty, see [RFC3986], Section 3.3> segment = <segment, see [RFC3986], Section 3.3> query = <query, see [RFC3986], Section 3.4> fragment = <fragment, see [RFC3986], Section 3.5>
I think I'm not wrong if I say that in IT when we say hostname we think of url.hostname. Maybe we were thinking that the URI RFC is too rigid (considering only that abnf) to cover this option. But perhaps judging things in the light of the above we can see that this is not the case. The RFC3986 has that generic syntax for reg_name as a fallback for non-specific cases, because it defines it:
A URI resolution implementation might use DNS,
host tables, yellow pages, NetInfo, WINS, or any other system for
lookup of registered names.
but, for the specific case when resources are located on the network sais:
A registered name intended for lookup in the DNS uses the syntax
defined in Section 3.5 of [RFC1034] and Section 2.1 of [RFC1123].
and:
If the URI scheme defines a default for host, then that default
applies when the host subcomponent is undefined or when the
registered name is empty (zero length). For example, the "file" URI
scheme is defined so that no authority, an empty host, and
"localhost" all mean the end-user's machine, whereas the "http"
scheme considers a missing authority or empty host invalid.
Note that the reg_name definition is here differentiated based on scheme.
and:
The reg-name syntax allows percent-encoded octets in order to
represent non-ASCII registered names in a uniform way that is
independent of the underlying name resolution technology. Non-ASCII
characters must first be encoded according to UTF-8 [STD63], and then
each octet of the corresponding UTF-8 sequence must be percent-
encoded to be represented as URI characters. URI producing
applications must not use percent-encoding in host unless it is used
to represent a UTF-8 character sequence. When a non-ASCII registered
name represents an internationalized domain name intended for
resolution via the DNS, the name must be transformed to the IDNA
encoding [RFC3490] prior to name lookup. URI producers should
provide these registered names in the IDNA encoding, rather than a
percent-encoding, if they wish to maximize interoperability with
legacy URI resolvers.
Here the RFC3490 infact must be read as RFC5980/5981 that obsoletes it. I read this as a back delegation that allows the storage of an IRI as URI, facilitating back and forth conversion between.
In conclusion, what I understand is this:
yellow.pages//john%20doe/phone is an allowed use of % in reg_name, but if the scheme is http it can only be a transitory state to transform that input into its unicode version delegated to RFC5890/5891 (and its delegations to RFC5892, RFC5893, and UTS#46) in order to give back the URI version of it (as we know the DNS is ASCII-only and will not change in our lifetime). Also, for network involving schemes, the other DNS rules are also mandatory (hyphen-rules, non empty labels, so on...).
Describe the proposal
- replace the
hostnameformat requirement rule toRFC 3986reg_nameabnf when scheme implies a network resource (http,https,ws,wss,file, or any other similar) - replace the
idn-hostnameformat requirement rule toRFC 3987ireg_nameabnf when scheme implies a network resource (http,https,ws,wss,file, or any other similar)
Implicitly, clarify the specs for uri, iri, uri-reference and iri-reference formats about scheme specific rules.
Describe alternatives you've considered
No response
Additional context
No response