Skip to content

✨ Proposal: redefine hostname and idn-hostname formats #1678

@SorinGFS

Description

@SorinGFS

Describe the inspiration for your proposal

Not long ago I read many times the RFCs related to idn-hostname parser I made, 5890/5891, which then took me through 5892/5893 to finally reach UTS#46. I realized at that time that in fact 5890/5891 were just a framework that prepared what was to come. Then, I started reading RFCs 3986/3987 for the identifier-js uri/iri validator to see exactly the same thing: they were not the rigid texts that I had in mind but were also frameworks that distribute and delegate responsibilities to various other specifications. While building the validator I always compared it to whatwg-url, which incorporates both uri/iri in a single specification, and I was struck by the fact that they treat urls differently depending on the scheme. I looked to see where they got the specifications from and found that they go back exactly to 3986/3987 RFCs. I have often made the mistake of jumping directly to the relevant normative texts when reading RFCs, but here I realized that something was missing, and I sought to understand more precisely what the authors of the texts were thinking. In a field of "URI/IRI" complexity, they couldn't put all the possible variations into a single text. So, they established this general framework for any valid URI/IRI, and delegated each scheme specific documentation to the related RFC's from the very beginning.

Abstract
A Uniform Resource Identifier (URI) is a compact sequence of
characters that identifies an abstract or physical resource. This
specification defines the generic URI syntax and a process for
resolving URI references that might be in relative form, along with
guidelines and security considerations for the use of URIs on the
Internet. The URI syntax defines a grammar that is a superset of all
valid URIs, allowing an implementation to parse the common components
of a URI reference without knowing the scheme-specific requirements
of every possible identifier. This specification does not define a
generative grammar for URIs; that task is performed by the individual
specifications of each URI scheme.

So, what are those specific schemes:

I made a research about each of those schemes specificity (not 100% sure about accuracy yet):

Scheme Userinfo Host (reg_name) Port Path Query Fragment
http optional required optional required optional optional
https optional required optional required optional optional
ws useless required optional required optional useless
wss useless required optional required optional useless
file optional optional forbidden required optional optional
ftp optional required optional required useless useless
mailto useless useless useless required optional useless
data useless useless useless required useless useless
blob useless useless useless required useless useless

When we read the related specs we find that the network resource specific schemes define their uris like: http-URI, https-URI, ws-URI, wss-URI, file-URI, all being subsets of the generic URI, and all pointing their base back to RFC3986. As for the other uri parts, for example this is how are they defined in http:

URI-reference = <URI-reference, see [RFC3986], Section 4.1>
absolute-URI  = <absolute-URI, see [RFC3986], Section 4.3>
relative-part = <relative-part, see [RFC3986], Section 4.2>
scheme        = <scheme, see [RFC3986], Section 3.1>
authority     = <authority, see [RFC3986], Section 3.2>
uri-host      = <host, see [RFC3986], Section 3.2.2>
port          = <port, see [RFC3986], Section 3.2.3>
path-abempty  = <path-abempty, see [RFC3986], Section 3.3>
segment       = <segment, see [RFC3986], Section 3.3>
query         = <query, see [RFC3986], Section 3.4>
fragment      = <fragment, see [RFC3986], Section 3.5>

I think I'm not wrong if I say that in IT when we say hostname we think of url.hostname. Maybe we were thinking that the URI RFC is too rigid (considering only that abnf) to cover this option. But perhaps judging things in the light of the above we can see that this is not the case. The RFC3986 has that generic syntax for reg_name as a fallback for non-specific cases, because it defines it:

A URI resolution implementation might use DNS,
host tables, yellow pages, NetInfo, WINS, or any other system for
lookup of registered names.

but, for the specific case when resources are located on the network sais:

A registered name intended for lookup in the DNS uses the syntax
defined in Section 3.5 of [RFC1034] and Section 2.1 of [RFC1123].

and:

If the URI scheme defines a default for host, then that default
applies when the host subcomponent is undefined or when the
registered name is empty (zero length). For example, the "file" URI
scheme is defined so that no authority, an empty host, and
"localhost" all mean the end-user's machine, whereas the "http"
scheme considers a missing authority or empty host invalid.

Note that the reg_name definition is here differentiated based on scheme.

and:

The reg-name syntax allows percent-encoded octets in order to
represent non-ASCII registered names in a uniform way that is
independent of the underlying name resolution technology. Non-ASCII
characters must first be encoded according to UTF-8 [STD63], and then
each octet of the corresponding UTF-8 sequence must be percent-
encoded to be represented as URI characters. URI producing
applications must not use percent-encoding in host unless it is used
to represent a UTF-8 character sequence. When a non-ASCII registered
name represents an internationalized domain name intended for
resolution via the DNS, the name must be transformed to the IDNA
encoding [RFC3490] prior to name lookup. URI producers should
provide these registered names in the IDNA encoding, rather than a
percent-encoding, if they wish to maximize interoperability with
legacy URI resolvers.

Here the RFC3490 infact must be read as RFC5980/5981 that obsoletes it. I read this as a back delegation that allows the storage of an IRI as URI, facilitating back and forth conversion between.

In conclusion, what I understand is this:

yellow.pages//john%20doe/phone is an allowed use of % in reg_name, but if the scheme is http it can only be a transitory state to transform that input into its unicode version delegated to RFC5890/5891 (and its delegations to RFC5892, RFC5893, and UTS#46) in order to give back the URI version of it (as we know the DNS is ASCII-only and will not change in our lifetime). Also, for network involving schemes, the other DNS rules are also mandatory (hyphen-rules, non empty labels, so on...).

Describe the proposal

  • replace the hostname format requirement rule to RFC 3986 reg_name abnf when scheme implies a network resource (http,https,ws,wss,file, or any other similar)
  • replace the idn-hostname format requirement rule to RFC 3987 ireg_name abnf when scheme implies a network resource (http,https,ws,wss,file, or any other similar)

Implicitly, clarify the specs for uri, iri, uri-reference and iri-reference formats about scheme specific rules.

Describe alternatives you've considered

No response

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    proposalInitial discussion of a new idea. A project will be created once a proposal document is created.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions