Re: Non-English Domain Names Likely Delayed
Forwarded Message from Neil Harris <neil@tonal.clara.co.uk> --- Fergie (Paul Ferguson) wrote:
...sez Vint...due to the prevalence of phishing:
http://www.msnbc.msn.com/id/8586332/
- ferg
Paul, I'm not registered as a poster on the Nanog list, so I thought I'd let you know that this problem is already well under control. After extensive analysis and discussion, the Mozilla community and Opera have already produced a fix for this, based on only displaying Unicode IDN labels where the registry publishes and enforces well-defined anti-homograph policies, and displaying the Punycode equivalent otherwise. All that is needed is a couple of lines of code in the Punycode -> Unicode translation code in the application, and a whitelist of TLDs. See http://www.mozilla.org/projects/security/tld-idn-policy-list.html for more details. This delegates the responsibility of catching homographs to the registries, rather than trying to catch them using ad-hoc heuristics at the browser end. In many cases, this can be as simple as restricting labels within a TLD to use a small set of non-confusable characters. In others, with wider character sets, techniques such as bundling and blocking sets of confusable labels using homograph tables can be used. RFC 3743 is a case in point. For an excellent summary of the technical details, which is intended to help anyone attempting to eliminate homographs from a naming system, see the latest, much-expanded, version of Unicode TR #36, which also links to machine-readable confusables tables. http://www.unicode.org/reports/tr36/ Already, some 21 TLDs are whitelisted, including .cn, .tw, a number of European ccTLDs, .museum, and .info. Any other registrars who want to be supported can simply E-mail Gerv at the Mozilla Foundation, or his Opera counterpart, and give them a pointer to their anti-spoofing rules. You might want to summarize to the list. -- Neil
After extensive analysis and discussion, the Mozilla community and Opera have already produced a fix for this, based on only displaying Unicode IDN labels where the registry publishes and enforces well-defined anti-homograph policies, and displaying the Punycode equivalent
1. It's strange that so many months of discussion and debate about this elsewhere missed such an obvious and complete solution. 2. Who is the authority that decides whether a TLD uses an acceptable policy? 3. How does this apply to subordinate domains that might or might not enforce "acceptable" policies, given that no all policy-making is at the TLD level? d/ Dave Crocker Brandenburg InternetWorking +1.408.246.8253 dcrocker a t ... WE'VE MOVED to: www.bbiw.net
On Sun, Jul 17, 2005 at 09:49:32PM -0700, Dave Crocker <dhc2@dcrocker.net> wrote a message of 25 lines which said:
2. Who is the authority that decides whether a TLD uses an acceptable policy?
That's the big problem with this so-called "solution".
Dave Crocker wrote:
After extensive analysis and discussion, the Mozilla community and Opera have already produced a fix for this, based on only displaying Unicode
IDN labels where the registry publishes and enforces well-defined anti-homograph policies, and displaying the Punycode equivalent
...snip...
3. How does this apply to subordinate domains that might or might not enforce "acceptable" policies, given that no all policy-making is at the TLD level?
It assumes that organization-level delegation of names is enforced by the TLD registry for all domains that it issues domains in. The assumption is made that operators and users of websites and other services have to place their trust in the chain of organizations delegating the DNS for their domain, and in particular, the one that registered the domain with the TLD registry. This reflects common practice, in which most services involving any significant value or risk are generally operated from their own domains in order to reduce the number of third parties to be trusted as far as possible. -- Neil
On Sun, Jul 17, 2005 at 04:29:52PM +0000, Fergie (Paul Ferguson) <fergdawg@netzero.net> wrote a message of 49 lines which said:
Forwarded Message from Neil Harris <neil@tonal.clara.co.uk> --- ... After extensive analysis and discussion, the Mozilla community and Opera have already produced a fix for this,
Which is highly questionable and that is rejected by most european ccTLDs.
Already, some 21 TLDs are whitelisted, including .cn, .tw, a number of European ccTLDs, .museum, and .info. Any other registrars who want to be supported can simply E-mail Gerv at the Mozilla Foundation, or his Opera counterpart, and give them a pointer to their anti-spoofing rules.
The Polish registry already refused to comply, saying that the Mozilla foundation has no legitimacy deciding the registration rules in ".pl".
Stephane Bortzmeyer <bortzmeyer@nic.fr> writes:
Already, some 21 TLDs are whitelisted, including .cn, .tw, a number of European ccTLDs, .museum, and .info. Any other registrars who want to be supported can simply E-mail Gerv at the Mozilla Foundation, or his Opera counterpart, and give them a pointer to their anti-spoofing rules.
The Polish registry already refused to comply, saying that the Mozilla foundation has no legitimacy deciding the registration rules in ".pl".
And it's completely their right to do this, however, if they are at all subject to pressure from their constituency this policy will probably change over time if this scheme becomes a de-facto standard (say, for instance, M$ and Apple decide to run the same whitelist, the discussion is effectively over). What's the drawback again to letting commercial forces help shape the discussion here? I forget... ---Rob
Stephane Bortzmeyer wrote:
Forwarded Message from Neil Harris <neil@tonal.clara.co.uk> ---
...
After extensive analysis and discussion, the Mozilla community and Opera have already produced a fix for this,
Which is highly questionable and that is rejected by most european ccTLDs.
Already, some 21 TLDs are whitelisted, including .cn, .tw, a number of European ccTLDs, .museum, and .info. Any other registrars who want to be supported can simply E-mail Gerv at the Mozilla Foundation, or his Opera counterpart, and give them a pointer to their anti-spoofing rules.
The Polish registry already refused to comply, saying that the Mozilla foundation has no legitimacy deciding the registration rules in ".pl".
Stephane, can I ask you what your detailed objections are to the Moz/Opera mechanism, and could you let me know your proposal for an alternative mechanism for preventing IDN spoofing? I completely understand the need for registries to define and control their own rules, since every registry has different needs. Thus, I agree with you that the Mozilla foundation does not have, and should not have, any right whatsoever to decide registries' registration rules. However, by the same principle, Mozilla, Opera and other software vendors also have the right to choose their policy for how they display domain names in their products' GUI. Ultimately, the decision of what policy is used devolves to the user, who decides what software they want to install on their machine. The Moz/Opera anti-spoofing mechanism is the result of widespread public analysis and discussion, and has the following advantages: * it deals with the actual problem: the visual representation of characters to the user -- the problem is, quite literally, in the eye of the beholder * it is simple to code and deploy: about ten lines of code for the Mozilla implementation. * it is based on simple and non-political principles * it requires only a minimal amount of data to be distributed with the software * it is the sole survivor of a large number of alternative proposals that were considered and rejected. Unlike most of the other rejected proposals, it does not need any modifications to the DNS protocol, or distribution of "language" codes for labels, nor does it require multiple DNS lookups, large character tables in the browser, or real-time access to WHOIS information. (I can tell you in great detail about some of the flawed alternative proposals, if you like). * it is based on a much more thorough analysis of the problem than the earlier ICANN proposals, and builds on the experience of the Unicode community, and the earlier analysis of the spoofing problem for the CJK languages performed for RFC 3743. For example, simple script restrictıons alone, as per ICANN, do not solve the problem -- there are plenty of subtle homographs in the Latin alphabet, such as the one embedded in this sentence. * it does not treat IDNs as second-class citizens * it is language- and script-agnostic * it is scalable on a per-registry basis, so there's no need for a "flag day", and requires no action on behalf of the registry beyond that which might be expected as a service to their customers, who have a reasonable expectation that their domains not be easily spoofed. * and, most of all, it uses human, and not technical, means to provide a chain of trust from the registry to the application to the user I must say that, from a user's perspective, I find it hard to understand why any registry would not want to put their anti-spoofing policy -- assuming they have one -- on public display, thus encouraging software vendors to regard their IDN labels as safe to display within their software. In the long run, of course, it makes sense for best common registry anti-spoofing practices to be codified, probably in an RFC, or through the Unicode consortium. However, until then, the maintenance of an ad-hoc list by software vendors seems to be a powerful incentive in the short term for registries to implement and publish anti-spoofing policies which encourage trust. There are a vast number of possible policies which registries could introduce, any of which might serve this purpose. For example, for .fr, it could be as simple as saying something like "labels in .fr must consist only of characters from the set -, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p, q, r, s, t, u, v, w, x, y, z, à, â, æ, ç, è, é, ê, ë, î, ï, ô, ù, û, ü, ÿ, œ", putting that statement on their website, and letting the software makers know about it. For .pl, which appears to want to support multiple character sets including the Cyrillic alphabet, it could be to say "we implement the character set restrictions of draft-bartosiewicz-idn-pltld-06.txt, together with blocking bundling using the confusables.txt table as per UTR #36-3". In my opinion, either of these statements would persuade me that the registry was applying due diligence in avoiding homograph spoofs, and I would imagine that browser vendors would take the same view. Again, if this is unworkable, please let me know a better alternative. -- Neil
Stephane, can I ask you what your detailed objections are to the Moz/Opera mechanism, and could you let me know your proposal for an alternative mechanism for preventing IDN spoofing?
I would suggest that an alternative mechanism should include a set of code points to be used for the on-the-wire DNS protocol and the registry databases. This set of codepoints will greatly restrict the possibility of ambiguity. Right now it is utterly impossible to represent the ambiguity of IBM, ibm, IBM or IbM in the DNS because the set of codepoints only allows for one code to be shared by I and i. This principle could be extended to other scripts so that, for instance, codes for the 2nd and 4th letters of the Cyrillic alphabet could be added while not adding codes for the 1st and 3rd letters because A and B are already there. Two additional items needed are translation tables. One translation table would be the PREFERRED mapping from the DNS codepoints to Unicode. I say "preferred" because while some people will be happy to see the "b" as in "ibm", others may prefer to see it as "B" especially Cyrillic users who use "B" for a completely different letter most of the time. Also, Arabs may prefer to map first and last letters of a domain to the initial and final forms of the letter and use medials for the rest because it looks better most of the time. This does not create exploitable ambiguity. The second item is a comprehensive mapping for all of UNICODE that maps each code point into one of the DNS code points. This should be defined as an algorithm because that allows for a combination of mapping tables and more efficient ways of defining and executing the mapping. It may be painful to upgrade the DNS, but if we are going to do so, we need to try to make it a solution that will work for a long time, not just quick fix patches. I have nothing against the Mozilla solution as a quick fix but I hope that it is used to demonstrate the need for upgrading DNS and fixing the problem at its root.
For example, simple script restrictıons alone, as per ICANN, do not solve the problem -- there are plenty of subtle homographs in the Latin alphabet, such as the one embedded in this sentence.
Personally, I consider that to be the Turkish alphabet, not the Latin one. Turkic speakers who use Cyrillic also have a habit of adopting munged up characters in their alphabets. I think this is solved by defining the PREFERRED mapping as described above. Turkey would implement it keeping the distinction between the i with and without the dot. Many other countries would opt for sticking in some code like "?" to indicate that there is a wierd character there. If I localize my computer to allow Turkish text entry and Turkish fonts, no doubt I would also get the Turkish domain name mapping preferences. And no doubt, central asian countries speaking Turkic languages but using the Cyrillic alphabet would map all the codes into their familiar Cyrillic forms. This is possible because the reverse mapping allows one to type in many different possible UNICODE character forms of a domain name in order to get the same single unambiguous registered domain name.
* it is scalable on a per-registry basis, so there's no need for a "flag
day", and requires no action on behalf of the registry beyond that which
might be expected as a service to their customers, who have a reasonable
expectation that their domains not be easily spoofed.
I think if we are going to upgrade the DNS, then registries will have to adapt in the same way as everybody else. And if that includes a flag day, then so be it. I suspect, however, that we will find some less disruptive way to transition, perhaps with two flag days to indicate the beginning and the end of a transition period.
For example, for .fr, it could be as simple as saying something like "labels in .fr must consist only of characters from the set -, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p, q, r, s, t, u, v, w, x, y, z, à, â, æ, ç, è, é, ê, ë, î, ï, ô, ù, û, ü, ÿ, œ", putting that statement on their website, and letting the software makers know about it.
And if a Turkish cultural centre in Paris wants to register a domain name with the undotted i, then what? National boundaries have no relationship to cultural boundaries. Admittedly, in my solution suggested above, if such a turkish domain name did exist, anyone who did not have a localized system supporting entry of the undotted i would not be able to enter the name of the domain. They could still access the website by leveraging a website that allowed them to access it by clicking a link, in the same way that http://www.translit.ru provides a Cyrillic keyboard for computers without Cyrillic localization installed. --Michael Dillon
Michael, your idea of mapping confusable characters to a single "master" character was one of the options which was considered, but rejected. To see why, consider the Turkish dotless-i in your second example. Now, to most non-Turkish readers, dotless-i is a homograph of the more common dotted-i character. If we map both to ASCII code 105, we've eliminated the homograph for non-Turkish users, but we then deny Turkish users the useful distinction between the two letters. Adding epicycles to this scheme with character-set tags, or filter rules based on locale setting on the client unfortunately make things worse not better. This example actually illustrates rather nicely why it is so important that different TLDs, particularly ccTLDs, should be able to have different rules. For example, it's possible (I don't know Turkish) that there may be some pair of names in Turkish for which may be distinguished entirely by the difference between dotted and dotless-i. Any procedure for preventing spoofing must bear in mind the fact that registries process vast numbers of registrations daily, and human oversight is not generally possible in the general case. Bundling using confusables-tables, with appropriate considerations for cultural variations in what is confusable, is a much more effective approach, and allows subtle distinctions to be retained for those labels for which they are useful. For example, the example of registering a dotless-i in a name registered in .fr could be easily dealt with by bundling, even if for French purposes dotted and dotless-i were normalized to the same equivalence set of confusable characters, provided that no potentially confusable French name had been registered first. -- Neil
participants (6)
-
Dave Crocker
-
Fergie (Paul Ferguson)
-
Michael.Dillon@btradianz.com
-
Neil Harris
-
Robert E.Seastrom
-
Stephane Bortzmeyer