12 Point Design  

(209) 565-12PD
Email Us

12 Point Design/advice/url_canonicalization.asp  

URL Canonicalization

What's in a name?

Shakespeare first proposed, by way of Romeo; "That which we call a rose, by any other name would smell as sweet."

Canonical = Authoritative

Canonicalization is a loaded word and most people can't get past the pronunciation, much less the meat of the subject. In this context you can safely treat it the same as "authoritative". The object of canonicalization is to determine the "most perfect" address for a resource.

True canonicalization relies heavily on simplification, Occam's razor and all that.

To put it simply is the point.

The statement implies that an object could be described any of infinite ways and still have the same welcome traits. While this may be true, if half the people in the world were to call a rose a "rose" and the other half called it a "gul", would either name have more significant value? You could expect every other person you meet to know it as one or the other, but not both. And those that knew both names would have to guess as to which would be the most appropriate to use in any specific conversation. Now, what happens when the rose goes by three names? Ten? Five hundred?

Internet addressing is much the same. While it's wonderful to have innumerable paths to a single location, only one can be "most" accurate. Let's call that the rose.

In Internet terminology the "most perfect" address for a resource is known as its "Canonical" address. Linguistics aside, that just means that it's the best or most accurate for that specific content.

Wwwith and Wwwithout

There are many ways that an address can be duplicated.

The most common, and probably the only commonly-understood, is the "wwwith and wwwithout" issue of domains. Typically, your webhost will enable access to your site as both http://www.example.com/ and http://example.com/. This is for both yours and your visitors convenience. This immediately and effectively duplicates all of the addresses for any file on your site into two individual addresses. Your host will often provide a way to correct this behavior by adding a 301 (permanent) redirection from one to the other. This effectively tells visitors (and spiders) that the resource should only be known as one or the other.

Home again, home again, jiggety-jig

The next most frequent error has to do with default files. A site usually has a single file as the "index" or "default" file. That file can be addressed as the root or slash, or as the file, and if multiple instances of this default file exist then each of them could be used as an address:

Depending on how many duplicates of this file you use, you split the value of your address into that many pieces. The splits will not be created equally, of course, but can detrimentally effect the distribution of the resources actual value.

Two-faced, too?

But what if you have multple domains? The sample above becomes:

If you're not using a forced (301) redirect - that is - if a different URL appears in the address bar of your browser for the same content under your aliased domains, you're diluting your content value.

This is compounded significantly if your webhost is duplicating or consuming your content through their own domains and addresses, either through a "stage" address or as a part of their content management system or shopping cart system (like Blogger and Citymax). This specific problem is addressed in greater detail within our Domain Selection Advice.

Riddle Query me this, Batman!

Another common, but less understood, problem is querystring data. This surfaces in several ways. One sample of this damage is querystring parameter sequencing. For example:

On almost every site, these two addresses will each result in the exact same resource. And that's with only two parameters, used to result in two different addresses. Add a third parameter and you get 6 addresses to the same resource:

And if the address would be an index file, too, you get at least twice that number.

entia non sunt multiplicanda praeter necessitatem

entities should not be multiplied beyond necessity

Another problem with querystring data is when it is used to pass a source flag - like an affiliate ID. Since the actual resource is still the same thing, it becomes diluted across the various addresses. This, too, can be resolved by 301 redirecting requests that have a querystring value to the more appropriate URL. To maintain the tracking information, assign a cookie with the affiliate code (or source flag), if it cannot be drawn directly from logfiles later.

Where querystring data is really offensive, however, is when session keys or other state information is passed near-randomly to each visitor. While affiliate codes and misordered parameters can results in dozens or hundreds of address variations, session-stuff addresses are each unique. All of them. None has any more value than any other, and all are essentially useless.

Storing the visitors tracking information, for example, the previous 3 pages they visited (as keys) within the URL is equally irresponsible, particularly since the purpose of tracking that state information (the last three pages) is to track their activity, if they bookmark or share the address (and/or a search engine picks up on it), NONE of the visitors after the first will ever have experienced the same click-patterns, so the historical keys you collect are truly invalid in every way.

Will the real article please stand up?

Detective Spooner enters the NS5 'imprinting' factory, to locate the NS5

Imagine yourself in the place of Detective Spooner in I, Robot, in hot pursuit of a robot you believe to hold the key to a murder. You follow it directly into a warehouse where you know it is located. Upon entry, you are welcomed by not one NS-5 robot, but five thousand and one. Only one of these holds the actual information you're after. The rest are in the way. They're preventing you from finding the original. Unlike Will Smith's character, however, you don't have the ability to cleanly separate the original based on a clear-cut rule - that the one you're looking for is actually different significantly enough to be able to ascertain it's identity.

Now imagine your potential visitor sitting at Google, looking through the results of the 50 million plus results for an article they're after. The dilution of your article amongst the copies means that your own preferred site is unlikely to appear in the first few (thousand) results pages. You may as well not appear in the results at all. And maybe you don't.

Eliminating unnecessary duplication enables you to appear distinctively for the content of your article or content. Only you can prevent this duplication. And it's much easier if you don't stick your article into a copy machine, effectively recreating the warehouse scenario above.

Hammer Time

Are you using the right tools? And more importantly, are you using the right tools correctly? Are you storing history data in the address instead of the log files? Are you using thousands of affiliate addresses for your content, without 301'ing them to a sanitized - and thus distinct - URL?

There are ways to deal with each of these problems programmatically using any server-side language and sometimes through .htaccess files. But they must be dealt with.

But why?!

What I have yet to cover is why. Why is this important? Why should you care? There can only be one "right" absolutely accurate source of any single piece of content. It can only exist as truly authoritative in one location. Everything else is a knock-off, a copy, a facsimile...a clone. It can only serve to draw attention away from the original. While duplication can be a useful method offline, it dilutes the authority of the original online. The result will always be misdirection of authority to less valuable resources, diminishing the potential for any single instance to be treated as the final word.

For best visibility you need to ensure that you only use one address for any specific content. Write as much content as you like, but only publish any piece - any article - under a single address. The web really is a popularity contest, at least as far as search engines are concerned.

Validation, Too

It sure is nice when Google validates an article you've written by writing their own which, coincidentally, not only duplicates my advice, but follows my own content structure and even my sample URL formats. Things that make you go "hmmm". I guess when you're doing something right, it's only natural that it'll be duplicated elsewhere.

The more copies of a piece of content that exist, the less likely that any one of them is to be viewed. The value for all of the address variations is shared across the whole, instead of the one truly authoritative address. Is this a penalty? A punishment? Not exactly, but it is most definitely an effect. While it may not "harm" your article to have 50 or 100 different popular addresses for it, it will absolutely effect the visibility of any one of the instances. And in some search engines, Google is one, when the same content appears to exist at more than one address, the entry is more likely to be relaxed from the primary results and collected into the supplemental results. The supplemental results are "secondary", and are far less likely to appear during a search, even if the query is very appropriate for the content.


Shawn K. Hall