Change 100M URLs and Live to Tell

Updated: Dec 4, 2018



You think changing your website’s URL is quite simple. Use “Redirect” and you’re done, right? Actually no, it’s not that simple — especially if you use Ajax crawling and your website is a Single Page App. And you want Facebook to keep your “likes”. And Twitter. And all your social media footprint. And you have 100 Million URLs, not just one. How do you do that?

OVERVIEW

A long long time ago, back in 2008, Wix launched its first Flash editor and website viewer. HTML5 hadn’t been born yet, and we had to implement page navigation while keeping user experience as intuitive as possible. Taking into account then-current browser technology restrictions, we decided to go with Hashbangs in the URLs. Eight years later, with new features and advanced web technology, it was time to change the page navigation and migrate all Wix’s 85M users.

Old URL format

The goals in mind when choosing to go with Hashbang were:

  • Keeping Indexability to allow page bookmarking, and for SEO purposes

  • Giving users the freedom to choose their own URL

  • Maintaining a native navigation experience (showing URL change) without a page refresh

  • At the time this was the industry standard set by Twitter and recommended by Google.

To achieve these, we implemented Single Page Interface (SPI) [a.k.a the Anchor System] represented by a hash sign (#) in the URL. Also, since we used Javascript and Ajax calls to render the page content, we had to use Ajax Crawling (which sends Search Engine Bots to the “escaped fragment” version of the website) by adding the exclamation mark (!).

The page representation in the URL is composed of 2 parts, which allowed us enjoy both worlds (allegedly): a page-url title chosen by the user and a randomly generated unique Page ID, for internal purposes.

Combine all the above and you get our old page URL format:

New URL format motivation

Time went by, and HTML5 was introduced to the world in 2014 and with it a cool feature - the History API. This enabled changing the URL in the browser without a page refresh, and made the anchor system redundant for us, so we could now eliminate some problems.

For example: according to HTTP protocol, all URL parameters after the hash are not sent to the server, making it impossible to pre-render page-specific content such as meta tags. This means the server always renders the main-page tags before the client-side corrects to the specified URL, which caused a flickering when loading sites. Since we could now remove the #, this was no longer a limitation. Another benefit is that we could now return 404 NOT FOUND for non-existing pages because the page identifier appeared in the url path that the server receives.

Moreover, users felt that the unique page-id in the URL is difficult to memorise, making the URL useless for printed media and marketing campaigns. On top of that, on October 2015 Google had announced the deprecation of Ajax Crawling. For us this was the final nail in the HashBang coffin, so we decided to reformat all our millions of URLs!

Considerations

To avoid driving searchbots crazy, we had to implement a single change that would incorporate all our goals. Getting rid of hash-bang (#!) and using slash instead, while removing Page ID, for cleaner and simpler URLS.

Additionally since 2008, the number of websites hosted on wix.com grew enormously. This was our chance to apply a separation of concerns - migrating all websites to wixsites.com, leaving Wix services on the Wix.com domain.

Combine all the above and you get our new page URL format:

Much better now, isn’t it?

Implementation

We had several concerns we had to take into account in order to complete this project successfully.

Backwards compatibility

To keep supporting the old links (indexed, bookmarked, etc.) we had to implement various solutions both on the client side, and on the server side. We also had to keep in mind search bots, and social media footprints (likes and shares in Facebook, tweets on Twitter, etc.).

  • Client Side - We wanted to show the new format while still rendering the old links. Because hash fragments are not sent to the server, we could not use the common HTTP 301 Redirect. For this we used the HTML5 feature pushState to change the URL browser history.

  • Server backward compatibility - When a site visitor requests a site on wix.com - redirect it to wixsite.com using HTTP 301 Redirect.

  • SEO backward compatibility - When a bot requests the escaped fragment in the old URL format, it needs to be redirected to the new URL containing both new domain and new page format. For example, a request to: http://tzufit.wix.com/change-my-url?_escaped_fragment_=my-page-title%2Fwzupj would result in a 301 redirect response to http://tzufit.wixsite.com/change-my-url/my-page-title

  • Resolve duplicate URLs in the same site - Since page-uri in the new format is set by the user, we had to ensure new page cannot have the same page-uri as an existing page, and have a deterministic logic, shared with client and server, that resolves duplicate page-uri of two existing pages.

  • Preserve social media footprints - Social media platforms, like Facebook and Twitter, usually use the URL as the key for counting likes, shares, tweets, etc. In order to preserve our users’ historical, much-invested footprints in the social media we keep rendering the old URL format for social media HTML components.

  • The Facebook bot does not follow redirects. This means that in order for Facebook to render the correct SEO content, any request from Facebook bot for an old URL format should not be responded with 301 Redirect. Instead we respond with 200 OK and render the content as if we have never changed our URLs.

Have a single referral point for a migration status of a website

Since the process of the URL migration is gradual, we had to indicate for each website if it was migrated or not. The URL migration status of each website (whether it was migrated or not), is stored in a MySql DB table. Every component in Wix’s architecture that has to know which URL to generate, query this DB table (for example: generating links to page, pushState of URL, parsing URL from HTTP requests, etc.)

Monitoring

Such a dramatic change needs to have close monitoring, in order to identify problems as soon as possible, and rollback if needed. To achieve that we created a list of an example site for every use case we changed. Using a simple bash script we “curl-ed” each site and compared the response to the expected result.


Once we gained confidence in the implemented solution - we could use Pingdom (www.pingdom.com) as a website monitoring tool to achieve the same result - automatically. Whenever a site’s response is not as expected (whether this is 301 REDIRECT, 404 NOT FOUND or 200 OK) - Pingdom sends an alert. Thus the feedback loop when a bug is introduced is very short.

Rollout

OMG! How do we start rolling out such a huge infrastructural change without breaking anything? We have 85M users, some with more than a few websites - where do we start?

Naturally, we did it gradually!

The first step was testing a compilation of tests sites with as many features as possible for two weeks, making sure they are indexed properly by different search engines. After we gained confidence in the new URL format, we asked a focus group of 50 super-users (users with 10 or more sites) to migrate their live sites to the new format. This of course surfaced a few bugs we had to fix before moving on to the next phase - migrating 10,000 free sites.

When that went smoothly, we started a whole URL migration campaign: emailed all our premium users presenting them with our new shiny URLs, and marketing our new URL structure. By that time, we had also enforced new URL format for any newly created site. Finally, by the end of August 2016, we migrated all the sites :)

And Et voila - we have changed 100M URLs and lived to tell the tale!

SUMMARY

There are few key takeaways from this URL migration project:

  1. A journey of a thousand miles begins with a single step - Approaching such a huge project might seem intimidating at the beginning, maybe even impossible. But if you break it down to several small steps, as many as may be - it will suddenly look doable.

  2. Research research research - Consult every colleague which might have something to contribute to your knowledge. Know your domain, or involve people that know the domain - whether those are client developers, server developers or SEO professionals.

  3. Gradual rollout - Expose new features gradually and have automatic tools that verify the correctness of the new outputs.

I am just one of many people involved in this project: Moran Frumer, Omer Ganim, Laurent Gaertner, Ofer Judovits, Ohad Laufer, Shai Lachmanovich, Boris Rasin, Adam Fainaru. This wouldn’t have happened without each and every one of them.

Photo by: Austin Neill on Unsplash


This post was written by Tzufit Barzilay

You can find her on LinkedIn

#ajax #urls #crawling

0 views
  • Black Twitter Icon
  • Black YouTube Icon

At Wix Engineering we develop some of the most innovative cloud-based web applications that influence our +180 million users worldwide.

Have any questions?
Email: wixeng@wix.com

Trademarks and logos of other parties appearing in this post are the property of their respective holders.

Get Wix Engineering Straight to Your e-mail