Wettone.com

How to use mod_rewrite to create clean URLs

Learn how to remove cruft from your URLs and give yourself greater flexibility while making things much easier for visitors and search engines.

Cruft and permalinks

How many times have you seen URLs like this?

http://example.com/weblog/index.php?y=2000&m=11&d=23&id=5678
http://example.com/weblog/archive/00005678.html

Horrible, aren't they? They look ugly, contain meaningless information and are hard to decipher. Wouldn't it be nice if you could have URLs like this?

http://example.com/weblog/2000/11/23/example

That is surely much more meaningful. The hierarchy of the directory structure makes it easy to infer what you're likely to see on the page: a weblog entry from 23rd November 2000 called ‘Example’.

Another problem is link rot. What happens when your URLs are several years old? Will they still work? What if you ditch PHP and rewrite your site in Python, for example? All those URLs ending in .php will suddenly break, and visitors who've found your site in a search engine won't get what they bargained for.

Yet another problem is that search engines often avoid URLs which contain the ? character, because they are likely to be just one node in a large database which could leave the spider going round in circles for a long time. If your URLs take that form then this may be why you haven't found yourself in Google.

Using mod_rewrite

This is how I've solved the problem. The Apache web server has a module named mod_rewrite, which allows you to create rules for rewriting requested URLs on the fly. This means you can announce simple, clean URLs that your web server automatically converts into an underlying format that only you need to know. Everyone looking at your site will just get the tidy version.

You'll need to edit a file called .htaccess at the top level of your web folder. This is where you can specify certain settings to control the way Apache accesses items in this folder and below.

First things first. Let's turn on mod_rewrite:

RewriteEngine On

This simply lets Apache know that you are going to want it to rewrite some URLs. So let's do just that, and create our first rule.

RewriteRule ^([a-z]+)/([a-z\-]+)$ /$1/$2.php [L]

This is a bit more complicated. It specifies a regular expression and then a rewrite format. It tells Apache to look for any incoming requests whose URL matches the expression and to rewrite them accordingly. The [L] tells Apache that if this rule matches, it should consider it to be the last one and stop parsing here.

The rule matches any URL which is formed of lower case letters, followed by a /, then more lower case letters and/or hyphens, and appends .php to the end. It keeps track of anything wrapped in brackets () and refers to them later as $1 and $2, i.e. the first and second match. So if someone visits these URLs:

http://example.com/weblog/archive
http://example.com/etc/colophon

they will be converted, so that it will be as if they were:

http://example.com/weblog/archive.php
http://example.com/etc/colophon.php

It's as simple as that. What this means is that now nobody needs to know that we are using PHP to run the site. So if everything changes to Python at some point, we could change the rule to:

RewriteRule ^([a-z]+)/([a-z\-]+)$ /$1/$2.py [L]

replacing .php with .py, so the URLs would then be converted to:

http://example.com/weblog/archive.py
http://example.com/etc/colophon.py

It's limited only by your imagination. Here are some other rules I use to control access to my weblog:

RewriteRule ^weblog/([0-9]{4})/([0-9]{2})/([0-9]{2})/([a-z0-9\-]+)$ /weblog/index.php?y=$1&m=$2&d=$3&n=$4 [L]
RewriteRule ^weblog/([0-9]{4})/([0-9]{2})/([0-9]{2})$ /weblog/index.php?y=$1&m=$2&d=$3 [L]
RewriteRule ^weblog/([0-9]{4})/([0-9]{2})$ /weblog/index.php?y=$1&m=$2 [L]
RewriteRule ^weblog/([0-9]{4})$ /weblog/index.php?y=$1 [L]

These mean that URLs like this:

http://wettone.com/weblog/2000/01/01/example
http://wettone.com/weblog/2000/01/01
http://wettone.com/weblog/2000/01
http://wettone.com/weblog/2000

are automatically converted to:

http://wettone.com/weblog/index.php?y=2000&m=01&d=01&n=example
http://wettone.com/weblog/index.php?y=2000&m=01&d=01
http://wettone.com/weblog/index.php?y=2000&m=01
http://wettone.com/weblog/index.php?y=2000

This means I could drastically alter the way my weblog works — even making it load a static HTML page for each day, month or year — and nobody would be any the wiser. All the messy URLs stay behind the scenes and the visitors only see the tidy URLs.

No URL left behind

What if you already have a web site which is already indexed in search engines? You can't just change your URLs overnight because the old ones won't exist any more. Or will they? There is a way round this with Apache's RedirectMatch keyword.

Older versions of my site had weblog URLs like these:

http://wettone.com/weblog/2000_01.html
http://wettone.com/weblog/index.php/2000/01

They have cropped up in search engines, so I want to make sure that visitors to my site get what they expected when they visit those URLs. I use these rules:

RedirectMatch permanent ^/weblog/([0-9]{4})_([0-9]{2}).html$ /weblog/$1/$2
RedirectMatch permanent ^/weblog/index.php/([0-9]{4})/([0-9]{2})$ /weblog/$1/$2

These tell the browser to redirect to the new-style URL. The permanent keyword lets the browser know that this is a permanent change and that they don't need to try the old URL next time.

You can use this technique to make sure that nobody visiting your site sees a 404 Not Found error. Over time, search engines will revisit your site to find these redirections, and their indexes will be updated accordingly.

For more information

More in-depth documentation on the Apache modules is available on their site.

mod_rewrite

mod_alias

URL Rewriting Guide

I hope you found this article useful. There are more articles like this in the code section of my site.