Thomas Dowling helps distill the meaning from reams of web use statistics.
By Thomas Dowling April 15, 2001Libraries tally reference questions, circulation figures, and even the number of people who come in the front door. Whatever we do, we count how often we do it. Our understanding of how our services are used, and perhaps even our funding for those services, may depend on how well we quantify their use, so it's no surprise that we want to quantify the use of our web sites also. The raw data for measuring this use are the log file, or files, kept by our web servers. Every request that a browser sends to a web server--for an HTML document, an inline image, a style sheet, a form submission, anything--is typically recorded by the server. Since the very first graphical user interface (GUI) web server at NCSA in 1993, most servers have recorded the same information in the same format. This 'Common Logfile Format' is a plain text file with one line for each request, each showing seven pieces of information (see the print edition of netconnect for Spring 2001 for an explanation of a sample line from a combined logfile). The NCSA server kept two additional logs: one recording the names that browsers send to identify themselves, and one recording the URLs of referring pages that link to the URL being requested. Many servers now use a 'Combined Logfile Format' that appends both these pieces of information to the data in the Common format.
This raw data can be imported into any good spreadsheet or database program; Excel is a reasonably good tool for studying log files. However, many web managers prefer to use applications specifically designed to analyze web logs. These analyzers usually create reports such as total numbers of hits, the most frequently requested documents, the total number of unique addresses making requests, or peak usage. Some may report the number of 'user sessions,' where a session is defined as a series of hits from a specific address with no gaps of more than, say, half an hour between them.
Log analyzers are very good at what they do, but they leave too many web managers unaware of the shortcomings in their reports. These aren't problems with the analysis tools but with the data they're given to analyze. For better or for worse, almost everything in your log file is fuzzier than it appears, starting with what a hit really is and where it really comes from.
What web logs can't tell you
View counts are the most useful and perhaps the most deceptive piece of information. Hits are when a web server sends a document. Views are when a human being sees a document. What we really want to measure is the number of times our documents are viewed, while what we really can measure is the number of times they're sent out. So we maintain a polite fiction and call the two numbers the same. A page that was hit 100 times must have been viewed 100 times, right? Wrong.
When you put a new page on your web site, and I visit it for the first time, that's definitely one hit and one view. After that, though, all bets are off. Depending on my browser's configuration, it might store a copy of that document for a day, or a month, or until too many other stored documents take over its space. These local copies are called the browser cache, and if a browser sees a recent copy of a page in its cache, it will use that instead of getting it from your server again. To compound this problem, the browser might connect to the Internet through a departmental, corporate, or campus firewall that maintains its own cache for all its users to share, cutting down on the amount of traffic going through its bottleneck. A growing number of Internet service providers (ISPs), most notably America Online, also maintain shared caches for their users.
Caches usually improve performance for users by decreasing the number of comparatively slow connections to your site and increasing the number of faster connections to their own hard drives or local network. Caches are good things, but they mean that the initial one hit on your server might translate into any number of views, by any number of people. In other words, you'll never know the exact number of times someone views your pages.
User counts are another basic yet misunderstood statistic. You'll never know the exact number of people looking at your pages. Proxy servers, including most firewalls, again play a role. A user with a direct connection to the Internet will register in your log files under his or her computer's own name. Users coming through a proxy, however, will all show the name and IP address of the proxy instead of their workstations'.
ISPs have always connected users to the Internet with a pool of IP addresses that change from one dial-up session to another. Some, again including AOL, have begun maximizing the use of their connections by reassigning them on the fly. That is, a dial-up user requests a page on your server from one IP address; while he or she is reading it, the ISP gives that address to another customer; when the first user requests another page from your server, he or she gets a different IP address, all without disconnecting. This means that any measure of unique addresses in your logs will undercount users coming through proxies but overcount users whose ISPs change their addresses on the fly. Both factors also affect any method for counting user sessions.
What web logs can tell you
Relative use over time can be a useful measure. If the absolute numbers of views and users on a site are affected by an unknowable fudge factor, we can at least assume that this uncertainty remains relatively stable and compare measurements over time. If the logfile says my site got 10,000 hits in March 2000 and 20,000 hits in March 2001, I know that neither figure represents the real number of times my pages were viewed. But I can say with some confidence that the use of my site has doubled in a year. I can also make comparisons over time between different sections of my site. If most of my web services are getting twice the number of hits as a year ago, but my list of Internet reference tools is seeing about steady use, I can assume that a growing number of my users are either not finding that list useful or not finding it at all.
One of the least used items recorded by both the Common and Combined Logfile Formats is the HTTP status of the browser's request. This three-digit number is sufficiently cryptic that many people never learn what it means. Simply put, all servers that handle HTTP requests (web servers, but also proxies and firewalls, for example) categorize how they handle each request and record that category by a number. Most of the status numbers in your log should be 200--the 'OK' status, meaning that the document was found at the requested location and was sent normally. Some lines will show a status of 301 or 302--permanent or temporary redirections to a new location. If you have documents restricted by password or the user's address, you may see some '403' lines--the server's configuration forbids sending the document to that user.
Inevitably, however, you will find lines in your log with a status of '404: Not Found.' In terms of HTTP, this is a request for a URL that does not exist on your server. In terms of user service, this represents a possibly confused, possibly frustrated user, someone who is very probably about to leave your site.
You can use the 404 lines in your log file several ways. You can determine if there are frequently requested URLs that are not found on your server. These may be pages that were moved or deleted. If your server's operating system is case sensitive and you capitalize some file names, you may find 404s for the lower-case version of those files. In either case, you can use this information to leave 'This page has moved' pointers at the problem URL. An even better solution is to configure your server to send a permanent redirection response for those URLs. This sends browsers to the new URL automatically and has the added benefit of signaling web crawlers to drop the old URL and index the new one.
If your logs use the Combined format, you will also see referring pages for your unfound URLs. These may turn up pages on your own web site whose links need to be updated; you can visit other sites' pages to see if they have a contact address to send corrections. If all else fails, you can send a message to the 'postmaster' address at the other site and alert them to the bad link.
Browser trends can also be tracked by a bit of log analysis. How many aspects of your web site's development hinge on the question, 'Do our users have browsers that support Feature X?' From use of Cascading Style Sheets to PNG-format images to specific JavaScript functions, the percentage of users who will benefit from these features is tied to the percentage who have upgraded to browsers that support them. While the responsibilities of good web management include support for older browsers (and newer browsers with some functions disabled), you can make informed choices about your site's design features if your log files tell you the percentage of users with, for example, Internet Explorer 5, Netscape 4, IE 4, and so on. Since this is recorded in the Combined format, you can not only get a general sense of these percentages for your site as a whole but also for specific pages or directories.
Share your logs
Knowing what is and isn't in your log files doesn't help if you don't have access to them. Too many library web sites are subject to systems administrators who deny access to the raw logs and make available only certain canned reports. This may be because other sites on the same server consider their logs confidential, or it may just be a failure to understand how valuable log access is to site developers. In either case, you're entitled to see the raw data on how your own site is used, and you should make sure your administrators know you want it. Articulate exactly what you want to see and what information you hope to get from it. Contrary to widespread belief, many systems administrators are reasonable people and will help you if they just know what you want.
| See the print edition of netconnect for Spring 2001 for an explanation of a sample line from a combined logfile |
| Author Information |
| Thomas Dowling (tdowling@ohiolink.edu) is a board member of LITA and a member of the Web4Lib editorial board. He currently works as Assistant Director for Library Services and a web developer at OhioLINK |







