Web statistics can seem very alien to those not familiar with them. They are a very important aspect of a web presence because they let us know if our website is being accessed. Here is a little information about what some of the terms mean.
Welcome to the wonderful world of web server usage analysis! This information is intended for the users of the Webalizer, but can be applied to most any analysis package out there. If you are new to web server analysis, or just want to find out how things work, then this guide is for you.
Ok, so you have a web site and you want to know if anyone is looking at it, and, if so, just what they are looking at – and how many times. Most every web server (at least here at CharlesWorks) keeps a log of when websites are accessed, so you can just go look and see. The server logs are in what’s referred to as just plain ASCII text files, so any text editor or viewer would work just fine to view them (note, though, that they are not located where they are readily accessible to the public). Each time someone (using a web browser) asks for one of your web pages, or any component of your website (known as URLs, or Uniform Resource Locator), the web server will write a line to the end of the traffic log representing that request. Unfortunately, the raw logs appear rather cryptic for most of us to read. While you might be able to determine if anybody was looking at your web site, any other information would require some sort of processing to determine. A typical log entry might look something like the following:
192.168.45.13 – – [24/May/2005:11:20:39 -0400] “GET /mypage.html HTTP/1.1” 200 117
This represents a request from a computer with the IP address 192.168.45.13 for the URL /mypage.html on the web server. It also gives the time and date the request was made, the type of request, the result code for that request and how many bytes were sent to the remote browser. There will be a line similar to this one for each and every request made to the web server over the period covered by the log. A ‘Hit’ is another way to say ‘request made to the server’, so as you may have noticed, each line in the log represents a ‘Hit’. If you want to know how many Hits your server received, just count the number of lines in the log. And since each log line represents a request for a specific URL, from a specific IP address, you can easily figure out how many hits you got for each of your web pages or how many hits you received from a particular IP address by just counting the lines in the log that contain them. Yes, it really is that simple. And while you could do this manually with a text editor or other simple text processing tools, it is much more practical and easier to use a program specifically designed to analyze the logs for you, such as the Webalizer. They take the work out of it for you, provide totals for many other aspects of your server, and allow you to visualize the data in a way not possible by just looking at the raw logs.
How does it all work? Well, to understand what you can analyze, you really should know what information is provided by your web server and how it gets there. At the very least, you should know how the HTTP (HyperText Transport Protocol) protocol works, and it’s strengths and weaknesses. At it’s simplest, a web server just sits there listening on the network for a web browser to make a request. Once a request is received, the server processes it and then sends something back to the requesting browser (and as explained above, the request is logged to a log file). These requests are typically for some URL, although there are other types of information a browser can request, such as server type, HTTP protocol versions supported, modification dates, etc., but those types are not as common. To visualize the interaction between server, browser and web pages, lets use an example to illustrate the information flow. Imagine a simple web page, ‘mypage.html’, which is a HTML web page that contains two graphic images, ‘myimage1.jpg’, and ‘myimage2.jpg’. A typical server/browser interaction might go something like this:
The web browser asks for the URL mypage.html.
The server sees the request and sends back the HTML page.
The web browser notices that there are two inline graphic links in the HTML page, so it asks for the first one, myimage1.jpg.
The server sees the request and sends back the graphic image.
The web browser then asks for the second image, myimage2.jpg.
The server sees the request and sends back the graphic image.
The browser displays the web page and graphics for the user.
In the web server log, the following lines would be added:
192.168.45.13 – – [24/May/2005:11:20:39 -0400] “GET /mypage.html HTTP/1.1” 200 117
192.168.45.13 – – [24/May/2005:11:20:40 -0400] “GET /myimage1.jpg HTTP/1.1” 200 231
192.168.45.13 – – [24/May/2005:11:20:41 -0400] “GET /myimage2.jpg HTTP/1.1” 200 432
So what can we gather from this exchange? Well, based on the what we learned above, we can count the number of lines in the log file and determine that the server received 3 hits during the period that this log file covers. We can also calculate the number of hits each URL received (in this case, 1 hit each). Along the same lines, we can see that the server received 3 hits from the IP address 192.168.45.13, and when those requests were received. The two numbers at the end of each line represent the response code and the number of bytes sent back to the requestor. The response code is how the web server indicates how it handled the request, and the codes are defined as part of the HTTP protocol. In this example, they are all 200, which means everything went OK. One response code you may be very familiar with is the all too common ‘404 – Not Found’, which means that the requested URL could not be found on the server. There are several other response codes defined, however these two are the most common.
And that, in a nutshell, is about all you can accurately determine from the logs. “But wait!” you might be screaming, “most analysis program have lots of other numbers displayed!”, and you would be right. Some more obscure numbers can be calculated, like the number of different response codes, number of hits within a given time period, total number of bytes sent to remote browsers, etc.. Other numbers can be implied based on certain assumptions, however those cannot be considered entirely accurate, and some can even be wildly inaccurate. Other log formats might be used by a web server as well, which provide additional information above what the CLF format does, and those will be discussed shortly. For now, just realize that the only thing you can really, accurately determine is what IP address requested which URL, and when it requested that URL, as shown in the example above.