Web stats Part 2: What does it all mean anyway?

by Charles Oropallo | May 1, 2012 | WordPress

So now you have a good grasp of how your web server works and what information can be obtained from its logs, like number of hits (to the server and to individual URLs), number of IP addresses making the requests (and how many hits each IP address made), and when those requests were made. Given just that information, you can answer questions such as “What is the most popular URL on my site”, “What was the next most popular URL?”, “What IP address made the most requests to my server?”, and “How busy was my server during this time period?”. Most analysis programs will also make it easy to answer such questions as “What time of day is my web server the most active?”, or “What day of the week is the busiest?”. They can give you an insight into usage patterns that may not be apparent by just looking at the raw logs. All of these questions can be answered with completely accurate answers, based just on the simple analysis of your web server logs. That’s the good news!

However, bad news is that with all the things you can determine by looking at your logs, there are a lot of things you can’t accurately calculate. Unfortunately, some analysis programs lead you to believe otherwise, and forget to mention (particularly commercial packages) that these are not much more than assumptions and cannot be considered at all accurate. Like what? you ask.. well, how about those things that some programs call ‘user trails’ or ‘paths’, that are supposed to tell you what pages and in what order a user travelled through your site. Or how about the length of time a user spends on your site. Another less than accurate metric would be that of ‘visits’, or how many users ‘visited’ your site during a given time period. All of these cannot be accurately calculated, for a couple of different reasons… Lets look at some of them.

The HTTP protocol is stateless. In a typical computer program that you run on your own machine, you can always determine what the user is doing. They log in, do some stuff, and when finished, they log out. The HTTP protocol however is different. Your web server only sees requests from some remote IP address. The remote address connects, sends a request, receives a response and then disconnects. The web server has no idea what the remote side is doing between these requests, or even what it did with the response sent to it. This makes it impossible to determine things like how long a user spends on your site. For example, if an IP address makes a request to your server for your home page, then 15 minutes later makes a request for some other page on your site, can you determine how long the user had been at your site? The answer is of course No!. Just because 15 minutes expired between requests, you have no idea what the remote address was doing between those two requests. They could have hit your site, then immediately gone somewhere else on the web, only to come back 15 minutes later to request another page. Some analysis packages will say that the user stayed on your site for at least 15 minutes plus some ‘fudge’ time for viewing the last page requested (like 5 minutes or so). This is actually just a guess, and nothing more.

You cannot determine individual users. Web servers see requests and send results to IP addresses only. There is no way to determine what is at that address, only that some request came from it. It could be a real person, it could be some program running on a machine, or it could be lots of people all using the same IP address (more on that below). Some of you will note that the HTTP protocol does provide a mechanism for user authentication, where a username and password are required to gain access to a web site or individual pages. And while that is true, it isn’t something that a normal, public web site uses (otherwise it wouldn’t be public!). As an example, say that one IP address makes a request to your server, and then a minute later, some other IP address makes a request. Can you say how many people visited your site? Again, the answer is No!. One of those requests may have come from a search engine ‘spider’, a program designed to scour the web looking for links and such. Both requests could have been from the same user, but at different addresses. Some analysis program will try to determine the number of users based on things like IP address plus browser type, but even so, these are nothing more than guesses made on some rather faulty assumptions.

Network topology makes even IP addresses problematic. In the good old days, every machine that wanted to talk on the internet had it’s own unique IP address. However, as the internet grew, so did the demand for addresses. As a result, several methods of connecting to the internet were developed to ease the addressing problem. Take, for example, a normal dial-up user sitting at home. They call their service provider, the machines negotiate the connection, and an IP address is assigned from a re-usable ‘pool’ of IP addresses that have been assigned to the provider. Once the user disconnects, that IP address is made available to other users dialing in. The home user will typically get a different IP address each time they connect, meaning that if for some reason they are disconnected, they will re-connect and get a new IP address. Given this situation, a single user can appear to be at many different IP addresses over a given time. Another typical situation is in a corporate environment, where all the PCs in the organization use private IP addresses to talk on the network, and they connect to the internet through a gateway or firewall machine that translates their private address to the public one the gateway/firewall uses. This can make all the users within the organization appear as if they were all using the same IP address. Proxy servers are similar, where there can be thousands of users, all appearing to come from the same address. Then there are reverse-proxy servers, typical of many large providers such as AOL, that can make a single machine appear to use many different IP addresses while they are connected (the reverse-proxy keeps track of the addresses and translates them back to the user). Given this situation, can you say how many users visited your site if your logs show 10 requests from the same IP address over an hour? Again, the answer is No!. It could have been the same user, or it could have been multiple users sitting behind a firewall. Or how about if your logs show 10 requests from 10 different IP addresses? Think it was from 10 different users? Of course not. It could have been 10 different users, could have been a couple of users sitting behind a reverse proxy, could have been one or more users along with a search engine ‘spider’, or it could be any combination of them all.

Ok, so what have we learned here? Well, in short, you don’t know who or what is making requests to your server, and you can’t assume that a single IP address is really a single user. Sure, you can make all kinds of assumptions and guesses, but that is all they really are, and you should not consider them at all accurate. Take the following example; IP address A makes a request to your server, 1 minute later, IP address B makes a request, and then 10 minutes later, address A makes another request. What can we determine from that sequence? Well, we can assume that two users visited. But what if address A was that of a firewall? Those two requests from address A could have been two different users. What if the user at address A got disconnected and dialed back in, getting a different address (address B) and someone else dialed in at the same time and got the now free address A? Or maybe the user was sitting behind a reverse-proxy, and all three requests were really from the same user. And can we tell what ‘path’ or ‘trail’ these users took while at the web site or how long they remained? Hopefully, you should now see that the answer to all these things is a big resounding No! we can’t! Without being able to identify individual unique users, there is no way to tell what an individual unique user does. All is not lost however. Over time, people have come up with ways to get around these limitations. Systems have been written to get around the stateless nature of the HTTP protocol. Cookies and other unique identifiers have been used to track individuals, as has various dynamic pages with back end databases. However, these things are all, for the most part, external to the protocol, not logged in a standard web server log, and require specialized tools to analyze. In all other cases, any programs that claim to analyze these types of metrics should just be considered guesses based on certain assumptions. One such example can be found within the Webalizer itself. The concept of a ‘visit’ is a metric that cannot be accurately reported, yet that is one of the things that the Webalizer does show. It was added because of the huge number of requests received from individuals using the program. It is based on the assumption that a single IP address represents a single user. You have already seen how this assumption falls flat in the real world, and if you read through the documentation provided with the program, you will see that it clearly says the ‘visit’ numbers (along with ‘entry’ and ‘exit’ pages) are not to be considered accurate, but more of a rough guess. We haven’t touched on entry and exit pages yet, but they are based on the concept of a ‘visit’, which we have already seen isn’t accurate. These are supposed to be the first and last page a user sees while at the web site. If a request comes in that is considered a new ‘visit’, then the URL of that request would be, in theory, the ‘Entry’ page to the site. Likewise, the last URL requested in a visit would be the ‘Exit’ page. Similar to user ‘paths’ or ‘trails’, and being based on the ‘visit’ concept, they are to be treated with the same caution. One of the funniest metrics I have seen in one particular analysis program was supposed to tell you where the user was geographically, based on where the domain name of the requesting remote address was registered. Clever idea, but completely worthless. Take for example AOL, which is registered in Virginia. The program considered all AOL users as living in Virginia, which we know is not the case for a provider with access points all over the globe.

Other metrics you CAN determine. Now that you have seen what is possible, you may be thinking that there are some other things these programs display, and wondering about how accurate they might be. Hopefully, based on what you have already seen thus far, you should be able to figure them out on your own. One such metric is that of a ‘page’ or ‘page view’. As we already know, a web page is made up of an HTML text document and usually other elements such as graphic images, audio or other multimedia objects, style sheets, etc.. One request for a web page might generate dozens of requests for these other elements, but a lot of people just want to know how many web pages were requested without counting all the stuff that makes them up. You can get this number, if you know what type of files you may consider a ‘page’. In a normal server, these would be just the URLs that end with a .htm or .html extension. Perhaps you have a dynamic site, and your web pages use an .asp, .pl or .php extension instead. You obviously would not want to count .gif or .jpg images as pages, nor would you want to count style sheets, flash graphic and other elements. You could go through the logs and just count up the requests for whatever URL meets your criteria for a ‘page’, but most analysis programs (including the Webalizer) allows you to specify what you consider a page and will count them up for you.

Other information. Up to now, we have just discussed the CLF (Common Log Format) log format. There are others. The most common is called ‘combined’, and takes the basic CLF format and adds two new pieces of information. Tacked on the end is the ‘user agent’ and ‘referrer’. A user agent is just the name of the browser or program being used to generate the request to the web server. The ‘referrer’ is supposed to be the page that referred the user to your web server. Unfortunately, both of these can be completely misleading. The user agent string can be set to anything in some modern browsers. One common trick for Opera users is to set their user agent string to that of MS Internet Explorer so they can view sites that only allow MSIE visitors. And the referrer string, according to the standards document (RFC) for the HTTP protocol, may or may not be used at the browsers choosing, and if used, does not have to be accurate or even informative. The apache web server (which is the most common on the internet) allows other things to be logged, such as cookie information, length of time to handle the request and lots of other stuff. Unfortunately, the inclusion and placement of this information in the server logs are not standard. Another format, developed by the W3C (world wide web consortium), allows log records to be made up of many different pieces of information, and their location can be anywhere in the log entry with a header record needed to map them. Some analysis programs handle these and other formats better than others.

Analysis techniques. The only true way to get an accurate picture of what your web server is doing is to look at it’s logs. This is how most of the analysis packages out there get their information, and is the most accurate. Other methods can be used, with different results. One common method, which was widely popular for a while, was the use of a ‘page counter’. Basically, it was a dynamic bit included in a web page that incremented a counter and displayed it’s value each time the page was requested. Normally, it was included in the page as if it were a standard image file. One problem with this method was that you had to include a different ‘image’ file for each page you wanted to track. Another problem occurred if the remote user had image display turned off in their browser, or could not display images at all (such as in a text based web browser). You could also easily inflate the number by just hitting the ‘reload’ button on your browser over and over again. Similar methods were developed using java and JavaScript, in an attempt to get even more information about the visiting browser, such as screen resolution and operating system type. Of course, these can easily be circumvented as well. Some companies set up systems that claim to track your server usage remotely, by including an image or JavaScript element on your site which would then contact the companies system each time the image or JavaScript element was requested. These all have the same problems and limitations. In all of these, you can simply turn off images and Java/JavaScript and then browse the web site completely uncounted and unseen (except in the web server logs). Beware of these types of counters and remote usage sites, they are not quite as accurate as they may lead you to believe.

Conclusion. It should now be obvious that there are only certain things you can determine from a web server log. There are some completely accurate numbers you can generate without question. And then, there are some wildly inaccurate and misleading numbers you can garner depending on what assumptions you make. Want to know how many requests generated a 404 (not found) result? Go right ahead and count them up and be completely confident with the number you get. Want to know how many ‘users’ visited your web site? Good luck with that one.. unless you go ‘outside the logs’, it will be a hit or miss stab in the dark. But now you should have a good idea of what is and isn’t possible, so when you look at your usage report, you will be able to determine what the numbers mean and how much to trust them. You should also now see that a lot can depend on how the program is configured, and that the wrong configuration can lead to wrong results. Take the example of ‘pages’.. if your analysis software thinks that only URLs with a .htm or .html extension is a page, and all you have is .php pages on your site, that number will be completely wrong. Not because the program is wrong, but because someone told it the wrong information to base its calculations on. Remember, knowledge is power, so now you have the power to ask the proper questions and get the proper results. The next time you look at a server analysis report, hopefully you will see it in a different light given your new found knowledge.

Web stats Part 2: What does it all mean anyway?

Recent Articles

Resource Categories

Archives

Web stats Part 2: What does it all mean anyway?

Recent Articles

Resource Categories

Archives

Tags