Log analysis: What the web server log reveals about your visitors

For a computer to function properly, the operating system generally has to run various processes at the same time. The majority of these “tasks” take place in the background, without the user even being aware of them. But the actions of executed programs are still registered, as they automatically run protocol in special files, referred to as log files. In these, certain activities and events of each application are listed – such as error messages, setting changes, program accesses, or crash reports. Approved users can extract files from the functionality and search for errors. By assessing the log files of a webserver, you can also gain information about the general behavior of site users. Reviewing a log in this way is referred to as log file analysis.

What is a log file analysis?

Log file analysis, or log analysis for short, refers to the process of targeted inspection and analysis of log data. This method allows for database and e-mail transmission errors to be retraced or firewall activity to be reviewed. Most commonly, it’s used for search engine optimization. If you take the trouble to analyze your server log files, you can get useful information about your visitors and their activities. A look at the web server log can reveal the following user data:

  • IP address and host name
  • Access time
  • Browser used by the visitor
  • Operating system used by the visitor
  • Origin link or URL
  • Search machine used, including used keywords
  • Length of access
  • Number of pages accessed
  • Last page opened before leaving the site

Since the manual analysis of the log files is simply impossible due to how many there are, various analysis tools exist, which present relevant information from the logs in a visual form. Now the analyst’s only job is to draw the correct conclusions from the visualized data. In addition to reviewing website visitors, the web server log file is also used to determine technical errors (in terms of the network, applications, or individual components) and security issues, as well as automated access by bots.

Server log analysis: Typical problems and solutions

The biggest problem with web server log analysis is that HTTP is a stateless protocol: The transmission protocol responsible for communication between server and client handles each query separately. The web server also assigns two different page views to a single client – something rather impractical for the analysis of a user’s general behavior. There are several ways to solve this problem:

  1. Assigning a session ID: The session ID is server-generated and stored in the user’s browser. If a user is given such an identification number, then all queries they submit to the web server are processed via the same number. All their actions are therefore combined into a single session. There are two options for the assignment of an ID number: One option is to use cookies, which are stored starting with the client’s first request and then transmitted on each further contact with the server. Cookies are not visible in the log file though, and so require special programming if you want to be able to analyze them later. The other option is to transmit the session ID as a URL parameter. But for these user-specific links, a higher programming effort is required. From the SEO point of view, the individual URLs can cause problems. The same content is accessible at a different URL every time the crawler visits the page, so it could easily be misconstrued as duplicate content.
  2. User identification via IP address: As long as all of a user’s actions are attributed to the same IP address, they can be uniquely identified using this method. The prerequisite (in the opinion of many data protection experts) is that the user has agreed in advance to the collection of their IP address for analysis purposes. Problems arise when visitors are assigned their IP dynamically and so are counted multiple times, or if several users are using the same IP – for example, via a proxy server.
  3. Using server-independent measures: So-called tracking pixels – basic building blocks of page tagging – which are visibly embedded in the website, provide advanced information, such as screen resolution and which browser plugins a user has installed. In combination with the IP address and the information on the browser and operating system, users can be distinguished to a certain degree. A 100% separation of users is not possible using this method. But, with the help of pixel widgets or AJAX elements, you can track. In a simple log file analysis, you don’t receive any special information about the way it’s used.

Another problem with log file analysis is the caching function of the web browser or applied proxy server. While it’s crucial for the rapid delivery of the requested data, it also means that users don’t always come into contact with the web server. Inquiries that concern cached content only appear in a limited way in the server log files (status code 304 “Not modified”). It should also be mentioned that additional resources are needed for the permanent logging, storage, and evaluation of the server accesses – especially for web projects with a high traffic volume. In addition, the log file analysis doesn’t include important indicators such as the bounce rate or the length of visits, which is why it should only be used as a supplement to other test instruments and not as the sole method for visitor analysis.

Log data analysis – How to

To get an impression of the performance of a log analysis, you should first take a look at the content and structure of the log file. For example, the log of an Apache web server that is named access.log is automatically created in the apache directory.

Information provided by the Apache log

The created entries are saved in the common log format (also called the NCSA common log format) which has a predetermined syntax:

%h %I %u %t "%r" %>s %b

The individual components stand for the following information:

%h Client IP address
%I Identity of the client, which isn’t determined by default so you often find a minus (-) at this point, which indicates a missing entry in the log file.
%u User ID of the client, associated for example in directory protection with HTTP authentication; Normally not assigned
%t Timestamp of the access time
%r Information about the HTTP request (method, requested resource, and protocol version)
%>s Status code with which the server responded to the request
%b Data transferred (in bytes)

A complete entry in the access.log looks like this:

203.0.113.195 - user [07/Oct/2016:10:43:00 +0200] "GET /index.html HTTP/2.0" 200 2326

In this case, a client with the IP address 203.0.113.195 and the user ID user accessed the index.html at 10:43 on October 7, 2016 using HTTP/2.0. The server answered with the status code 200 (Request successful, result transferred) and transmitted 2326 bytes.

In the httpd.conf file, you can also extend the log file entries by adding two additional components to the log format: \"%{Referer}i\" and \"%{User-agent}i\". In this way, you can find out which link or which page visitors have used to reach your website, as well as which client programme and which operating system were used. Thanks to the latter details, the combined log format allows for the verification of external applications that retrieve data from your web server, such as e-mail programmes and embedded graphics, but also search-engine crawlers and spam bots. For more information about the properties of the log files, please see our introductory article on the topic.

Tools that enable analysis of the log entries

Provided you have the necessary rights to access the log file of your web server, it’s theoretically possible to check individual client accesses to your project manually without the use of additional tools. Assume, however, that your site has around 1,000 visitors a day who visit an average of ten pages – 10,000 new log file entries would be created every day, and that count doesn’t include the embedded content. Such a large quantity would be impossible to evaluate manually. For log file analysis, tools with which you can export and segment data are needed.

If the volume of log files is manageable, then conventional data processing tools, such as Microsoft Excel, can be used to convert the log file to the CSV format and import it – as described in the following Microsoft instructions. In Excel, you can organize information about collected queries and sort them, for example, by IP address, status code, or referrer. But because there are limitations on the size of an imported log, the results of an Excel log file analysis can only ever provide a snapshot.

For longer-term investigations of your traffic based on log files, the use of a log file analyzer is recommended. Unlike calculation programs, these tools were specially developed for graphical display and evaluation of log files. In dashboards, which can be accessed via an ordinary browser, the previously mentioned code digits taken from the log file are visually processed – part of this happens almost in real-time, for example, with the open source tool GoAccess which we have discussed elsewhere.

Log file analysis and the issue of data security

A log file analysis that contains information about website visitors always affects aspects of data protection, with two aspects in particular in this context. One important point for analyzing methods is the possibility to save all obtained data on your own server. If you host your web server – and its associated log files – through an external provider, then you should know exactly what services are available for secure data protection. With tracking tools like Google Analytics, server location for visitors and data outside of the US is a constant concern, as all user information is then stored on Google’s servers, which are mostly located in the USA.

The second important point is the question of personalization in regard to the collected log file analysis data. While information such as access time, visited sites, or utilized browser hardly provides specific personal information about the user, the situation concerning recorded IP addresses is a different story. Especially with statically assigned IP addresses, the possibility to establish direct personal reference is criticized by data security experts. Location determination based on IP address is illegal in some countries, and webmasters are generally advised to save the addresses in an anonymous form and delete them after no longer than one week, unless there is some sort of obligation to preserve them as a form of public record.

If you would like to examine log files, though, the anonymization option is not possible, as you wouldn’t be able to combine the different actions of the single user for analysis. To be safe, website operators should inform users about the recording of IP addresses for analysis purposes, as the US currently has no data retention laws, but EU courts have confirmed that IP addresses can in fact constitute personal data in some cases.

Server log file analysis: A solid basis for web analysis

The recording of user data is one of the most important means of measuring the success of a web project. Only by observing the development of traffic and the regular behavior of the visitors can you match your offers and content to your intended audience. Unlike tracking using tools like Piwik and Google Analytics, log file analysis is based on standard data received by the server without the help of JavaScript. The data can be recorded even if the user blocks the script language. But this advantage has its own small limitations: While the assignment of individual accesses to specific user sessions is possible, it is much more complex than with cookie tracking and the log analysis will lack some indicators, such as the bounce rate or the exact visit length of users.

The caching of client programs is also difficult. It enables user requests without server contact, which only appear in a restricted form in the web server log, as well as dynamic IP addresses, which prevent a concrete assignment of the various accesses to individual users. The results of a log analysis should primarily be understood as just a snapshot and, depending on the size of your web project, should be used in addition to other analysis methods. If you take care of the hosting and evaluation of log files on your own and inform your visitors about the recording of IP addresses and other data, though, you have a solid and data-compliant basis for analyzing user behavior. The log analysis is particularly useful when trying to look at the difference between mobile and desktop traffic, as well as for detecting bots like the Google crawler.

1