|Delphi Clinic||C++Builder Gate||Training & Consultancy||Delphi Notes Weblog||Dr.Bob's Webshop|
Analysing IIS Logfiles with Decision Cube
Is there anybody out there? (visiting our websites)
I’ve been a webmaster for a long time (more than a decade actually) and I know that there is nothing more gratifying than knowing that a lot of people are watching and reading your work on the web.
For the UK Developers Group website, my webmastering tasks mainly consist of uploading PDF files, downloadable files, and keeping links up-to-date.
Examples of PDF files include the DG Magazine, but also a set of PDF files we’ve been working on are the O’Reilly newsletters.
Putting stuff on the web is nice, but we were really wondering if anyone was actually visiting the website, reading what we put on there. Fortunately, being hosted by TDMWeb (excellent service by the way), our website directory on the server contains a Logfiles directory that contains the IIS logfiles – saved on a per-day basis. So, what better way to find out which parts of the website are visited (and which not) than to examine these logfiles!
Of course, I didn’t use notepad for this, but decided to write a nice little Delphi application for this purpose, which is described in this article.
The first thing I had to do was to download the ISS logfiles from the TDMWeb server (one for each day), and look inside them for interesting details. For a first test, this resulted in a few dozen logfiles which all start with the following four lines:
#Software: Microsoft Internet Information Services 6.0 #Version: 1.0 #Date: 2006-04-01 00:01:00 #Fields: date time cs-method cs-uri-stem cs-uri-query cs-username c-ip cs-version cs(User-Agent) cs(Referer) sc-status sc-bytesThis is very helpful information, especially the 4th line, since it defines the fields that were logged. Not everything might be interesting to analyse, but since I want to build a semi-automatic logfile analysis tool, knowing the names of the fields is a real time saver.
type TCounter = class visits: Integer; IPS: TStringList; constructor Create(const IP: String); end; constructor TCounter.Create(const IP: String); begin visits := 1; IPS := TStringList.Create; IPS.Sorted := True; IPS.Duplicates := dupIgnore; IPS.Add(IP) end;Note that I place the IP address in a TStringList of its own, setting Sorted to True (which allows for quick insertion), and the Duplicate property to dupIgnore, so we will only include each IP-address just once. That way, we can simply ask the IPS field how many unique visitors (i.e. with unique IP-addresses) have visited this specific URL.
The next step is the AnalyseLogFile method, which takes a filename as argument and builds the list of URLs. This list is stored in the URL field of type TStringList. In the AnalyseLogFile, we can skip the first four lines of each logfile (as mentioned before), and for each subsequent line skip the first 24 characters that hold the date, time and method. The URL is the next field, followed by a space (or in fact two empty fields for cs-uri-query and cs-username, so we can search for ‘ - - ‘ to determine the end of the URL.
Once we have the full URL string, we need to filter out the garbage. By garbage, I mean anything that comes after a ? or & character, and anything within an accidental < or > character. Also, when the URL contains .com, .co.uk or .edu instead of a local path, then we should ignore it as well. Finally, a URL with no dot or with a @ character inside is invalid as well. These checks were included to ensure that 99% of the invalid URLs are ignored.
The URL is then used to extract the folder as well as the extension, since I want to maintain and analyse all three. This can enable us later to filter our the GIF or JPG files for example, when all I want to look at are content pages with the HTM, ASP or even the PDF extension.
When the URL is cleaned, it can be added to the list. If it already exists, we only need to add the IP address (which will be ignored if it already existed). The IP address can be found as 7th field in each line, usually right after the ‘ - - ‘ that follows the URL itself, so that’s easy to find.
procedure TMainForm.AnalyseLogFile(const FileName: String); var F: TextFile; Str,Str2,IP: String; i: Integer; begin AssignFile(F, FileName); Reset(F); readln(F); readln(F); readln(F); readln(F,Str); // line with field definitions while not eof(F) do begin readln(f,Str); Delete(Str,1,24); // skip date, time and method fields i := Pos(' - - ',Str); if i > 0 then begin IP := Copy(Str,i+5,16); Delete(IP,Pos(#32,IP),16) end; Delete(Str,Pos(#32,Str),Length(Str)); // remove other fields // filter URL field if Pos('?',Str) > 0 then Delete(Str,Pos('?',Str),Length(Str)); if Pos('&',Str) > 0 then Delete(Str,Pos('&',Str),Length(Str)); if Pos('<',Str) > 0 then Delete(Str,Pos('<',Str),Length(Str)); if Pos('>',Str) > 0 then Delete(Str,Pos('>',Str),Length(Str)); if (Length(Str) > 1) and ((Str <> '/') or (Str = '&') or (Str = '_')) then Str := ''; if (Pos('.com',Str) > 1) or (Pos('.co.uk',Str) > 1) or (Pos('.edu',Str) > 1) then Str := ''; if Pos('.',Str) = 0 then Str := ''; if Pos('@',Str) > 0 then Str := ''; if (Length(Str) > 1) and (Str[Length(Str)] = '.') then Delete(Str,Length(Str),1); // still anything left? if Str <> '' then begin i := URL.IndexOf(Str); if i >= 0 then // already found begin with URL.Objects[i] as TCounter do begin Inc(visits); IPS.Add(IP) end end else begin Str2 := Str; while Pos('/',Str2) > 0 do Str2[Pos('/',Str2)] := '\'; if FileExists('..\html' + Str2) then URL.AddObject(Str,TCounter.Create(IP)) end end end; CloseFile(F) end;
Once all logfiles are parsed and we have a full URL TStringList, we can traverse it once more to determine the maximum length of the directory, full URL and extension fields. With that information in mind, we can finally create a new logfile table and dump the contents of the URTL TStringList in it, as shown in the CreateLogTable method.
I’m starting with a local Paradox table, so can use the TTable component to set the TableType, TableName and then add FieldDefs information for the URL, Dir, Ext, hits and IPs fields. The URL field will be unique, but there will be multiple URLs sharing the same Dir, and the same applies to the value of the Ext column. The hits and IPs columns are simply integer fields that contain the actual information we’re looking for. Note that I’m optionally add an index for the Dir and Ext fields, something that might come in handy later if we want to search or sort by the Dir or Ext fields.
procedure TMainForm.CreateLogTable; var i,j: Integer; UrlLen, DirLen, ExtLen: Integer; begin UrlLen := 0; DirLen := 0; ExtLen := 0; for i:=0 to URL.Count-1 do begin if Length(URL[i]) > UrlLen then UrlLen := Length(URL[i]); j := PosEx('/',URL[i],2); if j > DirLen then DirLen := j; j := Length(URL[i]); while (j > 0) and (URL[i][j] <> '.') do Dec(j); if j > 0 then if (Length(URL[i]) - j) > ExtLen then ExtLen := Length(URL[i]) - j; end; // create logfile table tblLogFile.TableType := ttParadox; tblLogFile.TableName := 'Logfiles.db'; tblLogFile.FieldDefs.Add('URL', ftString, UrlLen, False); tblLogFile.FieldDefs.Add('Dir', ftString, DirLen, False); tblLogFile.FieldDefs.Add('Ext', ftString, ExtLen, False); tblLogFile.FieldDefs.Add('hits', ftInteger, 0, False); tblLogFile.FieldDefs.Add('IPs', ftInteger, 0, False); tblLogFile.IndexDefs.Add('', 'URL', [ixPrimary,ixUnique]); tblLogFile.CreateTable; //tblLogFile.AddIndex('Dir','Dir',); //tblLogFile.AddIndex('Ext','Ext',); // add logfile into to table tblLogFile.Active := True; for i:=0 to URL.Count-1 do begin tblLogFile.Append; try tblLogFile.FieldByName('URL').AsString := URL[i]; j := PosEx('/',URL[i],2); if j > 0 then tblLogFile.FieldByName('Dir').AsString := LowerCase(Copy(URL[i],2,j-2)); if tblLogFile.FieldByName('Dir').AsString = '' then tblLogFile.FieldByName('Dir').AsString := '/'; j := Length(URL[i]); while (j > 0) and (URL[i][j] <> '.') do Dec(j); if j > 0 then tblLogFile.FieldByName('Ext').AsString := Copy(URL[i],j+1,Length(URL[i])); tblLogFile.FieldByName('hits').AsInteger := (URL.Objects[i] as TCounter).visits; tblLogFile.FieldByName('IPs').AsInteger := (URL.Objects[i] as TCounter).IPS.Count; tblLogFile.Post; except tblLogFile.Cancel end end end;Finally, once the table is created, it’s a simple matter of traversing the URL TStringList again, and for each item in the TStringList, append a new record to the table with the values for the five fields.
SELECT Ext, Dir, Sum(hits), Sum(IPs) FROM Logfiles GROUP BY Ext, DirThe output of such a SQL query can be presented in a grid, but it would be more helpful if we could use a slightly more flexible component that would allow us to expand dimensions (the Ext and Dir fields), and swap summaries (the Sum(hits) and Sum(IPs) values). This is actually exactly what the TDecisionCube – or actually the TDecisionGrid – component is good for.
Today, I’m using Delphi 2006, and while previous versions of Delphi Enterprise contained a nice set of components under the label Decision Cube, these are not installed by default in the Delphi 2006 tool palette. Not by default, but it’s still included. All we have to do is install the correct design-time package dcldss100.bpl which can be found in the BDS\4.0\Bin directory:
With this package installed, the Decision Cube components are all available.
The SQL query I defined earlier will be feeding the TDecisionCube component, which in turn feeds the specific Decision Cube visual controls like the DecisionGrid, DecisionPivot and DecisionChart.
We won’t be used the last of these, but the other two will come in quite handy.
To connect the DecisionCube with the visual controls, we need a DecisionSource component in between.
Just like working with regular data-aware controls, you only need to set a few properties to tie everything together.
The DecisionPivot is used to define the dimensions and summaries that are displayed in the DecisionGrid. There are typically two sets of dimensions: one at the left side (the y-axis) and one at the top side (the x-axis). Each axis can have zero or more dimensions open at the same time, with the summaries shown as data cells. For our example SQL statement, we have two dimensions: Ext and Dir. We could have added the URL itself as dimension as well, but this would lead to a big explosion of data (each additional dimension increases the amount of required memory by the number of possible values in the dimension, so it’s not advisable to use dimensions with a few hundred possible values).
Having multiple summaries allows you to switch from one set of values to another. In our case, the Sum(hits) and Sum(IPs) are our summaries, so we can toggle between viewing all visits (the hits) or just the unique visitors (the IPs). The next screenshot shows the hits in a DecisionGrid of Dir vs. Ext. And specifically, I’ve focussed on the PDF files in the oreilly directory, to see if these O’Reilly PDF newsletter are being read at all on the Developers Group website.
Well, from the looks of it, there are some people reading it. Some directories are visited more often, like the meetings and magazine directory (would be a shame if it wasn’t).
Now that we’ve seen that the entire directory is indeed visited, I would like to see some more details, like which individual URLs (the actual newsletters) have been read, and how many times. For this, we simply need to open the table in a “normal’ Grid, but as a little help I’m adding a drop-down combobox op top of the TDBGrid to allow the user to select a directory. For this, we must ensure that all available directories are listed in the drop-down combobox, which is done in the FormCreate as follows:
procedure TVisitForm.FormCreate(Sender: TObject); begin cdsLogCube.Active := True; cdsLogFile.Active := True; cbDir.Sorted := True; cdsLogFile.First; while not cdsLogFile.Eof do begin if Length(cdsLogFile.FieldByName('Dir').AsString) > 1 then if cbDir.Items.IndexOf(cdsLogFile.FieldByName('Dir').AsString) < 0 then cbDir.Items.Add(cdsLogFile.FieldByName('Dir').AsString); cdsLogFile.Next end; cdsLogFile.First end;Selecting a specific directory from the drop-down combobox, will assign this directory as filter to the table, as shown in the cbDirChange event handler:
procedure TVisitForm.cbDirChange(Sender: TObject); begin cdsLogFile.Filtered := False; if cbDir.ItemIndex > 0 then cdsLogFile.Filter := 'Dir = ''' + cbDir.Items[cbDir.ItemIndex] + '''' else cdsLogFile.Filter := ''; cdsLogFile.Filtered := cdsLogFile.Filter <> '' end;As example, let’s select the meetings directory, and see which meeting PDF files have been read in the previous period. This can be seen below:
As you can see, the number of Hits and the unique IPs are not equal.
So sometimes, a user (someone sharing the same IP-address) has visited or downloaded the same PDF file.
Which might happen if you first see the agenda for a meeting, and then later decide to go but want to check the schedule again.
To identify the ratio of number of hits and the number of unique IPs, I’ve defined a new calculated field, visible in the normal Grid with the “Average Number of Hits Per IP Visit” caption. This calculated field is of course implemented as follows:
procedure TVisitForm.cdsLogFileCalcFields(DataSet: TDataSet); begin DataSet.FieldByName('HitsPerIP').AsFloat := DataSet.FieldByName('Hits').AsInteger / DataSet.FieldByName('IPs').AsInteger; end;
What you may also have noticed from the previous screenshot is the fact that the first caption has a slightly different colour. This is a feature to identify by which column I’ve sorted the resulting dataset. By default you may want to have it sorted on the URL, but sometimes you may want to sort it by Number of hits, Unique IPs or just by Directory or Ext. Note that you cannot sort it by the last field (the Average Number of Hits Per IP Visit), since that’s a calculated field.
To sort on a different column, just click on the column header in the TDBGrid. This behaviour can be implemented by writing a little code in the OnTitleClick event handler, setting the IndexFieldName of the dataset. Note that this won’t work with all kinds of DataSets, but it works just fine with a TClientDataSet. And in fact, I’ve loaded the contents of the logfile table inside an in-memory TClientDataSet to speed up the application, and make it independent of the BDE (or any other “real” database for that matter).
procedure TVisitForm.DBGrid1TitleClick(Column: TColumn); var i: Integer; begin for i:=0 to DBGrid1.Columns.Count-2 do DBGrid1.Columns[i].Title.Color := clBtnFace; if Column.FieldName <> 'HitsPerIP' then begin cdsLogFile.IndexFieldNames := Column.FieldName; Column.Title.Color := clSkyBlue end else cdsLogFile.IndexFieldNames := '' end;And even the result of the SQL command shown earlier, to feed the DecisionCube, has been saved to disk and loaded into a TClientDataSet component. As long as you keep the TClientDataSet components “active” at design-time (with the contents in memory, stored in the DFM file), without a value assigned to their filename, you can actually produce a single stand-alone executable that contains the data itself as well. Make sure to add the MidasLib unit to the uses clause of your project, so you even do not have to deploy the MIDAS.DLL with the application. The only “disadvantage” is that you actually have to compile and re-deploy the stand-alone executable once the data changes (i.e. when you’ve imported and analysed a new set of logfiles). But for clients and distribution purposes, this is really easy.
The final screenshot shows another example, of the magazine directory, sorted by extension.
Note that this list does not contain the regular password-protected magazine issues, but rather the free “old” 2004 issues which can be accessed freely by anyone.
I’m actually amazed by the fact that we’ve had several thousand downloads or views (even if the unique IPs indeed mean “only” a few hundred visits in the short timespan I’ve been analysing the logfiles).
Shameless plug: if a non-member one day reads this article, please consider joining – this is a nice group, and you won’t be sorry (you won't even have to live in the UK)...
Is this the end of it? Almost.
There are some ideas and new wishes.
It would be nice to have an idea of the period in which URLs were viewed.
That can be combined with the DecisionGrid as well, with either the month or the weeknumber as additional dimension to work with.
Finally, just as I finished writing this article, I discovered an even better way to filter out invalid URLs from the logfiles: the cs-Status field contains the 200 OK code, or an error code in case the page wasn’t retrieved correctly. If we simply ignore all non-200 status lines, then this will simply the filter for invalid URLs significantly. That’s left as exercise for the reader. If you want the source code for the project, take a look at our downloads page (it has been visited a few times before, and we can even publish a top 10 of downloaded files if people are interested).
In this article, I’ve described a 2-pass way to analyse IIS logfiles, put an extract of their information first in a TStringList and then in a Paradox table (and later a TClientDataSet in-memory table). Delphi specific techniques include using the associated object with a string in a TStringList, using the DecisionCube and related components, and filtering and sorting the data in a TDBGrid.