Delphi Clinic C++Builder Gate Training & Consultancy Delphi Notes Weblog Dr.Bob's Webshop
Dr.Bob's Delphi Notes Dr.Bob's Delphi Clinics Dr.Bob's Delphi Courseware Manuals
 Dr.Bob Examines... #103
See Also: Dr.Bob's Delphi Papers and Columns

This article was first published in the The Delphi Magazine issue #139 (March, 2007).

Broken Link Detection and Web Spidering
This article was my last Under Construction column in The Delphi Magazine issue #139.Without TDM to write for, I have used my website(s) more often to publish articles.And my main website at www.drbob42.com already contains almost 1000 HTML files (not even counting the images and other related files), with thousands and thousands of links.A hard job to keep up-to-date, and that's not even considering the links that become invalid, get redirected, or should otherwise be fixed or removed.Time for a little help using my favourite development tool called Delphi.
The main topic of this month is a broken link detector, used for some web spidering along the way.Way back in issue #27 (November 1997), I wrote a first TBrokenLink component using Delphi 3.At the time, my internet connection was still using a 14K4 modem, and I was using a Pentium 100Mhz with 16MB of RAM (later upgraded to 64MB).As a result, that TBrokenLink component and the InterBob program that was written using it, was seldom used to actually verify links on the web - it was just too slow and would take too long to retrieve all pages (or wait for the error messages). In less than a decade, I now have an always-on multi-MB cable as well as a 8Mb ADSL (fallback) internet connection, using development machines that are between 2 and 3 GHz with somewhere between 1 and 2 GB of RAM inside.Time to revisit the wheel and see if we can design a new TSiteCheck component to check and spider an entire website.And not just for Win32, but let's also do a .NET version this time (just to compare the different techniques used to connect to the web).
When I examined the code from issue #27 and opened the project in Delphi 2006, I got an error message telling me that TBrokenLink was not found (it was not installed in the Tool Palette).Back in those days, I used components for everything, even this non-visual engine stuff.While that may have seemed a good idea at the time, I've since learned that it's easier just to encapsulate this code in a class and not necessarily a component.If you really need to adjust and configure the component at design-time instead of runtime (a database connection component comes to mind), then a component is a good idea, otherwise a class is just as easy (like a TStringList or a TIniFile, which neither are components but mere classes).

URL Checking
The act of checking an entire site consists of a number of recursive steps: we start with a URL (like http://www.drbob42.com) and then check if the URL is a valid one (I'll get back to that later).When found to be valid, the contents of the URL is downloaded and the HTML inside is parsed, looking for hyperlinks of the format “<a href" or “<frame src".Any links found will be checked and if valid downloaded ad infinitum.While easy to code as a recursive process, it could become a never-ending story if we don't add the necessary termination checks.Like if a URL has been checked before and found valid or invalid (it's no use checking the same URL twice in the same validation run).Another criteria that may help to end the validation process is to limit the actual downloading of URL contents to URLs that share the domain with the starting URL.So if we check the site on the domain www.drbob42.com then it has little sense downloading the contents of links on the www.codegear.com website.Doing so might be interesting, but the result is a web checker and not a site checker.And where a site with 1000 or more pages may be validated in under an hour, checking the entire web will likely take long enough to invalidate the results before the process is finished (due to the ever changing nature of the web itself).This can be used as starting point for a spider, however, where you could define not one but a number of domains to be valid enough to pursuit in the downloading and checking (or archiving) process.
Apart from the list of broken links - which is nice to know, but not really that helpful - I would also like to know in which pages the broken links actually occurred.That was one of the “missing features" from the solution in issue #27, and something which we'll add to the design right away this time.
And while we're at it: I found that some URLs are not broken while they're not exactly correct either.I'm talking about URLs that are redirected.Like URLS that start with the http://bdn.borland.com domain which are redirected to the http://dn.codegear.com domain.Since the redirect will happen automatically when you enter or click on the URL, the original URL is not really broken.Until the redirection is removed (in this example sometime after the break between Borland and CodeGear is finalised and no more links between them exist).I found a lot more URLs that are currently redirected to other domains, and to prevent these links from breaking in the future, we should maintain a list of all URLs that are (automatically) redirected so we can prevent these links from breaking in the first place.
The process of actually validating a URL is something which can be done using WinINet, Indy, .NET or whatever you want to use today.In fact, there are so many choices, that the first version of the TSiteCheck class will only have a virtual abstract CheckURL method.Derived classes can override this method and implement CheckURL using any protocol (like I've done in this article).
The starting point is a class TSiteCheck (see the code listing below) with a FDomain string to hold the local domain to check (if you turn this into a TStringList then you can validate multiple interconnecting sites all at once, but I leave that as exercise for the reader for now).
Four TStringLists are used to store the Links (URLs) which are in the (sorted) list to be checked, the URLs that have been checked and found to be valid, the URLs that have been checked but were redirected (and may require some work in the near future to fix), and finally the URLs that were not valid and are considered broken (and require immediate work to fix).

  type
    TCheckResult = (crOK, crRedirected, crBroken);

    TSiteCheck = class
    private
      FDomain: String;
    private
      FLinks: TStringList; // URLs to check
      FChecked: TStringList; // Checked URLs
      FRedirected: TStringList; // Redirected Links
      FBroken: TStringList; // Broken links
    public
      constructor Create(const NewURL: String);
      destructor Destroy; override;
      procedure Check;
      function CheckURL(const LinkURL: String;
        const OutputFileName: String = ''): Boolean; virtual; abstract;
    protected
      procedure ParseHTML(const URL, OutputFileName: String);
    public
      property Domain: String read FDomain;
      property Checked: TStringList read FChecked;
      property Redirected: TStringList read FRedirected;
      property Broken: TStringList read FBroken;
    end;

In order to make sure no URL is checked twice, each TStringList must be preconfigured to be sorted, ignore duplicates and not be case sensitive (note that the latter is only valid for web sites on Windows servers, as Linux and UNIX servers are hosting web sites with case sensitive URLs, so you may want to modify this if your site is hosted or pointing to a case sensitive web site).
In order to ensure that all TStringLists are created with the correct properties, I've added a class method to the class, which returns a fully configured TStringList.The code for this class method is shown below.I could have used a regular method or external routine, but a class method at least makes it clear where it belongs to.

  class function TSiteCheck.ConfiguredStringList: TStringList;
  begin
    Result := TStringList.Create;
    Result.Sorted := True;
    Result.Duplicates := dupIgnore;
    Result.CaseSensitive := False // only on Windows
  end;

As you may know, Delphi TStringLists are not limited to containing strings.In fact, each string can have an object associated with it (at the same index location).So apart from StringList.Strings[i] or StringList[i] for short, we can also use StringList.Objects[i].And anything can be put in the Object, including another TStringList.The reason why I plan to do this is to ensure that each URL has the referrer URLs associated with it.And since any URL can be referred to from many other URLs, I'm using a TStringList to store the referring URLs (a configured TStringList, since I want to ignore duplicate references in case a certain URL happens to link twice to the same other URL).
The constructor is called with one argument: the URL to check.From this URL we can extract the domain (stripped from trailing slashes), using that as indicator whether or not to download the contents of a new URLs that are found while parsing downloaded HTML.The FLinks, FChecked, FRedirected, and FBroken TStringList fields are created using the class method ConfigureStringList.

  constructor TSiteCheck.Create(const NewURL: String);
  begin
    inherited Create;
    if Pos('http://',NewURL) = 0 then FDomain := NewURL
    else
      FDomain := Copy(NewURL,1,PosEx('/', NewURL, 8)-1);
    while (Length(FDomain) > 0) and (FDomain[Length(FDomain)] = '/') do
      Delete(FDomain,Length(FDomain),1); // strip trailing /
    FLinks := ConfiguredStringList;
    FLinks.Add(NewURL);
    FLinks.AddObject(NewURL, ConfiguredStringList); // empty referrer list
    FChecked := ConfiguredStringList;
    FRedirected := ConfiguredStringList;
    FBroken := ConfiguredStringList
  end;

  destructor TSiteCheck.Destroy;
  var
    i: Integer;
  begin
    for i:=FLinks.Count-1 downto 0 do
      (FLinks.Objects[i] as TStringList).Free;
    FLinks.Free;

    for i:=FChecked.Count-1 downto 0 do
      (FChecked.Objects[i] as TStringList).Free;
    FChecked.Free;

    for i:=FRedirected.Count-1 downto 0 do
      (FRedirected.Objects[i] as TStringList).Free;
    FRedirected.Free;

    for i:=FBroken.Count-1 downto 0 do
      (FBroken.Objects[i] as TStringList).Free;
    FBroken.Free;
    inherited
  end;

Note that the Destructor must not only free the TStringLists, but also the TStringLists associated with the objects of each item in the TStringLists.Objects associated with TStringList items are not managed (or owned) by the TStringList itself, so that's why we need the for-loops to call the Free methods on each of them.Obviously, this is only needed in the Win32 edition of the TSiteCheck class, as in .NET we will get the help from the garbage collector who'll jump in as soon as the references are removed (with the Free the original TStringList holding the string items).

Checking
The checking process is implemented in method Check.Here, we enter a loop which will not end until the FLinks TStringList is empty.If you want to ensure that your applications is able to end (without brute force), you may want to add an “and not Application.Terminated" condition to the while loop.This will ensure that the loop is terminated when the application has received a close or quite event.As long as you report the broken links found at that point, no work will be lost (I'll demonstrate this in the actual visual application later in this article).
For each URL in the FLinks collection, we call the CheckURL method.This method can return one of the TCheckResult values: crOK, crRedirected or crBroken.Depending on the result, a new entry is added to the FChecked, FRedirected or FBroken TStringList with the current URL as string contents.However, we should not only add the URL, but also all pages that contained the URL (so far).These can be found in the associated Object at the index of the URL in the original FLinks TStringList.
Once the URL has been processed, it can be removed from the FLink StringList again.

  procedure TSiteCheck.Check;
  var
    URL,FileName: String;
    i: Integer;
  begin
    while (FLinks.Count > 0)
     {$IFDEF WIN32}and not Application.Terminated{$ENDIF} do
    begin
      URL := FLinks.Strings[0];
      FileName := URL2FileName(URL);
      case CheckURL(URL, FileName) of
        crOK:
          begin
            FChecked.AddObject(URL, FLinks.Objects[0]);
            if FileName <> '' then
              ParseHTML(URL, FileName) // modify FLinks and/or FBroken (404)
          end;
        crRedirected:
          begin
            FRedirected.AddObject(URL, FLinks.Objects[0])
          end;
        crBroken:
          begin
            FBroken.AddObject(URL, FLinks.Objects[0])
          end;
      end;
      FLinks.Delete(FLinks.IndexOf(URL))
    end
  end;

Only for URLs that are neither broken nor redirected, and that are part of the local domain, I want to download the contents and parse the HTML.In order to determine if a URL is part of the domain and should be downloaded, I've written a little URL2FileName function.When implemented effectively, this function can help in downloading the entire website by translating URL paths to local paths.As starting point, I'm taking the domain (like www.drbob42.com), without the http:// prefix, but adding a local disk prefix of C:\Inetpub\webroot (not using the wwwroot directory), and translating all / characters into \ instead.This translates for example the URL http://www.drbob42.com/home.htm to the local disk file c:\inetpub\webroot\www.drbob42.com\home.htm etc.
The implementation of this URL2FileName function is as follows:

  function TSiteCheck.URL2FileName(const URL: String): String;
  var
    i: Integer;
  begin
    if (Length(URL) > Length(FDomain)) and (Pos(FDomain, URL) = 1) then
    begin
      Result := URL;
      Delete(Result,1,7); // http://
      for i:=1 to Length(Result) do
        if Result[i] = '/' then Result[i] := '\';
      Result := LocalRoot + Result
    end
    else Result := ''
  end;

As you can see, any URL outside of the local domain will be translated to an empty string, which will be used as condition not to download the contents of the URL in the CheckURL method.
Note that the CheckURL method of the TSiteCheck class is declared as virtual abstract, so we cannot create an instance of TSideCheck and use it.Instead, we should derive a new class and override plus implement the CheckURL method.We'll do that shortly for both a Win32 and a .NET solution.
Until that time, let's example the ParseHTML method first, which takes the saved contents on disk in search for new hyperlinks.

Parsing HTML
The saved HTML file is parsed line by line, in search for tags of the format “<a href=" or “<frame src=".The search is case sensitive, but fails on links that span multiple lines (I know for a fact that I don't allow HREF-tags to span multiple lines, but you may have to adjust this code if you want to check sites that use some kind of formatting tool that doesn't enforce this.
Once the start of a new hyperlink is found, we need everything until the closing double quote.If a # character is found inside the URL, then anything after that character (including the # itself) will be ignored, since these extensions just reference contents within a page that the base URL points to.If the URL starts with a / then we need to add only the domain to it.Otherwise, if the URL does not already contain the :// substring (from http://), we need to add the relative path of the original URL, like http://www.drbob42.com/delphi).After that, we need to remove all ../ and ./ parts from the URL, ending in a well formed URL that can be checked against the Checked, Broken, Redirected and Links lists.If the URL is already found in any of these lists, then the originating URL should be added to the list of referrers (so we'll know which page contains a broken, checked or redirected URL to modify later).

  procedure TSiteCheck.ParseHTML(const URL, OutputFileName: String);
  const
    AHREF = '<A HREF="';
    FRAME = '<FRAME SRC="';
  var
    f: TextFile;
    Str,URLRoot,NewURL: String;
    i,j,line: Integer;
  begin
    AssignFile(f,OutputFileName);
    Reset(f);
    if IOResult = 0 then
    begin
      URLRoot := URL;
      while (Length(URLRoot) > 0) and (URLRoot[Length(URLRoot)] <> '/') do
        Delete(URLRoot,Length(URLRoot),1);

      line := 0;
      while not eof(f) do
      begin
        Inc(line);
        readln(f,Str);

        i := Pos(AHREF,UpperCase(Str));
        j := Pos(FRAME,UpperCase(Str));
        while (i > 0) or (j > 0) do
        begin
          if (j = 0) or ((i < j) and (i > 0)) then Delete(Str,1,i+Length(AHREF)-1)
          else // (j < i) and (j > 0)
            Delete(Str,1,j+Length(FRAME)-1);

          while (Length(Str) > 0) and (Str[1] = #32) do Delete(Str,1,1);
          if (Pos('#',Str) <> 1) and
             (Pos('mailto:',Str) <> 1) and
             (Pos('ms-help:',Str) <> 1) and
             (Pos('news:',Str) <> 1) and
             (Pos('mms:',Str) <> 1) and
             (Pos('ftp:',Str) <> 1) then // skip mailto/ms-help/news/mms/ftp
          begin
            // terminate # in URL
            if Pos('#',Str) in [1..Pos('"',Str)] then Str[Pos('#',Str)] := '"';
            // add domain in front
            if (Str[1] = '/') then Str := FDomain + Str;
            // add path in front
            if not (Pos('://',Str) in [1..Pos('"',Str)]) then Str := URLRoot + Str;
            // change \ to /
            while Pos('\',Str) in [1..Pos('"',Str)] do Str[Pos('\',Str)] := '/';

            // remove XXX/../
            while Pos('../',Str) in [1..Pos('"',Str)] do
            begin
              i := Pos('../',Str);
              Delete(Str,i,3);
              repeat
                Dec(i);
                Delete(Str,i,1)
              until Str[i-1] = '/'
            end;
            // remove ./
            while Pos('./',Str) in [1..Pos('"',Str)] do
            begin
              i := Pos('./',Str);
              Delete(Str,i,2);
            end;

            // We have the new URL
            NewURL := Copy(Str,1,Pos('"',Str)-1);
            if (Length(NewURL) > 0) and (NewURL[Length(NewURL)] = '/') then
              Delete(NewURL,Length(NewURL),1);

            if (Length(NewURL) > 0) and (FChecked.IndexOf(NewURL) < 0) then
            begin // not in checked links
              i := FBroken.IndexOf(NewURL);
              if (i >= 0) then // already in broken links?
                (FBroken.Objects[i] as TStringList).
                  Add(URL + Format(' at line %d',[Line]))
              else // not yet in broken link
              begin
                i := FRedirected.IndexOf(NewURL);
                if (i >= 0) then // already in redirected links?
                  (FRedirected.Objects[i] as TStringList).
                    Add(URL + Format(' at line %d',[Line]))
                else // not yet in broken link
                begin
                  i := FLinks.IndexOf(NewURL);
                  if (i < 0) then // not in list to check
                    i := FLinks.AddObject(NewURL, ConfiguredStringList);
                  (FLinks.Objects[i] as TStringList).
                    Add(URL + Format(' at line %d',[Line]));
                end
              end
            end
          end;
          i := Pos(AHREF,UpperCase(Str));
          j := Pos(FRAME,UpperCase(Str))
        end // while
      end;
      CloseFile(f);
      if IOResult <> 0 then { skip }
    end
  end;

The ParseHTML routine adds new URLs to the Broken, Checked, Redirected or Links lists, growing the number of further steps that have to be performed to complete the checking.
Since the process only stops when the Links list is empty, we should now focus on the CheckURL method (currently virtual abstract in the base TSiteCheck class), which is the only place where the Links list can be emptied.

CheckURLs
I want to show and demonstrate two different implementations of CheckURL: one for Win32 using WinINET, and one for .NET using the System.NET namespace.Since these can only compile for their respective target, we need two IFDEFs as well, as can be seen below:

  {$IFDEF WIN32}
    TWinINetSiteCheck = class(TSiteCheck)
      function CheckURL(const LinkURL: string;
        const OutputFileName: string = ''): TCheckResult; override;
    end;
  {$ENDIF}

  {$IFDEF CLR}
    TDotNetSiteCheck = class(TSiteCheck)
      function CheckURL(const LinkURL: string;
        const OutputFileName: string = ''): TCheckResult; override;
    end;
  {$ENDIF}

WinINet CheckURL
Let's start with the Win32 implementation, using WinINET (available in every version of Windows, and only needing an internet connection).In order to make an internet connection, we need to call the InternetOpen API.The result of this function is a pointer (a handle actually) which can be used for any WinINET internet operation, until we explicitly close it using the InternetCloseHandle.Calling InternetCloseHandle will be done in the destructor, to increase efficiency by not having to create a new connection when verifying a new URL.We only need to check if hSession is already assigned, and if not call the InternetOpen (for the first time) to assign it.
Checking a URL starts by calling the InternetOpenURL function, passing - among others - both the hSession internet handle and the URL we need to verify.If this call doesn't return a valid handle, then we could not connect to the URL for some reason.This typically means the name could not be resolved to an IP-address, or there was no response from the IP-address.
If we get a valid handle from the InternetOpenURL call, we should first retrieve the actual station information of the request.This can be done using the HttpQuertInfo function, requesting the HTTP_QUERY_STATUS_CODE.If this function returns true, then we can examine the resulting status code and see if it's the expected value ‘200' (for ‘200 OK').If not, then the status code indicates a possible HTTP error code why we could not access the page.There are number of HTTP error codes; the most common being 404 (page not found) or 500 (internal server error), but for the sake of simplicity I just accept only ‘200' as a valid response, and will assume a broken link for all other status code results.
However, even a status code of 200 may not always mean a successful page has been found.There are a number of examples where websites show a page that indicates that “the page cannot be found" or “article not found" without including the 404 status code, thereby preventing the HttpQueryInfo from detecting the URL as being open.
As a result, I always follow-up with at least one call to InternetReadFile, reading a buffer of 8192 bytes (or less if there is less data available), and scanning the resulting buffer for a number of known strings that I've found to indicate a broken link.A page with a Title tag that says “The page cannot be found" or “404 Not Found" is a clear indicator.The new CodeGear Developer Network returns a nice page with the message “Article not found.in <h2> tags when an incorrect article ID has been passed.There are more examples, and I've hardcoded five of them in the listing below to ensure that any page that returns with a statuscode of 200 but is still clearly invalid should be marked as broken.There's a small chance that a normal page will contain the strings I'm checking for, but I'm willing to take that risk (you can always verify the suspect URL before taking action to remove links to it from your webpages).

  function TWinINetSiteCheck.CheckURL(const LinkURL,
    OutputFileName: string): TCheckResult;
  const
    BufferSize = 8192;
  var
    hURL: HInternet;
    Buffer: Array[0..Pred(BufferSize)] of Char;
    BufferLength,Dummy: DWORD;
    f: File;
  begin
    Result := crBroken;
    if not Assigned(hSession) then
      hSession := InternetOpen('DrBob',INTERNET_OPEN_TYPE_PRECONFIG,nil,nil,0);

    if Assigned(hSession) then
    begin
      RedirectedURL := LinkURL;
      hURL := InternetOpenURL(hSession, PChar(LinkURL), nil,0,INTERNET_FLAG_RAW_DATA,0);

      if Assigned(hURL) then
      try
        Dummy := 0;
        BufferLength := BufferSize;
        if HttpQueryInfo(hURL, HTTP_QUERY_STATUS_CODE, @Buffer, BufferLength, Dummy) then
        begin
          if Buffer = '200' then
          begin
            if (OutputFileName <> '') then
            begin
              ForceDirectories(ExtractFilePath(OutputFileName));
              Assign(f, OutputFileName);
              Rewrite(f,1)
            end;

            repeat
              InternetReadFile(hURL, @Buffer, BufferSize, BufferLength);
              if OutputFileName <> '' then BlockWrite(f, Buffer, BufferLength)
            until (BufferLength = 0) or (OutputFileName = '');

            if (OutputFileName <> '') then Close(f);

            if ((Pos('<TITLE>The page cannot be found</TITLE>', Buffer) > 0) and
                (Pos('<h2>HTTP Error 404', Buffer) > 0)) or // IIS
              (Pos('<TITLE>404 Not Found</TITLE>', Buffer) > 0) or // Apache
              (Pos('<title>Cannot find server</title>', Buffer) > 0) or // IE6
              (Pos('<h2>Article not found.</h2>', Buffer) > 0) then
            begin
              // contents indicates invalid page!
            end
            else
            begin
              if RedirectedURL <> LinkURL then // redirected!
              begin
                Result := crRedirected
              end
              else // 200 OK
              begin
                Result := crOK;
              end
            end
          end
        end
      finally
        InternetCloseHandle(hURL)
      end
    end
  end;

Note that I'm using a (global) variable called RedirectedURL and compare it to the passed value of LinkURL to see if the URL has changed since we made the request.Unfortunately, this is not something that was very easy to do using the WinINET API, it actually meant I had to install a callback function.

WinINET Callback
WinINET supports the option to install an InternetStatusCallback function that will be called when the status of a request changes.We can install it using the InternetSetStatusCallback function, and it has to be of the form as shown below:

  var
    RedirectedURL: String = '';

  procedure InternetStatusCallback(
      hInet: HINTERNET;
      var dwContext: DWORD;
      dwInternetStatus: DWORD;
      lpvStatusInformation: Pointer;
      dwStatusInformationLength: DWORD); far; stdcall;
  begin
    if dwInternetStatus = INTERNET_STATUS_REDIRECT then
      if Assigned(lpvStatusInformation) then
        RedirectedURL := PChar(lpvStatusInformation)
  end;

Inside the callback function, we have to be careful to handle global variables since we're operating in a different thread.Fortunately, if the main thread is just waiting for the HttpQueryInfo to finish, we can safely write to a global variable (called RedirectedURL as shown in the above listing as well).
One note on using the InternetStatusCallback function: we must call the InternetOpenURL method and pass a last value (the context parameter) which is not equal to zero, otherwise the callback function will not be called.Also, make sure not to spend too much time in the callback function, since the main thread will be blocked (unless you plan to use asynchronised WinINET methods, which I've not done in the given examples).
In theory, the InternetStatusCallback function could also be used to give some feedback to the user, for example by updating a progress indicator.This will be done in another, much clear, way after we've seen the .NET implementation of CheckURL.

.NET CheckURL
As a native .NET solution, we can use the HttpWebRequest class and use CreateDefault to create an instance passing the URL we need to verify.Before we call the GetResponse method, there are a number of request properties that you may want to set.In order to bring the wait time a bit down, I've specified a Timeout value of 10000 milliseconds (or 10 seconds) - if after that time no response has been given, then I expect no further answers anyway.Another important property to set if the AllowAutoRedirect.Especially with URLs from the Borland Developer Network for example, which are automatically redirected to http://dn.codegear.com now, it would be a bad idea to mark them as invalid (if redirection was not allowed).
Once we call the GetResponse method and cast it to a HttpWebResponse instance, we can immediately check the response StatusCode to see if it's OK.Note that we do not have to check against the magic number 200, but can use the HttpStatusCode.OK enumerated value (which of course is equivalent to 200).
Also, in order to detect a redirection, we should now compare the request RequestUri string with the response ResponseUri string.When not equal, then a redirection has been found (note that sometimes a “redirection" merely consists of adding a / to a URL, so you may not really want to mark that as a redirection, but I leave that as exercise for the reader to add some special treatment for those situations).
In case the OutputFileName is not empty, we need to save the contents to a local file, and for that we can use the GetResponseStream method and a StreamReader plus StreamWriter to create the contents on disk.In short, the .NET way feels a bit more straightforward than calling the WinINET APIs (with their pointer handles and context values), but that may be a matter of taste.

  function TDotNetSiteCheck.CheckURL(const LinkURL,
    OutputFileName: string): TCheckResult;
  var
    Req: HttpWebRequest;
    Res: HttpWebResponse;
    ResStream: Stream;
    Reader: StreamReader;
    Writer: StreamWriter;
    U: URI;
  begin
    U := URI.Create(LinkURL);
    Req := HttpWebRequest.CreateDefault(U) as HttpWebRequest;

    Req.AllowAutoRedirect := True;
    Req.Timeout := 10000; // 10 second timeout
    Req.Method := 'GET';
    Req.KeepAlive := False;
    try
      Res := Req.GetResponse as HttpWebResponse;
      if Res.StatusCode = HttpStatusCode.OK then
      begin
        if Req.RequestUri.ToString <> Res.ResponseUri.ToString then
        begin
          Result := crRedirected;
          RedirectedURL := Res.ResponseUri.ToString;
        end
        else Result := crOK;

        if OutputFileName <> '' then
        begin
          ResStream := Res.GetResponseStream;
          Reader := StreamReader.Create(ResStream);
          try
            Writer := StreamWriter.Create(OutputFileName);
            try
              Writer.Write(Reader.ReadToEnd)
            finally
              Writer.Close
            end
          finally
            Reader.Close
          end
        end
      end
      else
        Result := crBroken
    except // timeout
      Result := crBroken
    end
  end;

Before trying these two implementations, I wanted to add a progress indicator for use in the Win32 and .NET applications.

Progress
Instead of using a callback function like the WinINET API offers, I decided to add an OnCheckProgress event property to the TSiteCheck base class.The function itself is a method of object, as declared as follows:

  type
    TOnCheckProgress = procedure(LinkCount,CheckedCount,BrokenCount: Integer;
      NextURL: String) of object;
  

We need to add a private field and published property of type TOnCheckProgress to the TSiteCheck class as shown below:.

    private
      FOnCheckProgress: TOnCheckProgress;
    published
      property OnCheckProgress: TOnCheckProgress read FOnCheckProgress
        write FOnCheckProgress;
    end;

There will be some notable differences in using this event in the Win32 or the .NET world, as I will demonstrate now.

Console Win32 Example
Let's start with a Win32 console example, using the TWinINETSiteCheck class, assigning a value to the OnCheckProgress event handler so we can see how far the site validation is progressing.
An event handler “of object" needs a class to embed the event handler, which is just fine for a VCL forms application where we can add the event handler to the Form (or rather Delphi will do that for us), but for a Win32 console application it's a different story.To solve the issue, I usually declare a derived class from the type I want to use anyway (in this case a class TWinINetSiteCheckPlusEvent derived from TWinINetSiteCheck), and add the CheckProgress event handler to it manually.Don't forget to assign the CheckProgress event handler to the OnCheckProgress event property.

  program CheckSite;
  {$APPTYPE CONSOLE}
  uses
    Classes,
    SiteCheck in 'SiteCheck.pas';

  type
    TWinINetSiteCheckPlusEvent = class(TWinINetSiteCheck)
      procedure CheckProgress(LinkCount,CheckedCount,BrokenCount: Integer;
        NextURL: String);
    end;

  procedure TWinINetSiteCheckPlusEvent.
    CheckProgress(LinkCount,CheckedCount,BrokenCount: Integer; NextURL: String);
  begin
    writeln(CheckedCount,#32,LinkCount,#32,BrokenCount,#32);
    writeln(NextURL)
  end;

  var
    Check: TWinINetSiteCheckPlusEvent;
    i: Integer;
    f: Text;
  begin
    Check := TWinINetSiteCheckPlusEvent.Create('http://www.drbob42.com/index.htm');
    try
      Check.OnCheckProgress := Check.CheckProgress;
      Check.Check;

      Assign(f,'Win32Broken.txt');
      Rewrite(f);
      for i:=0 to Check.Broken.Count-1 do
      begin
        writeln(f,'Broken URL: ' + Check.Broken[i]);
        writeln(f,(Check.Broken.Objects[i] as TStringList).Text)
      end;
      Close(f);

      AssignFile(f,'Win32Redirected.txt');
      Rewrite(f);
      for i:=0 to Check.Redirected.Count-1 do
      begin
        writeln(f,'Redirected URL: ' + Check.Redirected[i]);
        writeln(f,(Check.Redirected.Objects[i] as TStringList).Text)
      end;
      CloseFile(f)
    finally
      Check.Free
    end
  end.

After calling the Check method - which will be reporting it's progress now - we can save the contents of the Broken and Redirected TStringList properties.Since they are sorted, we simple have to walk through them, write the string item following by the string representation of the associated stringlist in the object for that string item.The latter stringlist should contain all referring URLs containing in their contents the broken or redirected link.These reports greatly simply my work as a webmaster, since I can first verify if the link is indeed broken, and if so can edit all referring URLs (the line number is reported as well, as you may remember) and remove the broken link or modify the redirected link to avoid a broken link in the future.
For a website consisting of a few hundred pages with a few thousand links (and a few dozen broken links), the entire verification process takes more than an hour.Most of the time is spent while waiting for a response (the actual processor time can be ignored, so computer speed and even internet connection speed is not an issue).As a side-effect of checking a website, the pages that belong to the specified domain are downloaded to the local disk (without the images, but these can be added if we also add some code to parse for “<img src=" tags in the HTML pages).
Let's see if the .NET solution can improve on this!

Console .NET Example
There are a few differences in the .NET console application compared to the Win32 console application.First of all, although the OnCheckProgress event type was declared as a method “of object", the “of object" part has no relevance in the .NET world.So instead of a method pointer, we can just assign a local function pointer to the Check.OnCheckProgress property.This again greatly simplifies the code we have to write.

  program CheckNet;
  {$APPTYPE CONSOLE}
  uses
    Classes,
    SiteCheck in 'SiteCheck.pas';

  procedure CheckProgress(LinkCount,CheckedCount,BrokenCount: Integer;
    NextURL: String);
  begin
    writeln(#32,CheckedCount,#32,LinkCount,#32,BrokenCount,#32);
    writeln(NextURL)
  end;

  var
    TaskList: TStringList;
    Check: TSiteCheck;
    i,j: Integer;
    f: Text;
  begin
    Check := TDotNetSiteCheck.Create('http://www.drbob42.com/index.htm');
    try
      Check.OnCheckProgress := CheckProgress;
      Check.Check;

      TaskList := TStringList.Create;
      try
        TaskList.Sorted := True;
        for i:=0 to Check.Broken.Count-1 do
          for j:=0 to (Check.Broken.Objects[i] as TStringList).Count-1 do
            TaskList.Add((Check.Broken.Objects[i] as TStringList)[j] +
              ' -> ' + Check.Broken[i]);
        TaskList.SaveToFile('dotNet-BrokenTasklist.txt');

        TaskList.Clear;
        for i:=0 to Check.Redirected.Count-1 do
          for j:=0 to (Check.Redirected.Objects[i] as TStringList).Count-1 do
            TaskList.Add((Check.Redirected.Objects[i] as TStringList)[j] +
              ' -> ' + Check.Redirected[i]);
        TaskList.SaveToFile('dotNet-RedirectedTasklist.txt')
      finally
        TaskList.Free
      end
    finally
      Check.Free
    end
  end.

Again, after the call to the Check method, we can print the contents of the Broken and Redirected stringlists.However, instead of sorting them by the broken or redirected URL, it would also be helpful to produce a task list, sorted by content URL, listing - for each web page which is part of my website - the broken and/or redirected links that I need to edit.With this tasklist, I can edit each page and only perform the modifications.
Unfortunately, when running the .NET console application, it turned out that it took almost twice as long as the WinINET solution.Moreover, the list of broken links was also longer than what the WinINET application produced.Double-checking showed that a number of the reported broken links were not broken.
Additional code showed that for some URLs, the call to “Req.GetResponse as HttpWebResponse" results in a timeout.Increasing the timeout value from 10 to 30 seconds or more didn't help.One thing most of these failing URLs had in common was that they were redirecting URLs (for example from bdn.borland.com to dn.codegear.com), but not all redirecting URLs failed, so this is not the explanation I'm looking for.At the time of writing, I still have no explanation, but will report it my weblog once/if I find the reason and a solution, so stay tuned.

VCL for Win32 Client
The final client - code for which can be found in the code archive of this last month's Under Construction article - is a Win32 VCL client, using a TProgressBar to show the progress of the site validation process (see the next listing for the OnCheckProgress event handler implementation).
Note that the value of LinkCount + CheckedCount determine the maximum position of the progress bar, so the position on the bar may be moving to the left (when the max grows after parsing a HTML file) or to the right (when the position itself moves after a URL has been checked).While this still shows that something is going on, it's hard to get a clear idea how far the total process is, especially when new pages are parsed that contain lots of links.But at least it will show what the application is currently doing.

  procedure TFrmBroken.CheckProgress(LinkCount,CheckedCount,BrokenCount: Integer;
    NextURL: String);
  begin
    ProgressBar1.Max := LinkCount + CheckedCount;
    ProgressBar1.Min := 0;
    ProgressBar1.Step := 1;
    ProgressBar1.Position := CheckedCount;
    Caption := Format('InterBob - website checker: %d broken links detected',
      [BrokenCount]);
    StatusBar1.SimpleText := Format('%d/%d - %s',
      [CheckedCount,LinkCount+CheckedCount,NextURL]);
    Memo1.Lines.Add(NextURL);
    Application.ProcessMessages
  end;

Web Spidering
A nice side effect of the broken link detection process is that it's downloading the HTML files it's parsing and analysing.And with the URL2File function mapping each URL to a similar local filename, it's easy now to download entire websites (as long as pages keep linking to each other in plain HTML, and not use some JavaScript or other technique).The only missing items are images, but if you add checks for the “<img src=" tags to the ParseHTML method, then these can be downloaded and archived, too.

Next Steps...
I've been using the applications described in this article for a few years now.Neither the Win32 nor the .NET solution is absolutely perfect, since they often contains “false positives": reporting links as being broken, when in fact the link is not broken.This only happens with potential broken links (I never get a false message about a redirection), so I do need a manual check before I remove links.And even then a website may just be “out of order" for a little while, so there's always a careful consideration involved.
The next steps of automating the process would be to load the local HTM files and be able to jump from one potential broken or redirected link to the other in an editor, so I can even quicker modify them.
Another extension which I've been working on is the parsing of the meta tags: the KEYWORDS and DESCRIPTION tags as well as the regular TITLE tag of the HTML files.Using this information, plus the URL itself, you could easily build your own search engine (a topic first covered by me in issues #29 and #30), which might be a good idea especially when the prospect of the website is that it will continue to grow with articles.

Final Words
Although this is not my final article for The Delphi Magazine (some more are scheduled to be published on the website after March 1st), it is my last Under Construction article in these pages.For more than a decade, it has been more than fun, and I want to thank all readers for the feedback and comments I've been receiving over the years.Delphi was and still is (and for a very long time will be) my main “construction" tool, and whenever I need a job done, it's still construction time.
Until we meet again, I wish you all a safe journey.Stay in touch.And don't forget to bring your towel.


This webpage © 2008-2010 by Bob Swart (aka Dr.Bob - www.drbob42.com).All Rights Reserved.