| 
  • If you are citizen of an European Union member nation, you may not use this service unless you are at least 16 years old.

  • You already know Dokkio is an AI-powered assistant to organize & manage your digital files & messages. Very soon, Dokkio will support Outlook as well as One Drive. Check it out today!

View
 

Source Code for Broken Links Traversal

Page history last edited by Tim 15 years, 12 months ago

Source Code for Broken Links Traversal

This is a C# .net Console Application for generating the Most Wanted and Broken Links pages. If you do not have access to .net, see Tim's C++ implementation at auTIMator, Broken Link Generator. The C# application is actually a child of Tim's project and wouldn't exist without this inspiration.

The C# implemention differs from the C++ one in that it can read directly from the .zip file, so there's no need to extract it first. I have to mention how cool the SharpZipLib project is; I can't imagine ever taking the time to bother figuring out how to access ZIP files myself. It's a great project and it is actually written in C# itself.

Build Notes

If you have access to a .net compiler, here's what to do. As usual with these things, if something breaks and sets your computer on fire or anything like that, I take no responsibility:

  1. Download and extract SharpZipLib - The precompiled DLL version is fine right now.
  2. Create a new C# Console Application
  3. Add a reference to System.Windows.Forms - This is needed even in a Console Application to access the System.Windows.Forms.Application.StartupPath property.
  4. Add a reference to the ICSharpCode.SharpZipLib.dll
  5. Rename the Class1.cs file to traverse.cs
  6. Copy/Paste the below source code, overwriting everything in the traverse.cs file.

The project should then compile. ICSharpCode.SharpZipLib.dll needs to be in the same folder as traverse.exe for it to run.

Generating the Pages

The simplest way to generate the files is:

  1. Download a Backup
  2. Put the ZIP file in the same folder as traverse and run the exe.

With no parameters specified the application will use its StartupPath for disk i/o and attempt to find the ZIP file by listing *.zip in StartupPath and picking the newest ZIP file in the list, decided by file creation date.

There are several command line parameters that you can use to change this default behavior:

  • -a or /a - treat all links as broken (spits out an index of how the entire site is crosslinked)
  • -i or /i - Specify a different input path
  • -o or /o - Specify a different output path
  • -p or /p - Specify both paths pointing to the same location
  • -z or /z (Or any "non-param", see note below) - Specify ZIP file name
  • -m or /m - Specify Rank interval - eg -m 6 will list pages with 6 or more references when building missing_ranked.dat. The default is 3.
  • -d or /d - view debug output (Application pauses to let you view certain info, hit Enter when prompted)
  • -w or /w - Specify wiki name. Required if using this on a wiki other than elothtes

Note: If any unrecognised paramter is specified it is assumed to be the ZIP filename, eg "traverse -m 5 foo.zip" and "traverse -m 5 -z foo.zip" do exactly the same thing and are both valid syntax -- and hence "traverse -z foo.zip" and "traverse foo.zip" are valid parameter syntax too.

If input path is specified but ZIP filename is not, the application will list all ZIP files in the input path and pick the newest one, decided by file creation date.

Excluded Files

If you think a particular page on the Wiki should be excluded from the traversal process, please add it to the list maintained on the ExcludedPages page.

The Source Code

Here it is:

using System;
using System.IO;
using System.Text;
using System.Text.RegularExpressions;
using System.Collections;
using System.Windows.Forms;
using ICSharpCode.SharpZipLib.Zip;

/*
 * traverse   C# .net   Console Application
 * 
 * Requires a Reference to System.Windows.Forms         (To get StartUpPath)
 * Requires a Reference to ICSharpCode.SharpZipLib.Zip  (#ZipLib DLL)
 * 
 * #ZipLib download:
 * http://www.icsharpcode.net/OpenSource/SharpZipLib/Default.aspx
 * 
 */

namespace traverse
{
        class Strings
        {
                public static string strWikiName = "elothtes";
                public static string strFilenamePrefix = "/pages/";
                public const string strFilenamePostfix = "/current";
        }

        class ZipReader
        {
                ZipFile zpFile;

                public ZipReader(string strZipFilename)
                {
                        this.zpFile = new ZipFile(strZipFilename);
                }

                public bool FileExists(string strFindFilename)
                {
                        if (this.zpFile.FindEntry(strFindFilename, true) >= 0)
                        {
                                return true;
                        }
                        else
                        {
                                return false;
                        }
                }

                public string[] GetTextFile(string strTextFilename)
                {
                        try
                        {
                                ZipEntry zpeExPages = this.zpFile.GetEntry(strTextFilename);
                                BinaryReader brGetData = new BinaryReader(this.zpFile.GetInputStream(zpeExPages));
                                int intSize = 2048;
                                byte[] abData = new byte[2048];
                                string strData = "";
                                while((intSize = brGetData.Read(abData, 0, abData.Length)) > 0)
                                {
                                        strData += Encoding.Default.GetString(abData, 0, intSize);
                                }
                                strData = strData.Replace("r","");
                                strData = strData.Replace(@"\","");
                                return strData.Split('n');
                        }
                        catch
                        {
                                return new string[0];
                        }
                }

                public ZipFile ZipFile
                {
                        get
                        {
                                return this.zpFile;
                        }
                }
        }

        class ExcludedList
        {
                ArrayList alExcludedPages = new ArrayList();

                public ExcludedList(string strFn)
                {
                        this.alExcludedPages.Add(strFn.Trim().ToLower());
                }

                public void Add(string strEntry)
                {
                        strEntry = strEntry.Replace(Strings.strFilenamePrefix,"");
                        strEntry = strEntry.Replace(Strings.strFilenamePostfix,"");
                        if ((strEntry != null) && (strEntry != ""))
                        {
                                this.alExcludedPages.Add(strEntry.Trim().ToLower().Replace("~",""));
                        }
                }

                public bool IsInList(string strIs)
                {
                        strIs = strIs.Replace(Strings.strFilenamePrefix,"");
                        strIs = strIs.Replace(Strings.strFilenamePostfix,"");
                        if (this.alExcludedPages.Count > 0)
                        {
                                foreach(string strMatch in this.alExcludedPages)
                                {
                                        if (strIs.ToLower().IndexOf(strMatch) >= 0) 
                                        { 
                                                return true;
                                        }
                                }
                                return false;
                        }
                        else
                        {
                                return false;
                        }
                }
        }

        class traverse
        {
                public const bool bShowOutput = true;
                public static bool bDebugMode = false;
                public static bool bIncludeDupLinks = false;
                public static bool bIncludeAllLinks = false;
                public static int intMRCutoff = 3;
                public static ExcludedList exlPages = new ExcludedList(".dat");
                public static ZipReader zrWiki;
                public static bool bLastFileParseDeadend = false;
                public static bool bSummaryOnly = false;
                public static string szEmail;
                public static string szName;
                public static string szDate;
                public static string szPage;
                public static string szOgDate;

                [STAThread]
                static void Main(string[] args)
                {
                        //Parameter Variables
                        string strBothPath = "";     //-p or /p
                        string strInputPath = "";    //-i or /i
                        string strOutputPath = "";   //-o or /o
                        string strZipFileName = "";  //-z or /z or <filename>
                        //Read Any Command-Line Parameters
                        int intXN = 0;
                        for(int intX=0;intX<args.Length;intX++)
                        {
                                Char firstChar = '0';                           
                                Char paramLetter = '0';
                                if(args[intX].Length==2)
                                {
                                        firstChar = args[intX][0];
                                        paramLetter = args[intX][1];
                                }

                                intXN = intX+1;

                                if(firstChar!='-' && firstChar!='/')
                                        continue;

                                if (intX+1<args.Length) // parameters that require an additional following parameter
                                {
                                        switch(paramLetter)
                                        {
                                                case 'p':
                                                        strBothPath = args[intXN].TrimEnd('\');
                                                        args[intXN] = "";
                                                        continue;
                                                case 'i':
                                                        strInputPath = args[intXN].TrimEnd('\');
                                                        args[intXN] = "";
                                                        continue;
                                                case 'o':
                                                {
                                                        strOutputPath = args[intXN].TrimEnd('\');
                                                        args[intXN] = "";
                                                        continue;
                                                }
                                                case 'z':
                                                {
                                                        strZipFileName = args[intXN];
                                                        args[intXN] = "";
                                                        continue;
                                                }
                                                case 'm':
                                                {
                                                        try
                                                        {
                                                                intMRCutoff = Int32.Parse(args[intXN]);
                                                        }
                                                        catch {}
                                                        args[intXN] = "";
                                                        continue;
                                                }
                                                case 'w': // wiki name
                                                {
                                                        Strings.strWikiName = args[intXN];
                                                        args[intXN] = "";
                                                        continue;
                                                }
                                        }
                                }
                                
                                if (paramLetter=='a')
                                {
                                        bIncludeAllLinks=true;
                                }
                                else if (paramLetter=='s')
                                {
                                        bSummaryOnly = true;
                                }
                                else if (paramLetter == 'd')
                                {
                                        bDebugMode = true;
                                }
                                else if (paramLetter == 'i')
                                {
                                        bIncludeDupLinks = true;
                                }
                                else
                                {
                                        Console.WriteLine("Ignored argument: "+paramLetter);
                                        //if ((args[intX].Trim() != "")) { strZipFileName = args[intX]; }
                                }
                        }

                        Strings.strFilenamePrefix = Strings.strWikiName + Strings.strFilenamePrefix;

                        //Use Application Startup Path if no params supplied
                        //Or if one or the other is supplied
                        if ((strInputPath == "") && (strOutputPath == ""))
                        {
                                strBothPath = Application.StartupPath;
                        }
                        else if ((strInputPath != "") && (strOutputPath == ""))
                        {
                                strOutputPath = Application.StartupPath;
                        }
                        else if ((strInputPath == "") && (strOutputPath != ""))
                        {
                                strInputPath = Application.StartupPath;
                        }

                        //Yes, -p is supposed to overwrite -i/-o
                        if (strBothPath != "")
                        {
                                strInputPath = strBothPath;
                                strOutputPath = strBothPath;
                        }

                        //Durrr no ZIP file specified...
                        if (strZipFileName == "")
                        {
                                if (Directory.Exists(strInputPath))
                                {
                                        //...so try and pick the newest ZIP file in the folder!
                                        DateTime dtLast = DateTime.Parse("01/01/1970");
                                        foreach(string strZipFile in Directory.GetFiles(strInputPath, "*.zip"))
                                        {
                                                DateTime dtNewest = File.GetCreationTime(strZipFile);
                                                if (dtNewest > dtLast)
                                                {
                                                        strZipFileName = Path.GetFileName(strZipFile);
                                                }
                                                else
                                                {
                                                        dtLast = dtNewest;
                                                }
                                        }

                                }
                        }

                        Hashtable htLinks = new Hashtable();
                        Hashtable htMostWanted = new Hashtable();
                        ArrayList alLinks = new ArrayList();
                        string strLink = "";
                        String strDDead = strOutputPath + @"deadends.dat";
                        string strDOut = strOutputPath + @"out.dat";
                        string strDMRank = strOutputPath + @"missing_ranked.dat";
                        string strDExPages = Strings.strFilenamePrefix + @"ExcludedPages" + Strings.strFilenamePostfix;
                        string strZipFilePath = strInputPath + @"\" + strZipFileName;

                        StreamWriter srDead = new StreamWriter(strDDead, false);

                        if ((bDebugMode) && (bShowOutput))
                        {
                                Console.WriteLine("Input Path  : {0}", strInputPath);
                                Console.WriteLine("Output Path : {0}", strOutputPath);
                                Console.WriteLine("ZIP Filename: {0}", strZipFileName);
                                Console.WriteLine("Press **Enter** to Continue...");
                                Console.ReadLine();
                        }

                        if (File.Exists(strZipFilePath))
                        {
                                zrWiki = new ZipReader(strZipFilePath);
                                if (zrWiki.FileExists(strDExPages))
                                {
                                        exlPages.Add(strDExPages);
                                        exlPages.Add("elothtes/pages/Dooblegnards/current");
                                        foreach(string strLine in zrWiki.GetTextFile(strDExPages))
                                        {
                                                if (!strLine.Trim().StartsWith("//")) { exlPages.Add(Strings.strFilenamePrefix + strLine + Strings.strFilenamePostfix); }
                                        }
                                }

                                StreamWriter srSummary = new StreamWriter(strOutputPath + @"summary.dat", false);
                                srSummary.WriteLine("|**Date Code**| **User** | **Page** |");
                                srSummary.Close();


                                string strFn = "";
                            string directoryName = "";
                                foreach(ZipEntry zpeFileEntry in zrWiki.ZipFile)
                                {
                                        strFn = zpeFileEntry.Name.ToString().Trim();
                                        if(strFn.Length!=0 && strFn[strFn.Length-1]=='/')
                                                continue; // empty directory in zip file

                                        if (!exlPages.IsInList(strFn))
                                        {
                                                if (bShowOutput) { Console.WriteLine("Reading: " + strFn); }
                                                alLinks = ParsePage(strFn);

                                                strFn = strFn.Replace(Strings.strFilenamePrefix,"");
                                                if (bSummaryOnly)
                                                        directoryName = (strFn.Length!=0)?strFn.Substring(0,strFn.IndexOf('/')):"";
                                                strFn = strFn.Replace(Strings.strFilenamePostfix,"");

                                                if(bLastFileParseDeadend)
                                                        srDead.WriteLine("* [" + strFn + "]");
                                        
                                                srSummary = new StreamWriter(strOutputPath + @"summary.dat", true);
                                                srSummary.WriteLine("<!--" + szOgDate +"-->t" + szDate + "t|" +  szName + "|t[" + szPage + "]" + (bSummaryOnly?"t" + directoryName:""));
                                                srSummary.Close();

                                                for(int intX=0;intX<alLinks.Count;intX++)
                                                {
                                                        strLink = alLinks[intX].ToString().Trim();
                                                        if (htLinks.ContainsKey(strLink))
                                                        {
                                                                htLinks[strLink] = "[" + strFn + "]. " + htLinks[strLink];
                                                                htMostWanted[strLink] = (int)htMostWanted[strLink] + 1;
                                                        }
                                                        else
                                                        {
                                                                htLinks[strLink] = "[" + strFn + "]";
                                                                htMostWanted[strLink] = 1;
                                                        }
                                                }
                                        }
                                }

                                int intTBLinks = htLinks.Keys.Count;
                                for(int intX=65;intX<91;intX++) { if (!htLinks.ContainsKey(((char)intX).ToString())) htLinks.Add(((char)intX).ToString(), "!!!"); }
                                Array arLinksKeys = Array.CreateInstance(typeof(string), htLinks.Keys.Count);
                                htLinks.Keys.CopyTo(arLinksKeys, 0);
                                Array.Sort(arLinksKeys);

                                Array arMWKeys = Array.CreateInstance(typeof(string), htMostWanted.Keys.Count);
                                Array arMWValues = Array.CreateInstance(typeof(int), htMostWanted.Values.Count);
                                htMostWanted.Keys.CopyTo(arMWKeys, 0);
                                htMostWanted.Values.CopyTo(arMWValues, 0);
                                Array.Sort(arMWValues, arMWKeys);
                                Array.Reverse(arMWKeys);
                                Array.Reverse(arMWValues);

                                string strKeyFormatted = "";
                                string strValueFormatted = "";
                                Regex rxRTilde = new Regex(@"((?:[A-Z][a-z0-9]+[A-Z][a-z0-9]+)+)", RegexOptions.None);
                                StreamWriter srDOut = new StreamWriter(strDOut, false);
                                srDOut.WriteLine("''There are currently **{0}** Broken Links listed!''", intTBLinks.ToString());
                                foreach(string strKey in arLinksKeys)
                                {
                                        strKeyFormatted = strKey.Trim();
                                        strKeyFormatted = rxRTilde.Replace(strKeyFormatted, "~$1", -1);
                                        strValueFormatted = htLinks[strKey].ToString().Trim();
                                        strValueFormatted = rxRTilde.Replace(strValueFormatted, "~$1", -1);
                                        if (strValueFormatted == "!!!")
                                        {
                                                srDOut.WriteLine("!!!" + strKeyFormatted);
                                        }
                                        else
                                        {
                                                if (bShowOutput)
                                                {
                                                        Console.WriteLine("Page: " + strKey);
                                                        Console.WriteLine("Link Tree: " + strValueFormatted);
                                                }
                                                srDOut.WriteLine("* [" + strKeyFormatted + "]: " + strValueFormatted);
                                        }
                                }
                                srDOut.Close();

                                StreamWriter srDMRank = new StreamWriter(strDMRank, false);
                                int lastNum = 0;
                                for(int intX=0;intX<arMWKeys.Length;intX++)
                                {
                                        if (Int32.Parse(arMWValues.GetValue(intX).ToString().Trim()) >= intMRCutoff)
                                        {
                                                strKeyFormatted = arMWValues.GetValue(intX).ToString().Trim();
                                                strKeyFormatted = rxRTilde.Replace(strKeyFormatted, "~$1", -1);
                                                strValueFormatted = htLinks[arMWKeys.GetValue(intX).ToString()].ToString().Trim();
                                                strValueFormatted = rxRTilde.Replace(strValueFormatted, "~$1", -1);
                                                if (bShowOutput)
                                                {
                                                        Console.WriteLine("Page: " + arMWKeys.GetValue(intX).ToString());
                                                        Console.WriteLine("RefCount: " + strKeyFormatted);
                                                }
                                                if(lastNum!= Convert.ToInt32(strKeyFormatted))
                                                {
                                                        lastNum = Convert.ToInt32(strKeyFormatted);
                                                        srDMRank.WriteLine("!! {0}",strKeyFormatted);
                                                }
                                                srDMRank.WriteLine("** [{0}]n*** {1}n", arMWKeys.GetValue(intX).ToString(), strValueFormatted);
                                        }
                                        else
                                        {
                                                break;
                                        }
                                }
                                srDead.Close();

                                srDMRank.Close();
                        }
                        else
                        {
                                if (bShowOutput) { Console.WriteLine("Fatal Error: Zip Data File Not Found!"); }
                        }
                }

                static ArrayList ParsePage(string strPFn)
                {
                        string strWLink = "";
                        Regex rxParseLinks = new Regex(@"[(.+?)[||#|]]", RegexOptions.IgnoreCase);
                        ArrayList alPLinks = new ArrayList();
                        bool bExistsWithTilde = false;                  

                        bLastFileParseDeadend = true;
                        szEmail = "";
                        szName = "";
                        szDate = "";
                        szOgDate = "";
                        szPage = "";

                        foreach(string strPLine in zrWiki.GetTextFile(strPFn))
                        {
                                if (strPLine.StartsWith("name:"))
                                {
                                        szName = strPLine.Substring(5);
                                }
                                else if (strPLine.StartsWith("time:"))
                                {
                                        szOgDate = strPLine.Substring(5);
                                        int i =  Convert.ToInt32(szOgDate);
                                        DateTime tmpBob = new DateTime(1970, 1, 1,0,0,0,0);
                                        tmpBob = tmpBob.AddSeconds(i);
                                        szDate = tmpBob.ToShortDateString() +" "+ tmpBob.ToShortTimeString();
                                }
                                else if (strPLine.StartsWith("email:"))
                                {
                                        szEmail = strPLine.Substring(6);
                                }
                                else if (strPLine.StartsWith("page:"))
                                {
                                        szPage = strPLine.Substring(5);
                                }
                                else if (bSummaryOnly)
                                        continue;
                                else foreach(Match mWLink in rxParseLinks.Matches(strPLine))
                                {
                                        for(int intX=1;intX<mWLink.Groups.Count;intX++)
                                        {
                                                for(int intY=0;intY<mWLink.Groups[intX].Captures.Count;intY++)
                                                {
                                                        bLastFileParseDeadend = false;

                                                        strWLink = mWLink.Groups[intX].Captures[intY].Value.ToString().Trim();
                                                        strWLink = strWLink.Replace("[", "");
                                                        strWLink = strWLink.Replace("]", "");
                                                        if (!bIncludeAllLinks && zrWiki.FileExists(Strings.strFilenamePrefix + strWLink + Strings.strFilenamePostfix))
                                                        {
                                                                bExistsWithTilde = true;
                                                        }
                                                        else
                                                        {
                                                                bExistsWithTilde = false;
                                                        }
                                                        strWLink = strWLink.Replace("~", "");
                                                        if (!strWLink.StartsWith("http://") && !strWLink.StartsWith("https://") && (strWLink.IndexOf("@") < 0))
                                                        {
                                                                if (bIncludeDupLinks || !alPLinks.Contains(strWLink))
                                                                {
                                                                        if (bIncludeAllLinks || ((bExistsWithTilde == false) && !zrWiki.FileExists(Strings.strFilenamePrefix + strWLink + Strings.strFilenamePostfix)))
                                                                        {
                                                                                if (!exlPages.IsInList(strWLink))
                                                                                {
                                                                                        alPLinks.Add(strWLink);
                                                                                        if (bShowOutput) { Console.WriteLine("Missing: " + strWLink); }
                                                                                }
                                                                        }
                                                                }
                                                        }
                                                }
                                        }
                                }
                        }
                        return(alPLinks);
                }
        }
}

Comments (0)

You don't have permission to comment on this page.