Spidering This Site

Sometimes people will try running a spider program to download this entire wiki for offline browsing. Most of the time they end up getting a few pages and then just 403 or 503 errors and then they complain to me about it, especially if they persist and the site locks them out. The reason this happens is MoinMoin has some built in surge protection that helps to prevent the server from becoming overloaded or from getting into a denial of service state. One of the things that can trigger this protection is a single IP address fetching pages faster than a configured threshold. Currently that threshold is set at 30 pages per 60 seconds, so if you want to spider this site you'll need to use spidering software that is able to be configured to fetch pages at something lower than this limit.

Although that limit might seem a bit severe, the real problem is that there are lots of ways to get to the same data in a wiki, so your spider will end up fetching the same page, or metadata about the same page multiple times, so you'll end up with a lot more data being downloaded than you need. This is because these tools are by default pretty dumb and will download everything they can get to via links from the starting point you specify. So for a MoinMoin wiki you'll end up fetching all of the previous revisions of a page, the raw text of the page, tools to edit and print the pages, etc., etc. So another feature you will want your spidering software to support is the ability to filter out URLs based on patterns. By filtering out all of these URLs that are not necessary then the limitation of the pages/minute threshold shouldn't be too bad, because you'll be fetching only a fraction of the entire site. Here is a list of strings that you should configure to be filtered out if the string is found in the URL:

NOTE: I haven't yet tried this myself, I'm just basing this advice on what I've read online and also in the MoinMoin source code. If you do try this please feel free to post updates above, or just add some comments about your experience.


SpideringThisSite (last edited 2008-03-11 10:50:34 by localhost)

NOTE: To edit pages in this wiki you must be a member of the TrustedEditorsGroup.