Advice and answers from the Scrapinghub Team

In this article we'll cover two features you can use to help improve the efficiency of your spiders by restricting which pages are crawled and preventing duplicate pages being scraped from different URLs.

Link Following Rules

By default, spiders created with Portia will follow all in-domain links. It’s not uncommon for the majority of pages on the target website to be irrelevant, and ignoring those pages can have a dramatic effect on crawling efficiency.

To define which links to follow, and which to ignore, you can define following patterns. Following patterns are set in the Crawling section of your spider:

Link following patterns are defined with regular expressions. Regular expressions define a search pattern to match against text, and allow you to define very specific patterns which wouldn’t be possible with wildcard matching.

Here are some examples which should give you a good idea of how you can define your own patterns:

The following:

/product/\d+

Would match against:

/product/123
/product/123-product-name

Would not match against:

/product/example-123
/product/some-product

The following:

/product/\d+$

Would match against:

/product/123

Would not match against:

/product/123-product-name

/(category|product)/\d+

Would match against:

/category/123
/product/123-product-name

Would not match against:

/category/example-123
/product/product-name

So you may have noticed that \d+ denotes a number. That’s true, but more specifically it denotes one or more numbers. The \d refers to a single digit, and the + is a quantifier which specifies one or more of the preceding character. The pipe symbol (‘|’) means ‘or’, so where we have (category|product) it means look for the text 'category' or 'product'. The dollar symbol refers to the end of a line.

These are just some basic examples, you can find a more in-depth tutorial on regular expressions here.

Query Cleaner Addon

The query cleaner addon works similarly to Portia’s link following rules, only rather than affecting which links to follow, it instead allows you to strip parameters from URLs to avoid duplicates using the following settings:

QUERYCLEANER_REMOVE
QUERYCLEANER_KEEP

The QUERYCLEANER_REMOVE setting let’s you specify which fields should be removed, and the QUERYCLEANER_KEEP setting let’s you specify those which should be kept. If QUERYCLEANER_KEEP is set then only parameters listed within its value will be kept; any others will be removed. You can specify the fields using a regex OR (‘|’), for example ‘search|login|postid’ will match the search, login and postid fields.

Take the following example:
http://www.example.com/product.php?pid=135&cid=12&ttda=12
Suppose we only care about the pid field, there are two ways we could keep the pid field while excluding the others. With QUERYCLEANER_REMOVE we could specify ‘cid|ttda’ or with QUERYCLEANER_KEEP we could specify ‘pid’. In this case, we should go for the latter option as it’s shorter and there may be other unwanted parameters on other pages which we want to ignore.

We hope this two useful tools help you to improve your projects using Portia.

Did this answer your question?