In this tutorial, we are going to set up Portia and develop a project to scrape items from books.toscrape.com Then we are going to export these Items in CSV format.
Once you've created an account in Scrapinghub, go to the dashboard to start a new project.
First, click on Create Project.
A pop-up window will appear, here we will set the name of our project. You can also select between Portia or a Scrapy project. In this case we are interested in to create a Portia project.
Select it and then click Create.
Now you are in the Portia environment. At the top, you'll be able to find the Portia 2.0 Documentation to learn about all the features of this new release.
Note: If you are still on Version 1.0, you can change to Portia-beta.
Let’s introduce our site to scrape: books.toscrape.com, and then hit Enter or Return.
A new window with a browser are shown. If the site isn’t correct just writing the new site address on the box.
Click on NEW SPIDER.
Go to the starting page, by selecting one of the products we want to scrape, in this case a book. Click on the link and go to the product page.
Now you can navigate through the book page and check if the information you need is correct.
Suppose we are interested in to extract the fields: title of the book, description, price and availability (check stock).
Click on NEW SAMPLE.
Here, you can select and edit the item that you want to scrape. Every time you pass the mouse over the field, it will be highlighted.
Choose for example the title of the book. A new field will be created on the left menu where you can edit the name and the type of the field (in this case text).
Continue by selecting other fields (and types) such as price (price), stock (text) and description (text).
Let’s mark the required fields checkers to avoid missing data on your searching.
Finally you can click on CLOSE SAMPLE.
Then, click on the Publish button (the small green cloud). You are ready to deploy your project!
A message will be displayed showing that you are able to start running the spider on the Scrapy Cloud dashboard. If you want to change or add something, you can do it here and then publish the project again .
Now, you are ready to run your spider! Click on the RUN button.
The last pop-up window will be displayed to select the spider and tags or arguments. As you can see, you can create different spiders in the same project. By now, we just want to put this spider to work! Click on RUN again.
Our spider started to run searching and extracting the items. As you can see in this example, one single spider has extracted 352 items in 40 seconds. By the end of the job, 1000 items were extracted in just about 1.4 minutes.
That represents an outsanding speed! Can you imagine selecting and extracting 12 items per second? Of course you can upgrade the performance of your project using more spiders and the Crawlera addon for example.
Once the job is finished. Click on the items and a new menu will be displayed. You can navigate through the different items and also through the tags to analyze the spider performance.
Now, you are ready to download the data to your computer! Click on EXPORT and select Get as CSV.
After downloading the file you can open it with your favorite spreadsheet. In this example, we are using the open source Libre Office - Calc to handle it.
Make sure you are selecting UTF-8 Character set, otherwise some elements won’t be displayed properly. Then click OK.
You can arrange your columns and edit as you prefer in order to store and use your data efficiently.
As you can see, you can extract a huge volume of items and turned it into useful data.