Sometimes, you need to add certain fields to your scraped data that can be derived from the context. For example, you may need a timestamp for when an item was scraped, or you need to extract an identifier from a URL. This is where the Magic Fields addon comes in.
You can enable the addon by going to Settings -> Addons
and clicking Add
on the Magic Fields addon.
Navigate to the settings of the spider you want to modify. Let’s use the $time
magic variable as an example.
Add { "timestamp": "$time" }
to the MAGIC_FIELDS
setting. This will add a timestamp
field containing the time at which the item was scraped.
The following magic variables are available:
-
time
The UTC timestamp at which the item was scraped, in the format‘%Y-%m-%d %H:%M:%S’
-
unixtime
The Unix time at which the item was scraped. -
isotime
The UTC timestamp at which the item was scraped, in the format‘%Y-%m-%dT%H:%M:%S’
. -
spider:<attribute>
The value of the specified attribute argument. -
env:<variable>
The value of the specified variable. Note: the name of the variable will be omitted. -
jobid
The job ID. Shortcut for$env:SCRAPY_JOB
. -
jobtime
The UTC timestamp at which the job started, in the format‘%Y-%m-%d %H:%M:%S’
. -
setting:<name>
The value for the specified setting.field:<name>The value of the existing field specified -
response:<property>
The value of the specified property of the response.
You can also use regular expressions to extract a portion of the variable.
For example, let’s say you need to extract a parameter from a URL like this: http://www.example.com/product.html?item_no=345
. The normal syntax, { "sku": "$field:url" }
will store the full URL into the sku
field. If we want to extract only the item_no
value, we can use a regex like this:
{ "sku": "$field:url,r'item_no=(\d+)'" }