Scrapy
http://doc.scrapy.org/en/latest/intro/tutorial.html
Scrapy Tutorial
In this tutorial, we’ll assume that Scrapy is already installed on your system.If that’s not the case, see
Installation guide.
We are going to use
Open directory project (dmoz) asour example domain to scrape.
This tutorial will walk you through these tasks:
- Creating a new Scrapy project
- Defining the Items you will extract
- Writing a
spider to crawl a site and extractItems - Writing an
Item Pipeline to store theextracted Items
Scrapy is written in
Python. If you’re new to the language you might want tostart by getting an idea of what the language is like, to get the most out ofScrapy. If you’re already familiar with other languages, and want to learnPython quickly, we recommend
Dive Into Python. If you’re new to programmingand want to start with Python, take a look at
this list of Python resourcesfor non-programmers.
Creating a project
Before you start scraping, you will have set up a new Scrapy project. Enter adirectory where you’d like to store your code and then run:
scrapy startproject tutorial
This will create a tutorial directory with the following contents:
tutorial/
scrapy.cfg
tutorial/
__init__.py
items.py
pipelines.py
settings.py
spiders/
__init__.py
...
These are basically:
- scrapy.cfg: the project configuration file
- tutorial/: the project’s python module, you’ll later import your code fromhere.
- tutorial/items.py: the project’s items file.
- tutorial/pipelines.py: the project’s pipelines file.
- tutorial/settings.py: the project’s settings file.
- tutorial/spiders/: a directory where you’ll later put your spiders.
Defining our Item
Items are containers that will be loaded with the scraped data; they worklike simple python dicts but provide additional protecting against populatingundeclared fields, to prevent typos.
They are declared by creating an
scrapy.item.Item class an definingits attributes as
scrapy.item.Field objects, like you will in an ORM(don’t worry if you’re not familiar with ORMs, you will see that this is aneasy task).
We begin by modeling the item that we will use to hold the sites data obtainedfrom dmoz.org, as we want to capture the name, url and description of thesites, we define fields for each of these three attributes. To do that, we edititems.py, found in the
tutorial directory. Our Item class looks like this:
from scrapy.item import Item, Field
class DmozItem(Item):
title = Field()
link = Field()
desc = Field()
This may seem complicated at first, but defining the item allows you to use other handycomponents of Scrapy that need to know how your item looks like.
Our first Spider
Spiders are user-written classes used to scrape information from a domain (or groupof domains).
They define an initial list of URLs to download, how to follow links, and howto parse the contents of those pages to extract
items.
To create a Spider, you must subclass
scrapy.spider.BaseSpider, anddefine the three main, mandatory, attributes:
-
name:
identifies the Spider. It must beunique, that is, you can’t set the same name for different Spiders. -
start_urls:
is a list of URLs where theSpider will begin to crawl from. So, the first pages downloaded will be thoselisted here. The subsequent URLs will be generated successively from datacontained in the start URLs. -
parse()
is a method of the spider, which willbe called with the downloaded
Response object of eachstart URL. The response is passed to the method as the first and onlyargument.This method is responsible for parsing the response data and extractingscraped data (as scraped items) and more URLs to follow.
The
parse() method is in charge of processingthe response and returning scraped data (as
Itemobjects) and more URLs to follow (as
Request objects).
This is the code for our first Spider; save it in a file nameddmoz_spider.py under the
dmoz/spiders directory:
from scrapy.spider import BaseSpider
class DmozSpider(BaseSpider):
name = "dmoz"
allowed_domains = ["dmoz.org"]
start_urls = [
"http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
"http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
]
def parse(self, response):
filename = response.url.split("/")[-2]
open(filename, 'wb').write(response.body)
Crawling
To put our spider to work, go to the project’s top level directory and run:
scrapy crawl dmoz
The crawl
dmoz command runs the spider for the dmoz.org domain. Youwill get an output similar to this:
2008-08-20 03:51:13-0300 [scrapy] INFO: Started project: dmoz
2008-08-20 03:51:13-0300 [tutorial] INFO: Enabled extensions: ...
2008-08-20 03:51:13-0300 [tutorial] INFO: Enabled downloader middlewares: ...
2008-08-20 03:51:13-0300 [tutorial] INFO: Enabled spider middlewares: ...
2008-08-20 03:51:13-0300 [tutorial] INFO: Enabled item pipelines: ...
2008-08-20 03:51:14-0300 [dmoz] INFO: Spider opened
2008-08-20 03:51:14-0300 [dmoz] DEBUG: Crawled <http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/> (referer: <None>)
2008-08-20 03:51:14-0300 [dmoz] DEBUG: Crawled <http://www.dmoz.org/Computers/Programming/Languages/Python/Books/> (referer: <None>)
2008-08-20 03:51:14-0300 [dmoz] INFO: Spider closed (finished)
Pay attention to the lines containing [dmoz], which corresponds to ourspider. You can see a log line for each URL defined in
start_urls. Becausethese URLs are the starting ones, they have no referrers, which is shown at theend of the log line, where it says
(referer:
<None>).
But more interesting, as our parse method instructs, two files have beencreated:
Books and Resources, with the content of both URLs.
What just happened under the hood?
Scrapy creates
scrapy.http.Request objects for each URL in thestart_urls attribute of the Spider, and assigns them the
parse method ofthe spider as their callback function.
These Requests are scheduled, then executed, andscrapy.http.Response
objects are returned and then fed back to thespider, through the
parse() method.
Extracting Items
Introduction to Selectors
There are several ways to extract data from web pages. Scrapy uses a mechanismbased on
XPath expressions called
XPath selectors.For more information about selectors and other extraction mechanisms see theXPath selectors documentation.
Here are some examples of XPath expressions and their meanings:
- /html/head/title: selects the
<title> element, inside the
<head>element of a HTML document - /html/head/title/text(): selects the text inside the aforementioned<title> element.
- //td: selects all the
<td> elements - //div[@class="mine"]: selects all
div elements which contain anattribute
class="mine"
These are just a couple of simple examples of what you can do with XPath, butXPath expressions are indeed much more powerful. To learn more about XPath werecommend
this XPath tutorial.
For working with XPaths, Scrapy provides a
XPathSelectorclass, which comes in two flavours,
HtmlXPathSelector(for HTML data) and
XmlXPathSelector (for XML data). Inorder to use them you must instantiate the desired class with aResponse
object.
You can see selectors as objects that represent nodes in the documentstructure. So, the first instantiated selectors are associated to the rootnode, or the entire document.
Selectors have three methods (click on the method to see the complete APIdocumentation).
-
select():
returns a list of selectors, each ofthem representing the nodes selected by the xpath expression given asargument. -
- extract():
returns a unicode string with -
the data selected by the XPath selector.
- extract():
-
re():
returns a list of unicode stringsextracted by applying the regular expression given as argument.
Trying Selectors in the Shell
To illustrate the use of Selectors we’re going to use the built-in
Scrapyshell, which also requires IPython (an extended Python console)installed on your system.
To start a shell, you must go to the project’s top level directory and run:
scrapy shell http://www.dmoz.org/Computers/Programming/Languages/Python/Books/
This is what the shell looks like:
[ ... Scrapy log here ... ]
[s] Available Scrapy objects:
[s] 2010-08-19 21:45:59-0300 [default] INFO: Spider closed (finished)
[s] hxs <HtmlXPathSelector (http://www.dmoz.org/Computers/Programming/Languages/Python/Books/) xpath=None>
[s] item Item()
[s] request <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Books/>
[s] response <200 http://www.dmoz.org/Computers/Programming/Languages/Python/Books/>
[s] spider <BaseSpider 'default' at 0x1b6c2d0>
[s] xxs <XmlXPathSelector (http://www.dmoz.org/Computers/Programming/Languages/Python/Books/) xpath=None>
[s] Useful shortcuts:
[s] shelp() Print this help
[s] fetch(req_or_url) Fetch a new request or URL and update shell objects
[s] view(response) View response in a browser
In [1]:
After the shell loads, you will have the response fetched in a localresponse variable, so if you type
response.body you will see the bodyof the response, or you can type
response.headers to see its headers.
The shell also instantiates two selectors, one for HTML (in the
hxsvariable) and one for XML (in the
xxs variable) with this response. So let’stry them:
In [1]: hxs.select('//title')
Out[1]: [<HtmlXPathSelector (title) xpath=//title>]
In [2]: hxs.select('//title').extract()
Out[2]: [u'<title>Open Directory - Computers: Programming: Languages: Python: Books</title>']
In [3]: hxs.select('//title/text()')
Out[3]: [<HtmlXPathSelector (text) xpath=//title/text()>]
In [4]: hxs.select('//title/text()').extract()
Out[4]: [u'Open Directory - Computers: Programming: Languages: Python: Books']
In [5]: hxs.select('//title/text()').re('(w+):')
Out[5]: [u'Computers', u'Programming', u'Languages', u'Python']
Extracting the data
Now, let’s try to extract some real information from those pages.
You could type response.body in the console, and inspect the source code tofigure out the XPaths you need to use. However, inspecting the raw HTML codethere could become a very tedious task. To make
this an easier task, you canuse some Firefox extensions like Firebug. For more information seeUsing Firebug for scraping and
Using Firefox for scraping.
After inspecting the page source, you’ll find that the web sites informationis inside a
<ul> element, in fact the
second <ul> element.
So we can select each <li> element belonging to the sites list with thiscode:
hxs.select('//ul/li')
And from them, the sites descriptions:
hxs.select('//ul/li/text()').extract()
The sites titles:
hxs.select('//ul/li/a/text()').extract()
And the sites links:
hxs.select('//ul/li/a/@href').extract()
As we said before, each select() call returns a list of selectors, so we canconcatenate further
select() calls to dig deeper into a node. We are going to usethat property here, so:
sites = hxs.select('//ul/li')
for site in sites:
title = site.select('a/text()').extract()
link = site.select('a/@href').extract()
desc = site.select('text()').extract()
print title, link, desc
Note
For a more detailed description of using nested selectors, seeNesting selectors andWorking
with relative XPaths in the
XPath Selectorsdocumentation
Let’s add this code to our spider:
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
class DmozSpider(BaseSpider):
name = "dmoz"
allowed_domains = ["dmoz.org"]
start_urls = [
"http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
"http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
]
def parse(self, response):
hxs = HtmlXPathSelector(response)
sites = hxs.select('//ul/li')
for site in sites:
title = site.select('a/text()').extract()
link = site.select('a/@href').extract()
desc = site.select('text()').extract()
print title, link, desc
Now try crawling the dmoz.org domain again and you’ll see sites being printedin your output, run:
scrapy crawl dmoz
Using our item
Item objects are custom python dicts; you can
access thevalues of their fields (attributes of the class we defined earlier) using thestandard dict syntax like:
>>> item = DmozItem()
>>> item['title'] = 'Example title'
>>> item['title']
'Example title'
Spiders are expected to return their scraped data insideItem
objects. So, in order to return the data we’vescraped so far, the final code for our Spider would be like this:
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from tutorial.items import DmozItem
class DmozSpider(BaseSpider):
name = "dmoz"
allowed_domains = ["dmoz.org"]
start_urls = [
"http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
"http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
]
def parse(self, response):
hxs = HtmlXPathSelector(response)
sites = hxs.select('//ul/li')
items = []
for site in sites:
item = DmozItem()
item['title'] = site.select('a/text()').extract()
item['link'] = site.select('a/@href').extract()
item['desc'] = site.select('text()').extract()
items.append(item)
return items
Note
You can find a fully-functional variant of this spider in the
dirbotproject available at
https://github.com/scrapy/dirbot
Now doing a crawl on the dmoz.org domain yields DmozItem‘s:
[dmoz] DEBUG: Scraped from <200 http://www.dmoz.org/Computers/Programming/Languages/Python/Books/>
{'desc': [u' - By David Mertz; Addison Wesley. Book in progress, full text, ASCII format. Asks for feedback. [author website, Gnosis Software, Inc.n],
'link': [u'http://gnosis.cx/TPiP/'],
'title': [u'Text Processing in Python']}
[dmoz] DEBUG: Scraped from <200 http://www.dmoz.org/Computers/Programming/Languages/Python/Books/>
{'desc': [u' - By Sean McGrath; Prentice Hall PTR, 2000, ISBN 0130211192, has CD-ROM. Methods to build XML applications fast, Python tutorial, DOM and SAX, new Pyxie open source XML processing library. [Prentice Hall PTR]n'],
'link': [u'http://www.informit.com/store/product.aspx?isbn=0130211192'],
'title': [u'XML Processing with Python']}
Storing the scraped data
The simplest way to store the scraped data is by using the
Feed exports, with the following command:
scrapy crawl dmoz -o items.json -t json
That will generate a items.json file containing all scraped items,serialized in
JSON.
In small projects (like the one in this tutorial), that should be enough.However, if you want to perform more complex things with the scraped items, youcan write an
Item Pipeline. As with Items, aplaceholder file for Item Pipelines has been set up for you when the project iscreated, in
tutorial/pipelines.py. Though you don’t need to implement any itempipeline if you just want to store the scraped items.
Next steps
This tutorial covers only the basics of Scrapy, but there’s a lot of otherfeatures not mentioned here. Check the
What else? section inScrapy at a glance chapter for a quick overview of the most important ones.
Then, we recommend you continue by playing with an example project (seeExamples), and then continue with the sectionBasic
concepts.
版权所有,禁止转载. 如需转载,请先征得博主的同意,并且表明文章出处,否则按侵权处理.