python - tutorial - web applications with django




¿Cuál es una forma simple de extraer la lista de URL en una página web usando Python? (2)

Esto es realmente muy simple con BeautifulSoup .

from BeautifulSoup import BeautifulSoup

[element['href'] for element in BeautifulSoup(document_contents).findAll('a', href=True)]

# [u'http://example.com/', u'/example', ...]

Una última cosa: puede usar urlparse.urljoin para hacer que todas las URL sean absolutas. Si necesita el texto del enlace, puede usar algo como element.contents[0] .

Y así es como puedes atarlo todo:

import urllib2
import urlparse
from BeautifulSoup import BeautifulSoup

def get_all_link_targets(url):
    return [urlparse.urljoin(url, tag['href']) for tag in
            BeautifulSoup(urllib2.urlopen(url)).findAll('a', href=True)]

Quiero crear un rastreador web simple para divertirme. Necesito el rastreador web para obtener una lista de todos los enlaces en una página. ¿Tiene la biblioteca de Python alguna función incorporada que haría esto más fácil? Gracias cualquier conocimiento apreciado.


Hay un artículo sobre el uso de HTMLParser para obtener las URL de las etiquetas <a> en una página web.

El código es este:

de HTMLParser import HTMLParser de urllib2 import urlopen

class Spider(HTMLParser):

    def __init__(self, url):
        HTMLParser.__init__(self)
        req = urlopen(url)
        self.feed(req.read())

    def handle_starttag(self, tag, attrs):
        if tag == 'a' and attrs:
            print "Found link => %s" % attrs[0][1]

Spider('http://www.python.org')

Si ejecutara ese script, obtendría resultados como este:

[email protected]:~> python crawler.py
Found link => /
Found link => #left-hand-navigation
Found link => #content-body
Found link => /search
Found link => /about/
Found link => /news/
Found link => /doc/
Found link => /download/
Found link => /community/
Found link => /psf/
Found link => /dev/
Found link => /about/help/
Found link => http://pypi.python.org/pypi
Found link => /download/releases/2.7/
Found link => http://docs.python.org/
Found link => /ftp/python/2.7/python-2.7.msi
Found link => /ftp/python/2.7/Python-2.7.tar.bz2
Found link => /download/releases/3.1.2/
Found link => http://docs.python.org/3.1/
Found link => /ftp/python/3.1.2/python-3.1.2.msi
Found link => /ftp/python/3.1.2/Python-3.1.2.tar.bz2
Found link => /community/jobs/
Found link => /community/merchandise/
Found link => margin-top:1.5em
Found link => margin-top:1.5em
Found link => margin-top:1.5em
Found link => color:#D58228; margin-top:1.5em
Found link => /psf/donations/
Found link => http://wiki.python.org/moin/Languages
Found link => http://wiki.python.org/moin/Languages
Found link => http://www.google.com/calendar/ical/b6v58qvojllt0i6ql654r1vh00%40group.calendar.google.com/public/basic.ics
Found link => http://wiki.python.org/moin/Python2orPython3
Found link => http://pypi.python.org/pypi
Found link => /3kpoll
Found link => /about/success/usa/
Found link => reference
Found link => reference
Found link => reference
Found link => reference
Found link => reference
Found link => reference
Found link => /about/quotes
Found link => http://wiki.python.org/moin/WebProgramming
Found link => http://wiki.python.org/moin/CgiScripts
Found link => http://www.zope.org/
Found link => http://www.djangoproject.com/
Found link => http://www.turbogears.org/
Found link => http://wiki.python.org/moin/PythonXml
Found link => http://wiki.python.org/moin/DatabaseProgramming/
Found link => http://www.egenix.com/files/python/mxODBC.html
Found link => http://sourceforge.net/projects/mysql-python
Found link => http://wiki.python.org/moin/GuiProgramming
Found link => http://wiki.python.org/moin/WxPython
Found link => http://wiki.python.org/moin/TkInter
Found link => http://wiki.python.org/moin/PyGtk
Found link => http://wiki.python.org/moin/PyQt
Found link => http://wiki.python.org/moin/NumericAndScientific
Found link => http://www.pasteur.fr/recherche/unites/sis/formation/python/index.html
Found link => http://www.pentangle.net/python/handbook/
Found link => /community/sigs/current/edu-sig
Found link => http://www.openbookproject.net/pybiblio/
Found link => http://osl.iu.edu/~lums/swc/
Found link => /about/apps
Found link => http://docs.python.org/howto/sockets.html
Found link => http://twistedmatrix.com/trac/
Found link => /about/apps
Found link => http://buildbot.net/trac
Found link => http://www.edgewall.com/trac/
Found link => http://roundup.sourceforge.net/
Found link => http://wiki.python.org/moin/IntegratedDevelopmentEnvironments
Found link => /about/apps
Found link => http://www.pygame.org/news.html
Found link => http://www.alobbs.com/pykyra
Found link => http://www.vrplumber.com/py3d.py
Found link => /about/apps
Found link => reference external
Found link => reference external
Found link => reference external
Found link => reference external
Found link => reference external
Found link => reference external
Found link => reference external
Found link => reference external
Found link => reference external
Found link => reference external
Found link => reference external
Found link => reference external
Found link => reference external
Found link => /channews.rdf
Found link => /about/website
Found link => http://www.xs4all.com/
Found link => http://www.timparkin.co.uk/
Found link => /psf/
Found link => /about/legal

Puede usar regex para distinguir entre URL absolutas y relativas.







web-applications