Content Gardening

18 Sep 2017

Need to automate HTTP requests using Python? One popular way, and also my preferred one, is to use the requests module. Let's have a quick look.

First, you need to add the module to Python

Use pip, from your Python:

pip install requests

On Windows:

pip.exe install requests

Interact with websites

Let's interact with an hypothetical website, examplesite.com. To get started, you could write code similar to the following, to extract the content of a given page, using its URL.

import requests

url = 'http://examplesite.com/examplepage/'

with requests.get(url) as r:
    if r.status_code == 200:
        print(r.text)

What is being done here?
When you pass a variable url containaing the URL of a webpage to the requests.get() function, you get the "HTTP response" as a result. In Python jargon, this is an object, and here we give it the name r.
You can then get the returned HTML text by calling the right object attribute (or property), the .text part called on the r object.

And you can print it in the Pyhon console, as we do here, or do something else such as store it on disk or in a database that you access from Python.

As you've noticed, we use the with statement which comes from Python's "Context Managers" feature, i.e. the HTTP request object is a Context Manager. This allows the system resources involved to be freed at the end.

There is an alternative way, interesting to know, if you want to access several pages on the same domain. The code for this technique, optimized for such cases, would look like:

import requests

BASE_URL = 'http://examplesite.com'
PATHS = ['/example-page/',         
         '/another-example-page/', 
         '/a-third-example-page/', 
        ]

with requests.Session() as s:
    for path in PATHS:
        r = s.get(BASE_URL + path)
        if r.status_code == 200:
            print(r.text)

Here we use a session object, returned by the requests.Session() call. Then, we can make all our requests using it, instead of making several separated requests.get(url) calls.
See http://docs.python-requests.org/en/master/user/advanced/#session-objects for more about this feature.

Parsing the HTML

Let's go a bit further and see how you would parse the HTML that is fetched, to extract the text you need, so that the examples are more useful.

There are several options in terms of libraries to help parse the HTML, depending on your precise need. Three of them are:

BeautifulSoup
lxml
HTML2Text

For example, you can use HTML2Text to extract all the text from the HTML content, using the code updated as follows.

import requests
import html2text

# Setup of the html2text handler
h = html2text.HTML2Text()
h.ignore_links = True
h.ignore_images = True

def parseHTML(htmltext):
    """ Parsing / Extraction of plain text from HTML string """
    result = h.handle(htmltext)
    return result

BASE_URL = 'http://examplesite.com'   
PATHS = ['/example-page/',            
         '/another-example-page/',    
         '/a-third-example-page/',    
        ]

with requests.Session() as s:
    for path in PATHS:
        url = BASE_URL + path
        r = s.get(url)
        if r.status_code == 200:
            res = parseHTML(r.text)
            print(res)

That's it!
Another bonus in terms of Python language feature, we are using a function for the "HTML to text" transformation part so that our code is well organized and easy to read.

Note that you need to install the additional module, in your Python, for this to work:

pip install html2text

Much more possibilities

This was a quick introduction. A lot more is possible such as interactions with an API-style server with POST and PUT requests, downloading files in "streaming mode", and using Python's multithreading or asynchronous techniques to perform the operations.

This is the kind of things I am currently working on, related to the topic of "Automation". You can see a longer version of this article as a bonus to a first report I recently made available: Leverage Automation For Your Projects.
With more practical code examples coming.

Automating Requests To Websites Using Python

First, you need to add the module to Python

Interact with websites

Parsing the HTML

Much more possibilities

Need help for your project?