Need to automate HTTP requests using Python? One popular way, and also my preferred one, is to use the requests module. Let's have a quick look.
First, you need to add the module to Python
Use pip, from your Python:
pip install requests
On Windows:
pip.exe install requests
Interact with websites
Let's interact with an hypothetical website, examplesite.com
. To get started, you could write code similar to the following, to extract the content of a given page, using its URL.
import requests
url = 'http://examplesite.com/examplepage/'
with requests.get(url) as r:
if r.status_code == 200:
print(r.text)
What is being done here?
When you pass a variable url
containaing the URL of a webpage to the requests.get()
function, you get the "HTTP response" as a result. In Python jargon, this is an object, and here we give it the name r
.
You can then get the returned HTML text by calling the right object attribute (or property), the .text
part called on the r
object.
And you can print it in the Pyhon console, as we do here, or do something else such as store it on disk or in a database that you access from Python.
As you've noticed, we use the with
statement which comes from Python's "Context Managers" feature, i.e. the HTTP request object is a Context Manager. This allows the system resources involved to be freed at the end.
There is an alternative way, interesting to know, if you want to access several pages on the same domain. The code for this technique, optimized for such cases, would look like:
import requests
BASE_URL = 'http://examplesite.com'
PATHS = ['/example-page/',
'/another-example-page/',
'/a-third-example-page/',
]
with requests.Session() as s:
for path in PATHS:
r = s.get(BASE_URL + path)
if r.status_code == 200:
print(r.text)
Here we use a session object, returned by the requests.Session()
call. Then, we can make all our requests using it, instead of making several separated requests.get(url)
calls.
See http://docs.python-requests.org/en/master/user/advanced/#session-objects for more about this feature.
Parsing the HTML
Let's go a bit further and see how you would parse the HTML that is fetched, to extract the text you need, so that the examples are more useful.
There are several options in terms of libraries to help parse the HTML, depending on your precise need. Three of them are:
- BeautifulSoup
- lxml
- HTML2Text
For example, you can use HTML2Text to extract all the text from the HTML content, using the code updated as follows.
import requests
import html2text
# Setup of the html2text handler
h = html2text.HTML2Text()
h.ignore_links = True
h.ignore_images = True
def parseHTML(htmltext):
""" Parsing / Extraction of plain text from HTML string """
result = h.handle(htmltext)
return result
BASE_URL = 'http://examplesite.com'
PATHS = ['/example-page/',
'/another-example-page/',
'/a-third-example-page/',
]
with requests.Session() as s:
for path in PATHS:
url = BASE_URL + path
r = s.get(url)
if r.status_code == 200:
res = parseHTML(r.text)
print(res)
That's it!
Another bonus in terms of Python language feature, we are using a function for the "HTML to text" transformation part so that our code is well organized and easy to read.
Note that you need to install the additional module, in your Python, for this to work:
pip install html2text
Much more possibilities
This was a quick introduction. A lot more is possible such as interactions with an API-style server with POST and PUT requests, downloading files in "streaming mode", and using Python's multithreading or asynchronous techniques to perform the operations.
This is the kind of things I am currently working on, related to the topic of "Automation". You can see a longer version of this article as a bonus to a first report I recently made available: Leverage Automation For Your Projects.
With more practical code examples coming.