The following explains how to specify the URL of an image, ZIP, PDF, or other file on the Web in Python, download it, and save it as a local file.
- Download images by specifying the URL.
- Code example
urllib.request.urlopen()
:Open URLopen()
:Write to a file in binary mode- A simpler code example
- Download ZIP files, PDF files, etc.
- Extract the URL of the image on the web page.
- If the number is sequential
- Extract with Beautiful Soup
- Batch download multiple images from a list of URLs
Download images by specifying the URL.
You can use the standard library only to download individual files by specifying their URLs; no additional installation is required.
Code example
The following is an example of a function that downloads and saves a file by specifying the URL and destination path, and its usage. This code is a bit verbose for the sake of explanation. A simple example is given below.
import os import pprint import time import urllib.error import urllib.request def download_file(url, dst_path): try: with urllib.request.urlopen(url) as web_file: data = web_file.read() with open(dst_path, mode='wb') as local_file: local_file.write(data) except urllib.error.URLError as e: print(e)
url = 'https://www.python.org/static/img/python-logo.png' dst_path = 'data/temp/py-logo.png' download_file(url, dst_path)
To specify the destination directory and save the file with the URL file name, do the following
def download_file_to_dir(url, dst_dir): download_file(url, os.path.join(dst_dir, os.path.basename(url))) dst_dir = 'data/temp' download_file_to_dir(url, dst_dir)
It extracts the file name from the URL with os.path.basename() and joins it with the directory specified with os.path.join() to generate the destination path.
The following sections describe the part of the data acquisition and the part of the data saving as a file.
urllib.request.urlopen(): Open URL
Use urllib.request.urlopen() to open the URL and retrieve the data. Note that urllib.urlopen() has been deprecated in Python 2.6 and earlier. urllib.request.urlretrieve() has not been deprecated yet, but may be in the future.
To avoid stopping when an exception occurs, catch the error with try and except.
In the example, urllib.error is imported and only urllib.error.URLError is explicitly captured. The error message will be displayed when the URL of the file does not exist.
url_error = 'https://www.python.org/static/img/python-logo_xxx.png' download_file_to_dir(url_error, dst_dir) # HTTP Error 404: Not Found
If you want to also catch exceptions (FileNotFoundError, etc.) when saving locally, do the following.(urllib.error.URLError, FileNotFoundError)
It is also possible to use the third-party library Requests instead of the standard library urllib to open the url and get the data.
Write to a file in binary mode in open()
The data that can be obtained with urllib.request.urlopen() is a byte string (bytes type).
Open() with mode='wb' as the second argument writes the data as binary. w means write and b means binary.
A simpler code example
Nested with statements can be written at once, separated by commas.
Using this, we can write the following.
def download_file(url, dst_path): try: with urllib.request.urlopen(url) as web_file, open(dst_path, 'wb') as local_file: local_file.write(web_file.read()) except urllib.error.URLError as e: print(e)
Download ZIP files, PDF files, etc.
The examples so far are for downloading and saving image files, but since we are simply opening a file on the web and saving it as a local file, the same functions can be used for other types of files.
You can download and save files by specifying the URL.
url_zip = 'https://from-locas.com/sample_header.csv.zip' download_file_to_dir(url_zip, dst_dir) url_xlsx = 'https://from-locas/sample.xlsx' download_file_to_dir(url_xlsx, dst_dir) url_pdf = 'https://from-locas/sample1.pdf' download_file_to_dir(url_pdf, dst_dir)
Note that the URL specified in this function must be a link to the file itself.
For example, in the case of a GitHub repository file, the following URL has a pdf extension but is actually an html page. If this URL is specified in the function above, the html source will be downloaded.
- https://github.com/from-locals/python-snippets/blob/master/notebook/data/src/pdf/sample1.pdf
The link to the file entity is the following URL, which you need to specify if you want to download and save the file.
- https://github.com/from-locals/python-snippets/raw/master/notebook/data/src/pdf/sample1.pdf
There are also cases where access is restricted by user agent, referrer, etc., making it impossible to download. We do not guarantee that all files will be downloaded.
It is easy to use Requests to change or add request headers such as user agent.
Extract the URL of the image on the web page.
To download all the images in a page at once, first extract the URLs of the images and create a list.
If the number is sequential
If the URL of the image you want to download is a simple sequential number, it is easy. If the URLs are not only sequential numbers but also have some regularity, it is easier to make a list of URLs according to the rules rather than scraping with Beautiful Soup (see below).
Use list comprehension notation.
- Related Articles:Using Python list comprehensions notation
url_list = ['https://example.com/basedir/base_{:03}.jpg'.format(i) for i in range(5)] pprint.pprint(url_list) # ['https://example.com/basedir/base_000.jpg', # 'https://example.com/basedir/base_001.jpg', # 'https://example.com/basedir/base_002.jpg', # 'https://example.com/basedir/base_003.jpg', # 'https://example.com/basedir/base_004.jpg']
In the above example, {:03} is used for a 3-digit zero-filled sequential number; {} is used when zero-filling is not necessary, and {:05} is used for a 5-digit number instead of 3 digits. For more information about the format method of string str, see the following article.
- Related Articles:Format conversion in Python, format (zero-filling, exponential notation, hexadecimal, etc.)
Also, here we are using pprint to make the output easier to read.
Extract with Beautiful Soup
To extract image URLs from web pages in bulk, use Beautiful Soup.
import os import time import urllib.error import urllib.request from bs4 import BeautifulSoup url = 'https://from-locals.com/' ua = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) '\ 'AppleWebKit/537.36 (KHTML, like Gecko) '\ 'Chrome/55.0.2883.95 Safari/537.36 ' req = urllib.request.Request(url, headers={'User-Agent': ua}) html = urllib.request.urlopen(req) soup = BeautifulSoup(html, "html.parser") url_list = [img.get('data-src') for img in soup.find(class_='list').find_all('img')]
In the example, the URL of the thumbnail image of this website is extracted.
The structure varies depending on the web page, but basically it is obtained as follows.
- Get a list of <img> tag objects by specifying the class, id, etc. of the block containing the multiple images you want to download.
soup.find(class_='list').find_all('img')
- Obtain the URL of the image from the src element or data-src element of the <img> tag.
img.get('data-src')
The above sample code is just an example and not guaranteed to work.
Batch download multiple images from a list of URLs
If you have a list of URLs, you can just turn it in a for loop and call the function to download and save the file with the first URL shown. Because of the temporary URL list, the function call download_image_dir() is commented out here.
download_dir = 'data/temp' sleep_time_sec = 1 for url in url_list: print(url) # download_file_dir(url, download_dir) time.sleep(sleep_time_sec) # https://example.com/basedir/base_000.jpg # https://example.com/basedir/base_001.jpg # https://example.com/basedir/base_002.jpg # https://example.com/basedir/base_003.jpg # https://example.com/basedir/base_004.jpg
In order not to overload the server, I use time.sleep() to create a wait time for each image download. The unit is in seconds, so in the example above, the time module is imported and used.
The example is for image files, but other types of files can be downloaded together as well, as long as they are listed.