Python web crawler to download images from web page

Created: Tue 04 Apr 2023 Updated: 10 months ago

article-featured-image

In this article, I'll be explaining the working of a Python web crawler whose purpose is to extract all the images from the given URLs. I created this Python web crawler to grab all images of all formats: .img .jpg .jpeg .png .webp .bmp. This crawler can also grab GIFs from the web pages and it is not limited to just one webpage or URL. You can crawl/scrap multiple webpages or URLs.

Scraping a webpage without permission is a violation of the website's terms of service. Make sure the webpage that you are trying to crawl either does not restrict scraping or you have appropriate permission to scrap that webpage.

Creating python environment and required packages

First you need setup virtual environment to isolate the current project. Check How to create python virtual environment article. Most of the libraries used for this code are Pyhton in-build libraries. These are some of the external packages that you have install manually:

$
pip install beautifulsoup4 tqdm
  • In the script, Beautiful Soup 4 is used to extract data from HTML webpages. All that parsing, navigation, searching, and modification is done amazingly with the help of beautiful soup.
  • Another library is tqdm which means progress in Arabic. As the name states, this library is used to represent the progress of operations running by the script.

Functioning of crawler

After running the script using python getimages.py command, user just has to input the URLs. Now if you have more than one URL, paste them one by one in the input field, make sure you separate URLs with commas otherwise it won't work.Check the below image for reference:

url-format

Now the beauty of this crawler is, It works for both relative and absolute links. nowadays, there are many websites that use relative links in their image tags. But it won't be an issue with the crawler as this crawler will automatically convert the relative link into an absolute link, which will be used to download images.

Scrapped images will be saved in the current working directory under a new directory by the name of python-image-crawler. This directory will be created automatically upon the initialization of this script.

Download Python crawler script

Script is available on my GitHub repository. Click on the button below to clone the repository:

Now there is a high probability that you will get 403 Forbidden error because website these use cloudflare, which will automatically block scrappers. You need to make sure that you have appropriate permission to crawl that website. Below are some of the links that won't restrict scraping. You can use their links to check the script.

Sample URLs:
https://github.com/,https://dribbble.com/,https://www.amazon.in/gp/bestsellers/?ref_=nav_cs_bestsellers
Pass these sample URLs as input after executing the script. All the images present on input URLs will be downloaded and saved in the current directory.
Web Scrapper to download all images from web page
protocolten-admin

Author: Harpreet Singh
Server Administrator

POST CATEGORY
  1. Programming
  2. Knowledge
  3. Scripts
Suggested Posts:
INFORMATIVE post image
What is DMCA ignored hosting and 3 best hosting providers [2023]

In this article you'll get to know about 3 of the best DMCA ignored hosting …

LINUX post image
Create python virtual environment on windows and linux

Creating and managing a Python virtual environment is very crucial part of any project. …

WINDOWS post image
Run any program as service in windows

Running a program as a service in Windows can be incredibly useful, allowing you to …

CLOUD post image
Create IAM user policy for single S3 bucket access

Are you looking to grant specific access to an AWS S3 bucket for an IAM …

LINUX post image
Secure Apache against DDoS attacks using mod evasive

mod_evasive is an Apache web server module that helps protect the server against some types …

Sign up or Login to post comment.

Sign up Login

Comments (0)