Tuesday, June 28, 2011

Scrape Multiple Images And Other Files Off the Web with WGET

This guide for downloading multiple web files is particularly written for those that are using a Linux operating system. For those people on Windows OS, perform a Google search for " Wget for Windows".

Tool for Grabbing Multiple Files from the Web

There may be a situation when you will have to grab multiple images or other files off of a particular website.  Maybe you don't have access to the actual web docs or perhaps you just want to download all content (mp3's, pdf's, etc.) off of a given website.


Rather than doing the right-click save image as over and over you can create a text file, fill it with links to web content, and execute a command to grab it all and download to your local machine.

A handy tool is the command WGET available via a Linux terminal (you might have to install this software package depending on your flavor of Linux). GNU Wget is a free software package for retrieving files using HTTP, HTTPS and FTP, the most widely-used Internet protocols.

How to Harvest Files from Web Pages with WGET

Step 1. Create a simple document with your favorite text editor (gedit, notepad, etc.).  Let's call it grabimages.txt

Step 2. Fill your text document with URL's that represent the files you want to download.

Example:

http://digital.library.schreiner.edu/hcc00022a_m.jpg
http://digital.library.schreiner.edu/hcc00022b_m.jpg
...and so on.

Step 3. Save the text file (in this case grabimages.txt and remember where the file is located, to keep it simple just save it to your home directory).

Step 4. Open the terminal window on your Ubuntu, Mint, or other Linux machine.

Step 5. Make sure you are in your home directory by typing ls (your home directories should be listed including your new text file grabimages.txt.

Step 6. Next type wget -c -i grabimages.txt and press Enter.  (depending on the speed of your connection and the number of file in your grabimages.txt files, you should see your system connecting to the website(s) specified in your text file and downloading those files to your system.)

Step 7. Navigate through the GUI or by typing ls to your home directory.  You should now see the actual (MP3, PDF, or image) files residing in your home folder.


Congrats, now you have the power to download multiple files from the web with one script and a text file from the Linux command line interface.  You can modify WGET to perform all sorts of web harvesting, web scraping, and general getting of multiple files from web pages on the interwebs.

*Information contained within these pages do not necessarily reflect the opinions or views of Schreiner University.

1 comment:

  1. Hello friends,

    SCRAP project is capable of working with any video4linux compatible device, supporring all of its documented features. It can grab frames from multiple inputs of a single device and also support working with multiple capture boards at the same time, giving user the advantage of a low-cost recording system. Thanks a lot.....

    Web Data Extraction Software

    ReplyDelete