- Using wget with a proxy
- Introduction
- Prerequisites & Installation
- Download wget on Mac
- Download wget on Windows
- What is wget?
- wget Commands
- Download a single file 📃
- Download a File to a Specific Directory 📁
- Rename a Downloaded File 📝
- Define Yourself as User-Agent 🧑💻
- Limit Speed ⏩
- Extract as Google bot 🤖
- Convert Links on a Page 🖇️
- Mirroring Single Webpages 📑
- Extract Multiple URLs 🗂️
- How to Configure a Proxy with wget
- Conclusion
- Resources
- How to Use Wget With Proxy
- What is Wget
- How to install Wget
- Running Wget
- Downloading a single file
- Changing the User-Agent
- Downloading multiple files
- Extracting links from a webpage
- Using proxies with Wget
- cURL vs Wget
- Conclusion
Using wget with a proxy
Introduction
In this article, you will examine how to use wget commands, to retrieve or transfer data, with a proxy server. Proxy servers are often referenced as the gateway between you and the world wide web and can make accessing data more secure. Feel free to learn more about proxies here, but let’s get started!
Prerequisites & Installation
This article is for a wide range of developers, ✨including you juniors!✨ But to get the most of the material, it is advised to:
✅ Be familiar with Linux and unix commands and arguments.
✅ Have wget installed.
Check if wget is installed by opening the terminal and typing:
If it is present, it will return the version. If not, follow the following steps to download wget on Mac or Windows.
Download wget on Mac
The recommended method to install wget on Mac using a package manager such as Homebrew.
You can install wget with Homebrew by running:
You can check for successful installation by rerunning the previous command to view its current version.
Download wget on Windows
To install and configure wget for Windows:
- Download wget for Windows and install the package.
- Copy the wget.exe file into your C:\Windows\System32 folder.
- Open the command prompt (cmd.exe), and run wget to check if it was successfully installed
Still having trouble? Here is an additional video that shows how to install wget on Windows 10.
What is wget?
wget is a GNU command-line utility tool primarily used to download content from the internet. It supports HTTP, HTTPS, and FTP protocols.
wget is designed to be effectively reliable over slow or unstable network connections. So if a download stops before completion due to a network error, wget automatically continues the same download from where it left off. And repeats this process until the whole file is successfully retrieved.
The tool also works as a web crawler by scraping linked resources from HTML pages and downloading them in sequence. It repeats this process until all content downloads or a specified recursion depth by the user is obtained. Afterward, the retrieved data is saved in a directory structure mirroring the remote server, thus creating a clone of webpages via HTTP.
wget is versatile, which is another reason it became so popular, it can work for scripts, terminals, and cron jobs. Not to mention the tool is non-interactive and runs independently in the background, meaning it does not matter if you are actively logged on while downloads occur.
Speaking of downloads, the wget tool even supports downloads through HTTP proxies. A proxy server is any machine that translates traffic between networks or protocols. Proxies are an intermediary server separating end-user clients from the destinations that they browse. Proxy servers may exist in schools, workplaces, and other institutions, where users need authentication to access the internet, and in some cases restricts the user from accessing certain websites.
When you use a proxy server, traffic flows through the proxy server to the requested address. Then the request returns through that same server, although this may not always be the case, and that server forwards the received data from the requested webpage back to you.
Thanks to proxies, you can download content more securely from the world wide web. In this post, you will examine how to, by using wget behind a proxy server.
wget Commands
If you are not familiar with wget, the tool uses a pretty repetitive syntax. It has two arguments: [OPTION] and [URL].
OPTION: Decides what to do with the argument given. To view all the wget commands run wget -h .
URL: of the file or directory you want to download/synchronize. You can call multiple OPTIONS or URLs at once.
Now that you learned wget’s long and tedious syntax, it’s time to learn some commands! 📣
Download a single file 📃
To download a regular file, run:
You can even set wget to retrieve the data if the latest version in the server is newer than the local copy. Instead of running the previous command, you would first extract the file using -S to keep a timestamp of the initial extraction, and any made after that.
Next, to check if the file altered and download it if it has, you can run:
If you would like to download content and save it as the title on the HTML page, run this:
Use / in name as needed.
Download a File to a Specific Directory 📁
Just replace PATH with the output directory location that you want to save the file.
Rename a Downloaded File 📝
To rename a file, replace FILENAME with your desired name and run:
Define Yourself as User-Agent 🧑💻
Limit Speed ⏩
Part of scraping etiquettecy is not crawling too fast. And thankfully wget can help with that by implementing the —wait and —limit-rate commands.
Extract as Google bot 🤖
Convert Links on a Page 🖇️
Convert the links in the HTML to still work in your local version. (e.g. example.com/path -> localhost:8000/path)
Mirroring Single Webpages 📑
You can run this command to mirror a single web page to view it on your local device.
Extract Multiple URLs 🗂️
First create and add all desired URLs to a urls.txt file.
Next, run the following command to extract all the URLs.
That covers the most commonly used wget commands, but feel free to check out more!
How to Configure a Proxy with wget
First locate the wget initialization file inside /usr/local/etc/wgetrc (global, for all users) or $HOME/.wgetrc (for a single user). You can also view the documentation here to see a sample wget initialization file .wgetrc.
- Inside the initialization file add the lines:
wget recognizes the following environment variables to specify proxy location:
http_proxy/https_proxy: should contain the URLs of the proxies for HTTP and HTTPS connections respectively.
ftp_proxy: should contain the URL of the proxy for FTP connections. It is common that http_proxy and ftp_proxy are set to the same URL.
no_proxy: should contain a comma-separated list of domain extensions proxy should not be used for.
In addition to the environment variables, proxy location and settings may be specified from within wget itself using the commands: ‘—no proxy and ‘proxy = on/off’ ). Note that this may suppress the use of proxy, even if the correct environment variables are in place.
In the shell you can set the variables by running:
- Lastly, add the following line(s) in either your
/.bash_profile or /etc/profile:
Some proxy servers require authorization to enable use, usually consisting of a username and password, which are sent using wget. Similar to HTTP authorization, while several authentication schemes exist, only the Basic authentication scheme is actively implemented.
You may enter your username and password through the proxy URL or the command-line options. In a not uncommon case that a company’s proxy is located at proxy.company.com at port 8001, a proxy URL containing authorization content might look like so:
Alternatively, you can utilize the ‘proxy-user’ and ‘proxy-password’ options, and the equivalent .wgetrc settings proxy_user and proxy_password, to set the proxy username and password.
- You did it! Now wget your data using your proxy. 🎉
Conclusion
Now that you are a wget proxy pro, you now have free reign to extract almost whatever you want from websites. wget is a free and user-friendly tool that does not look like it is leaving anytime soon, so go ahead and get familiar with it. Hopefully, this article was able to help you start on your journey, and wget all the data needed!
As always, there are alternatives to wget, such as aria2 and cURL, but each come with their benefits. cURL also supports proxy use, and you can see how to do that in the article, How to set up a proxy with cURL?.
If you have enjoyed this article on setting up a proxy with wget, give ScrapingBee a try, and get the first 1000 requests free. Check out the getting started guide here!🐝
Scraping the web is challenging, given that anti-scraping mechanisms are growing by the day, so getting it done right can be quite a tedious task. ScrapingBee allows you to skip the noise and focus only on what matters the most: data.
Resources
Maxine is a software engineer and passionate technical writer, who enjoys spending her free time incorporating her knowledge of environmental technologies into web development.
How to Use Wget With Proxy
Wget is a popular command-line utility that can download files from the web. It’s part of the GNU Project and, as a result, commonly bundled with numerous Linux distributions.
This article will walk you through the step-by-step process of installing and downloading files using Wget with or without proxies, covering multiple scenarios and showcasing practical examples.
What is Wget
Wget is a free software package that can retrieve files via HTTP(S) and FTP(S) internet protocols. The utility is part of the GNU Project. Thus, the full name is GNU Wget. The capitalization is optional (Wget or wget).
How to install Wget
Wget can be downloaded from the official GNU channel and installed manually. However, we recommend using package managers. Package managers facilitate the installation and make future upgrades more convenient. Also, most Linux distributions are bundled with Wget.
To install Wget on Ubuntu/Debian, open the terminal and run the following command:
To install Wget on CentOS/RHEL, open the terminal and run the following command:
If you’re using macOS, we highly recommend using the Homebrew package manager. Open the terminal and run the following command:
If you’re using Windows, Chocolatey package manager is a good choice. When using Chocolatey, run the following command from the command line or PowerShell:
Lastly, to verify the installation of Wget, run the following command:
This will print the installed version of Wget along with other related information.
Running Wget
Wget command can be run from any command-line interface. In this tutorial, we’ll be using the terminal. To run the Wget command, open the terminal and enter the following:
This will list all the options that can be used with the Wget command grouped in categories, such as Startup, Logging, Download, etc.
Downloading a single file
To download a single file, run Wget and type in the complete URL of the file. For example, the Wget binary file is located at https://ftp.gnu.org/gnu/wget/wget2-2.0.0.tar.lz. To download this file, enter the following in the terminal:
Wget shows the progress of downloads
Wget shows detailed information about the file being downloaded: the download completion bar, progress of each step, total file size and its mime type, etc.
Changing the User-Agent
Every program, including web browsers, sends certain headers when connecting to a web service. In this case, the User-Agent header is the most important as it contains a string that identifies the program.
To see how User-Agent varies across various applications, open this URL in different browsers that you have installed.
To identify the User-Agent used by Wget, request this URL:
This command will download a file named user-agent without any extension. To view the contents of this file, use the cat command on macOS and Linux. On Windows, you can use the type command.
The default User-Agent can be modified using the —header option. The syntax is as follows:
The following example should clarify it further:
As it’s evident here, the User-Agent has changed. If you wish to send any other header, you can add more —header options followed by a header in «HeaderName: HeaderValue» format.
Downloading multiple files
There are two methods for downloading multiple files using Wget. The first method is to send all the URLs to Wget separated with a space. For example, the following command will download files from all three URLs:
If you wish to try a real example, use the following command:
The command will download both files one at a time.
This method works well when the number of files is limited. It can become difficult to manage as the number of files grows, making the second method more useful.
The second method is to write all the URLs in a file and use the -i or —input-file option. For example, to read the URLs from the urls.txt file, run any of the following commands:
The best part of this option is that if any of the URLs don’t work, Wget will continue and download the rest of the functional URLs.
Extracting links from a webpage
The —input-file option of the Wget command can be expanded to extract links from a webpage.
In its simplest form, you can supply a URL that contains the links to the files. For example, this page contains links to downloadable content of Wget. To download all files from this URL, run the following:
However, this command won’t be particularly useful without any further customization. There are multiple reasons for that.
By default, Wget does not overwrite existing files. If a download results in overwriting a file, it’ll create a new file by appending a numerical suffix. It means that for every instance of a compressed.gif file, it’ll create new files with names such as compressed.gif, compressed.gif.1, compressed.gif.2, and so on.
This behavior can be modified by specifying the —no-clobber switch to skip duplicate files.
Next, you may want to download the files recursively by specifying the —recursive switch.
Finally, you may want to skip downloading certain files by specifying the extensions as a comma-separated list to the —reject switch.
Similarly, you may want to download certain files while ignoring everything else by using the —accept switch. This also takes a list of extensions separated by a comma.
Some other useful switches are —no-directories and —no-parent . These two ensure that no directories are created, and the Wget command doesn’t traverse to a parent directory.
For example, to download all files with the .sig extension, use the following command:
Using proxies with Wget
There are two methods for Wget proxy integration. The first method uses command line switches to specify the proxy server and authentication details.
The easiest way to verify is to get an IP address before specifying a proxy server. To check your current IP address, run the following commands:
The first command simply receives the index.html file containing the IP address. The cat command (or type command for Windows) prints the file contents.
The same result can be achieved by running Wget in quiet mode and redirecting the output to the terminal instead of downloading the file:
The shorter version of the same command is as follows:
To utilize a proxy that doesn’t require authentication, use two -e or two —execute switches. The first will enable the proxy, and the second will specify the proxy server’s URL.
The following commands enable the proxy and specify the proxy server’s IP 12.13.14.15 and port 1234 :
In the example above, the proxy doesn’t require authentication. If the proxy server requires user authentication, set the proxy username by using the —proxy-user switch. Similarly, set the proxy password using the —proxy-password switch:
As evident here, the command is quite long. However, it’s useful when you don’t want to use a proxy all the time.
The second method is to use the .wgetrc configuration file. This file can store proxy configuration, which Wget then reads.
The configuration file is located in the user’s home directory and is named .wgetrc . Alternatively, you can use any file as the configuration file by using the —config switch.
/.wgetrc file, enter the following lines:
If you also need to set user authentication for the proxy, modify the file as follows:
As of now, every time Wget runs, it’ll use the specified proxy.
The proxies can also be set with the environment variables like http_proxy . However, it isn’t specific to Wget and will apply to the entire network traffic, making it unsuitable for the task at hand.
cURL vs Wget
cURL or Curl is another open-source command-line tool for downloading files and is available for free.
cURL and Wget share many similarities, but there are important distinctions differentiating the tools for specific individual purposes.
First, let’s take a quick look at the similarities. Both options:
Are open-source, command-line tools for downloading content from HTTP(S) and FTP(S)
Can send HTTP GET and POST requests
Are designed to run in the background
The following features are only available in cURL:
Available as a library
Support for more protocols beyond HTTP and FTP
Better SSL support
More HTTP authentication methods
Support for SOCKS proxies
Better support for HTTP POST
Nonetheless, Wget has its advantages as well:
Supports recursive. This is the most prominent advantage, allowing you to download files recursively using the —mirror switch and create a local copy of a website.
Can resume interrupted downloads
This article expands more on what cURL is and how to use it. If you want to read about the differences in detail, see the cURL comparison table.
The differences listed above should help you figure out the more suitable tool for a particular scenario. For example, if you want recursive downloads, choose Wget. If you require SOCKS proxy support, pick cURL.
Neither tool is decisively better than the other. Select the one that is suitable for your specific scenario at a given moment.
Conclusion
This article detailed how to configure Wget, from installation and downloading single or multiple files to the methods of using proxies. Lastly, the comparison between cURL and Wget overviewed their differences according to the functionality and individual use cases.
If you want to find out more about how proxies and advanced public data acquisition tools work or about specific web scraping use cases, such as web scraping job postings or building a Python web scraper, check out our blog.