How to Extract URLs from Webpages
If you are wondering how to extract URLs from any webpage, for free, really fast you are in the right place!
I selected the best 3 methods:
- Browser
- Addon
- Terminal
Let’s jump right into it.
1. Extract URLs from webpages with your browser:
You will need:
- a browser
- piece of code
Navigate to the page from which you’d like to extract links.
Right-click the page and select “Inspect”.
This will open up the console, into which you can type or copy and paste this code:
var x = document.querySelectorAll("a");
var myarray = []
for (var i=0; i<x.length; i++){
var nametext = x[i].textContent;
var cleantext = nametext.replace(/\s+/g, ' ').trim();
var cleanlink = x[i].href;
myarray.push([cleantext,cleanlink]);
};
function make_table() {
var table = '<table><thead><th>Name</th><th>Links</th></thead><tbody>';
for (var i=0; i<myarray.length; i++) {
table += '<tr><td>'+ myarray[i][0] + '</td><td>'+myarray[i][1]+'</td></tr>';
};
var w = window.open("");
w.document.write(table);
}
make_table()
2. Extract URLs from webpages with addons:
I can reccomend a really powerful Chrome addon called Link Klipper
This extension allows you to :
- Extract all the links on the webpage
- Store all the extracted links as a CSV file
- Custom drag a selectable area on the webpage from which all the links will be extracted
3. Extract URLs from webpages with the terminal:
That´s what you´ll need to do the magic:
- MAC
- termnial
- wget installed
First of all check if you already installed wget by running the following command:
$ wget -V
If it is not installed yet then install Homebrew first:
$ ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"
Then you need to install wget:
$ brew install wget
Now you got everything you need to extract URLs from a webpage for free with only one single command:
$ wget --mirror --delete-after --no-directories https://www.the-page-you-wanna-crawl.com 2>&1 | grep '^--' | awk '{print $3}' | sort >extracted-URLs.txt
The command will export all the URLs in a txt file called extracted-URLs.txt (if you want you can also rename it).
This method was shared via Twitter by John Muller.
Have fun!
Comments