Google Search Console can help you determine which of your website pages are indexed, but what about identifying the ones that aren't? Columnist Paul Shapiro has a Python script that does just that.

There are three main components to organic search: crawling, indexing and ranking. When a search engine like Google arrives at your website, it crawls all of the links it finds. Information about what it finds is then entered into the search engines index, where different factors are used to determine which pages to fetch, and in what order, for a particular search query.

As SEOs, we tend to focus our efforts on the ranking component, but if a search engine isnt able to crawl and index the pages on your site, youre not going to receive any traffic from Google. Clearly, ensuring your site is properly crawled and indexed by search engines is an important part of SEO.

But how can you tell if your site is indexed properly?

If you have access to Google Search Console, it tells you how many pages are contained in your XML sitemap and how many of them are indexed. Unfortunately, it doesnt go as far as to tell you which pages arent indexed.

This can leave you with a lot of guesswork or manual checking. Its like looking for a needle in a haystack. No good! Lets solve this problem with a little technical ingenuity and another free SEO tool of mine.

Determining if a single URL has been indexed by Google
To determine if an individual URL has been indexed by Google, we can use the info: search operator

If the URL is indexed, a result will show for that URL:

However, if the URL is not indexed, Google will return an error saying there is no information available for that URL:

Using Python to bulk-check index status of URLs
Now that we know how to check if a single URL has been indexed, you might be wondering how you can do this en masse. You could have 1,000 little workers check each one or, if you prefer, you could use my Python solution:

To use the Python script above, make sure you have Python 3 installed. You will also have to install the BeautifulSoup library. To do this, open up a terminal or command prompt and execute:

pip install beautifulsoup4

You can then download the script to your computer. In the same folder as the script, create a text file with a list of URLs, listing each URL on a separate line.

Now that your script is ready, we need to set up Tor to run as our free proxy. On Windows, download the Tor Expert Bundle. Extract the zip folder to a local directory and run tor.exe. Feel free to minimize the window.

Next, we have to install Polipo to run Tor and HTTP proxy. Download the latest Windows binary (it will be named and unzip to a folder.

In your Polipo folder, create a text file (ex: config.txt) with the following contents:

  • socksParentProxy = "localhost:9050"
  • socksProxyType = socks5
  • diskCacheRoot = ""
  • disableLocalInterface=true

Open a command prompt and navigate to your Polipo directory.

Run the following command:

  • polipo.exe -c config.txt

At this point, were ready to run our actual Python script:

  • python

The script will prompt you to specify the number of seconds to wait between checking each URL.

It will also prompt you to enter a filename (without the file extension) to output the results to a CSV.

Finally, it will ask for the filename of the text file that contains the list of URLs to check.

Enter this information and let the script run.

The end result will be a CSV file, which can easily be opened in Excel, specifying TRUE if a page is indexed or FALSE if it isnt.

In the event that the script seems to not be working, Google has probably blocked Tor. Feel free to use your own proxy service in this case, by modifying the following lines of the script:

  • proxies = {
  • 'https' : 'https://localhost:8123',
  • 'https' : 'http://localhost:8123'
  • }

Knowing which pages are indexed by Google is critical to SEO success. You cant get traffic from Google if your web pages arent in Googles database!

Unfortunately, Google doesnt make it easy to determine which URLs on a website are indexed. But with a little elbow grease and the above Python script, we are able to solve this problem.

View more the articles about SEO here:

View more threads in the same category: