Web Scraping with Scrapy: How to Find Pages with Specific Keywords Using Python

Author: Blessing I. Paul
Last Update On: 16-Aug-2024 02:01:43am
Category: Technology
Topic: Software and app tutorial

Check here why this python script is really important in the digital market and in SEO at large.

How the Code Works:

The provided code is a Python script that performs web scraping using the Scrapy framework to search for a specific keyword across multiple websites (urls). It follows a concurrent approach, utilizing multiprocessing to speed up the scraping process.

The script starts by importing the necessary modules: scrapy, CrawlerProcess, LinkExtractor, and multiprocessing.

import scrapy
from scrapy.crawler import CrawlerProcess
from scrapy.linkextractors import LinkExtractor
import multiprocessing

The code also defines a custom Spider class called MySpider, which inherits from Scrapy's Spider class. This Spider is responsible for crawling websites and searching for the specified keyword.

class MySpider(scrapy.Spider):
    name = 'keyword_spider'

    def __init__(self, url, keyword):
        self.start_urls = [url]
        self.keyword = keyword

    def parse(self, response):
        # Check if the keyword is present in the page title
        if self.keyword.lower() in response.css('title::text').get().lower():
            # If the keyword is found, save the URL to a text file
            with open('matching_urls.txt', 'a') as f:
                f.write(response.url + '\n')

        # Extract all links from the page
        link_extractor = LinkExtractor()
        links = link_extractor.extract_links(response)

        # Filter the links to include only internal links within the current website
        internal_links = [link for link in links if link.url.startswith(response.url)]

        # Follow each internal link recursively
        for link in internal_links:
            yield response.follow(link.url, callback=self.parse)

The MySpider class takes a URL and a keyword as input during initialization. It visits the given URL, looks for the keyword in the page title, and if found, saves the URL to a text file called matching_urls.txt.

The parse method of MySpider extracts all the links present on the current page and filters out only the internal links within the same website. It follows each internal link recursively by calling the parse method on those links.

The run_spider function is defined to run the Scrapy Spider with a specific URL and keyword. It sets the user-agent to mimic a web browser.

def run_spider(args):
    url, keyword = args
    process = CrawlerProcess(settings={
        'USER_AGENT': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
    })
    process.crawl(MySpider, url=url, keyword=keyword)
    process.start()

if __name__ == '__main__':
    # Read the list of URLs from a text file
    print("Link Keyword Finder. Coded by Blessing \n")
    with open('urls.txt', 'r') as file:
        urls = [line.strip() for line in file]

    # Read the keyword from a text file
    with open('keywords.txt', 'r') as file:
        keyword = file.read().strip()

    # Define the number of processes to use
    num_processes = 2

    # Create a pool of worker processes and distribute the URLs
    with multiprocessing.Pool(processes=num_processes) as pool:
        pool.map(run_spider, [(url, keyword) for url in urls])

In the main block, the script reads the list of URLs from a file named urls.txt and the keyword to search for from a file called keywords.txt.

It defines the number of processes to use for parallel execution (num_processes) and creates a pool of worker processes using Python's multiprocessing.Pool().

The worker processes distribute the URLs among themselves, and each worker process runs the run_spider function with a specific URL and the common keyword.

Web Scraping Keyword Search free

Requirements and Modules Needed:

Before running the code, the following requirements and modules need to be in place:

Scrapy: Install the Scrapy library using pip install scrapy. Scrapy is a powerful web scraping framework that the script relies on.

Python: Ensure you have Python installed on your system, preferably Python 3.x.

URL and Keyword Files: Create the urls.txt file containing a list of URLs to be scraped, with one URL per line. Similarly, create the keywords.txt file containing the specific keyword to search for.

Text File: Ensure you have written permissions in the directory where the script will be run to create and append data to the matching_urls.txt file.

Multiprocessing: Python's multiprocessing module is required for parallel processing. It is a standard library, so no additional installation is necessary.

Once these requirements are met, you can execute the script by running it with Python: python web_scraping_keyword_search.py using the command prompt (CMD) or another python compiler. The script will initiate the web scraping process, and if the keyword is found on any of the web pages, their URLs will be saved in the matching_urls.txt file.

You can download the full source code as a tool for free for your digital marketing and seo journey. Web Scraping Keyword Search free download.

Don’t forget to subscribe and share this post to help us improve.

Views 1.8K |
Comments |

likes

Post Tags

Please like and share our post on:

Comment section is On for this post

About Author

Blessing I. Paul

Super Admin, Founder, Admin, & Contributor

Blessing Ikechukwu, Paul, is the CEO/Manager of Blomset Drive Technologies, also the founder of this website (www.tech-hint.net).

He's a full stack web developer, digital marketing consultant & SEO analyst, computer security personnel and more, with more than 7+ years' experience. For hire you can contact him. You can check more of his blog post. Follow him on LinkedIn, Twitter and Facebook.