Author: Blessing I. Paul
Last Update On: 16-Aug-2024 02:01:43am
Category: Technology
Topic: Software and app tutorial
Check here why this python script is really important in the digital market and in SEO at large.
The provided code is a Python script that performs web scraping using the Scrapy framework to search for a specific keyword across multiple websites (urls). It follows a concurrent approach, utilizing multiprocessing to speed up the scraping process.
The script starts by importing the necessary modules: scrapy, CrawlerProcess, LinkExtractor, and multiprocessing.
import scrapy
from scrapy.crawler import CrawlerProcess
from scrapy.linkextractors import LinkExtractor
import multiprocessing
class MySpider(scrapy.Spider):
name = 'keyword_spider'
def __init__(self, url, keyword):
self.start_urls = [url]
self.keyword = keyword
def parse(self, response):
# Check if the keyword is present in the page title
if self.keyword.lower() in response.css('title::text').get().lower():
# If the keyword is found, save the URL to a text file
with open('matching_urls.txt', 'a') as f:
f.write(response.url + '\n')
# Extract all links from the page
link_extractor = LinkExtractor()
links = link_extractor.extract_links(response)
# Filter the links to include only internal links within the current website
internal_links = [link for link in links if link.url.startswith(response.url)]
# Follow each internal link recursively
for link in internal_links:
yield response.follow(link.url, callback=self.parse)
The MySpider class takes a URL and a keyword as input during initialization. It visits the given URL, looks for the keyword in the page title, and if found, saves the URL to a text file called matching_urls.txt.
The parse method of MySpider extracts all the links present on the current page and filters out only the internal links within the same website. It follows each internal link recursively by calling the parse method on those links.
The run_spider function is defined to run the Scrapy Spider with a specific URL and keyword. It sets the user-agent to mimic a web browser.
def run_spider(args):
url, keyword = args
process = CrawlerProcess(settings={
'USER_AGENT': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
})
process.crawl(MySpider, url=url, keyword=keyword)
process.start()
if __name__ == '__main__':
# Read the list of URLs from a text file
print("Link Keyword Finder. Coded by Blessing \n")
with open('urls.txt', 'r') as file:
urls = [line.strip() for line in file]
# Read the keyword from a text file
with open('keywords.txt', 'r') as file:
keyword = file.read().strip()
# Define the number of processes to use
num_processes = 2
# Create a pool of worker processes and distribute the URLs
with multiprocessing.Pool(processes=num_processes) as pool:
pool.map(run_spider, [(url, keyword) for url in urls])
In the main block, the script reads the list of URLs from a file named urls.txt and the keyword to search for from a file called keywords.txt.
It defines the number of processes to use for parallel execution (num_processes) and creates a pool of worker processes using Python's multiprocessing.Pool().
The worker processes distribute the URLs among themselves, and each worker process runs the run_spider function with a specific URL and the common keyword.
Before running the code, the following requirements and modules need to be in place:
Scrapy: Install the Scrapy library using pip install scrapy. Scrapy is a powerful web scraping framework that the script relies on.
Python: Ensure you have Python installed on your system, preferably Python 3.x.
URL and Keyword Files: Create the urls.txt file containing a list of URLs to be scraped, with one URL per line. Similarly, create the keywords.txt file containing the specific keyword to search for.
Text File: Ensure you have written permissions in the directory where the script will be run to create and append data to the matching_urls.txt file.
Multiprocessing: Python's multiprocessing module is required for parallel processing. It is a standard library, so no additional installation is necessary.
Once these requirements are met, you can execute the script by running it with Python: python web_scraping_keyword_search.py using the command prompt (CMD) or another python compiler. The script will initiate the web scraping process, and if the keyword is found on any of the web pages, their URLs will be saved in the matching_urls.txt file.
You can download the full source code as a tool for free for your digital marketing and seo journey. Web Scraping Keyword Search free download.
Don’t forget to subscribe and share this post to help us improve.
Please like and share our post on:
Comment section is On for this post
Blessing Ikechukwu, Paul, is the CEO/Manager of Blomset Drive Technologies, also the founder of this website (www.tech-hint.net).
He's a full stack web developer, digital marketing consultant & SEO analyst, computer security personnel and more, with more than 7+ years' experience. For hire you can contact him. You can check more of his blog post. Follow him on LinkedIn, Twitter and Facebook.
Tech Jobs in Demand: The Top 8 Tech JobsRead More »»
721 | 0 | 0
7 Things to Do to Build Your Personal Brand OnlineRead More »»
565 | 1 | 2
How to Use Dmarc Record to Protect a Domain/Email from SpoofingRead More »»
677 | 1 | 0
The Ultimate Beginner's Guide to HTML and CSS: Unleash Your Web Design SkillsRead More »»
756 | 0 | 0
Drop a comment below: