I already shared some cluster approaches victimization TF-IDF Vectorizer for clustering keywords along in Python. This works excellent for grouping keywords together in Python that share identical text strings; however, you’re not capable of a group by which means and linguistics relationships.
One thanks to agitating linguistics is a build-up, for example, word2vec models and cluster keywords with Word Mover’s Distance. The downside: you have got to pay some effort building such models. For this reason, I would like to indicate to you an additionally accessible answer you’ll be able to transfer and run.
Use the Google SERP results to discover semantic relationships
Google is victimization informatics models to supply the most effective search results for the user. Yes, it’s a black box – however, we can use it to our advantage. Rather than building our models, we tend to use this black box to cluster keywords by their semantic in Python. Here is however the Python program logic works:
- Starting purpose may be a list of keywords for a topic
- For each Keyword, we scrape the SERP results
- A graph is formed using the connection between keywords and ranking pages: If identical pages rank for various keywords, they appear to be connected. this can be the principle we are creating the linguistics keyword clusters
Let’s put everything together in Python
The Python Script covers these functionalities:
- Download the SERPs for a given list of keywords by victimization googles custom search engine. The results are saved to an SQLite database. You would like to line up a custom search API here. Once doing this, you’ll use the free quota of a hundred requests per day – the paid arrangement can value you $5 per one thousand requests if you’ve got larger keyword sets and you need results right away. If you have time to accompany the SQLite solutions – the SERP results will be appended to the table on every run (take a brand new set of 100 keywords for the subsequent day once the free quota is accessible again). Within the python script, you’ve got to line up this variable:
- CSV_FILE=”keywords.csv” => store your keywords here
- LANGUAGE = “en”
- COUNTRY = “en”
- API_KEY=”xxxxxxx”
- CSE_ID=”xxxxxxx”
Running getSearchResult(CSV_FILE,LANGUAGE,COUNTRY,API_KEY,CSE_ID,DATABASE,SERP_TABLE) will write the SERP results to the database
- The Clustering is made using networkx and the community detection module. The data is fetched from the SQLite database – the clustering is called with getCluster(DATABASE,SERP_TABLE,CLUSTER_TABLE,TIMESTAMP)
- The Clustering results can be found in the SQLite table – if you do not change the name it is “keyword_clusters” by default.
You can get the full code below:
# Semantic Keyword Clustering by Pemavor.com
# Author: Stefan Neefischer (stefan.neefischer@gmail.com)
from googleapiclient.discovery import build
import pandas as pd
import Levenshtein
from datetime import datetime
from fuzzywuzzy import fuzz
from urllib.parse import urlparse
from tld import get_tld
import langid
import json
import pandas as pd
import numpy as np
import networkx as nx
import community
import sqlite3
import math
import io
from collections import defaultdict
def cluster_return(searchTerm,partition):
return partition[searchTerm]
def language_detection(str_lan):
lan=langid.classify(str_lan)
return lan[0]
def extract_domain(url, remove_http=True):
uri = urlparse(url)
if remove_http:
domain_name = f"{uri.netloc}"
else:
domain_name = f"{uri.netloc}://{uri.netloc}"
return domain_name
def extract_mainDomain(url):
res = get_tld(url, as_object=True)
return res.fld
def fuzzy_ratio(str1,str2):
return fuzz.ratio(str1,str2)
def fuzzy_token_set_ratio(str1,str2):
return fuzz.token_set_ratio(str1,str2)
def google_search(search_term, api_key, cse_id,hl,gl, **kwargs):
try:
service = build("customsearch", "v1", developerKey=api_key,cache_discovery=False)
res =
service.cse().list(q=search_term,hl=hl,gl=gl,fields='queries(request(totalResults,searchTerms,hl,gl)),items(title,displayLink,link,snippet)',num=10, cx=cse_id, **kwargs).execute()
return res
except Exception as e:
print(e)
return(e)
def google_search_default_language(search_term, api_key, cse_id,gl, **kwargs):
try:
service = build("customsearch", "v1", developerKey=api_key,cache_discovery=False)
res = service.cse().list(q=search_term,gl=gl,fields='queries(request(totalResults,searchTerms,hl,gl)),items(title,displayLink,link,snippet)',num=10, cx=cse_id, **kwargs).execute()
return res
except Exception as e:
print(e)
return(e)
def getCluster(DATABASE,SERP_TABLE,CLUSTER_TABLE,TIMESTAMP="max"):
dateTimeObj = datetime.now()
connection = sqlite3.connect(DATABASE)
if TIMESTAMP=="max":
df = pd.read_sql(f'select * from {SERP_TABLE} where requestTimestamp=(select max(requestTimestamp) from {SERP_TABLE})', connection)
else:
df = pd.read_sql(f'select * from {SERP_TABLE} where requestTimestamp="{TIMESTAMP}"', connection)
G = nx.Graph()
#add graph nodes from dataframe columun
G.add_nodes_from(df['searchTerms'])
#add edges between graph nodes:
for index, row in df.iterrows():
df_link=df[df["link"]==row["link"]]
for index1, row1 in df_link.iterrows():
G.add_edge(row["searchTerms"], row1['searchTerms'])
# compute the best partition for community (clusters)
partition = community.best_partition(G)