Duplicate Search on Quora Dataset

A few weeks back we published a post about Universal Sentence Encoders. We discussed how to use the encoders and their application in Semantic Similarity Analysis.

In this post, we will use the Universal Sentence Encoder to find duplicate questions in the First Quora dataset.

1. What is the First Quora dataset?

In Jan 2017, Quora announced that it was planning to release a series of public NLP datasets. As mentioned in its post:

Today, we are excited to announce the first in what we plan to be a series of public dataset releases. Our dataset releases will be oriented around various problems of relevance to Quora and will give researchers in diverse areas such as machine learning, natural language processing, network science, etc. the opportunity to try their hand at some of the challenges that arise in building a scalable online knowledge-sharing platform. Our first dataset is related to the problem of identifying duplicate questions.

The Quora dataset consists of a large number of question pairs and a label which mentions whether the question pair is logically duplicate or not. For example, two questions below carry the same intent.

“What is the most populous state in the USA?”
“Which state in the United States has the most people?”

Ideally, only one of the two should be present on Quora.

The dataset can be downloaded from Kaggle.

2. Pre-processing Dataset

Before carrying out the similarity analysis, it’s very important to pre-process the data. The basic outline of pre-processing is as follows.

Read the csv file
Ignore header row
For each row, extract ID of Question 1 and 2 (column 2 and 3) and the questions (column 4 and 5).

Now, let’s have a look at the code for pre-processing.

Download Code To easily follow along this tutorial, please download code by clicking on the button below. It's FREE!

Click here to download the source code to this post

csv_fname = "q_quora.csv"

question1 = {}
question2 = {}

print("Loading data from {}".format(csv_fname))

numLines = int(input("Enter number of lines to read: "))

with open(csv_fname,'r') as f:
    if numLines == -1:
        totalLines = f.readlines()[1:]
    else:
        totalLines = f.readlines()[1:numLines]
    for line in totalLines:
        try:
            qid1, qid2, q1, q2 = line.strip().split(',')[1:5]
            question1[qid1] = q1
            question2[qid2] = q2
        except:
            continue
print("Data loaded successfully")

3. Load Google’s Universal Sentence Encoder

Next, let’s load the model.

module_url = "https://tfhub.dev/google/universal-sentence-encoder/2"

print("Loading model from {}".format(module_url))
embed = hub.Module(module_url)
print("Model loaded successfully")

4. Displaying similarity using heatmap

Heatmap is a 2-dimensional representation of data where the individual values obtained in a matrix are represented as colors. The intent behind using heatmap is not to convey the exact values but to visualize the variation in values in the matrix – average, maximum and minimum values of the dataset for example. In our case, the darker a color is, the higher is the similarity between the corresponding question pair.

We will use the concepts we learned in our last post on Universal Sentence Encoders along with heatmap to assess the semantic similarity between two questions.

100K+ Learners
3 Hours of Learning

Join Free OpenCV Bootcamp

15K+ Learners
3 Hours of Learning

Join Free TensorFlow Bootcamp

10K+ Learners
8 Hours of Learning

Join Free PyTorch Bootcamp

Now that we have already loaded the model in last step, let’s write some functions to plot the heatmap. Also, we will use a threshold of 0.8 to see if two sentences are duplicate or not.

def plot_similarity(labels1, labels2, features1, features2, rotation):
    corr = np.inner(features1, features2)
    corr2 = corr.copy()
    corr2[corr2<0.8]=0
    corr2[corr2>=0.8]=1
    sns.set(font_scale=0.6)
    g = sns.heatmap(corr,\
        vmin=0,\
        vmax=1,\
        cmap="YlOrRd")
    g.set_title("Semantic Textual Similarity")
    plt.tight_layout()
    plt.savefig("Quora.png")
    plt.show()
    similar_qid = {}
    for i in range(len(labels1)):
        for j in range(len(labels2)):
            if corr2[i][j] == 1:
                similar_qid[labels1[i]]=labels2[j]
    return similar_qid

def run_and_plot(session_, input_tensor_, messages1_, messages2_, labels1_,labels2_, encoding_tensor):
    print("Embeddings questions 1")
    message_embeddings1_ = session_.run(encoding_tensor, feed_dict={input_tensor_: messages1_})
    print("Embeddings questions 2")
    message_embeddings2_ = session_.run(encoding_tensor, feed_dict={input_tensor_: messages2_})
    similar_qid = plot_similarity(labels1_,labels2_, \
            message_embeddings1_,\
            message_embeddings2_, 90)
    return similar_qid

5. Semantic Similarity Analysis of First Quora Dataset

Let’s run a new Tensorflow session and pass the data. We will also write similar lines to a new file.

similarity_input_placeholder = tf.placeholder(tf.string, shape=(None))
similarity_message_encodings = embed(similarity_input_placeholder)
with tf.Session() as session:
    session.run(tf.global_variables_initializer())
    session.run(tf.tables_initializer())
    similar_qid = run_and_plot(session, similarity_input_placeholder,\
            list(question1.values()),\
            list(question2.values()),\
            list(question1.keys()),\
            list(question2.keys()),\
            similarity_message_encodings)

with open("similarity-results.txt",'w') as f:
    for i in list(similar_qid.keys()):
        f.write("{},{}\n=======================\n".format(question1[i], question2[similar_qid[i]]))

5. Results : Duplicate Question Detection

Let’s look at the matrix we obtain for 2000 lines of data. We use the model to do similarity analysis for all possible pairs between Question 1 set and Question 2 set.

The x-axis represents a question from set Q1 and the y-axis represents a question from set Q2. A deeper red color indicates the two titles are similar, and the lighter color indicates they are not similar.

Let’s first look at some pairs of titles which with high similarity score. We can call these duplicates.

1. What is the quickest way to increase Instagram followers?
2. How can we increase our number of Instagram followers?
Similarity score : 0.9328753

1. How do I make friends.
2. How to make friends ?
Similarity score : 0.8874366

1. Is Career Launcher good for RBI Grade B preparation?
2. How is career launcher online program for RBI Grade B?
Similarity score : 0.86849654

1. What are some good rap songs to dance to?
2. What are some of the best rap songs?
Similarity score : 0.85388064

Let’s now look at title pairs that were judged least similar by the algorithm.

1. What is the quickest way to increase Instagram followers?
2. How to train my dog?
Similarity score : 0.23722759

1. How do I make friends.
2. How to eat a potato?
Similarity score : 0.25956485

1. Is Career Launcher good for RBI Grade B preparation?
2. Is Elon Musk crazy?
Similarity score : 0.2054378

1. What are some good rap songs to dance to?
2. What is the answer to life, universe, and everything?
Similarity score : 0.12436064

I hope you enjoyed the post. Make sure you try it out! In case of any query or suggestion, feel free to drop a comment in the comment section below and we will get back to you soon.