• Skip to primary navigation
  • Skip to main content
  • Skip to primary sidebar
  • Skip to footer

Learn OpenCV

OpenCV, PyTorch, Keras, Tensorflow examples and tutorials

  • Home
  • Getting Started
    • Installation
    • PyTorch
    • Keras & Tensorflow
    • Resource Guide
  • Courses
    • Opencv Courses
    • CV4Faces (Old)
  • Resources
  • AI Consulting
  • About

Duplicate Search on Quora Dataset

Vishwesh Shrimali
December 12, 2018 Leave a Comment
Application Deep Learning Tutorial

December 12, 2018 By Leave a Comment

Quora-Post-Image

A few weeks back we published a post about Universal Sentence Encoders. We discussed how to use the encoders and their application in Semantic Similarity Analysis.

In this post, we will use the Universal Sentence Encoder to find duplicate questions in the First Quora dataset.

1. What is the First Quora dataset?

In Jan 2017, Quora announced that it was planning to release a series of public NLP datasets. As mentioned in its post:

Today, we are excited to announce the first in what we plan to be a series of public dataset releases. Our dataset releases will be oriented around various problems of relevance to Quora and will give researchers in diverse areas such as machine learning, natural language processing, network science, etc. the opportunity to try their hand at some of the challenges that arise in building a scalable online knowledge-sharing platform. Our first dataset is related to the problem of identifying duplicate questions.

The Quora dataset consists of a large number of question pairs and a label which mentions whether the question pair is logically duplicate or not. For example, two questions below carry the same intent.

  1. “What is the most populous state in the USA?”
  2. “Which state in the United States has the most people?”

Ideally, only one of the two should be present on Quora.

The dataset can be downloaded from Kaggle.

2. Pre-processing Dataset

Download Code To easily follow along this tutorial, please download code by clicking on the button below. It's FREE!

Download Code

Before carrying out the similarity analysis, it’s very important to pre-process the data. The basic outline of pre-processing is as follows.

  1. Read the csv file
  2. Ignore header row
  3. For each row, extract ID of Question 1 and 2 (column 2 and 3) and the questions (column 4 and 5).

Now, let’s have a look at the code for pre-processing.

csv_fname = "q_quora.csv"

question1 = {}
question2 = {}

print("Loading data from {}".format(csv_fname))

numLines = int(input("Enter number of lines to read: "))

with open(csv_fname,'r') as f:
    if numLines == -1:
        totalLines = f.readlines()[1:]
    else:
        totalLines = f.readlines()[1:numLines]
    for line in totalLines:
        try:
            qid1, qid2, q1, q2 = line.strip().split(',')[1:5]
            question1[qid1] = q1
            question2[qid2] = q2
        except:
            continue
print("Data loaded successfully")

3. Load Google’s Universal Sentence Encoder

Next, let’s load the model.

module_url = "https://tfhub.dev/google/universal-sentence-encoder/2"

print("Loading model from {}".format(module_url))
embed = hub.Module(module_url)
print("Model loaded successfully")

4. Displaying similarity using heatmap

Heatmap is a 2-dimensional representation of data where the individual values obtained in a matrix are represented as colors. The intent behind using heatmap is not to convey the exact values but to visualize the variation in values in the matrix – average, maximum and minimum values of the dataset for example. In our case, the darker a color is, the higher is the similarity between the corresponding question pair.

We will use the concepts we learned in our last post on Universal Sentence Encoders along with heatmap to assess the semantic similarity between two questions.

Become an expert in Computer Vision, Machine Learning, and AI in 12-weeks! Check out our course

Computer Vision Course

Now that we have already loaded the model in last step, let’s write some functions to plot the heatmap. Also, we will use a threshold of 0.8 to see if two sentences are duplicate or not.

def plot_similarity(labels1, labels2, features1, features2, rotation):
    corr = np.inner(features1, features2)
    corr2 = corr.copy()
    corr2[corr2<0.8]=0
    corr2[corr2>=0.8]=1
    sns.set(font_scale=0.6)
    g = sns.heatmap(corr,\
        vmin=0,\
        vmax=1,\
        cmap="YlOrRd")
    g.set_title("Semantic Textual Similarity")
    plt.tight_layout()
    plt.savefig("Quora.png")
    plt.show()
    similar_qid = {}
    for i in range(len(labels1)):
        for j in range(len(labels2)):
            if corr2[i][j] == 1:
                similar_qid[labels1[i]]=labels2[j]
    return similar_qid

def run_and_plot(session_, input_tensor_, messages1_, messages2_, labels1_,labels2_, encoding_tensor):
    print("Embeddings questions 1")
    message_embeddings1_ = session_.run(encoding_tensor, feed_dict={input_tensor_: messages1_})
    print("Embeddings questions 2")
    message_embeddings2_ = session_.run(encoding_tensor, feed_dict={input_tensor_: messages2_})
    similar_qid = plot_similarity(labels1_,labels2_, \
            message_embeddings1_,\
            message_embeddings2_, 90)
    return similar_qid

5. Semantic Similarity Analysis of First Quora Dataset

Let’s run a new Tensorflow session and pass the data. We will also write similar lines to a new file.

similarity_input_placeholder = tf.placeholder(tf.string, shape=(None))
similarity_message_encodings = embed(similarity_input_placeholder)
with tf.Session() as session:
    session.run(tf.global_variables_initializer())
    session.run(tf.tables_initializer())
    similar_qid = run_and_plot(session, similarity_input_placeholder,\
            list(question1.values()),\
            list(question2.values()),\
            list(question1.keys()),\
            list(question2.keys()),\
            similarity_message_encodings)

with open("similarity-results.txt",'w') as f:
    for i in list(similar_qid.keys()):
        f.write("{},{}\n=======================\n".format(question1[i], question2[similar_qid[i]]))

5. Results : Duplicate Question Detection

Let’s look at the matrix we obtain for 2000 lines of data. We use the model to do similarity analysis for all possible pairs between Question 1 set and Question 2 set.

The x-axis represents a question from set Q1 and the y-axis represents a question from set Q2. A deeper red color indicates the two titles are similar, and the lighter color indicates they are not similar.

Quora Similarity Analysis Result

Let’s first look at some pairs of titles which with high similarity score. We can call these duplicates.

1. What is the quickest way to increase Instagram followers?
2. How can we increase our number of Instagram followers?
Similarity score : 0.9328753


1. How do I make friends.
2. How to make friends ?
Similarity score : 0.8874366


1. Is Career Launcher good for RBI Grade B preparation?
2. How is career launcher online program for RBI Grade B?
Similarity score : 0.86849654


1. What are some good rap songs to dance to?
2. What are some of the best rap songs?
Similarity score : 0.85388064

Let’s now look at title pairs that were judged least similar by the algorithm.

1. What is the quickest way to increase Instagram followers?
2. How to train my dog?
Similarity score : 0.23722759


1. How do I make friends.
2. How to eat a potato?
Similarity score : 0.25956485


1. Is Career Launcher good for RBI Grade B preparation?
2. Is Elon Musk crazy?
Similarity score : 0.2054378


1. What are some good rap songs to dance to?
2. What is the answer to life, universe, and everything?
Similarity score : 0.12436064

Subscribe & Download Code

If you liked this article and would like to download code (C++ and Python) and example images used in this post, please subscribe to our newsletter. You will also receive a free Computer Vision Resource Guide. In our newsletter, we share OpenCV tutorials and examples written in C++/Python, and Computer Vision and Machine Learning algorithms and news.

Subscribe Now


I hope you enjoyed the post. Make sure you try it out! In case of any query or suggestion, feel free to drop a comment in the comment section below and we will get back to you soon.

References

  1. Featured image from Wikipedia
  2. Quora questions dataset from Kaggle
  3. Tensorflow Hub
Tags: deep learning Encoder Kaggle NLP Python Quora TensorflowHub

Filed Under: Application, Deep Learning, Tutorial

About

I am an entrepreneur with a love for Computer Vision and Machine Learning with a dozen years of experience (and a Ph.D.) in the field.

In 2007, right after finishing my Ph.D., I co-founded TAAZ Inc. with my advisor Dr. David Kriegman and Kevin Barnes. The scalability, and robustness of our computer vision and machine learning algorithms have been put to rigorous test by more than 100M users who have tried our products. Read More…

Getting Started

  • Installation
  • PyTorch
  • Keras & Tensorflow
  • Resource Guide

Resources

Download Code (C++ / Python)

ENROLL IN OFFICIAL OPENCV COURSES

I've partnered with OpenCV.org to bring you official courses in Computer Vision, Machine Learning, and AI.
Learn More

Recent Posts

  • Making A Low-Cost Stereo Camera Using OpenCV
  • Optical Flow in OpenCV (C++/Python)
  • Introduction to Epipolar Geometry and Stereo Vision
  • Depth Estimation using Stereo matching
  • Classification with Localization: Convert any Keras Classifier to a Detector

Disclaimer

All views expressed on this site are my own and do not represent the opinions of OpenCV.org or any entity whatsoever with which I have been, am now, or will be affiliated.

GETTING STARTED

  • Installation
  • PyTorch
  • Keras & Tensorflow
  • Resource Guide

COURSES

  • Opencv Courses
  • CV4Faces (Old)

COPYRIGHT © 2020 - BIG VISION LLC

Privacy Policy | Terms & Conditions

We use cookies to ensure that we give you the best experience on our website. If you continue to use this site we will assume that you are happy with it.AcceptPrivacy policy