What is Difflib?

So let's get started with this amazing python module Difflib

Ajinkya Mishrikotkar
3 min readJun 14, 2021

Introduction:

Let's say you have a use case of getting similar keywords for every keyword present in the column. So how can we do that? Firstly, we can use the structure of the embedding to calculate the cosine similarity between every keyword in the column one by one and then map it map by the highest cosine similarity score. But calculating embeddings and then cosine similarity will be a computationally heavy task and if the list is large then it will take a lot of time as well. So here comes this amazing python module for our rescue. Difflib is a module that provides functions for comparing the sequences. It could be used for comparing strings and get additional information regarding them.

Functions:

1. difflib.SequenceMatcher : Sequence Matcher is the class in the Difflib module. This class is used for comparing the strings and get the score of similarity between two strings. It finds the longest matching sequence between two strings excluding the spaces and white lines.

Example :

import difflib
a = 'Medium'
b = 'Median'
seq = difflib.SequenceMatcher(None,a,b)
d = seq.ratio()*100
print(d)
66.66666666666666

We can see from the above block of code when we compare ‘Medium’ and ‘Median’, we get 66.6% similarity.

a = 'Medium'
b = 'Mediun'
seq = difflib.SequenceMatcher(None,a,b)
d = seq.ratio()*100
print(d)
83.33333333333334

and when we change string b to ‘Mediun’ our similarity ratio goes up to 83.3%. This is because in the first example there was a difference of two-character whereas in the second example only one character is different.

In the SequenceMatcher function there are 4 parameters to be specified : isjunk, string a, string b, autojunk.

2. difflib.get_close_matches : get_close_match is a function that returns a list of best matches keywords for a specific keyword. So when we feed the input string and list of strings in get_close_match function it will return the list of strings which are matching with the input string.

It has parameters such as n, cutoff where n is the maximum number of close matches to return and cutoff is a float number which denotes the possibility that whichever words have scores below the cutoff are ignored.

Example:

get_close_matches('appel', ['ape', 'apple', 'peach', 'puppy'])
['apple', 'ape']

Here for the input word ‘appel’ we get ‘apple’ and ‘ape’ as the most similar words. Let's check another example:

import difflibdifflib.get_close_matches('when', ['what', 'whene','where','why'], n=2, cutoff=0.8)
['whene']

In this example we have added parameters like n and cutoff. As we have cutoff range 0.8 we are not getting where as the close match but if we lower the cutoff range we will get where as well.

There are many other functions in difflib such as difflib.Differ, difflib.HtmlDiff, difflib.context_diff, difflib.ndiff, difflib.restore, difflib.unified_diff which are used as per the use case.

For more detailed information on any of thess functions do check out the official documentation of Difflib module : https://docs.python.org/3/library/difflib.html

Thank You!

--

--

Ajinkya Mishrikotkar
Ajinkya Mishrikotkar

No responses yet