You have 2 free stories left this month.

3 Quick Ways To Compare Data with Python

Costas Andreou
Jun 5, 2019 · 5 min read

For anyone working in an analytical role, receiving requests to compare data will be all too familiar. Whether that is to prove the integrity of the data, the successful delivery of data or merely attempting to understand the difference between two files or data sets, the process will almost always be the same.

Sometimes, when the data is sufficiently small, putting each dataset in a worksheet and then doing a comparison between cells will just have to suffice.

If something more powerful is required, then there are many premium data comparison software out in the market to get the job done. My personal favourite? Hands down, Beyond Compare!

In this article, we will explore something in between the two solutions we just looked at. Maybe because you want to do a very quick diff, or perhaps because the datasets are simply too large for Excel to handle or for you to set up a diff. Let us explore three quick alternatives.

Photo by Taylor Vick on Unsplash

1. Check the integrity of the data

In this instance, by checking the integrity of the data, I mean that we will be reviewing the data set in its entirety, and we will either pass the test or fail it. We will not be able to determine why the check has failed, but this is a super quick way of doing it. Let us explore two alternative algorithms.

An MD5 Checksum is essentially an algorithm that will return a hexadecimal number for the contents of a file.

We can use the following code to leverage it:

import hashlib, sysfiles = [sys.argv[1], sys.argv[2]] #these are the arguments we takedef md5(fname):
md5hash = hashlib.md5()
with open(fname) as handle: #opening the file one line at a time for memory considerations
for line in handle:
md5hash.update(line.encode('utf-8'))
return(md5hash.hexdigest())
print('Comparing Files:',files[0],'and',files[1])if md5(files[0]) == md5(files[1]):
print('Matched')
else:
print('Not Matched')

Returning:

The SHA1 algorithm is yet another hexadecimal algorithm that will convert our file contents into a string. It is the same algorithm that bitcoin uses for its blockchain mining process.

Using the below:

import hashlib, sysfiles = [sys.argv[1], sys.argv[2]] #these are the arguments we takedef sha1(fname):
sha1hash = hashlib.sha1()
with open(fname) as handle: #opening the file one line at a time for memory considerations
for line in handle:
sha1hash.update(line.encode('utf-8'))
return(sha1hash.hexdigest())
print('Comparing Files:',files[0],'and',files[1])if sha1(files[0]) == sha1(files[1]):
print('Matched')
else:
print('Not Matched')

Giving:

2. Check the contents of the data with SQL

Photo by Joshua Sortino on Unsplash

Luckily for us, using a couple of Python libraries, we can import our files into an SQL database and use the Except Operator to highlight any differences. The only thing to note is that Except expects the data to be ordered; otherwise, it will highlight everything as a difference.

We can quickly use this method this way:

import sys, sqlite3, pandas as pdfiles = [sys.argv[1], sys.argv[2]] #these are the arguments we takeconn = sqlite3.connect(':memory:') #we are spinning an SQL db in memory
cur = conn.cursor()
chunksize = 10000
i=0
for file in files:
i = i+1
for chunk in pd.read_csv(file, chunksize=chunksize): #load the file in chunks in case its too big
chunk.columns = chunk.columns.str.replace(' ', '_') #replacing spaces with underscores for column names
chunk.to_sql(name='file' + str(i), con=conn, if_exists='append')
print('Comparing', files[0], 'to', files[1]) #Compare if all data from File[0] are present in File[1]
cur.execute( '''SELECT * FROM File1
EXCEPT
SELECT * FROM File2''')
i=0
for row in cur:
print(row)
i=i+1
if i==0: print('No Differences')
print('Comparing', files[1], 'to', files[0]) #Compare if all data from File[1] are present in File[0]
cur.execute( '''SELECT * FROM File2
EXCEPT
SELECT * FROM File1''')
i=0
for row in cur:
print(row)
i=i+1
if i==0: print('No Differences')
cur.close()

Returning:

3. Check the data using Pandas

Photo by Franki Chamaki on Unsplash

Using the famous Pandas package, we can quickly check our files once we have them in a dataframe. Let us examine a few different properties of the pandas library that will allow us to achieve this.

In the same set up as our previous examples; code:

import sys, sqlite3, pandas as pdfiles = [sys.argv[1], sys.argv[2]] #these are the arguments we takedf1 = pd.read_csv(files[0])
df2 = pd.read_csv(files[1])
df3 = df1.equals(df2)print('Matches:', df3)

Outcome:

We can quickly do that in the following way:

import sys, sqlite3, pandas as pdfiles = [sys.argv[1], sys.argv[2]] #these are the arguments we takedf1 = pd.read_csv(files[0])
df2 = pd.read_csv(files[1])
df3 = (df1 != df2).any(axis=None)
print('Differences in file:', df3)
df3 = (df1 != df2).any(1)
ne_stacked = (df1 != df2).stack()
changed = ne_stacked[ne_stacked]
changed.index.names = ['id', 'col']
print('Differences In:')
print(changed)

which returns:

Execute the following script:

import sys, sqlite3, pandas as pd, numpy as npfiles = [sys.argv[1], sys.argv[2]] #these are the arguments we takedf1 = pd.read_csv(files[0])
df2 = pd.read_csv(files[1])
df3 = df1.eq(df2)print(df3.all())#print(df3.all(axis=1))df4 = df3.all(axis=1)df4 = pd.DataFrame(df4, columns=['Columns'])
print(df4[df4['Columns']==False])

Resulting in:

Do you use any other ways to compare your data? Let me know!


FinanceExplained

A publication dedicated in explaining all things finance…

Costas Andreou

Written by

FinanceExplained

A publication dedicated in explaining all things finance, banking and financial technology simply.

Costas Andreou

Written by

FinanceExplained

A publication dedicated in explaining all things finance, banking and financial technology simply.

More From Medium

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store