You have 2 free stories left this month.
3 Quick Ways To Compare Data with Python
For anyone working in an analytical role, receiving requests to compare data will be all too familiar. Whether that is to prove the integrity of the data, the successful delivery of data or merely attempting to understand the difference between two files or data sets, the process will almost always be the same.
Sometimes, when the data is sufficiently small, putting each dataset in a worksheet and then doing a comparison between cells will just have to suffice.


If something more powerful is required, then there are many premium data comparison software out in the market to get the job done. My personal favourite? Hands down, Beyond Compare!
In this article, we will explore something in between the two solutions we just looked at. Maybe because you want to do a very quick diff, or perhaps because the datasets are simply too large for Excel to handle or for you to set up a diff. Let us explore three quick alternatives.
1. Check the integrity of the data
In this instance, by checking the integrity of the data, I mean that we will be reviewing the data set in its entirety, and we will either pass the test or fail it. We will not be able to determine why the check has failed, but this is a super quick way of doing it. Let us explore two alternative algorithms.
Check with an MD5 Checksum
An MD5 Checksum is essentially an algorithm that will return a hexadecimal number for the contents of a file.
We can use the following code to leverage it:
import hashlib, sysfiles = [sys.argv[1], sys.argv[2]] #these are the arguments we takedef md5(fname):
md5hash = hashlib.md5()
with open(fname) as handle: #opening the file one line at a time for memory considerations
for line in handle:
md5hash.update(line.encode('utf-8'))
return(md5hash.hexdigest())print('Comparing Files:',files[0],'and',files[1])if md5(files[0]) == md5(files[1]):
print('Matched')
else:
print('Not Matched')
Returning:

Check with the SHA1 algorithm
The SHA1 algorithm is yet another hexadecimal algorithm that will convert our file contents into a string. It is the same algorithm that bitcoin uses for its blockchain mining process.
Using the below:
import hashlib, sysfiles = [sys.argv[1], sys.argv[2]] #these are the arguments we takedef sha1(fname):
sha1hash = hashlib.sha1()
with open(fname) as handle: #opening the file one line at a time for memory considerations
for line in handle:
sha1hash.update(line.encode('utf-8'))
return(sha1hash.hexdigest())print('Comparing Files:',files[0],'and',files[1])if sha1(files[0]) == sha1(files[1]):
print('Matched')
else:
print('Not Matched')
Giving:

2. Check the contents of the data with SQL
Luckily for us, using a couple of Python libraries, we can import our files into an SQL database and use the Except Operator to highlight any differences. The only thing to note is that Except expects the data to be ordered; otherwise, it will highlight everything as a difference.
We can quickly use this method this way:
import sys, sqlite3, pandas as pdfiles = [sys.argv[1], sys.argv[2]] #these are the arguments we takeconn = sqlite3.connect(':memory:') #we are spinning an SQL db in memory
cur = conn.cursor()
chunksize = 10000i=0
for file in files:
i = i+1
for chunk in pd.read_csv(file, chunksize=chunksize): #load the file in chunks in case its too big
chunk.columns = chunk.columns.str.replace(' ', '_') #replacing spaces with underscores for column names
chunk.to_sql(name='file' + str(i), con=conn, if_exists='append')print('Comparing', files[0], 'to', files[1]) #Compare if all data from File[0] are present in File[1]
cur.execute( '''SELECT * FROM File1
EXCEPT
SELECT * FROM File2''')i=0
for row in cur:
print(row)
i=i+1
if i==0: print('No Differences')print('Comparing', files[1], 'to', files[0]) #Compare if all data from File[1] are present in File[0]
cur.execute( '''SELECT * FROM File2
EXCEPT
SELECT * FROM File1''')i=0
for row in cur:
print(row)
i=i+1
if i==0: print('No Differences')cur.close()
Returning:

3. Check the data using Pandas
Using the famous Pandas package, we can quickly check our files once we have them in a dataframe. Let us examine a few different properties of the pandas library that will allow us to achieve this.
Using the .equals() parameter
In the same set up as our previous examples; code:
import sys, sqlite3, pandas as pdfiles = [sys.argv[1], sys.argv[2]] #these are the arguments we takedf1 = pd.read_csv(files[0])
df2 = pd.read_csv(files[1])df3 = df1.equals(df2)print('Matches:', df3)
Outcome:

Using the .any() parameter
We can quickly do that in the following way:
import sys, sqlite3, pandas as pdfiles = [sys.argv[1], sys.argv[2]] #these are the arguments we takedf1 = pd.read_csv(files[0])
df2 = pd.read_csv(files[1])df3 = (df1 != df2).any(axis=None)
print('Differences in file:', df3)df3 = (df1 != df2).any(1)
ne_stacked = (df1 != df2).stack()
changed = ne_stacked[ne_stacked]
changed.index.names = ['id', 'col']print('Differences In:')
print(changed)
which returns:

Using the .Eq() parameter
Execute the following script:
import sys, sqlite3, pandas as pd, numpy as npfiles = [sys.argv[1], sys.argv[2]] #these are the arguments we takedf1 = pd.read_csv(files[0])
df2 = pd.read_csv(files[1])df3 = df1.eq(df2)print(df3.all())#print(df3.all(axis=1))df4 = df3.all(axis=1)df4 = pd.DataFrame(df4, columns=['Columns'])
print(df4[df4['Columns']==False])
Resulting in:

Do you use any other ways to compare your data? Let me know!