Python Data Transformation

Updated at December 22, 2020

Data.gov - A directory of government data downloads
/r/datasets - A subreddit that has hundreds of interesting data sets
Awesome datasets - A list of data sets hosted on GitHub
rs.io - A great blog post with hundreds of interesting data sets

String basics	Syntax
Replace substring within a string	<string>.replace(substring, string)
Convert to title cases (capitalize every letter after every dot)	<string>.title()
Check a string for the existence of a substring	if <substring> in <string>
Split a string into a list of strings	<string>.split(separator)
Slice characters from a string by position	<string>[:5]

String functions: capitalize, count, startswith, endswith, find, format, lower, upper, lstrip, rstrip, strip, replace, split, swapcase, title, zfill;

String interpolation	Syntax
Insert values into a string in order	“{} {}".format(value, value)
Insert values into a string by position	“{0} {1}".format(value, value)
Insert values into a string by name	“{name}".format(name=“value”)
Format specification for precision of two decimal places	“{:.2f}".format(float)
Order for format specification when using precision and comma separator	“{:,.2f}".format(float)
Python 3.6 String Interpolation	f"Hello {variable}”

Dates and Times Basics	Syntax
Import module	import datetime as dt
Instantiating dt.datetime	dt.datetime(year, month, day)
Creating dt.datetime from a string	dt.datetime.strptime(“day/month/year”, “%d/%m/%Y”)
Converting dt.datetime to a string	dt_object.strftime("%d/%m/%Y”)
Instantiating a dt.time	dt.time(hour=int, minute=int, second=int, microsecond=int)
Retrieving a part of a date	dt_object.day
Retrieving a date	dt_object.date()
Instantiating a dt.timedelta	dt.timedelta(weeks=3)

Dates and Times Math	Type
datetime - datetime	timedelta
datetime - timedelta	datetime
datetime + timedelta	datetime
timedelta + timedelta	timedelta
timedelta - timedelta	timedelta

Format	Description
%d	Day of the month as a zero-padded number
%A	Day of the week as a word
%m	Month as a zero-padded number
%Y	Year as four-digit number
%y	Year as two-digit number with zero-padding
%B	Month as a word
%H	Hour in 24 hour time as zero-padded number
%p	a.m. or p.m.
%I	Hour in 12 hour time as zero-padded number
%M	Minute as a zero-padded number

JSON Basics	Syntax
Import module	import json
JSON string to Object	json.loads(‘json’)
JSON file to Object	json.load(open(‘path’))
Object to JSON string	json.dumps(obj, sort_keys=True, indent=4)
Dictionary keys	obj.keys()
Delete key	del obj[key]

List Comprehansions and Lambdas	Syntax
Ranges (integers only)	range(min, max, interval)
List comprehension	[i * 10 for i in [0,1,2,3,4,5] if i > 0]
Functions on Objects	min\|max\|sorted(obj, key=function, reverse=True) \| one argument function extracts scalar value
Lambda function	f = lambda x, y: x * y
Ternary operator	return <val> if <expression> else None

NumPy Selecting	Syntax
Import module	import numpy as np
Convert a list of lists into a ndarray	np.array(list(csv.reader(open(file, “r”))))
Selecting a row from an ndarray	ndarr[1]
Selecting multiple rows from an ndarray	ndarr[1:]
Selecting a specific item from an ndarray	ndarr[1,1]
Selecting multiple columns	ndarr[:,1:3] \| ndarr[:, [1,2]]
Selecting a 2D slice	ndarr[1:4,:3]

NumPy Boolean Indexing	Syntax
Reading in a CSV file	np.genfromtxt(’.csv’, delimiter=’,’, skip_header=1)
Creating a Boolean array from filtering criteria	np.array([2,4,6,8]) < 5
Boolean filtering for 1D ndarray	a = np.array([2,4,6,8]) \| a[a < 5]
Boolean filtering for 2D ndarray	ndarr[ndarr[:,12] > 50]
Assigning values in a 2D ndarray using indices	ndarr[1,1] = 1 \| ndarr[:,0] = 1 \| ndarr[:,7] = ndarr[:,7].mean()
Assigning values using Boolean arrays	ndarr[ndarr[:,5] == 2, 15] = 1

NumPy 1D Statistics	Syntax
Vectorized math	+ - * /
Functions	ndarray .min() .max() .mean() .sum()

NumPy Utils	Syntax
Advanced ranges	np.arange(min, max, interval)

Tidy data: each variable is a column, each observation is a row, and each type of observational unit is a table.
Imputation: The technical name for filling in a missing value with a replacement value.

Pandas Information Basics	Syntax
Import module	import pandas as pd
Reading a file into a dataframe	pd.read_csv(’.csv’, index_col=0, parse_dates=[‘col’], encoding=’’)
Reading a JSON into a dataframe	pd.read_json()
Exporting data	df.to_csv(’.csv’, index=False)
Dataframe object info	df.info(memory_usage=‘deep’)
Describing a dataframe/series object	df.describe(include=‘all’) \| s.describe()
Returning a dataframe/series data types	df.dtypes \| s.dtype()
Returning or setting column names	df.columns
Returning the dimensions of a dataframe	dt.shape
Create dataframe/series	pd.DataFrame({‘col’: []}, columns=cols) \| pd.Series([])
Series to dataframe/list	s.to_frame(‘col’) \| s.tolist()

Pandas Select Operations	Syntax
Selecting the first n rows	df.head(5)
Selecting random n rows	df.sample(5, random_state=1)
Selecting a single column	df[‘col’]
Selecting multiple columns	df[[‘col’, ‘col2’]]
Shorthand Convention for columns	df[‘col’] \| df[[‘col’, ‘col2’]]
Shorthand Convention for rows	df[‘row’:‘row3’]
Selecting rows by label	df.loc[<row_labels>, [column_labels]]
Selecting rows by index	df.iloc[<row_index>, [column_index]]

Pandas Missing Values Handling	Syntax
Unique value counts for a dataframe/series	s.unique() \| s.value_counts(dropna=False, bins=3, normalize=True)
Selecting null and non-null values	s.isnull() \| s.notnull()
Selecting null rows	df.isnull().any(axis=1)
Renaming an existing column	df.rename(columns={‘src_name’: ‘dest_name’}, inplace=True)
Dropping an existing column	df.drop(labels=[‘row’], columns=[‘col’], inplace=True)
Dropping missing values	df.dropna(axis=0, thresh=number_of_records, inplace=True)
Show duplicated rows	df.duplicated(cols)
Drop duplicated rows	df.drop_duplicates(cols)
Fill missing values	df.fillna(value) \| s.fillna(value)
Reset index column	df.reset_index(drop=True, inplace=True)
Renaming an index	df.rename_axis(None, axis=0)

Pandas Boolean Masks Operations	Syntax
In operator	df[‘col’].isin([‘val1’, ‘val2’])
Between method	df[‘col’].between(val1, val2)
Updating values using Boolean filtering	s.loc[s[‘col’] == 0, ‘col’] = np.nan
Updating values using a Mapping dict	s.map({ ‘src_name’: ‘dest_name’ })
Updating values using mask method	s.mask(bool_mask, new_values)

Pandas Sort and Convert Basics	Syntax
Sorting by index column	df.sort_index(ascending=False)
Sorting by column values	df.sort_values(ascending=False)
Converting column to datetime	pd.to_datetime(series, errors=“coerce”)
Converting column to numeric	pd.to_numeric(series, errors=“coerce”)
Converting column to float/int	s.astype(float/int)
Stack multiple columns into one	df.stack(dropna=False)

Pandas Vectorized Accessors	Syntax
Multi-dimensional numpy array	df.values
Access datetime values in series	s.dt \| s.dt.year
Replace substring	s.str.replace(’”’, ‘’)
Extracting values from strings (first word)	s.str.split().str[0]

Pandas Aggregation Methods	Syntax
Grouping	df.groupby(‘col’)
Group indexing	gr[‘col’]
Select group data	gr.get_group(‘value’)
Groups and indexes	gr.groups
Aggregations	gr.size() \| mean, sum, count, min, max
Multiple aggregations	gr.agg(functions_list)
Aggregate	df.pivot_table(index=gr_cols, columns=gr_cols, values=val_cols, aggfunc=functions, margins=True)

Pandas Transforming Data	Syntax
Apply function for each row in Series	s.map(func/dict)
Apply function - Series: for each row, DataFrame: for each column	s.apply(func, args) \| df.apply(func, args, axis=0)
Apply function to every cell in the DataFrame	df.applymap(func)
Unpivot	df.melt(id_vars=cols, value_vars=cols) \| pd.melt()
Pivot	df.pivot(index=cols, columns=cols, values=cols)
List-like to a row (Pandas 0.25)	df.explot(column, ignore_index=True)

-Pandas Combining DataFrames*

Union
- pd.concat(df_list, axis, ignore_index=True)
- df.append() # shortcut
Join
- pd.merge(left=df1, right=df2, how=‘inner’, on=‘col’, on_left=‘col1’, on_right=‘col2’, left_index = True, right_index = True, suffixes=(’_x’, ‘_y’))
- df.merge() # shortcut
- df.join() # using indexes