Python Data Transformation

Updated at
  • Data.gov - A directory of government data downloads
  • /r/datasets - A subreddit that has hundreds of interesting data sets
  • Awesome datasets - A list of data sets hosted on GitHub
  • rs.io - A great blog post with hundreds of interesting data sets
String basicsSyntax
Replace substring within a string<string>.replace(substring, string)
Convert to title cases (capitalize every letter after every dot)<string>.title()
Check a string for the existence of a substringif <substring> in <string>
Split a string into a list of strings<string>.split(separator)
Slice characters from a string by position<string>[:5]
  • String functions: capitalize, count, startswith, endswith, find, format, lower, upper, lstrip, rstrip, strip, replace, split, swapcase, title, zfill;
String interpolationSyntax
Insert values into a string in order“{} {}".format(value, value)
Insert values into a string by position“{0} {1}".format(value, value)
Insert values into a string by name“{name}".format(name=“value”)
Format specification for precision of two decimal places“{:.2f}".format(float)
Order for format specification when using precision and comma separator“{:,.2f}".format(float)
Python 3.6 String Interpolationf"Hello {variable}”
Dates and Times BasicsSyntax
Import moduleimport datetime as dt
Instantiating dt.datetimedt.datetime(year, month, day)
Creating dt.datetime from a stringdt.datetime.strptime(“day/month/year”, “%d/%m/%Y”)
Converting dt.datetime to a stringdt_object.strftime("%d/%m/%Y”)
Instantiating a dt.timedt.time(hour=int, minute=int, second=int, microsecond=int)
Retrieving a part of a datedt_object.day
Retrieving a datedt_object.date()
Instantiating a dt.timedeltadt.timedelta(weeks=3)
Dates and Times MathType
datetime - datetimetimedelta
datetime - timedeltadatetime
datetime + timedeltadatetime
timedelta + timedeltatimedelta
timedelta - timedeltatimedelta
FormatDescription
%dDay of the month as a zero-padded number
%ADay of the week as a word
%mMonth as a zero-padded number
%YYear as four-digit number
%yYear as two-digit number with zero-padding
%BMonth as a word
%HHour in 24 hour time as zero-padded number
%pa.m. or p.m.
%IHour in 12 hour time as zero-padded number
%MMinute as a zero-padded number
JSON BasicsSyntax
Import moduleimport json
JSON string to Objectjson.loads(‘json’)
JSON file to Objectjson.load(open(‘path’))
Object to JSON stringjson.dumps(obj, sort_keys=True, indent=4)
Dictionary keysobj.keys()
Delete keydel obj[key]
List Comprehansions and LambdasSyntax
Ranges (integers only)range(min, max, interval)
List comprehension[i * 10 for i in [0,1,2,3,4,5] if i > 0]
Functions on Objectsmin|max|sorted(obj, key=function, reverse=True) | one argument function extracts scalar value
Lambda functionf = lambda x, y: x * y
Ternary operatorreturn <val> if <expression> else None
NumPy SelectingSyntax
Import moduleimport numpy as np
Convert a list of lists into a ndarraynp.array(list(csv.reader(open(file, “r”))))
Selecting a row from an ndarrayndarr[1]
Selecting multiple rows from an ndarrayndarr[1:]
Selecting a specific item from an ndarrayndarr[1,1]
Selecting multiple columnsndarr[:,1:3] | ndarr[:, [1,2]]
Selecting a 2D slicendarr[1:4,:3]
NumPy Boolean IndexingSyntax
Reading in a CSV filenp.genfromtxt(’.csv’, delimiter=’,’, skip_header=1)
Creating a Boolean array from filtering criterianp.array([2,4,6,8]) < 5
Boolean filtering for 1D ndarraya = np.array([2,4,6,8]) | a[a < 5]
Boolean filtering for 2D ndarrayndarr[ndarr[:,12] > 50]
Assigning values in a 2D ndarray using indicesndarr[1,1] = 1 | ndarr[:,0] = 1 | ndarr[:,7] = ndarr[:,7].mean()
Assigning values using Boolean arraysndarr[ndarr[:,5] == 2, 15] = 1
NumPy 1D StatisticsSyntax
Vectorized math+ - * /
Functionsndarray .min() .max() .mean() .sum()
NumPy UtilsSyntax
Advanced rangesnp.arange(min, max, interval)
  • Tidy data: each variable is a column, each observation is a row, and each type of observational unit is a table.
  • Imputation: The technical name for filling in a missing value with a replacement value.
Pandas Information BasicsSyntax
Import moduleimport pandas as pd
Reading a file into a dataframepd.read_csv(’.csv’, index_col=0, parse_dates=[‘col’], encoding=’’)
Reading a JSON into a dataframepd.read_json()
Exporting datadf.to_csv(’.csv’, index=False)
Dataframe object infodf.info(memory_usage=‘deep’)
Describing a dataframe/series objectdf.describe(include=‘all’) | s.describe()
Returning a dataframe/series data typesdf.dtypes | s.dtype()
Returning or setting column namesdf.columns
Returning the dimensions of a dataframedt.shape
Create dataframe/seriespd.DataFrame({‘col’: []}, columns=cols) | pd.Series([])
Series to dataframe/lists.to_frame(‘col’) | s.tolist()
Pandas Select OperationsSyntax
Selecting the first n rowsdf.head(5)
Selecting random n rowsdf.sample(5, random_state=1)
Selecting a single columndf[‘col’]
Selecting multiple columnsdf[[‘col’, ‘col2’]]
Shorthand Convention for columnsdf[‘col’] | df[[‘col’, ‘col2’]]
Shorthand Convention for rowsdf[‘row’:‘row3’]
Selecting rows by labeldf.loc[<row_labels>, [column_labels]]
Selecting rows by indexdf.iloc[<row_index>, [column_index]]
Pandas Missing Values HandlingSyntax
Unique value counts for a dataframe/seriess.unique() | s.value_counts(dropna=False, bins=3, normalize=True)
Selecting null and non-null valuess.isnull() | s.notnull()
Selecting null rowsdf.isnull().any(axis=1)
Renaming an existing columndf.rename(columns={‘src_name’: ‘dest_name’}, inplace=True)
Dropping an existing columndf.drop(labels=[‘row’], columns=[‘col’], inplace=True)
Dropping missing valuesdf.dropna(axis=0, thresh=number_of_records, inplace=True)
Show duplicated rowsdf.duplicated(cols)
Drop duplicated rowsdf.drop_duplicates(cols)
Fill missing valuesdf.fillna(value) | s.fillna(value)
Reset index columndf.reset_index(drop=True, inplace=True)
Renaming an indexdf.rename_axis(None, axis=0)
Pandas Boolean Masks OperationsSyntax
In operatordf[‘col’].isin([‘val1’, ‘val2’])
Between methoddf[‘col’].between(val1, val2)
Updating values using Boolean filterings.loc[s[‘col’] == 0, ‘col’] = np.nan
Updating values using a Mapping dicts.map({ ‘src_name’: ‘dest_name’ })
Updating values using mask methods.mask(bool_mask, new_values)
Pandas Sort and Convert BasicsSyntax
Sorting by index columndf.sort_index(ascending=False)
Sorting by column valuesdf.sort_values(ascending=False)
Converting column to datetimepd.to_datetime(series, errors=“coerce”)
Converting column to numericpd.to_numeric(series, errors=“coerce”)
Converting column to float/ints.astype(float/int)
Stack multiple columns into onedf.stack(dropna=False)
Pandas Vectorized AccessorsSyntax
Multi-dimensional numpy arraydf.values
Access datetime values in seriess.dt | s.dt.year
Replace substrings.str.replace(’”’, ‘’)
Extracting values from strings (first word)s.str.split().str[0]
Pandas Aggregation MethodsSyntax
Groupingdf.groupby(‘col’)
Group indexinggr[‘col’]
Select group datagr.get_group(‘value’)
Groups and indexesgr.groups
Aggregationsgr.size() | mean, sum, count, min, max
Multiple aggregationsgr.agg(functions_list)
Aggregatedf.pivot_table(index=gr_cols, columns=gr_cols, values=val_cols, aggfunc=functions, margins=True)
Pandas Transforming DataSyntax
Apply function for each row in Seriess.map(func/dict)
Apply function - Series: for each row, DataFrame: for each columns.apply(func, args) | df.apply(func, args, axis=0)
Apply function to every cell in the DataFramedf.applymap(func)
Unpivotdf.melt(id_vars=cols, value_vars=cols) | pd.melt()
Pivotdf.pivot(index=cols, columns=cols, values=cols)
List-like to a row (Pandas 0.25)df.explot(column, ignore_index=True)

-Pandas Combining DataFrames*

  • Union

    • pd.concat(df_list, axis, ignore_index=True)
    • df.append() # shortcut
  • Join

    • pd.merge(left=df1, right=df2, how=‘inner’, on=‘col’, on_left=‘col1’, on_right=‘col2’, left_index = True, right_index = True, suffixes=(’_x’, ‘_y’))
    • df.merge() # shortcut
    • df.join() # using indexes