Python Data Transformation
Updated at
- Data.gov - A directory of government data downloads
- /r/datasets - A subreddit that has hundreds of interesting data sets
- Awesome datasets - A list of data sets hosted on GitHub
- rs.io - A great blog post with hundreds of interesting data sets
| String basics | Syntax |
|---|---|
| Replace substring within a string | <string>.replace(substring, string) |
| Convert to title cases (capitalize every letter after every dot) | <string>.title() |
| Check a string for the existence of a substring | if <substring> in <string> |
| Split a string into a list of strings | <string>.split(separator) |
| Slice characters from a string by position | <string>[:5] |
- String functions:
capitalize,count,startswith,endswith,find,format,lower,upper,lstrip,rstrip,strip,replace,split,swapcase,title,zfill;
| String interpolation | Syntax |
|---|---|
| Insert values into a string in order | “{} {}".format(value, value) |
| Insert values into a string by position | “{0} {1}".format(value, value) |
| Insert values into a string by name | “{name}".format(name=“value”) |
| Format specification for precision of two decimal places | “{:.2f}".format(float) |
| Order for format specification when using precision and comma separator | “{:,.2f}".format(float) |
| Python 3.6 String Interpolation | f"Hello {variable}” |
| Dates and Times Basics | Syntax |
|---|---|
| Import module | import datetime as dt |
| Instantiating dt.datetime | dt.datetime(year, month, day) |
| Creating dt.datetime from a string | dt.datetime.strptime(“day/month/year”, “%d/%m/%Y”) |
| Converting dt.datetime to a string | dt_object.strftime("%d/%m/%Y”) |
| Instantiating a dt.time | dt.time(hour=int, minute=int, second=int, microsecond=int) |
| Retrieving a part of a date | dt_object.day |
| Retrieving a date | dt_object.date() |
| Instantiating a dt.timedelta | dt.timedelta(weeks=3) |
| Dates and Times Math | Type |
|---|---|
| datetime - datetime | timedelta |
| datetime - timedelta | datetime |
| datetime + timedelta | datetime |
| timedelta + timedelta | timedelta |
| timedelta - timedelta | timedelta |
| Format | Description |
|---|---|
| %d | Day of the month as a zero-padded number |
| %A | Day of the week as a word |
| %m | Month as a zero-padded number |
| %Y | Year as four-digit number |
| %y | Year as two-digit number with zero-padding |
| %B | Month as a word |
| %H | Hour in 24 hour time as zero-padded number |
| %p | a.m. or p.m. |
| %I | Hour in 12 hour time as zero-padded number |
| %M | Minute as a zero-padded number |
| JSON Basics | Syntax |
|---|---|
| Import module | import json |
| JSON string to Object | json.loads(‘json’) |
| JSON file to Object | json.load(open(‘path’)) |
| Object to JSON string | json.dumps(obj, sort_keys=True, indent=4) |
| Dictionary keys | obj.keys() |
| Delete key | del obj[key] |
| List Comprehansions and Lambdas | Syntax |
|---|---|
| Ranges (integers only) | range(min, max, interval) |
| List comprehension | [i * 10 for i in [0,1,2,3,4,5] if i > 0] |
| Functions on Objects | min|max|sorted(obj, key=function, reverse=True) | one argument function extracts scalar value |
| Lambda function | f = lambda x, y: x * y |
| Ternary operator | return <val> if <expression> else None |
| NumPy Selecting | Syntax |
|---|---|
| Import module | import numpy as np |
| Convert a list of lists into a ndarray | np.array(list(csv.reader(open(file, “r”)))) |
| Selecting a row from an ndarray | ndarr[1] |
| Selecting multiple rows from an ndarray | ndarr[1:] |
| Selecting a specific item from an ndarray | ndarr[1,1] |
| Selecting multiple columns | ndarr[:,1:3] | ndarr[:, [1,2]] |
| Selecting a 2D slice | ndarr[1:4,:3] |
| NumPy Boolean Indexing | Syntax |
|---|---|
| Reading in a CSV file | np.genfromtxt(’.csv’, delimiter=’,’, skip_header=1) |
| Creating a Boolean array from filtering criteria | np.array([2,4,6,8]) < 5 |
| Boolean filtering for 1D ndarray | a = np.array([2,4,6,8]) | a[a < 5] |
| Boolean filtering for 2D ndarray | ndarr[ndarr[:,12] > 50] |
| Assigning values in a 2D ndarray using indices | ndarr[1,1] = 1 | ndarr[:,0] = 1 | ndarr[:,7] = ndarr[:,7].mean() |
| Assigning values using Boolean arrays | ndarr[ndarr[:,5] == 2, 15] = 1 |
| NumPy 1D Statistics | Syntax |
|---|---|
| Vectorized math | + - * / |
| Functions | ndarray .min() .max() .mean() .sum() |
| NumPy Utils | Syntax |
|---|---|
| Advanced ranges | np.arange(min, max, interval) |
- Tidy data: each variable is a column, each observation is a row, and each type of observational unit is a table.
- Imputation: The technical name for filling in a missing value with a replacement value.
| Pandas Information Basics | Syntax |
|---|---|
| Import module | import pandas as pd |
| Reading a file into a dataframe | pd.read_csv(’.csv’, index_col=0, parse_dates=[‘col’], encoding=’’) |
| Reading a JSON into a dataframe | pd.read_json() |
| Exporting data | df.to_csv(’.csv’, index=False) |
| Dataframe object info | df.info(memory_usage=‘deep’) |
| Describing a dataframe/series object | df.describe(include=‘all’) | s.describe() |
| Returning a dataframe/series data types | df.dtypes | s.dtype() |
| Returning or setting column names | df.columns |
| Returning the dimensions of a dataframe | dt.shape |
| Create dataframe/series | pd.DataFrame({‘col’: []}, columns=cols) | pd.Series([]) |
| Series to dataframe/list | s.to_frame(‘col’) | s.tolist() |
| Pandas Select Operations | Syntax |
|---|---|
| Selecting the first n rows | df.head(5) |
| Selecting random n rows | df.sample(5, random_state=1) |
| Selecting a single column | df[‘col’] |
| Selecting multiple columns | df[[‘col’, ‘col2’]] |
| Shorthand Convention for columns | df[‘col’] | df[[‘col’, ‘col2’]] |
| Shorthand Convention for rows | df[‘row’:‘row3’] |
| Selecting rows by label | df.loc[<row_labels>, [column_labels]] |
| Selecting rows by index | df.iloc[<row_index>, [column_index]] |
| Pandas Missing Values Handling | Syntax |
|---|---|
| Unique value counts for a dataframe/series | s.unique() | s.value_counts(dropna=False, bins=3, normalize=True) |
| Selecting null and non-null values | s.isnull() | s.notnull() |
| Selecting null rows | df.isnull().any(axis=1) |
| Renaming an existing column | df.rename(columns={‘src_name’: ‘dest_name’}, inplace=True) |
| Dropping an existing column | df.drop(labels=[‘row’], columns=[‘col’], inplace=True) |
| Dropping missing values | df.dropna(axis=0, thresh=number_of_records, inplace=True) |
| Show duplicated rows | df.duplicated(cols) |
| Drop duplicated rows | df.drop_duplicates(cols) |
| Fill missing values | df.fillna(value) | s.fillna(value) |
| Reset index column | df.reset_index(drop=True, inplace=True) |
| Renaming an index | df.rename_axis(None, axis=0) |
| Pandas Boolean Masks Operations | Syntax |
|---|---|
| In operator | df[‘col’].isin([‘val1’, ‘val2’]) |
| Between method | df[‘col’].between(val1, val2) |
| Updating values using Boolean filtering | s.loc[s[‘col’] == 0, ‘col’] = np.nan |
| Updating values using a Mapping dict | s.map({ ‘src_name’: ‘dest_name’ }) |
| Updating values using mask method | s.mask(bool_mask, new_values) |
| Pandas Sort and Convert Basics | Syntax |
|---|---|
| Sorting by index column | df.sort_index(ascending=False) |
| Sorting by column values | df.sort_values(ascending=False) |
| Converting column to datetime | pd.to_datetime(series, errors=“coerce”) |
| Converting column to numeric | pd.to_numeric(series, errors=“coerce”) |
| Converting column to float/int | s.astype(float/int) |
| Stack multiple columns into one | df.stack(dropna=False) |
| Pandas Vectorized Accessors | Syntax |
|---|---|
| Multi-dimensional numpy array | df.values |
| Access datetime values in series | s.dt | s.dt.year |
| Replace substring | s.str.replace(’”’, ‘’) |
| Extracting values from strings (first word) | s.str.split().str[0] |
| Pandas Aggregation Methods | Syntax |
|---|---|
| Grouping | df.groupby(‘col’) |
| Group indexing | gr[‘col’] |
| Select group data | gr.get_group(‘value’) |
| Groups and indexes | gr.groups |
| Aggregations | gr.size() | mean, sum, count, min, max |
| Multiple aggregations | gr.agg(functions_list) |
| Aggregate | df.pivot_table(index=gr_cols, columns=gr_cols, values=val_cols, aggfunc=functions, margins=True) |
| Pandas Transforming Data | Syntax |
|---|---|
| Apply function for each row in Series | s.map(func/dict) |
| Apply function - Series: for each row, DataFrame: for each column | s.apply(func, args) | df.apply(func, args, axis=0) |
| Apply function to every cell in the DataFrame | df.applymap(func) |
| Unpivot | df.melt(id_vars=cols, value_vars=cols) | pd.melt() |
| Pivot | df.pivot(index=cols, columns=cols, values=cols) |
| List-like to a row (Pandas 0.25) | df.explot(column, ignore_index=True) |
-Pandas Combining DataFrames*
Union
- pd.concat(df_list, axis, ignore_index=True)
- df.append() # shortcut
Join
- pd.merge(left=df1, right=df2, how=‘inner’, on=‘col’, on_left=‘col1’, on_right=‘col2’, left_index = True, right_index = True, suffixes=(’_x’, ‘_y’))
- df.merge() # shortcut
- df.join() # using indexes