Machine learning – Mundivagant spirits

This article is written as part of Gaogao’s Advent Calendar series. This is the first year I’m participating and I wanted to write about an area of tech that I’m interested in and currently working on. Here I present eight tips for common issues that beginners would find useful.

Data can come from everywhere – web scrapping, online surveys, collection from customer records, data analytics platforms, and messy data is inevitable. Being a beginner starting your first few projects, messy data is a scary prospect. Hopefully the tips here can help you resolve generic issues while you overcome the initial panic of seeing over a million rows of haphazard, missing, strange data. I assume you are using either a notebook of some kind like Jupyter Notebook, Google colab, Kaggle etc. I also assume you are coding in Python and using the Pandas library.

Optional Parameters

A lot of inbuilt methods come with useful options that users, both experienced and beginner, sometimes forget about. Sometimes the documentation doesn’t explain it in the best way so you may have to Google some example or try it yourself to understand what each parameter does. Here are just some examples

#1 Tip: For reading and exporting ,
i) remember to use the ‘encoding’ parameter if non-Latin characters are present, e.g. Japanese /Chinese;
ii) if you do not want indexes, use ‘index=false’
Ex – df.read_csv(or read_excel)(<filepath>, encoding=”utf-8″, index=False)
Ex – df.to_csv(or to_excel)(“CleanData.csv”, encoding=”utf-8″, index=False)

#2 Tip: Check the axis. Some methods require you to specify ‘axis=1′ to apply to columns(axis=’columns’ is the same but longer to type). Else you will apply them to rows!
Ex – df.sort_values(by=1 ,axis=1)

#3 Tip: Remember to use ‘inplace=true’ to shorten your code
Ex – df.drop(<col_name>, axis=1, inplace=True)

Column Operations

If you can find a ready-made method like .groupby() or .split(), use those. They are faster and cleaner. Otherwise the next best I would suggest using is .apply + lambda function. This is highly flexible and quite readable. I use this A LOT.

#4 Tip: If you want to test a lambda function, remember to return the original variable. This is to ensure the data is not accidentally changed.

Return original in all cases when checking output

#5 Tip: To make your code readable, I suggest defining and running the function in the same code box if possible.

Lambda functions applied to df columns in the same code block

Python Types

After using Typescript and other coding languages that can type checking, Python can feel like a step backwards. The numbers your assumed were ‘int’s (integers) turned out to be ‘str’s (strings) or there was a ‘NaN’ in one of the rows which made the method throw an error… Its frustrating. Unfortunately, there is no easy way around this problem but to check and convert all data points in your columns to the one you desire.

#6 Tip: Use .dtypes() to confirm all types are consistent.

‘object’ type could either be all strings or a combination of types

#7 Tip: Get rid of ‘NaN’s in your columns. Use .isna().sum() to check for ‘NaN’s. Then use either .fillna() or dropna().

#8 Tip: Change all column data to a single type.
Example: df.loc[:, ‘col_name’] = df.loc[: ,’col_name’].astype(float).

Thank you for reading and hope you traverse your messy data with confidence!

Category: Machine learning

Beginner tips for Exploratory data analysis with Pandas

Optional Parameters

Column Operations

Python Types