Deal with Rare Categories Using Pandas

We will illustrate how to deal with rare categories using pandas mask.

import pandas as pd

#############
# Create fake names
frequent_names = list('ABC')
rare_names = list('DEF')

dataset = sum(
    [[i]*10 for i in frequent_names] + [[i]*2 for i in rare_names],
    []
)

# Create a series based on the names
series = pd.Series(dataset)

print(series)

# Find the counts of the names in the series
series_counts = series.value_counts()
print(series_counts)


# Find names that has less than 10 counts
# And create a mask
mask = series.isin(series_counts.loc[series_counts<10].index)
print(mask)


# Set these rare names to X
series[mask] = 'X'

# Check the new series
print(series.value_counts())

The original series has value counts

C    10
A    10
B    10
F     2
D     2
E     2

The new series has value counts

C    10
A    10
B    10
X     6

Planted: by ;

LM (2021). 'Deal with Rare Categories Using Pandas', Datumorphism, 03 April. Available at: https://datumorphism.leima.is/til/data/deal-with-rare-categories-using-pandas/.