R: Data Cleaning & Preprocessing

Preparing Your Data for Analysis in R

Before you run any statistical model or visualization, your data must be clean, consistent, and well-structured. Raw data is usually messy: missing values, wrong data types, duplicates, and strange formats.

In this tutorial, you will learn how to:

Inspect raw data
Handle missing values (NA)
Convert data types
Remove duplicates
Create new variables
Filter and select data

1. Inspect Your Data

Always start by understanding your dataset:

str(data)
summary(data)
head(data)

These functions show:
• Structure and data types
• Summary statistics
• First rows of the data

2. Handling Missing Values (NA)

Check missing values:

colSums(is.na(data))

Remove rows with NA:

data_clean <- na.omit(data)

Replace NA with mean (numeric):

data$age[is.na(data$age)] <- mean(data$age, na.rm = TRUE)

3. Convert Data Types

Sometimes numbers are read as characters.

data$income <- as.numeric(data$income)
data$date   <- as.Date(data$date)

Check:

class(data$income)

4. Rename Columns

names(data)
names(data)[1] <- "id"
names(data)[2] <- "age"

Or using dplyr:

library(dplyr)
data <- rename(data, user_id = id_user)

5. Remove Duplicates

data_unique <- distinct(data)

6. Create New Variables

Example: create age group

data$age_group <- ifelse(data$age < 30, "Young", "Adult")

Or:

data <- mutate(data,
               income_k = income / 1000,
               log_income = log(income))

7. Filter & Select Data

library(dplyr)

# Select columns
data2 <- select(data, age, income, city)

# Filter rows
data3 <- filter(data, age > 30 & city == "Jakarta")

8. Standardize / Normalize Data

data$income_scaled <- scale(data$income)

9. Save Clean Data

write.csv(data_clean, "data_clean.csv", row.names = FALSE)

Conclusion

Data cleaning is the most important step in analysis.

Clean data → Better models → Correct insights.