Preparing Your Data for Analysis in R
Before you run any statistical model or visualization, your data must be clean, consistent, and well-structured. Raw data is usually messy: missing values, wrong data types, duplicates, and strange formats.
In this tutorial, you will learn how to:
- Inspect raw data
- Handle missing values (NA)
- Convert data types
- Remove duplicates
- Create new variables
- Filter and select data
1. Inspect Your Data
Always start by understanding your dataset:
str(data) summary(data) head(data)
These functions show:
• Structure and data types
• Summary statistics
• First rows of the data
2. Handling Missing Values (NA)
Check missing values:
colSums(is.na(data))
Remove rows with NA:
data_clean <- na.omit(data)
Replace NA with mean (numeric):
data$age[is.na(data$age)] <- mean(data$age, na.rm = TRUE)
3. Convert Data Types
Sometimes numbers are read as characters.
data$income <- as.numeric(data$income) data$date <- as.Date(data$date)
Check:
class(data$income)
4. Rename Columns
names(data) names(data)[1] <- "id" names(data)[2] <- "age"
Or using dplyr:
library(dplyr) data <- rename(data, user_id = id_user)
5. Remove Duplicates
data_unique <- distinct(data)
6. Create New Variables
Example: create age group
data$age_group <- ifelse(data$age < 30, "Young", "Adult")
Or:
data <- mutate(data,
income_k = income / 1000,
log_income = log(income))
7. Filter & Select Data
library(dplyr) # Select columns data2 <- select(data, age, income, city) # Filter rows data3 <- filter(data, age > 30 & city == "Jakarta")
8. Standardize / Normalize Data
data$income_scaled <- scale(data$income)
9. Save Clean Data
write.csv(data_clean, "data_clean.csv", row.names = FALSE)
Conclusion
Data cleaning is the most important step in analysis.
Clean data → Better models → Correct insights.
