Data Structure and Exploratory Data Analysis (EDA) in R
Understanding data structures is akin to understanding the layers of the soil. Each structure serves a purpose and provides a foundation for advanced analysis. Paired with exploratory data analysis, soil scientists can begin to unearth the stories hidden within their data. This post introduces the fundamental data structures in R and the basics of EDA.
R's Fundamental Data Structures
Dipping Toes into Exploratory Data Analysis (EDA)
EDA is the initial step in your data analysis process. Here, we'll look at the soil data to understand its structure, extract summary statistics, and visualize patterns.
A. Understanding Data Structure: Use str() to get a quick overview of your data's structure. For a data frame, head() and tail() show the top and bottom parts of the data, respectively. Let's assume you have a dataset of soil samples taken at various depths, with measurements for pH, organic matter, and moisture content (hypothetical data).
#hypothetical data
soil_data <- data.frame(
Depth = c(10, 20, 30, 40, 50),
pH = c(6.5, 6.8, 7.2, 7.0, 6.9),
OrganicMatter = c(5.2, 4.8, 4.0, 3.5, 3.2),
MoistureContent = c(25, 27, 28, 26, 24)
)
#str, ,head and tail
str(soil_data)
head(soil_data)
tail(soil_data)
B. Summary Statistics: summary() provides basic statistics for each column in your data frame, such as mean, median, and quartiles. For a deeper dive, packages like Hmisc offer the describe() function for extended summaries.
summary(soil_data)
#extended summary using hmisc package
install.packages("Hmisc")
library(Hmisc)
describe(soil_data)
C. Visualization: Begin with basic visualizations to understand data distributions:Histogram: hist(soil_data$column_name) for the frequency distribution of a variable.Scatter Plot: plot(x, y) to explore the relationship between two variables. For more advanced visuals, the ggplot2 package is a powerful tool (I will make a separate post for this). For instance, you can visualize soil parameters across different depths or locations.
Recommended by LinkedIn
#histogram
hist(soil_data$pH, main="Histogram for Soil pH", xlab="pH", col="green", border="black")
#scatterplot
plot(soil_data$OrganicMatter, soil_data$MoistureContent, main="Organic Matter vs. Moisture Content", xlab="Organic Matter (%)", ylab="Moisture Content (%)", pch=19, col="blue")
#boxplot
boxplot(soil_data$OrganicMatter, main="Boxplot for Organic Matter", ylab="Organic Matter (%)", col="yellow")
D. Looking for Outliers: Boxplots, available through boxplot(data$column_name), visually represent the spread of the data and can help spot outliers. Statistical methods, like the IQR (Interquartile Range) method, can help identify outliers numerically.
Q1 <- quantile(soil_data$OrganicMatter, 0.25)
Q3 <- quantile(soil_data$OrganicMatter, 0.75)
IQR <- Q3 - Q1
lower_bound <- Q1 - 1.5 * IQR
upper_bound <- Q3 + 1.5 * IQR
outliers <- subset(soil_data$OrganicMatter, soil_data$OrganicMatter < lower_bound | soil_data$OrganicMatter > upper_bound)
print(outliers)
E. Correlation Analysis: Check how two variables in your soil data relate using the cor() function. This can help in understanding how, say, soil moisture might relate to soil compaction or organic matter content. You can find more details on correlation in this post: https://www.garudax.id/pulse/correlation-plots-inr-dr-saurav-das/?trackingId=kMhYlXx5RsaJZzrh7V9%2Bbw%3D%3D
cor(soil_data$pH, soil_data$OrganicMatter)
Conclusion
Much like understanding the intricacies of the soil, delving deep into R's data structures and performing EDA sets the stage for more advanced analyses.
this data viz 🔥