Data Structure and Exploratory Data Analysis (EDA) in R

Dr. Saurav Das

Published Sep 18, 2023

Understanding data structures is akin to understanding the layers of the soil. Each structure serves a purpose and provides a foundation for advanced analysis. Paired with exploratory data analysis, soil scientists can begin to unearth the stories hidden within their data. This post introduces the fundamental data structures in R and the basics of EDA.

R's Fundamental Data Structures

Vectors:The simplest data structure in R. A one-dimensional array that holds elements of the same type (numeric, character, or logical).Created using the c() function, e.g., soil_depth <- c(10, 20, 30, 40).
Matrices:A two-dimensional array with rows and columns, holding elements of the same type. Created using the matrix() function. Ideal for datasets where all elements are of the same type.
Data Frames:A table-like structure where columns can contain different types of variables (numeric, character, etc.).Created using the data.frame() function. Most common structure for storing and analyzing datasets in R, such as soil sample data.
Lists:A collection of objects, which can be of different types or even other lists. Created using the list() function. Useful for storing related sets of data of varying structures.

Dipping Toes into Exploratory Data Analysis (EDA)

EDA is the initial step in your data analysis process. Here, we'll look at the soil data to understand its structure, extract summary statistics, and visualize patterns.

A. Understanding Data Structure: Use str() to get a quick overview of your data's structure. For a data frame, head() and tail() show the top and bottom parts of the data, respectively. Let's assume you have a dataset of soil samples taken at various depths, with measurements for pH, organic matter, and moisture content (hypothetical data).

#hypothetical data
soil_data <- data.frame(
  Depth = c(10, 20, 30, 40, 50),
  pH = c(6.5, 6.8, 7.2, 7.0, 6.9),
  OrganicMatter = c(5.2, 4.8, 4.0, 3.5, 3.2),
  MoistureContent = c(25, 27, 28, 26, 24)
)

#str, ,head and tail
str(soil_data)
head(soil_data)
tail(soil_data)

B. Summary Statistics: summary() provides basic statistics for each column in your data frame, such as mean, median, and quartiles. For a deeper dive, packages like Hmisc offer the describe() function for extended summaries.

summary(soil_data)

#extended summary using hmisc package

install.packages("Hmisc")
library(Hmisc)
describe(soil_data)

C. Visualization: Begin with basic visualizations to understand data distributions:Histogram: hist(soil_data$column_name) for the frequency distribution of a variable.Scatter Plot: plot(x, y) to explore the relationship between two variables. For more advanced visuals, the ggplot2 package is a powerful tool (I will make a separate post for this). For instance, you can visualize soil parameters across different depths or locations.

Recommended by LinkedIn

🚀 Unraveling Data Mysteries: The Power of Exploratory…

Santhosh Sachin 2 years ago

Exploratory Data Analysis

Kshitiz . 3 years ago

Understanding Your Data 📊📈💾

Lilian Moraa 1 year ago

#histogram
hist(soil_data$pH, main="Histogram for Soil pH", xlab="pH", col="green", border="black")

#scatterplot
plot(soil_data$OrganicMatter, soil_data$MoistureContent, main="Organic Matter vs. Moisture Content", xlab="Organic Matter (%)", ylab="Moisture Content (%)", pch=19, col="blue")

#boxplot
boxplot(soil_data$OrganicMatter, main="Boxplot for Organic Matter", ylab="Organic Matter (%)", col="yellow")

D. Looking for Outliers: Boxplots, available through boxplot(data$column_name), visually represent the spread of the data and can help spot outliers. Statistical methods, like the IQR (Interquartile Range) method, can help identify outliers numerically.

Q1 <- quantile(soil_data$OrganicMatter, 0.25)
Q3 <- quantile(soil_data$OrganicMatter, 0.75)
IQR <- Q3 - Q1
lower_bound <- Q1 - 1.5 * IQR
upper_bound <- Q3 + 1.5 * IQR

outliers <- subset(soil_data$OrganicMatter, soil_data$OrganicMatter < lower_bound | soil_data$OrganicMatter > upper_bound)
print(outliers)

E. Correlation Analysis: Check how two variables in your soil data relate using the cor() function. This can help in understanding how, say, soil moisture might relate to soil compaction or organic matter content. You can find more details on correlation in this post: https://www.garudax.id/pulse/correlation-plots-inr-dr-saurav-das/?trackingId=kMhYlXx5RsaJZzrh7V9%2Bbw%3D%3D

cor(soil_data$pH, soil_data$OrganicMatter)

Conclusion

Much like understanding the intricacies of the soil, delving deep into R's data structures and performing EDA sets the stage for more advanced analyses.

R for Soil Science

2,850 followers

+ Subscribe

Julia Piaskowski 2y

this data viz 🔥

1 Reaction

To view or add a comment, sign in

Data Structure and Exploratory Data Analysis (EDA) in R

Dr. Saurav Das

R's Fundamental Data Structures

Dipping Toes into Exploratory Data Analysis (EDA)

Recommended by LinkedIn

Conclusion

R for Soil Science

2,850 followers

More articles by Dr. Saurav Das

Others also viewed

Data Science Life Cycle

Multivariate Outlier Detection

Decoding Data: The Art of Exploratory Analysis

how to | Cleaning and preparing a movie dataset

Mastering Exploratory Data Analysis: Essential Steps for Data Science Success

Understanding Data Science Processes I : Concepts and Practices

Exploratory Data Analysis (EDA): Unveiling the Story Hidden in Data

Data Exploration and Data Analysis: Unveiling Insights from Raw Data

An insight on Exploratory Data Analysis and its importance:

Exploratory Data Analysis

Explore content categories

R's Fundamental Data Structures

Dipping Toes into Exploratory Data Analysis (EDA)

Recommended by LinkedIn

Conclusion

R for Soil Science

2,850 followers

More articles by Dr. Saurav Das

From Assam to Allentown: A Short Note on Assam Tea

Reference Extraction and Distribution by Year

Synthetic Data for Soil C Modeling

Bootstrapping

Ecosystem Service Dollar Valuation (Series - Rethinking ROI)

Redefining ROI for True Sustainability

Linear Plateau in R

R vs R-Studio

Backtransformation

Spectroscopic Methods and Use in Soil Organic Matter & Carbon Measurement

Others also viewed

Data Science Life Cycle

Multivariate Outlier Detection

Decoding Data: The Art of Exploratory Analysis

how to | Cleaning and preparing a movie dataset

Mastering Exploratory Data Analysis: Essential Steps for Data Science Success

Understanding Data Science Processes I : Concepts and Practices

Exploratory Data Analysis (EDA): Unveiling the Story Hidden in Data

Data Exploration and Data Analysis: Unveiling Insights from Raw Data

An insight on Exploratory Data Analysis and its importance:

Exploratory Data Analysis

Similar topics

Exploratory Data Analysis in Scientific Research

Structural Biology Data Analysis

Common Pitfalls In Data Analysis For Scientists

Explore content categories