Data Structure and Exploratory Data Analysis (EDA) in R

Data Structure and Exploratory Data Analysis (EDA) in R

Understanding data structures is akin to understanding the layers of the soil. Each structure serves a purpose and provides a foundation for advanced analysis. Paired with exploratory data analysis, soil scientists can begin to unearth the stories hidden within their data. This post introduces the fundamental data structures in R and the basics of EDA.

R's Fundamental Data Structures

  1. Vectors:The simplest data structure in R. A one-dimensional array that holds elements of the same type (numeric, character, or logical).Created using the c() function, e.g., soil_depth <- c(10, 20, 30, 40).
  2. Matrices:A two-dimensional array with rows and columns, holding elements of the same type. Created using the matrix() function. Ideal for datasets where all elements are of the same type.
  3. Data Frames:A table-like structure where columns can contain different types of variables (numeric, character, etc.).Created using the data.frame() function. Most common structure for storing and analyzing datasets in R, such as soil sample data.
  4. Lists:A collection of objects, which can be of different types or even other lists. Created using the list() function. Useful for storing related sets of data of varying structures.

Dipping Toes into Exploratory Data Analysis (EDA)

EDA is the initial step in your data analysis process. Here, we'll look at the soil data to understand its structure, extract summary statistics, and visualize patterns.

A. Understanding Data Structure: Use str() to get a quick overview of your data's structure. For a data frame, head() and tail() show the top and bottom parts of the data, respectively. Let's assume you have a dataset of soil samples taken at various depths, with measurements for pH, organic matter, and moisture content (hypothetical data).

#hypothetical data
soil_data <- data.frame(
  Depth = c(10, 20, 30, 40, 50),
  pH = c(6.5, 6.8, 7.2, 7.0, 6.9),
  OrganicMatter = c(5.2, 4.8, 4.0, 3.5, 3.2),
  MoistureContent = c(25, 27, 28, 26, 24)
)

#str, ,head and tail
str(soil_data)
head(soil_data)
tail(soil_data)
        

B. Summary Statistics: summary() provides basic statistics for each column in your data frame, such as mean, median, and quartiles. For a deeper dive, packages like Hmisc offer the describe() function for extended summaries.

summary(soil_data)

#extended summary using hmisc package

install.packages("Hmisc")
library(Hmisc)
describe(soil_data)        

C. Visualization: Begin with basic visualizations to understand data distributions:Histogram: hist(soil_data$column_name) for the frequency distribution of a variable.Scatter Plot: plot(x, y) to explore the relationship between two variables. For more advanced visuals, the ggplot2 package is a powerful tool (I will make a separate post for this). For instance, you can visualize soil parameters across different depths or locations.

#histogram
hist(soil_data$pH, main="Histogram for Soil pH", xlab="pH", col="green", border="black")

#scatterplot
plot(soil_data$OrganicMatter, soil_data$MoistureContent, main="Organic Matter vs. Moisture Content", xlab="Organic Matter (%)", ylab="Moisture Content (%)", pch=19, col="blue")

#boxplot
boxplot(soil_data$OrganicMatter, main="Boxplot for Organic Matter", ylab="Organic Matter (%)", col="yellow")        

D. Looking for Outliers: Boxplots, available through boxplot(data$column_name), visually represent the spread of the data and can help spot outliers. Statistical methods, like the IQR (Interquartile Range) method, can help identify outliers numerically.

Q1 <- quantile(soil_data$OrganicMatter, 0.25)
Q3 <- quantile(soil_data$OrganicMatter, 0.75)
IQR <- Q3 - Q1
lower_bound <- Q1 - 1.5 * IQR
upper_bound <- Q3 + 1.5 * IQR

outliers <- subset(soil_data$OrganicMatter, soil_data$OrganicMatter < lower_bound | soil_data$OrganicMatter > upper_bound)
print(outliers)        

E. Correlation Analysis: Check how two variables in your soil data relate using the cor() function. This can help in understanding how, say, soil moisture might relate to soil compaction or organic matter content. You can find more details on correlation in this post: https://www.garudax.id/pulse/correlation-plots-inr-dr-saurav-das/?trackingId=kMhYlXx5RsaJZzrh7V9%2Bbw%3D%3D

cor(soil_data$pH, soil_data$OrganicMatter)        

Conclusion

Much like understanding the intricacies of the soil, delving deep into R's data structures and performing EDA sets the stage for more advanced analyses.



To view or add a comment, sign in

More articles by Dr. Saurav Das

  • From Assam to Allentown: A Short Note on Assam Tea

    So, I’m from Assam. I haven’t been back in a while, and sometimes it’s hard to explain to people what it’s like.

    1 Comment
  • Reference Extraction and Distribution by Year

    Recently, during the revision of one of our manuscripts, we had a bit of back-and-forth with the journal over whether…

    1 Comment
  • Synthetic Data for Soil C Modeling

    Note: The article is not complete yet My all-time question is, do we need all and precise data from producers (maybe I…

  • Bootstrapping

    1. Introduction to Bootstrapping Bootstrapping is a statistical resampling method used to estimate the variability and…

  • Ecosystem Service Dollar Valuation (Series - Rethinking ROI)

    The valuation of ecosystem services in monetary terms represents a critical frontier in environmental economics…

  • Redefining ROI for True Sustainability

    It’s been a while since I last posted for Muddy Monday, but a few thoughts have been taking root in my mind. In this…

  • Linear Plateau in R

    When working with data in fields such as agriculture, biology, and economics, it’s common to observe a response that…

    2 Comments
  • R vs R-Studio

    R: R is a programming language and software environment for statistical computing and graphics. Developed by Ross Ihaka…

    1 Comment
  • Backtransformation

    Backtransformation is the process of converting the results obtained from a transformed dataset back to the original…

    3 Comments
  • Spectroscopic Methods and Use in Soil Organic Matter & Carbon Measurement

    Spectroscopic methods comprise a diverse array of analytical techniques that quantify how light interacts with a…

    2 Comments

Others also viewed

Explore content categories