Avoid being misled by the data visualization method
[scRNA-seq analysis, R] When I was visualizing the distribution of the reads of one gene to cell count by a histogram, the zero expression cells occupied the whole graph, like this:
Certainly, I'd like to see the distribution of true-expressing cells so I simply added + ylim() to set up the y-axis limit:
This was nice by a glimpse because two peaks showed up underlying the big zero-expression counts. But wait! Where are the zero expression cells? Shouldn't the bar just be chopped and left a stump? After reexamining the code, I found ggplot did this:
"Warning message: Removed 4 rows containing missing values (geom_bar). "
Fishy! Then I found a better option by adding + coord_cartesian(ylim = c(, )), instead of + ylim(), and reploted the same data:
Yes! The zero-expression cells are back and the gap between 0.4 and 0.5 is filled, which seems to result from the "Removed 4 rows" when using + ylim().
Lessons learned: when zooming in to a ggplot, be aware of automatic data dropping (Or just remember using + coord_cartesian() to avoid being misled for the most of time).
Lastly, here is a toy code for parallel comparison (zoom in to 0-5 count by two ways):
gridExtra::grid.arrange(grobs = list(
ggplot(iris, aes(Sepal.Length)) + geom_histogram() + ggtitle("original data"),
ggplot(iris, aes(Sepal.Length)) + geom_histogram() + ylim(0,5) + ggtitle("+ ylim"),
ggplot(iris, aes(Sepal.Length)) + geom_histogram() + coord_cartesian(ylim = c(0,5)) + ggtitle("+ coord_cartesian")), ncol = 3)
References: