Hidden in Plain Sight
While building the data product at mySidewalk, we’ve spent a lot of time thinking and working on how to best help make communities easier to understand. Our approach in doing that has been to work on building a product that helps save time while working on analyzing a project area, reveal insights that may not have been apparent, and help you improve communication with all relevant stakeholders.
Like my colleague Brian Parr discussed in his excellent blog post a few weeks ago:
A common task while working with spatial data is to summarize data for custom boundaries at different levels of geography.
For urban planners, architects and city managers, the promise of being able to look at data reported and summarized for custom boundaries offers unprecedented access to information. It enables them to make better decisions to some of the most significant challenges facing urban areas, such as traffic, infrastructure, and pollution.
One of the ways to reveal insight about your community is data disaggregation, which gives you the ability to gain more meaningful information from data you already have.
Disaggregation is the breakdown of observations, usually within a common branch of a hierarchy, to a more detailed level or subgroup.
Often these statistics are aggregated up and reported using only descriptive statistics, presenting only a summary or a high level description.
For example, knowing that 28% of the trash in the city of Stamford, CT is recycled is potentially useful information. However, also knowing that in certain neighborhoods within Stamford, upwards of 60% of the trash is recycled, while in others it’s 5–10%, is probably a lot more beneficial. It is much more useful to have a breakdown of trash recycling at the neighborhood level to find the areas of the city where they need to target their efforts to encourage citizens to recycle more. This breakdown, ordisaggregation, of existing data, reveals significantly more valuable insights into your community.
Hidden in plain sight
In many cases, data disaggregation is not only a problem-solving process, but also a problem-finding process. It can help you ask better questions and confirm or deny biases you may hold. Especially when it comes to data about people, it’s always more powerful to look at how specific subgroups vary against the overall aggregate statistics, and it can help you target your efforts better. For example, if you’re looking to measure the effectiveness of a particular policy in a community, breaking down the outcomes by demographics might help you get a better sense of what your policy’s effectiveness is across the board.
In the world of geography and geostatistics, this is sometimes achieved through apportionment.
Apportionment is the process and method of allocating data and statistics from one geographic unit to another within situations where the boundaries of the source data and destination data do not align.
Apportionment allows us to tabulate statistical profiles for custom, or irregular geographies, in areas that do not neatly conform to the boundaries to which the statistics were originally reported.
It allows you to view relevant statistics summarized to your interest area. And depending on how the original statistics were reported, it can be done with a relatively low margin of error and reasonable accuracy and precision. Often, statistical and machine learning based spatial interpolation methods are used to extrapolate from existing information if the data is not available at the lowest granularity.
A statistician confidently tried to cross a river that was 1 meter deep on average. He drowned.
Just a joke. No one really drowned.
Now that you know why disaggregating data is useful and how it can reveal a lot more insight from the data you already have, it’s also important to note that it’s the combination of both the aggregated, descriptive statistics along with the disaggregated data and measures that provides a more complete story. Yet, often, there is no central source to put all the individual numbers from cities, counties, or states together in a way that would provide an aggregate benchmark as a means of measure.
If you have both aggregate as well as granular data, you can build models to compare your community against others.
The ability to compare two or more communities with each other can be a really powerful tool in planning practice. For example, you can do a clustering of communities to find similar communities in order to do pre and post treatment analysis to see how a similar community was transformed by a particular policy. This can give you insight about how a similar policy might affect your community. You can even do a nationwide search for similar communities to your interest area across a variety of variables, revealing additional communities that might be useful for you to study.
The average statistician is just plain mean.
You with me?
As more and more cities and public agencies have gotten on the open data bandwagon and started publishing data and statistics on the internet, often, they have a tendency to report data at the aggregate level, either of cohort or geography. However, it would be a lot more useful for both the agencies as well as the analysts or programs consuming that data to also have it reported at a less aggregate level as well.
Often, agencies choose to only publish aggregate statistics to balance privacy interests with transparency. And I’m not arguing that it isn’t a valid reason to only report aggregate statistics. However, agencies should try to see if they can include additional statistics without hindering privacy. For example, while reporting income for a particular area, instead of only reporting the median, perhaps report the average, minimum, maximum, and standard deviation too. It still maintains the individual’s privacy, while making the information that can be mined from the data a lot more meaningful in helping understand the community.
The encouraging trend is that more and more agencies are choosing to report disaggregated data: The US Department of Education is starting to publish more and more data in a disaggregated fashion. I sure hope a lot of other agencies get on that bandwagon too.
As George Burns once said,
“If you live to be one hundred, you’ve got it made. Very few people die past that age.”
Okay — all done.
To learn more about how you can discover data hidden in plain sight, visit our website.
Interesting article. Thanks for posting.