Topological Data Analysis

Fei SUN

Published Sep 28, 2020

Let's talk about topological data analysis!

In data analysis, we often encounter the following problem: say if we want to categorize or grouping some data, a natual method is to endow the data sample space some sort of metric (i.e, distance) and group those data points that are "close" to each other together. The idea seems natural and simple enough, but what does "close" mean? In reality, we may use some algorithm like KNN, and assign a parameter r so that a cluster of data points will be grouped together if element-wise distance in cluster is smaller than r. Then we vary the value of r, and find a state which we could claim we have a classification! But there is a problem, the data groups may have some non-standard distribution, e.g, one group forms a circle and the other group is enclosed by that circle, or more generally in topological language, groups are only path-connected but not simply connected. In this case, the above algo won't work. Since I have used topological language to describe the difficulty, you may have already guessed that I'm going to propose some topological method.

Homology is a common tool in topology, loosely speaking, it somehow differentiate the shapes of spaces. So, if our data points are scattered into/onto different surfaces, homology should tell (hopefully) the two groups are different. One common topological method is called Cech homology, and this is relted to our previous algo in the following way: the method is based on how close the data points are. Still we use a parameter r, and consider standard balls with radius r. We use these balls to cover the data space, and the covering gives rise to Cech homology. We may then vary the value of r, get different covering, which yield different homologies. If, say, r+dr, gives the same result, we may stop the procedure. In this method, we use a parameter to control our homology results, so called persistent homology. There is a wel-known persistent homology algorithm, let's say we have a Cech complex set up and a triangulation induced by it as well. The the homology algo will iteratively find the "youngest" unpaired simplex to pair or itself is a generator. After go through all the vertices/edges/faces, we will get the homology basis.

This is a fantastic algorithm, but not good enough! Homology is not the strongest classifer of spaces, a version that is more powerful is called homotopy. In dimension one, it's called fundamental group, and it's the non-commutative version of homology. Why homotopy is stronger than homology? Say our data point generate a 3D space carved out a knot, e.g, the complement of a trefoil knot. Then homology will not give you much info since homology of the complementary of knot is always the integer ring. However, homotopy can! Because homotopy is not commutative, it actually tell you how many rounds do we circle around something. The attached paper is an algorithm I developed for persistent homotopy, which is essentially the non-commutative version of the persistent homology algo.

Alexander R. Korbonits 5y

I’m a tad rusty on homotopy and homology but really exited to “Čech” this out!

2 Reactions

Chris Wang 5y

Nice! I'll start classify Pokemon through homotopy.

Tan L, PhD. 5y

brave heart...not sure if the info extracted from any (presumably low-dim) persistent non-commutative generators is worth the effort, assuming super-computing power is not a problem (which could well be as very few TDA algorithms are parallel-computing-friendly)... by the way, there is a very fine difference between cech complex and vietoris-rips complex which again translates to drastic difference in computational cost, and that's why in practice, PH is mostly computed from VR complex as opposed to cech's (e.g. the pic used in your article is a VR construction). all in all, keep it coming~

See more comments

Topological Data Analysis

Fei SUN

Others also viewed

Understanding Data Types and Measurement Scales: A Fundamental Guide

From Data to Decisions: The Role of EDA in Business Strategy

Avoiding mistakes with Data requirements in Projects

Dealing with Erratic Data in Time Series Forecasting: Strategies and Algorithms

How to detect outliers in your dataset?

Thought Leader Thursday: What’s in my Designs’ Future? Seven Ways to Read your Data to Learn It

Day 18: Handling Missing Data in Your Analysis: Bridging the Gaps

Segmenting with Mixed Type Data - A Case Study Using K-Medoids on Subscription Data

Data, big data and data driven decision making strategy: Part 2

Understanding Different Types of Data: A Comprehensive Overview with Examples

Explore content categories