Preventing Curse of Dimensionality with PCA

𝐏𝐂𝐀 (𝐏𝐫𝐢𝐧𝐜𝐢𝐩𝐚𝐥 𝐂𝐨𝐦𝐩𝐨𝐧𝐞𝐧𝐭 𝐀𝐧𝐚𝐥𝐲𝐬𝐢𝐬)- 𝐖𝐡𝐞𝐧 𝐭𝐨𝐨 𝐦𝐚𝐧𝐲 𝐟𝐞𝐚𝐭𝐮𝐫𝐞𝐬 𝐬𝐭𝐚𝐫𝐭 𝐛𝐞𝐜𝐨𝐦𝐢𝐧𝐠 𝐚 𝐩𝐫𝐨𝐛𝐥𝐞𝐦… While working on datasets with a large number of features, I realized something important: 𝐌𝐨𝐫𝐞 𝐟𝐞𝐚𝐭𝐮𝐫𝐞𝐬 ≠ 𝐛𝐞𝐭𝐭𝐞𝐫 𝐦𝐨𝐝𝐞𝐥 In fact, too many features can lead to a problem called: - Curse of Dimensionality - Models become slow - Computation increases - Noise increases - Visualization becomes difficult 𝐒𝐨𝐥𝐮𝐭𝐢𝐨𝐧 → 𝐃𝐢𝐦𝐞𝐧𝐬𝐢𝐨𝐧𝐚𝐥𝐢𝐭𝐲 𝐑𝐞𝐝𝐮𝐜𝐭𝐢𝐨𝐧 𝐏𝐂𝐀 is an 𝐮𝐧𝐬𝐮𝐩𝐞𝐫𝐯𝐢𝐬𝐞𝐝 𝐥𝐞𝐚𝐫𝐧𝐢𝐧𝐠  technique used when we only have input features (no target/output). It is a 𝐟𝐞𝐚𝐭𝐮𝐫𝐞 𝐞𝐱𝐭𝐫𝐚𝐜𝐭𝐢𝐨𝐧 𝐭𝐞𝐜𝐡𝐧𝐢𝐪𝐮𝐞 𝐭𝐡𝐚𝐭 𝐭𝐫𝐚𝐧𝐬𝐟𝐨𝐫𝐦𝐬 𝐡𝐢𝐠𝐡-𝐝𝐢𝐦𝐞𝐧𝐬𝐢𝐨𝐧𝐚𝐥 𝐝𝐚𝐭𝐚 𝐢𝐧𝐭𝐨 𝐥𝐨𝐰𝐞𝐫 𝐝𝐢𝐦𝐞𝐧𝐬𝐢𝐨𝐧𝐬  while preserving most of the important information. " In simple words: It keeps the essence of data but reduces complexity." 𝐔𝐬𝐢𝐧𝐠 𝐏𝐂𝐀 𝐡𝐞𝐥𝐩𝐬:- Reduce number of features - Improve model performance - Reduce computation cost - Speed up training - Make data easier to visualize 𝐇𝐨𝐰 𝐏𝐂𝐀 𝐖𝐨𝐫𝐤𝐬 (𝐒𝐭𝐞𝐩𝐬 𝐈 𝐏𝐫𝐚𝐜𝐭𝐢𝐜𝐞𝐝) 𝐒𝐭𝐞𝐩 1️⃣: 𝐒𝐭𝐚𝐧𝐝𝐚𝐫𝐝𝐢𝐳𝐞 𝐭𝐡𝐞 𝐝𝐚𝐭𝐚 Because PCA is scale-sensitive 𝐒𝐭𝐞𝐩 2️⃣: 𝐂𝐨𝐦𝐩𝐮𝐭𝐞 𝐂𝐨𝐯𝐚𝐫𝐢𝐚𝐧𝐜𝐞 𝐌𝐚𝐭𝐫𝐢𝐱 To understand relationships between features 𝐒𝐭𝐞𝐩 3️⃣: 𝐅𝐢𝐧𝐝 𝐄𝐢𝐠𝐞𝐧𝐯𝐚𝐥𝐮𝐞𝐬 & 𝐄𝐢𝐠𝐞𝐧𝐯𝐞𝐜𝐭𝐨𝐫𝐬 import numpy as np eigen_values, eigen_vectors=np.linalg.eig(cov_matrix) 𝐒𝐭𝐞𝐩 4️⃣: 𝐒𝐞𝐥𝐞𝐜𝐭 𝐏𝐫𝐢𝐧𝐜𝐢𝐩𝐚𝐥 𝐂𝐨𝐦𝐩𝐨𝐧𝐞𝐧𝐭𝐬 Choose top components with highest variance 𝘗𝘊𝘈 𝘪𝘴 𝘯𝘰𝘵 𝘫𝘶𝘴𝘵 𝘳𝘦𝘥𝘶𝘤𝘪𝘯𝘨 𝘤𝘰𝘭𝘶𝘮𝘯𝘴… 𝘐𝘵’𝘴 𝘢𝘣𝘰𝘶𝘵 𝘬𝘦𝘦𝘱𝘪𝘯𝘨 𝘵𝘩𝘦 𝘮𝘰𝘴𝘵 𝘪𝘮𝘱𝘰𝘳𝘵𝘢𝘯𝘵 𝘪𝘯𝘧𝘰𝘳𝘮𝘢𝘵𝘪𝘰𝘯 𝘸𝘩𝘪𝘭𝘦 𝘳𝘦𝘮𝘰𝘷𝘪𝘯𝘨 𝘳𝘦𝘥𝘶𝘯𝘥𝘢𝘯𝘤𝘺 #Datascience #Dataanalyst #Machinelearning #curseofdimensionality #featureextraction #python #numpy

To view or add a comment, sign in

Explore content categories