Anuj Katiyal
Jersey City, New Jersey, United States
3K followers
500+ connections
About
I am a Senior Data Scientist with a breadth of experience and expertise ranging from…
Activity
-
I'm excited to share that I will be joining Fordham University as its inaugural Associate Provost for STEM Strategy, beginning in May 2026. My time…
I'm excited to share that I will be joining Fordham University as its inaugural Associate Provost for STEM Strategy, beginning in May 2026. My time…
Liked by Anuj Katiyal
-
Honoured to have the concept, the team and the hard work behind a novel project recognized! This project is an elegant example of core-AI and…
Honoured to have the concept, the team and the hard work behind a novel project recognized! This project is an elegant example of core-AI and…
Liked by Anuj Katiyal
-
Building a company is a privilege. Even in some freshly falling snow. Again. Today’s a land-the-plane day on a critical piece of product. I’ll spend…
Building a company is a privilege. Even in some freshly falling snow. Again. Today’s a land-the-plane day on a critical piece of product. I’ll spend…
Liked by Anuj Katiyal
Experience
Education
-
Columbia University in the City of New York
4/4
-
Activities and Societies: Treasurer and Member at the Data Science Institute Student Council (DSISC)
Master of Science in Data Science
-
-
-
Licenses & Certifications
Volunteer Experience
-
Teacher
Ashakiran at IIIT Hyderabad
- 2 years 1 month
Education
Taught Mathematics and English to Underprivileged children
-
Treasurer
Data Science Institute Student Council, Columbia University
- 1 year 3 months
Education
As a treasurer and member at the Data Science Institute Student Council, I manage the funds for the student activities aimed at resolving student issues, better collaboration with administrators to address them and the facilitation of events to increase networking opportunities for the students.
Publications
-
An Unmixing Framework to improve class accuracies using detected High Importance Local Regions
IEEE International Geoscience and Remote Sensing Symposium
Image Classification techniques are aimed at improving the class accuracies which are affected by the occurrence of mixed pixels in the remotely sensed data. Improving the labeling accuracies for the mixed pixels regions will increase the global class accuracies. Spectral unmixing has been used to decompose the mixed pixel regions into its constituent endmembers, and a corresponding fractional abundance for each endmember. The unmixing approaches are approximated based on spectral behavior, but…
Image Classification techniques are aimed at improving the class accuracies which are affected by the occurrence of mixed pixels in the remotely sensed data. Improving the labeling accuracies for the mixed pixels regions will increase the global class accuracies. Spectral unmixing has been used to decompose the mixed pixel regions into its constituent endmembers, and a corresponding fractional abundance for each endmember. The unmixing approaches are approximated based on spectral behavior, but ignore the spatial neighborhood. The data values at the pixel along with
its spatial neighborhood are good indicators of the image characteristics including atmospheric conditions and need to be considered. In the current research, we propose a spatio-spectral framework that improves the classification accuracy and demonstrates its utility by improving the labels of the detected mixed regions for MODIS data and validated with AWIFS (APLULC 2005) derived land cover dataset.Other authors -
-
Improving Utility of Low-Resolution data using Statistical approaches in Remote Sensing
XVI Brazilian Remote Sensing Symposium
With the increase in the multi-resolution data available from the various satellite sensors, there is an increasing need to come up with analysis techniques to handle and exploit the information that can be extracted from lower resolution (LR) data before acquiring higher resolution (HR) data. This paper presents a methodology to use statistical approaches to sub-group the LR classified data into high importance local regions (HILRs) and low importance local regions (LILRs) after filtering, for…
With the increase in the multi-resolution data available from the various satellite sensors, there is an increasing need to come up with analysis techniques to handle and exploit the information that can be extracted from lower resolution (LR) data before acquiring higher resolution (HR) data. This paper presents a methodology to use statistical approaches to sub-group the LR classified data into high importance local regions (HILRs) and low importance local regions (LILRs) after filtering, for every class. The HILRs were shown to have more number of near pure-pixels as compared to the complete class regions, as verified by using the classified HR APLULC data, when a LR pixel was matched to the HR matrix using the Near Purity Measure (of 80). The HILRs were further shown to have higher stability, by showing reduced NDVI variation as compared to the complete class regions, using the HR AWIFS data. The method proposed works better for LR classes with limited intra-class heterogeneity and good inter-class separability. The proposed approach can help to reduce the processing done on HR resolution data based on the corresponding LR HILRs obtained for every class regions and further help in applications like pure-pixel matching, building HR-LR classification models and isolating pure pixels from the mixed/impure pixels in class regions.
Other authors -
-
Spatio-Spectral method for Estimating classified regions with high confidence using Modis Data
35th International Symposium on Remote Sensing of Environment
In studies like change analysis, the availability of very high resolution (VHR)/high
resolution (HR) imagery for a particular period and region is a challenge due to the sensor
revisit times and high cost of acquisition. Therefore, most studies prefer lower resolution (LR)
sensor imagery with frequent revisit times, in addition to their cost and computational
advantages. Further, the classification techniques provide us a global estimate of the class
accuracy, which limits its…In studies like change analysis, the availability of very high resolution (VHR)/high
resolution (HR) imagery for a particular period and region is a challenge due to the sensor
revisit times and high cost of acquisition. Therefore, most studies prefer lower resolution (LR)
sensor imagery with frequent revisit times, in addition to their cost and computational
advantages. Further, the classification techniques provide us a global estimate of the class
accuracy, which limits its utility if the accuracy is low. In this work, we focus on the sub-
classification problem of LR images and estimate regions of higher confidence than the global
classification accuracy within its classified region. The spectrally classified data was mined
into spatially clustered regions and further refined and processed using statistical measures to
arrive at local high confidence regions (LHCRs), for every class. Rabi season MODIS data of
January 2006 & 2007 was used for this study and the evaluation of LHCR was done using the
APLULC 2005 classified data. For Jan-2007, the global class accuracies for water bodies
(WB), forested regions (FR) and Kharif crops & barren lands (KB) were 89%, 71.7% and
71.23% respectively, while the respective LHCRs had accuracies of 96.67%, 89.4% and 80.9%
covering an area of 46%, 29% and 14.5% of the initially classified areas. Though areas are
reduced, LHCRs with higher accuracies help in extracting more representative class regions.
Identification of such regions can facilitate in improving the classification time and processing
for HR images when combined with the more frequently acquired LR imagery, isolate pure vs.
mixed/impure pixels and as training samples locations for HR imagery.Other authors -
Courses
-
Algorithms
CS3110
-
Algorithms for Data Science
CSOR 4246
-
Applied Machine Learning
COMS 4995
-
Artificial Intelligence
CS3500
-
Computer Systems for Data Science
COMS 4121
-
Computer Vision
CS5765
-
Data Science Capstone
ENGI 4800
-
Data Warehousing and Data Mining
CS5405
-
Database Management Systems
CS3400
-
Digital Image Processing
CS4750
-
Exploratory Data Analysis and Visualization
STAT 5702
-
Fieldwork : Data Science Internship (Twitter)
COMS 6910
-
Linear Algebra
MA3100
-
Machine Learning
CS5770
-
Machine Learning for Data Science
COMS 4721
-
Numerical Analysis
MA6401
-
Pattern Recognition
CS4770
-
Probability Theory
STAT 4203
-
Statistical Inference and Modeling
STAT 5703
-
Storytelling with Data
JOUR 4001
Projects
-
Deep Learning using Keras
(Skills : Python, Keras, Scikit-Learn, Pandas, Numpy, Matplotlib, Seaborn)
Datasets used: MNIST, http://ufldl.stanford.edu/housenumbers/
1. Trained a multilayer perceptron (feed forward neural network) with two hidden layers and rectified linear nonlinearities on the MNIST dataset using the Keras Sequential interface. Compared the baseline model with a model using drop-out resulting in an improved accuracy.
2. Trained a convolutional neural network on the SVHN dataset…(Skills : Python, Keras, Scikit-Learn, Pandas, Numpy, Matplotlib, Seaborn)
Datasets used: MNIST, http://ufldl.stanford.edu/housenumbers/
1. Trained a multilayer perceptron (feed forward neural network) with two hidden layers and rectified linear nonlinearities on the MNIST dataset using the Keras Sequential interface. Compared the baseline model with a model using drop-out resulting in an improved accuracy.
2. Trained a convolutional neural network on the SVHN dataset. Achieved an accuracy of 92.7% on the test-set with a base model. Also build a model using batch normalization and dropout which led to an increased accuracy of 95.3%.
3. Imported the weights of a pre-trained convolutional neural network, VGG, and used it as feature extraction method to train a multi-layered perceptron on the pets dataset. We achieved an accuracy of 73.2% on the 37-class classification task. -
In-Class Kaggle : Bank's Marketing Campaign to Analyze Subscription Status (Classification Analysis)
(Skills Used: Python, Numpy, Scikit-Learn, Pandas, Matplotlib, Seaborn, Git, Travis)
Dataset Used : https://archive.ics.uci.edu/ml/datasets/bank+marketing
Task was to predict whether someone will subscribe to the term deposit or not based on the direct marketing campaign ran by a Banking Institution based on phone calls.
Stood 2nd amongst 100 teams participating in the In-Class Kaggle Classification Problem. The best ROC-AUC score of 0.798 on the test set was obtained using a…(Skills Used: Python, Numpy, Scikit-Learn, Pandas, Matplotlib, Seaborn, Git, Travis)
Dataset Used : https://archive.ics.uci.edu/ml/datasets/bank+marketing
Task was to predict whether someone will subscribe to the term deposit or not based on the direct marketing campaign ran by a Banking Institution based on phone calls.
Stood 2nd amongst 100 teams participating in the In-Class Kaggle Classification Problem. The best ROC-AUC score of 0.798 on the test set was obtained using a poor-man stacking ensemble model with Logistic Regression, SVM and Random Forest. The major modules in the project included:
1. Data Cleaning and Pre-processing to exclude outliers, remove redundant independent variables and impute missing values.
2. Created a pipeline to perform feature engineering, feature selection and model validation on the dataset.
3. Applied classification models such as Logistic Regression, SVM followed by tree-based models such as Random Forest, Gradient boosted trees, XgBoost etc. Tuned the hyper-parameters for selecting the best model for each algorithm.
4. Applied various Ensemble Methods including Voting Classifiers, Poor Man's Stacking, Weighted Ensembles. The best ROC-AUC score of 0.798 on the test set was obtained using a poor-man stacking ensemble model with Logistic Regression, SVM and Random Forest.
5. Applied Resampling techniques like RandomUnderSampler, RandomOverSampler with various Classification Algorithms followed by Ensembles using them. Resampling was also applied using techniques like Edited Nearest Neighbors, SMOTE, SMOTEENN and SMOTETOMEK.
Other creators -
Predicting Market Rate for Apartments based on New York City Housing and Vacancy Survey (NYCHVS) : Regression Analysis
(Skills Used: Python, Scikit-learn, Matplotlib, Travis, Git, Numpy)
Datasets used are available at: https://www.census.gov/housing/nychvs/data/2014/userinfo2.html
(Data can be accessed at https://www.census.gov/housing/nychvs/data/2014/uf_14_occ_web_b.txt)
We utilized the linear models (Regression approaches) for predicting the monthly rent prices of an apartment in NYC based on the collected 2014 Census data. Feature engineering was done to obtain features that directly affect…(Skills Used: Python, Scikit-learn, Matplotlib, Travis, Git, Numpy)
Datasets used are available at: https://www.census.gov/housing/nychvs/data/2014/userinfo2.html
(Data can be accessed at https://www.census.gov/housing/nychvs/data/2014/uf_14_occ_web_b.txt)
We utilized the linear models (Regression approaches) for predicting the monthly rent prices of an apartment in NYC based on the collected 2014 Census data. Feature engineering was done to obtain features that directly affect the pricing of an apartment like the number of rooms, presence of elevator, floor number etc. The model was tested using a separate test data that was extracted from the complete dataset, based on metrics like the R^2 score (obtained as 0.62)
The project involved the followed major steps:
1. Loading and Initial Analysis of Data. This steps involved cleaning data to drop the rows with missing rent, dropping columns with leaked information and additional pre-processing like One Hot Encoding and Feature Selection/Engineering.
2. We were able to obtain an R^2 value of around 0.6, which is at par with most of the analysis work done on this dataset using linear models like Linear Regression, Ridge Regression, Lasso Regression and Polynomial Regression.
3. Important observations from the dataset included dropping rows with topcoded rent and also dropping rows which are rent controlled, which lead to an increase in the R^2 score from 0.45 to 0.62Other creators -
Complaints about Traffic conditions in the city of Boston : Text Classification and Clustering Analysis
-
(Skills: Python, Numpy, Scikit-learn, Matplotlib, Seaborn, Pandas, Nltk, Spacy, Gensim)
Major modules:
1. Data Cleaning - It involved the initial loading of data along with visualization of the class distribution. The duplicated labels were consolidated into one and the duplicated data points were removed.
2. Baseline Multi-Classification Algorithms - The baseline multi-classification models using bag of words approach were applied to the dataset, namely, Logistic…(Skills: Python, Numpy, Scikit-learn, Matplotlib, Seaborn, Pandas, Nltk, Spacy, Gensim)
Major modules:
1. Data Cleaning - It involved the initial loading of data along with visualization of the class distribution. The duplicated labels were consolidated into one and the duplicated data points were removed.
2. Baseline Multi-Classification Algorithms - The baseline multi-classification models using bag of words approach were applied to the dataset, namely, Logistic Regression, SGDClassifier, Multinomial Naive Bayes etc. The logistic regression with Tf-Idf Vectorizer results in the best Macro F1 score of 0.536.
3. Improved the Macro F1 score obtained using Logistic Regression with Tf-Idf Vectorize to 0.557 using complex text features including n-grams, character n-grams and domain specific features like length of the complaint, number of punctuation marks and uppercase letters in the complaint while putting a limit on the maximum number of features, ignoring stop words and ignoring infrequent words.
The results of the tuned model were analyzed using the confusion matrices, analysis of important feature with their coefficient weights and printing out sample mistakes by the model.
5. Clustering Analysis was done using the dataset by using methodologies like LDA, NMF and K-Means. The best topics obtained using LDA and NMF were visualized with the most important features belonging to every topic. The clustering methodologies were compared based on the calculated ARI score and NMF was shown to have the highest ARI score value.
6. Improved Classification Models using the classes obtained by the Clustering Analysis done in the previous module using NMF, LDA and K-Means. We reassigned the classes based on the clustering analysis done using the techniques and they resulted in improved F1 scores, 0.79 for NMF, 0.64 using LDA and 0.84 using K-Means. Further, word2vec vector embedding was used to improve classification models with improved features extracted from text.Other creators -
Quantified Self : Analyzing Personal Garmin Data (2012-2017)
-
See project(Skills Used: Python, R, D3.js, Matplotlib, ggplot2)
I have been an avid runner and fitness enthusiast for the past 5 years, and what has been a fun part of all those is how much we can learn about ourselves from exploring the data we capture.
I started capturing data about myself around the year 2012, which was when we saw the advent of wearable devices, and have always wanted to carry out an analysis to gain insights about my running patterns, my strength training schedules, my…(Skills Used: Python, R, D3.js, Matplotlib, ggplot2)
I have been an avid runner and fitness enthusiast for the past 5 years, and what has been a fun part of all those is how much we can learn about ourselves from exploring the data we capture.
I started capturing data about myself around the year 2012, which was when we saw the advent of wearable devices, and have always wanted to carry out an analysis to gain insights about my running patterns, my strength training schedules, my diet logs and any other metadata which I can collect about myself like the Heart Rate, Cadence for runs and also the step counts.
-
Machine Learning for Data Science (COMS 4721) : Projects
-
(Skills: Numpy, Scipy, Python, Matplotlib)
Project 1: Linear and Ridge Regression
Implemented Ridge Regression to predict mileage per gallon for various car models in an automobile dataset. Also, visualized the results of how the importance of various features vary with the change in regularization parameter. Further, implemented pth-order polynomial regression to observe how it affects the RMSE as a function of regularization parameter for various orders of p = 1, 2…(Skills: Numpy, Scipy, Python, Matplotlib)
Project 1: Linear and Ridge Regression
Implemented Ridge Regression to predict mileage per gallon for various car models in an automobile dataset. Also, visualized the results of how the importance of various features vary with the change in regularization parameter. Further, implemented pth-order polynomial regression to observe how it affects the RMSE as a function of regularization parameter for various orders of p = 1, 2, 3.
Project 2: Naive Bayes, K-Nearest Neighbors and Logistic Regression Algorithms
Implemented Naive Bayes algorithm to classify a dataset of emails into spam/ham. Secondly, implemented K-Nearest Neighbors algorithm using the same dataset and observed the changes in accuracy with varying values of K. Finally, used the dataset and implemented Logistic Regression to observe the changes in the objective function with the increase in the iterations.
Project 3: Gaussian Process and Ada-boost Algorithm
Implemented Gaussian Process and calculated RMSE by varying the values of the parameters of the model to observe the variation. Also, implemented boosting for the Least Squares Classifier on the dataset found at https://archive.ics.uci.edu/ml/datasets/Occupancy+Detection+
Project 4: K-Means Algorithm and Probabilistic Matrix Factorization
Implemented the K-means algorithm to cluster data generated from a weighted mixture of Gaussian
distributions. Also, predicted movie ratings and found ‘similar’ movies in the MovieLens dataset by implementing Maximum A Posteriori (MAP) inference Probabilistic Matrix Factorization collaborative filtering approach.
Project 5: First Order Markov Chain Model and Non-Negative Matrix Factorization
Implemented a First Order Markov Chain Model to rank 760 college football teams using game scores from the 2016 college football season. Also, implemented Non-Negative Matrix Factorization algorithm to do topic modeling on a New York Times data set. -
Data Visualization Project : Marvel vs DC Superheroes compared using D3.js
-
(Skills Used: HTML, Javascript, D3.js, Jquery)
Since the 1930s, Marvel and DC Comics have had a friendly rivalry, stemming from the comic book pages and now transcending onto the big screen. Many of them have deep and well-established backgrounds that go beyond just pages and screens, but where do they come from in the first place? How are they different? Are Marvel characters stronger than their DC counterparts?
We use visualizations created using D3.js to compare their origins,…(Skills Used: HTML, Javascript, D3.js, Jquery)
Since the 1930s, Marvel and DC Comics have had a friendly rivalry, stemming from the comic book pages and now transcending onto the big screen. Many of them have deep and well-established backgrounds that go beyond just pages and screens, but where do they come from in the first place? How are they different? Are Marvel characters stronger than their DC counterparts?
We use visualizations created using D3.js to compare their origins, superpowers, and other attribute differences between them!Other creatorsSee project -
Analyzing GitHub Experts
-
The work involved analyzing the the code data of several JAVA experts to discern their skill-set, tools and techniques used most often by them. Further, we looked into ways to tag and classify these skill-sets into specific subfields for the ease of visualisation.
(Tools Used: Numpy, Pandas, Scikit-Learn, NetworkX, D3.js, Flask, AWS Hosting)Other creators -
Golconda on the Web (Honors Project)
-
The project was done as a honors project for the Bachelors in Technology Degree. We created a visual experience for important historical monuments like the Golconda fort in Hyderabad. The visual experience was created by stitching together a lot of captured images in the form of a 3D image synths embedded within video captured at places of high interest. The work was later extended to involve automatic annotation of Images taken at Golconda Fort via a mobile device.
Tools Used: C…The project was done as a honors project for the Bachelors in Technology Degree. We created a visual experience for important historical monuments like the Golconda fort in Hyderabad. The visual experience was created by stitching together a lot of captured images in the form of a 3D image synths embedded within video captured at places of high interest. The work was later extended to involve automatic annotation of Images taken at Golconda Fort via a mobile device.
Tools Used: C, Bundler, Flash, SQL, Python, Symbian
Other creators
Honors & Awards
-
Best Young Scientist Presentation Award
35th International Symposium on Remote Sensing of Environment
Awarded for the Oral Presentation given on the "Spatio-Spectral method for Estimating classified regions with high confidence using Modis Data" at ISRSE35 held at Beijing, China.
-
Dean's List
International Institute of Information Technology
Awarded for 3 out of 8 semesters for being amongst the top 10% students in the batch (amongst 186 students).
Test Scores
-
GRE
Score: 322/340
-
TOEFL
Score: 112/120
Languages
-
English
Full professional proficiency
-
Hindi
Full professional proficiency
-
Punjabi
Limited working proficiency
Organizations
-
IEEE
Student Member
-Awarded a scholarship for the Oral Presentation at the conference IGARSS 2013 on "An Unmixing Framework to improve class accuracies using detected High Importance Local Regions" held at Melbourne, Australia in 2013
More activity by Anuj
-
🎉Today I hit 8-years at Databricks! 🎉 It feels like yesterday when we had a one-room Wework in Times Square. We’ve evolved, but still have the same…
🎉Today I hit 8-years at Databricks! 🎉 It feels like yesterday when we had a one-room Wework in Times Square. We’ve evolved, but still have the same…
Liked by Anuj Katiyal
-
Update for my network: I've been at Flagship Pioneering for the last couple of months, serving as Senior Director / Head of AI Research. It's been a…
Update for my network: I've been at Flagship Pioneering for the last couple of months, serving as Senior Director / Head of AI Research. It's been a…
Liked by Anuj Katiyal
-
Last evening, I had the privilege of watching a brilliant Marathi dark comedy, “Ucchad”, and I’m still carrying its laughter, discomfort, and sharp…
Last evening, I had the privilege of watching a brilliant Marathi dark comedy, “Ucchad”, and I’m still carrying its laughter, discomfort, and sharp…
Liked by Anuj Katiyal
-
Just defended my PhD at EPFL — and looking back, it’s been a wild, beautiful ride. While PhD horror stories often go viral for their drama… the…
Just defended my PhD at EPFL — and looking back, it’s been a wild, beautiful ride. While PhD horror stories often go viral for their drama… the…
Liked by Anuj Katiyal
-
At Whissle, we’ve always believed that intelligence isn’t just in text — it includes: sounds, voices, things-in-environment, faces, signals in…
At Whissle, we’ve always believed that intelligence isn’t just in text — it includes: sounds, voices, things-in-environment, faces, signals in…
Liked by Anuj Katiyal
-
The future of payments is being shaped by the innovations we build today and I’m excited for Cashfree Payments to be at the heart of that…
The future of payments is being shaped by the innovations we build today and I’m excited for Cashfree Payments to be at the heart of that…
Liked by Anuj Katiyal
-
🌟 I’m excited to share that I recently had the privilege of speaking at the SAP Transformation Excellence Summit in Austin! 🎤 During the event, I…
🌟 I’m excited to share that I recently had the privilege of speaking at the SAP Transformation Excellence Summit in Austin! 🎤 During the event, I…
Liked by Anuj Katiyal
-
Saw this interesting paper by Google today. An AI-augmented textbook seems to be a natural development given the current LLMs, but it did remind me…
Saw this interesting paper by Google today. An AI-augmented textbook seems to be a natural development given the current LLMs, but it did remind me…
Liked by Anuj Katiyal
-
We're hiring for a lead data scientist in London, please reach out if you are interested. https://lnkd.in/gz4ecqxr
We're hiring for a lead data scientist in London, please reach out if you are interested. https://lnkd.in/gz4ecqxr
Liked by Anuj Katiyal
-
What if… a Tech Lead from OpenAI welcomed you on your first day of college? Sounds unreal, right? But that’s exactly what happened at Scaler School…
What if… a Tech Lead from OpenAI welcomed you on your first day of college? Sounds unreal, right? But that’s exactly what happened at Scaler School…
Liked by Anuj Katiyal
-
I'll be in Vancouver for [ICML] Int'l Conference on Machine Learning this week (July 16-21) to present our spotlight paper on human baselines and our…
I'll be in Vancouver for [ICML] Int'l Conference on Machine Learning this week (July 16-21) to present our spotlight paper on human baselines and our…
Liked by Anuj Katiyal
-
Our paper in cold-start recommendation accepted at ECAI 2025! Tushar Prakash Raksha Jalan Brijraj Singh, PhD
Our paper in cold-start recommendation accepted at ECAI 2025! Tushar Prakash Raksha Jalan Brijraj Singh, PhD
Liked by Anuj Katiyal
Other similar profiles
Explore top content on LinkedIn
Find curated posts and insights for relevant topics all in one place.
View top content