Anuj Katiyal

Anuj Katiyal

Jersey City, New Jersey, United States
3K followers 500+ connections

About

I am a Senior Data Scientist with a breadth of experience and expertise ranging from…

Activity

Join now to see all activity

Experience

  • Condé Nast Graphic

    Condé Nast

    Greater New York City Area

  • -

    Greater New York City Area

  • -

    Greater New York City Area

  • -

    Pune Area, India

  • -

    Hyderabad Area, India

  • -

    Hyderabad Area, India

Education

Licenses & Certifications

Volunteer Experience

  • Teacher

    Ashakiran at IIIT Hyderabad

    - 2 years 1 month

    Education

    Taught Mathematics and English to Underprivileged children

  • Treasurer

    Data Science Institute Student Council, Columbia University

    - 1 year 3 months

    Education

    As a treasurer and member at the Data Science Institute Student Council, I manage the funds for the student activities aimed at resolving student issues, better collaboration with administrators to address them and the facilitation of events to increase networking opportunities for the students.

Publications

  • An Unmixing Framework to improve class accuracies using detected High Importance Local Regions

    IEEE International Geoscience and Remote Sensing Symposium

    Image Classification techniques are aimed at improving the class accuracies which are affected by the occurrence of mixed pixels in the remotely sensed data. Improving the labeling accuracies for the mixed pixels regions will increase the global class accuracies. Spectral unmixing has been used to decompose the mixed pixel regions into its constituent endmembers, and a corresponding fractional abundance for each endmember. The unmixing approaches are approximated based on spectral behavior, but…

    Image Classification techniques are aimed at improving the class accuracies which are affected by the occurrence of mixed pixels in the remotely sensed data. Improving the labeling accuracies for the mixed pixels regions will increase the global class accuracies. Spectral unmixing has been used to decompose the mixed pixel regions into its constituent endmembers, and a corresponding fractional abundance for each endmember. The unmixing approaches are approximated based on spectral behavior, but ignore the spatial neighborhood. The data values at the pixel along with
    its spatial neighborhood are good indicators of the image characteristics including atmospheric conditions and need to be considered. In the current research, we propose a spatio-spectral framework that improves the classification accuracy and demonstrates its utility by improving the labels of the detected mixed regions for MODIS data and validated with AWIFS (APLULC 2005) derived land cover dataset.

    Other authors
    • K S Rajan
    See publication
  • Improving Utility of Low-Resolution data using Statistical approaches in Remote Sensing

    XVI Brazilian Remote Sensing Symposium

    With the increase in the multi-resolution data available from the various satellite sensors, there is an increasing need to come up with analysis techniques to handle and exploit the information that can be extracted from lower resolution (LR) data before acquiring higher resolution (HR) data. This paper presents a methodology to use statistical approaches to sub-group the LR classified data into high importance local regions (HILRs) and low importance local regions (LILRs) after filtering, for…

    With the increase in the multi-resolution data available from the various satellite sensors, there is an increasing need to come up with analysis techniques to handle and exploit the information that can be extracted from lower resolution (LR) data before acquiring higher resolution (HR) data. This paper presents a methodology to use statistical approaches to sub-group the LR classified data into high importance local regions (HILRs) and low importance local regions (LILRs) after filtering, for every class. The HILRs were shown to have more number of near pure-pixels as compared to the complete class regions, as verified by using the classified HR APLULC data, when a LR pixel was matched to the HR matrix using the Near Purity Measure (of 80). The HILRs were further shown to have higher stability, by showing reduced NDVI variation as compared to the complete class regions, using the HR AWIFS data. The method proposed works better for LR classes with limited intra-class heterogeneity and good inter-class separability. The proposed approach can help to reduce the processing done on HR resolution data based on the corresponding LR HILRs obtained for every class regions and further help in applications like pure-pixel matching, building HR-LR classification models and isolating pure pixels from the mixed/impure pixels in class regions.

    Other authors
    • K S Rajan
  • Spatio-Spectral method for Estimating classified regions with high confidence using Modis Data

    35th International Symposium on Remote Sensing of Environment

    In studies like change analysis, the availability of very high resolution (VHR)/high
    resolution (HR) imagery for a particular period and region is a challenge due to the sensor
    revisit times and high cost of acquisition. Therefore, most studies prefer lower resolution (LR)
    sensor imagery with frequent revisit times, in addition to their cost and computational
    advantages. Further, the classification techniques provide us a global estimate of the class
    accuracy, which limits its…

    In studies like change analysis, the availability of very high resolution (VHR)/high
    resolution (HR) imagery for a particular period and region is a challenge due to the sensor
    revisit times and high cost of acquisition. Therefore, most studies prefer lower resolution (LR)
    sensor imagery with frequent revisit times, in addition to their cost and computational
    advantages. Further, the classification techniques provide us a global estimate of the class
    accuracy, which limits its utility if the accuracy is low. In this work, we focus on the sub-
    classification problem of LR images and estimate regions of higher confidence than the global
    classification accuracy within its classified region. The spectrally classified data was mined
    into spatially clustered regions and further refined and processed using statistical measures to
    arrive at local high confidence regions (LHCRs), for every class. Rabi season MODIS data of
    January 2006 & 2007 was used for this study and the evaluation of LHCR was done using the
    APLULC 2005 classified data. For Jan-2007, the global class accuracies for water bodies
    (WB), forested regions (FR) and Kharif crops & barren lands (KB) were 89%, 71.7% and
    71.23% respectively, while the respective LHCRs had accuracies of 96.67%, 89.4% and 80.9%
    covering an area of 46%, 29% and 14.5% of the initially classified areas. Though areas are
    reduced, LHCRs with higher accuracies help in extracting more representative class regions.
    Identification of such regions can facilitate in improving the classification time and processing
    for HR images when combined with the more frequently acquired LR imagery, isolate pure vs.
    mixed/impure pixels and as training samples locations for HR imagery.

    Other authors
    • K S Rajan
    See publication

Courses

  • Algorithms

    CS3110

  • Algorithms for Data Science

    CSOR 4246

  • Applied Machine Learning

    COMS 4995

  • Artificial Intelligence

    CS3500

  • Computer Systems for Data Science

    COMS 4121

  • Computer Vision

    CS5765

  • Data Science Capstone

    ENGI 4800

  • Data Warehousing and Data Mining

    CS5405

  • Database Management Systems

    CS3400

  • Digital Image Processing

    CS4750

  • Exploratory Data Analysis and Visualization

    STAT 5702

  • Fieldwork : Data Science Internship (Twitter)

    COMS 6910

  • Linear Algebra

    MA3100

  • Machine Learning

    CS5770

  • Machine Learning for Data Science

    COMS 4721

  • Numerical Analysis

    MA6401

  • Pattern Recognition

    CS4770

  • Probability Theory

    STAT 4203

  • Statistical Inference and Modeling

    STAT 5703

  • Storytelling with Data

    JOUR 4001

Projects

  • Deep Learning using Keras

    (Skills : Python, Keras, Scikit-Learn, Pandas, Numpy, Matplotlib, Seaborn)

    Datasets used: MNIST, http://ufldl.stanford.edu/housenumbers/

    1. Trained a multilayer perceptron (feed forward neural network) with two hidden layers and rectified linear nonlinearities on the MNIST dataset using the Keras Sequential interface. Compared the baseline model with a model using drop-out resulting in an improved accuracy.

    2. Trained a convolutional neural network on the SVHN dataset…

    (Skills : Python, Keras, Scikit-Learn, Pandas, Numpy, Matplotlib, Seaborn)

    Datasets used: MNIST, http://ufldl.stanford.edu/housenumbers/

    1. Trained a multilayer perceptron (feed forward neural network) with two hidden layers and rectified linear nonlinearities on the MNIST dataset using the Keras Sequential interface. Compared the baseline model with a model using drop-out resulting in an improved accuracy.

    2. Trained a convolutional neural network on the SVHN dataset. Achieved an accuracy of 92.7% on the test-set with a base model. Also build a model using batch normalization and dropout which led to an increased accuracy of 95.3%.

    3. Imported the weights of a pre-trained convolutional neural network, VGG, and used it as feature extraction method to train a multi-layered perceptron on the pets dataset. We achieved an accuracy of 73.2% on the 37-class classification task.

  • In-Class Kaggle : Bank's Marketing Campaign to Analyze Subscription Status (Classification Analysis)

    (Skills Used: Python, Numpy, Scikit-Learn, Pandas, Matplotlib, Seaborn, Git, Travis)
    Dataset Used : https://archive.ics.uci.edu/ml/datasets/bank+marketing
    Task was to predict whether someone will subscribe to the term deposit or not based on the direct marketing campaign ran by a Banking Institution based on phone calls.

    Stood 2nd amongst 100 teams participating in the In-Class Kaggle Classification Problem. The best ROC-AUC score of 0.798 on the test set was obtained using a…

    (Skills Used: Python, Numpy, Scikit-Learn, Pandas, Matplotlib, Seaborn, Git, Travis)
    Dataset Used : https://archive.ics.uci.edu/ml/datasets/bank+marketing
    Task was to predict whether someone will subscribe to the term deposit or not based on the direct marketing campaign ran by a Banking Institution based on phone calls.

    Stood 2nd amongst 100 teams participating in the In-Class Kaggle Classification Problem. The best ROC-AUC score of 0.798 on the test set was obtained using a poor-man stacking ensemble model with Logistic Regression, SVM and Random Forest. The major modules in the project included:

    1. Data Cleaning and Pre-processing to exclude outliers, remove redundant independent variables and impute missing values.
    2. Created a pipeline to perform feature engineering, feature selection and model validation on the dataset.
    3. Applied classification models such as Logistic Regression, SVM followed by tree-based models such as Random Forest, Gradient boosted trees, XgBoost etc. Tuned the hyper-parameters for selecting the best model for each algorithm.
    4. Applied various Ensemble Methods including Voting Classifiers, Poor Man's Stacking, Weighted Ensembles. The best ROC-AUC score of 0.798 on the test set was obtained using a poor-man stacking ensemble model with Logistic Regression, SVM and Random Forest.
    5. Applied Resampling techniques like RandomUnderSampler, RandomOverSampler with various Classification Algorithms followed by Ensembles using them. Resampling was also applied using techniques like Edited Nearest Neighbors, SMOTE, SMOTEENN and SMOTETOMEK.

    Other creators
  • Predicting Market Rate for Apartments based on New York City Housing and Vacancy Survey (NYCHVS) : Regression Analysis

    (Skills Used: Python, Scikit-learn, Matplotlib, Travis, Git, Numpy)

    Datasets used are available at: https://www.census.gov/housing/nychvs/data/2014/userinfo2.html
    (Data can be accessed at https://www.census.gov/housing/nychvs/data/2014/uf_14_occ_web_b.txt)

    We utilized the linear models (Regression approaches) for predicting the monthly rent prices of an apartment in NYC based on the collected 2014 Census data. Feature engineering was done to obtain features that directly affect…

    (Skills Used: Python, Scikit-learn, Matplotlib, Travis, Git, Numpy)

    Datasets used are available at: https://www.census.gov/housing/nychvs/data/2014/userinfo2.html
    (Data can be accessed at https://www.census.gov/housing/nychvs/data/2014/uf_14_occ_web_b.txt)

    We utilized the linear models (Regression approaches) for predicting the monthly rent prices of an apartment in NYC based on the collected 2014 Census data. Feature engineering was done to obtain features that directly affect the pricing of an apartment like the number of rooms, presence of elevator, floor number etc. The model was tested using a separate test data that was extracted from the complete dataset, based on metrics like the R^2 score (obtained as 0.62)

    The project involved the followed major steps:
    1. Loading and Initial Analysis of Data. This steps involved cleaning data to drop the rows with missing rent, dropping columns with leaked information and additional pre-processing like One Hot Encoding and Feature Selection/Engineering.
    2. We were able to obtain an R^2 value of around 0.6, which is at par with most of the analysis work done on this dataset using linear models like Linear Regression, Ridge Regression, Lasso Regression and Polynomial Regression.
    3. Important observations from the dataset included dropping rows with topcoded rent and also dropping rows which are rent controlled, which lead to an increase in the R^2 score from 0.45 to 0.62

    Other creators
  • Image Matching Tool

    The project was done as a part of the course in Computer Vision. The aim of the project was to obtain similar images to the trained Oxford Building Dataset using SIFT and SURF features based on the tf-idf score.
    Tools Used: C, SQL

    Other creators
  • Complaints about Traffic conditions in the city of Boston : Text Classification and Clustering Analysis

    -

    (Skills: Python, Numpy, Scikit-learn, Matplotlib, Seaborn, Pandas, Nltk, Spacy, Gensim)

    Major modules:

    1. Data Cleaning - It involved the initial loading of data along with visualization of the class distribution. The duplicated labels were consolidated into one and the duplicated data points were removed.

    2. Baseline Multi-Classification Algorithms - The baseline multi-classification models using bag of words approach were applied to the dataset, namely, Logistic…

    (Skills: Python, Numpy, Scikit-learn, Matplotlib, Seaborn, Pandas, Nltk, Spacy, Gensim)

    Major modules:

    1. Data Cleaning - It involved the initial loading of data along with visualization of the class distribution. The duplicated labels were consolidated into one and the duplicated data points were removed.

    2. Baseline Multi-Classification Algorithms - The baseline multi-classification models using bag of words approach were applied to the dataset, namely, Logistic Regression, SGDClassifier, Multinomial Naive Bayes etc. The logistic regression with Tf-Idf Vectorizer results in the best Macro F1 score of 0.536.

    3. Improved the Macro F1 score obtained using Logistic Regression with Tf-Idf Vectorize to 0.557 using complex text features including n-grams, character n-grams and domain specific features like length of the complaint, number of punctuation marks and uppercase letters in the complaint while putting a limit on the maximum number of features, ignoring stop words and ignoring infrequent words.
    The results of the tuned model were analyzed using the confusion matrices, analysis of important feature with their coefficient weights and printing out sample mistakes by the model.

    5. Clustering Analysis was done using the dataset by using methodologies like LDA, NMF and K-Means. The best topics obtained using LDA and NMF were visualized with the most important features belonging to every topic. The clustering methodologies were compared based on the calculated ARI score and NMF was shown to have the highest ARI score value.

    6. Improved Classification Models using the classes obtained by the Clustering Analysis done in the previous module using NMF, LDA and K-Means. We reassigned the classes based on the clustering analysis done using the techniques and they resulted in improved F1 scores, 0.79 for NMF, 0.64 using LDA and 0.84 using K-Means. Further, word2vec vector embedding was used to improve classification models with improved features extracted from text.

    Other creators
  • Quantified Self : Analyzing Personal Garmin Data (2012-2017)

    -

    (Skills Used: Python, R, D3.js, Matplotlib, ggplot2)

    I have been an avid runner and fitness enthusiast for the past 5 years, and what has been a fun part of all those is how much we can learn about ourselves from exploring the data we capture.

    I started capturing data about myself around the year 2012, which was when we saw the advent of wearable devices, and have always wanted to carry out an analysis to gain insights about my running patterns, my strength training schedules, my…

    (Skills Used: Python, R, D3.js, Matplotlib, ggplot2)

    I have been an avid runner and fitness enthusiast for the past 5 years, and what has been a fun part of all those is how much we can learn about ourselves from exploring the data we capture.

    I started capturing data about myself around the year 2012, which was when we saw the advent of wearable devices, and have always wanted to carry out an analysis to gain insights about my running patterns, my strength training schedules, my diet logs and any other metadata which I can collect about myself like the Heart Rate, Cadence for runs and also the step counts.

    See project
  • Machine Learning for Data Science (COMS 4721) : Projects

    -

    (Skills: Numpy, Scipy, Python, Matplotlib)

    Project 1: Linear and Ridge Regression
    Implemented Ridge Regression to predict mileage per gallon for various car models in an automobile dataset. Also, visualized the results of how the importance of various features vary with the change in regularization parameter. Further, implemented pth-order polynomial regression to observe how it affects the RMSE as a function of regularization parameter for various orders of p = 1, 2…

    (Skills: Numpy, Scipy, Python, Matplotlib)

    Project 1: Linear and Ridge Regression
    Implemented Ridge Regression to predict mileage per gallon for various car models in an automobile dataset. Also, visualized the results of how the importance of various features vary with the change in regularization parameter. Further, implemented pth-order polynomial regression to observe how it affects the RMSE as a function of regularization parameter for various orders of p = 1, 2, 3.

    Project 2: Naive Bayes, K-Nearest Neighbors and Logistic Regression Algorithms
    Implemented Naive Bayes algorithm to classify a dataset of emails into spam/ham. Secondly, implemented K-Nearest Neighbors algorithm using the same dataset and observed the changes in accuracy with varying values of K. Finally, used the dataset and implemented Logistic Regression to observe the changes in the objective function with the increase in the iterations.

    Project 3: Gaussian Process and Ada-boost Algorithm
    Implemented Gaussian Process and calculated RMSE by varying the values of the parameters of the model to observe the variation. Also, implemented boosting for the Least Squares Classifier on the dataset found at https://archive.ics.uci.edu/ml/datasets/Occupancy+Detection+

    Project 4: K-Means Algorithm and Probabilistic Matrix Factorization
    Implemented the K-means algorithm to cluster data generated from a weighted mixture of Gaussian 
    distributions. Also, predicted movie ratings and found ‘similar’ movies in the MovieLens dataset by implementing Maximum A Posteriori (MAP) inference Probabilistic Matrix Factorization collaborative filtering approach.

    Project 5: First Order Markov Chain Model and Non-Negative Matrix Factorization
    Implemented a First Order Markov Chain Model to rank 760 college football teams using game scores from the 2016 college football season. Also, implemented Non-Negative Matrix Factorization algorithm to do topic modeling on a New York Times data set.

  • Data Visualization Project : Marvel vs DC Superheroes compared using D3.js

    -

    (Skills Used: HTML, Javascript, D3.js, Jquery)

    Since the 1930s, Marvel and DC Comics have had a friendly rivalry, stemming from the comic book pages and now transcending onto the big screen. Many of them have deep and well-established backgrounds that go beyond just pages and screens, but where do they come from in the first place? How are they different? Are Marvel characters stronger than their DC counterparts?

    We use visualizations created using D3.js to compare their origins,…

    (Skills Used: HTML, Javascript, D3.js, Jquery)

    Since the 1930s, Marvel and DC Comics have had a friendly rivalry, stemming from the comic book pages and now transcending onto the big screen. Many of them have deep and well-established backgrounds that go beyond just pages and screens, but where do they come from in the first place? How are they different? Are Marvel characters stronger than their DC counterparts?

    We use visualizations created using D3.js to compare their origins, superpowers, and other attribute differences between them!

    Other creators
    See project
  • Analyzing GitHub Experts

    -

    The work involved analyzing the the code data of several JAVA experts to discern their skill-set, tools and techniques used most often by them. Further, we looked into ways to tag and classify these skill-sets into specific subfields for the ease of visualisation.
    (Tools Used: Numpy, Pandas, Scikit-Learn, NetworkX, D3.js, Flask, AWS Hosting)

    Other creators
  • Golconda on the Web (Honors Project)

    -

    The project was done as a honors project for the Bachelors in Technology Degree. We created a visual experience for important historical monuments like the Golconda fort in Hyderabad. The visual experience was created by stitching together a lot of captured images in the form of a 3D image synths embedded within video captured at places of high interest. The work was later extended to involve automatic annotation of Images taken at Golconda Fort via a mobile device.

    Tools Used: C…

    The project was done as a honors project for the Bachelors in Technology Degree. We created a visual experience for important historical monuments like the Golconda fort in Hyderabad. The visual experience was created by stitching together a lot of captured images in the form of a 3D image synths embedded within video captured at places of high interest. The work was later extended to involve automatic annotation of Images taken at Golconda Fort via a mobile device.

    Tools Used: C, Bundler, Flash, SQL, Python, Symbian

    Other creators

Honors & Awards

  • Best Young Scientist Presentation Award

    35th International Symposium on Remote Sensing of Environment

    Awarded for the Oral Presentation given on the "Spatio-Spectral method for Estimating classified regions with high confidence using Modis Data" at ISRSE35 held at Beijing, China.

  • Dean's List

    International Institute of Information Technology

    Awarded for 3 out of 8 semesters for being amongst the top 10% students in the batch (amongst 186 students).

Test Scores

  • GRE

    Score: 322/340

  • TOEFL

    Score: 112/120

Languages

  • English

    Full professional proficiency

  • Hindi

    Full professional proficiency

  • Punjabi

    Limited working proficiency

Organizations

  • IEEE

    Student Member

    -

    Awarded a scholarship for the Oral Presentation at the conference IGARSS 2013 on "An Unmixing Framework to improve class accuracies using detected High Importance Local Regions" held at Melbourne, Australia in 2013

More activity by Anuj

View Anuj’s full profile

  • See who you know in common
  • Get introduced
  • Contact Anuj directly
Join to view full profile

Other similar profiles

Explore top content on LinkedIn

Find curated posts and insights for relevant topics all in one place.

View top content

Add new skills with these courses