5 Great Tools for an Analyst's Stack
Must have tools for your stack and why.

5 Great Tools for an Analyst's Stack

You may be new to the analyst game or a seasoned pro, but you are reading this because like most analysts, you are are always looking at trends, peer reviews and objective opinions on technologies to help take your expertise to the next level.

Unfortunately, there are dozens, even hundreds of tech out there with varying "barriers to entry", so to speak. These barriers can include but not limited to:

  1. Cost of Software
  2. Not in the technology stack that is used at work
  3. Time needed to set up a working environment before dabbling
  4. Practical Application of your newly acquired tech

These things can make it tricky to grow your skill set and keep you from reaching the next level of your career & capabilities.

So, I'm sharing with you the top 5 tools that have remained a constant help throughout my career and have been easy to put into practice, with little to no barrier of entry!

Linux / Shell Scripting

Best for: Analysts dealing with big data, data mining and processing tasks.

Linux is an operating system that comes in many "flavors". Flavor is another word for type or distribution. Linux is free. Some of the well known providers include CentOS, SuSE, Mint and Ubuntu. If I left your favorite Linux distro out, please shout it out in a comment. Personally, I've used Ubuntu in the past (on Virtual Machines and AWS). Learning the Linux file system and structure is great, particularly since many newer big data technologies will sit on top of Linux (e.g. Hadoop, Hive, Spark, Presto), so it's always useful to understand your platform. But, the greater takeaway about learning and becoming familiarized with Linux is the shell. Once you set up Linux on your network, you can begin connecting to data sources and network drives and begin leveraging the your Linux Shell to manage, schedule and manipulate data tasks. Examples include:

  • Scheduling jobs and data processing scripts (crontab -e)
  • Exporting, compressing and manipulating large data files (e.g. zip filename.txt compressed_file.zip; sed s/word_to_be_replaced/replacement_word/ <--- best performing "find and replace" if you ask me)
  • Previewing large data files without having to open (head -100 data_file.txt) <--- that will allow you to preview the first 100 lines of your file
  • Oh, btw, on a Mac, OS X comes with a shell, just like Linux!

Set Theory & Bayes Theorem

Best for: Analysts who blend various data sources and analysts interested in probabilities and predicting outcomes

Set Theory: If you ever find yourself stumped by joining tables in SQL, then 2 or 3 thirty minute sessions reviewing the basics of set theory will go a long way. Upon reviewing set theory you will begin to see the parallels to SQL within this math topic, for example:

  • Elements are similar to field values in records
  • Union ~ combining 2 or more tables of data together. Also, applies to the SQL query predicate "OR"
  • Intersection ~ Are like inner joins in SQL. Also, applies to the SQL query predicate "AND"
  • Difference in Sets ~ "NOT EXISTS" clause in SQL

Bayes Theorem: The knowledge you just gained from set theory is a good segway into studying Bayes Theorem. Bayes theorem addresses questions of conditional probabilities, such as what are the odd that in a roulette game, the ball will land on black & even? Furthermore applying it's principles can help us eliminate personal bias from critical thinking. Using an excerpt from Daniel Kahneman's book, Thinking Fast and Thinking Slow, where we are asked to note our intuitive answer, he provides the following:

"A cab was involved in a hit-and-run accident at night. Two cab companies, the Green and the Blue, operate in the city. You are given the following data:
85% of cabs are Green and 15% are Blue
A witness identified the cab as Blue. The court tested the reliability of the witness under the circumstances that existed on the night of the accident and concluded that the witness correctly identified each one of the two colors 80% of the time and failed 20% of the time
What is the probability that the cab involved in the accident was Blue rather than Green?"

Your guess may have been 80%, based on the witness's reliability. However, applying Bayes's rule to the two sources of information, the answer is: there is a 41% probability the cab is Blue.

SQL (Structured Query Language):

Best for: Analysts using multiple data tables to make decisions

SQL is the syntax that is used to query most types of databases (e.g. Oracle, DB2, Postgres & SQL Server to name a few). Also, much of the SQL that you write from one database can be easily used in another databases. Database vendors tend to differentiate their product by adding advanced capabilities in query-syntax specific to their product. Often these database specific syntaxes will have it's own name such T-SQL for Microsoft's SQL Server, PL/SQL for Oracle, PSQL for Postgres and so on.

I would describe SQL as a powerful and popular scripting syntax for all analysts dealing with many data tables and sources. It's so powerful that even products known as "NoSQL" products & other big data products have a means to be interacted with it using SQL (e.g. Query Mongo for MongoDb, Presto for Hadoop).

Python

Best for: Analysts who want to mine data, apply mathematical analysis to their data and/or intend to combine data sources from different platforms (e.g. Hadoop & Postgres, Text Files and SQL Server)

Python is a multi purpose language that has caught much attention in the data science and analytics space in the past 8 or so years. Python can be used for various things such as backend development, application development, data analysis and the web. Therefore as an analyst, it's important to stay focused when endeavoring in Python and focus on the aspects of Python that will help most with your field.

The easiest way to get started with Python is via Jupyter Notebooks. Jupyter notebooks allow you to code and run your Python code on your web browser. Set up and configuration of Jupyter is easy in a Linux environment. It may be tricky in OS X and Windows, but you can always to turn to Anaconda to get up and running quickly with Jupyter Notebooks. Programming in Python will prepare you to think in terms of feeding inputs into predictive or data mining models as you step through and learn about functions and other aspects of Python.

For example the function below will assign a score to a list of users with a higher weight going towards users of specific industries

This is very simplistic, but the point is that you can write very few lines of code and accomplish a lot, specifically when you use libraries that already have many built in features that you won't have to write from scratch. Some of the libraries I recommend learning about in your Python journey are:

  • SQL Alchemy
  • Pandas (enables you to bring data from various sources into frames/data sets and apply logic, similar to that of SQL)
  • Numpy
  • PyHive (provided your have Presto or Hive enabled to speak with Hadoop instance)
  • Psycopg2 (if using Postgres)

Tableau

Best for: Analysts communicating findings to stakeholders. Analyst who love visuals of their data.

If you like creating pivot table and charts in Excel, you will absolutely love Tableau. Tableau's learning curve is short. With Tableau you can quickly create pivot tables, create new calculations based on existing measures and derive some based on a combination of measures and dimensions. It offers much of the same functionality found in core SQL (such as counting distinct values, aggregating at numerous levels and even joining two or more data set through related fields).

With Tableau you can also use real estate wisely via parameters. For example, you can have your chart switch between two or more attributes on a measurement.

While Tableau is a commercially licensed product, there is also Tableau Public which connects to multiple sources of data (even Google Sheets) and still enables you to create charts, pivot tables and full dashboards like it's paid counterpart. Furthermore you can publish your work to their server where it can be shared with everyone.

In summary, knowledge of these skills have helped me succeed and find my work enjoyable. I've also been able to better understand the roles of other technical counterparts from data engineers and architects to dev ops. That has paid dividends through understanding the capabilities and limitations of my work environment and allowing me to focus on areas that I am best equipped to succeed in. In other words, I'm pretty good at avoiding technologies that others will not adopt or support and advocate for the ones that make the most sense.

Btw, what stack has worked for you?

To view or add a comment, sign in

More articles by Oscar Valles

  • Making money when stocks fall

    Post COVID, we've experienced some interesting swings in the market. A couple of my friends (Xikai Zhao & Ben…

  • Identify Outliers: using 20 Lines of Python

    You're facing a data set with many measures that you are planning to run through a regression model or analyze in some…

    3 Comments
  • Forecasting with 20 lines of Python

    At some point you'll likely be asked to create a forecast for orders / sales / web visits, etc..

  • Investing for Millennials

    The other day I came across Stash, a mobile app designed to simplify investing. Similar to Robinhood, this app and…

    1 Comment
  • Easy ETL with Python - For Beginners

    At work and in discussions with peers and colleagues, you are likely to encounter the topic of leveraging python for…

    3 Comments
  • Simple Regression with KNIME

    Just downloaded KNIME last night. I was previously using RapidMiner for data prep, loading and mining tool.

    2 Comments
  • Converting Point in Time data to Time Series data

    Leveraging data to tell stories may, at times, will have caveats. The data will not lend itself for your visual…

  • Sorting Secondary Blended Data in Tableau

    Data blending in Tableau enables us to bring in secondary data sets into our viz's and analysis without having to…

    2 Comments

Others also viewed

Explore content categories