Intro to pandas from SQL perspective

pandas is an open source data manipulation and analysis package/library, built for the Python programming language. The name is derived from "panel data", a common term for multidimensional data sets encountered in statistics and econometrics. It was originally written by Wes McKinney, offering data structures and operations designed for working with structured and time series data. pandas is built on top of NumPy and serves as a strong complement to the existing scientific computing environment and integrate well with many other 3rd party libraries. It's one of, if not, THE most widely used tool(s) for data analysis, manipulation, munging/wrangling, you name it!

pandas key data structure is called the DataFrame. You might think of a DataFrame as a table, similar to a table in a SQL database. If you're new to pandas, but have some familiarity with SQL, this article is meant to provide some examples of performing some common SQL operations using pandas, intended as a quick introduction. Python and pandas installation is beyond the scope of this article (the simplest way is with Anaconda). The examples will be based on the famous DEPT and EMP tables of Oracle's SCOTT sample/demo schema and assume we have DataFrames of the same name and structure.

DEPT and EMP tables of Oracle's SCOTT sample/demo schema

Selecting All Columns

In SQL, selecting all columns is done using a *:

SQL> SELECT *
  2  FROM   dept;


    DEPTNO DNAME          LOC
---------- -------------- -------------
        10 ACCOUNTING     NEW YORK
        20 RESEARCH       DALLAS
        30 SALES          CHICAGO
        40 OPERATIONS     BOSTON

With pandas, you simply call the DataFrame:

No alt text provided for this image

Selecting Specific Columns

In SQL, column selection is done using a comma-separated list of columns you'd like to select:

SQL> SELECT deptno, loc
  2  FROM   dept;


    DEPTNO LOC
---------- -------------
        10 NEW YORK
        20 DALLAS
        30 CHICAGO
        40 BOSTON

With pandas, selection is done by passing a list of column names to your DataFrame:

No alt text provided for this image

Using Arithmetic Operators

In SQL, you can add a calculated column. So if you want to say add a column "Annual Salary" that equals monthly salary multiplied by 12:

SQL> select ename,
  2         sal,
  3         sal*12 "Annual Salary"
  4  FROM   emp;


ENAME             SAL Annual Salary
---------- ---------- -------------
SMITH             800          9600
ALLEN            1600         19200
WARD             1250         15000
JONES            2975         35700
MARTIN           1250         15000
BLAKE            2850         34200
CLARK            2450         29400
SCOTT            3000         36000
KING             5000         60000
TURNER           1500         18000
ADAMS            1100         13200
JAMES             950         11400
FORD             3000         36000
MILLER           1300         15600


14 rows selected.

With pandas, you can add the 'Annual Salary' column to your DataFrame:

No alt text provided for this image

Eliminating Duplicate Rows

We eliminate duplicate rows by using the DISTINCT keyword in the SELECT clause.

SQL> SELECT DISTINCT deptno
  2  FROM   emp;


    DEPTNO
----------
        30
        20
        10

In pandas, we use the unique() function to get unique values of a column.

No alt text provided for this image

Displaying Table Structure

The DESCRIBE command is used to display the structure of a table.

SQL> DESCRIBE dept
 Name                                      Null?    Type
 ----------------------------------------- -------- ----------------------------
 DEPTNO                                    NOT NULL NUMBER(2)
 DNAME                                              VARCHAR2(14)
 LOC                                                VARCHAR2(13)

pandas DataFrame.info() function is used to print a concise summary of a DataFrame.

No alt text provided for this image

Limiting Rows Selected

Using the WHERE Clause

Filtering in SQL is done via a WHERE clause.

SQL> SELECT ename, job, deptno
  2  FROM   emp
  3  WHERE  job='CLERK';


ENAME      JOB           DEPTNO
---------- --------- ----------
SMITH      CLERK             20
ADAMS      CLERK             20
JAMES      CLERK             30
MILLER     CLERK             10

DataFrames can be filtered in multiple ways; the most intuitive of which is using boolean indexing.

No alt text provided for this image

Basically, emp[emp.job == 'CLERK'] is returning all rows in which the condition is met.

Using the Comparison Operators

SQL> SELECT ename, sal, comm
  2  FROM   emp
  3  WHERE  sal<=comm;


ENAME             SAL       COMM
---------- ---------- ----------
MARTIN           1250       1400

The pandas equivalent would be:

No alt text provided for this image

Using the IN Operator

The IN operator is used to test for values in a list.

SQL> SELECT empno, ename, sal, mgr
  2  FROM   emp
  3  WHERE  mgr IN (7902, 7566, 7788);


     EMPNO ENAME             SAL        MGR
---------- ---------- ---------- ----------
      7369 SMITH             800       7902
      7788 SCOTT            3000       7566
      7876 ADAMS            1100       7788
      7902 FORD             3000       7566

pandas isin() method helps in selecting rows having a particular (or multiple) value in a particular column.

No alt text provided for this image

Again, emp[emp.mgr.isin([7902, 7566, 7788]) is basically returning rows having mgr=7902, 7566, or 7788.

Using the IS NULL Operator

Null values are tested with the IS NULL operator.

SQL> SELECT ename, mgr
  2  FROM   emp
  3  WHERE  mgr IS NULL;


ENAME             MGR
---------- ----------
KING

pandas isnull() method is used to check and manage NULL values in a DataFrame.

No alt text provided for this image

Using the AND Operator

AND requires both conditions to be TRUE.

SQL> SELECT empno, ename, job, sal
  2  FROM   emp
  3  WHERE  sal>=1100
  4  AND    job='CLERK';


     EMPNO ENAME      JOB              SAL
---------- ---------- --------- ----------
      7876 ADAMS      CLERK           1100
      7934 MILLER     CLERK           1300

Multiple conditions can be passed to a DataFrame using &.

No alt text provided for this image

Using the OR Operator

OR requires either condition to be TRUE.

SQL> SELECT empno, ename, job, sal
  2  FROM   emp
  3  WHERE  sal>=1100
  4  OR     job='CLERK';


     EMPNO ENAME      JOB              SAL
---------- ---------- --------- ----------
      7369 SMITH      CLERK            800
      7499 ALLEN      SALESMAN        1600
      7521 WARD       SALESMAN        1250
      7566 JONES      MANAGER         2975
      7654 MARTIN     SALESMAN        1250
      7698 BLAKE      MANAGER         2850
      7782 CLARK      MANAGER         2450
      7788 SCOTT      ANALYST         3000
      7839 KING       PRESIDENT       5000
      7844 TURNER     SALESMAN        1500
      7876 ADAMS      CLERK           1100
      7900 JAMES      CLERK            950
      7902 FORD       ANALYST         3000
      7934 MILLER     CLERK           1300


14 rows selected.

Multiple conditions can be passed to a DataFrame using |.

No alt text provided for this image

Retrieving Records with Equijoins

SQL> SELECT emp.empno, emp.ename, emp.deptno,
  2         dept.deptno, dept.loc
  3  FROM   emp, dept
  4  WHERE  emp.deptno=dept.deptno;


     EMPNO ENAME          DEPTNO     DEPTNO LOC
---------- ---------- ---------- ---------- -------------
      7782 CLARK              10         10 NEW YORK
      7839 KING               10         10 NEW YORK
      7934 MILLER             10         10 NEW YORK
      7566 JONES              20         20 DALLAS
      7902 FORD               20         20 DALLAS
      7876 ADAMS              20         20 DALLAS
      7369 SMITH              20         20 DALLAS
      7788 SCOTT              20         20 DALLAS
      7521 WARD               30         30 CHICAGO
      7844 TURNER             30         30 CHICAGO
      7499 ALLEN              30         30 CHICAGO
      7900 JAMES              30         30 CHICAGO
      7698 BLAKE              30         30 CHICAGO
      7654 MARTIN             30         30 CHICAGO


14 rows selected.

JOINs can be performed using merge():

No alt text provided for this image

Using the COUNT Function

COUNT(*) returns the number of rows in a table.

SQL> SELECT COUNT(*)
  2  FROM   emp
  3  WHERE  deptno = 30;


  COUNT(*)
----------
         6

The number of rows of a DataFrame can be obtained with the function len().

No alt text provided for this image


To view or add a comment, sign in

More articles by Arief Rahman

Others also viewed

Explore content categories