Intro to pandas from SQL perspective
pandas is an open source data manipulation and analysis package/library, built for the Python programming language. The name is derived from "panel data", a common term for multidimensional data sets encountered in statistics and econometrics. It was originally written by Wes McKinney, offering data structures and operations designed for working with structured and time series data. pandas is built on top of NumPy and serves as a strong complement to the existing scientific computing environment and integrate well with many other 3rd party libraries. It's one of, if not, THE most widely used tool(s) for data analysis, manipulation, munging/wrangling, you name it!
pandas key data structure is called the DataFrame. You might think of a DataFrame as a table, similar to a table in a SQL database. If you're new to pandas, but have some familiarity with SQL, this article is meant to provide some examples of performing some common SQL operations using pandas, intended as a quick introduction. Python and pandas installation is beyond the scope of this article (the simplest way is with Anaconda). The examples will be based on the famous DEPT and EMP tables of Oracle's SCOTT sample/demo schema and assume we have DataFrames of the same name and structure.
Selecting All Columns
In SQL, selecting all columns is done using a *:
SQL> SELECT * 2 FROM dept; DEPTNO DNAME LOC ---------- -------------- ------------- 10 ACCOUNTING NEW YORK 20 RESEARCH DALLAS 30 SALES CHICAGO 40 OPERATIONS BOSTON
With pandas, you simply call the DataFrame:
Selecting Specific Columns
In SQL, column selection is done using a comma-separated list of columns you'd like to select:
SQL> SELECT deptno, loc 2 FROM dept; DEPTNO LOC ---------- ------------- 10 NEW YORK 20 DALLAS 30 CHICAGO 40 BOSTON
With pandas, selection is done by passing a list of column names to your DataFrame:
Using Arithmetic Operators
In SQL, you can add a calculated column. So if you want to say add a column "Annual Salary" that equals monthly salary multiplied by 12:
SQL> select ename, 2 sal, 3 sal*12 "Annual Salary" 4 FROM emp; ENAME SAL Annual Salary ---------- ---------- ------------- SMITH 800 9600 ALLEN 1600 19200 WARD 1250 15000 JONES 2975 35700 MARTIN 1250 15000 BLAKE 2850 34200 CLARK 2450 29400 SCOTT 3000 36000 KING 5000 60000 TURNER 1500 18000 ADAMS 1100 13200 JAMES 950 11400 FORD 3000 36000 MILLER 1300 15600 14 rows selected.
With pandas, you can add the 'Annual Salary' column to your DataFrame:
Eliminating Duplicate Rows
We eliminate duplicate rows by using the DISTINCT keyword in the SELECT clause.
SQL> SELECT DISTINCT deptno 2 FROM emp; DEPTNO ---------- 30 20 10
In pandas, we use the unique() function to get unique values of a column.
Displaying Table Structure
The DESCRIBE command is used to display the structure of a table.
SQL> DESCRIBE dept Name Null? Type ----------------------------------------- -------- ---------------------------- DEPTNO NOT NULL NUMBER(2) DNAME VARCHAR2(14) LOC VARCHAR2(13)
pandas DataFrame.info() function is used to print a concise summary of a DataFrame.
Limiting Rows Selected
Using the WHERE Clause
Filtering in SQL is done via a WHERE clause.
SQL> SELECT ename, job, deptno 2 FROM emp 3 WHERE job='CLERK'; ENAME JOB DEPTNO ---------- --------- ---------- SMITH CLERK 20 ADAMS CLERK 20 JAMES CLERK 30 MILLER CLERK 10
DataFrames can be filtered in multiple ways; the most intuitive of which is using boolean indexing.
Basically, emp[emp.job == 'CLERK'] is returning all rows in which the condition is met.
Using the Comparison Operators
SQL> SELECT ename, sal, comm 2 FROM emp 3 WHERE sal<=comm; ENAME SAL COMM ---------- ---------- ---------- MARTIN 1250 1400
The pandas equivalent would be:
Using the IN Operator
The IN operator is used to test for values in a list.
SQL> SELECT empno, ename, sal, mgr 2 FROM emp 3 WHERE mgr IN (7902, 7566, 7788); EMPNO ENAME SAL MGR ---------- ---------- ---------- ---------- 7369 SMITH 800 7902 7788 SCOTT 3000 7566 7876 ADAMS 1100 7788 7902 FORD 3000 7566
pandas isin() method helps in selecting rows having a particular (or multiple) value in a particular column.
Again, emp[emp.mgr.isin([7902, 7566, 7788]) is basically returning rows having mgr=7902, 7566, or 7788.
Using the IS NULL Operator
Null values are tested with the IS NULL operator.
SQL> SELECT ename, mgr 2 FROM emp 3 WHERE mgr IS NULL; ENAME MGR ---------- ---------- KING
pandas isnull() method is used to check and manage NULL values in a DataFrame.
Using the AND Operator
AND requires both conditions to be TRUE.
SQL> SELECT empno, ename, job, sal 2 FROM emp 3 WHERE sal>=1100 4 AND job='CLERK'; EMPNO ENAME JOB SAL ---------- ---------- --------- ---------- 7876 ADAMS CLERK 1100 7934 MILLER CLERK 1300
Multiple conditions can be passed to a DataFrame using &.
Using the OR Operator
OR requires either condition to be TRUE.
SQL> SELECT empno, ename, job, sal 2 FROM emp 3 WHERE sal>=1100 4 OR job='CLERK'; EMPNO ENAME JOB SAL ---------- ---------- --------- ---------- 7369 SMITH CLERK 800 7499 ALLEN SALESMAN 1600 7521 WARD SALESMAN 1250 7566 JONES MANAGER 2975 7654 MARTIN SALESMAN 1250 7698 BLAKE MANAGER 2850 7782 CLARK MANAGER 2450 7788 SCOTT ANALYST 3000 7839 KING PRESIDENT 5000 7844 TURNER SALESMAN 1500 7876 ADAMS CLERK 1100 7900 JAMES CLERK 950 7902 FORD ANALYST 3000 7934 MILLER CLERK 1300 14 rows selected.
Multiple conditions can be passed to a DataFrame using |.
Retrieving Records with Equijoins
SQL> SELECT emp.empno, emp.ename, emp.deptno, 2 dept.deptno, dept.loc 3 FROM emp, dept 4 WHERE emp.deptno=dept.deptno; EMPNO ENAME DEPTNO DEPTNO LOC ---------- ---------- ---------- ---------- ------------- 7782 CLARK 10 10 NEW YORK 7839 KING 10 10 NEW YORK 7934 MILLER 10 10 NEW YORK 7566 JONES 20 20 DALLAS 7902 FORD 20 20 DALLAS 7876 ADAMS 20 20 DALLAS 7369 SMITH 20 20 DALLAS 7788 SCOTT 20 20 DALLAS 7521 WARD 30 30 CHICAGO 7844 TURNER 30 30 CHICAGO 7499 ALLEN 30 30 CHICAGO 7900 JAMES 30 30 CHICAGO 7698 BLAKE 30 30 CHICAGO 7654 MARTIN 30 30 CHICAGO 14 rows selected.
JOINs can be performed using merge():
Using the COUNT Function
COUNT(*) returns the number of rows in a table.
SQL> SELECT COUNT(*) 2 FROM emp 3 WHERE deptno = 30; COUNT(*) ---------- 6
The number of rows of a DataFrame can be obtained with the function len().