SAS DATA STEP OR PROC SQL

SAS DATA STEP OR PROC SQL

SAS (Statistical Analysis System) was first formally released in 1972, featuring the DATA step, a crucial component that allowed users to manage and manipulate data through a procedural programming approach. The DATA step quickly became fundamental in SAS, enabling users to efficiently read, transform, and analyze data.

In the 1970s, the concept of relational databases gained traction, leading researchers at IBM to develop SQL (Structured Query Language) to interact with and program these databases. SQL's popularity surged in the 1980s, prompting the American National Standards Institute (ANSI) to establish standards ensuring SQL's consistency across various platforms.

Recognizing the importance of SQL in data management, SAS introduced the PROC SQL procedure with the release of SAS 6 in the 1990s. This integration allowed SAS to cater to both traditional SAS programmers accustomed to the DATA step and those familiar with SQL from the database world. By incorporating PROC SQL, SAS provided a versatile programming environment that supported both procedural and declarative programming paradigms.

Both the SAS DATA step and PROC SQL have their strengths and are often used together to leverage their respective advantages in data processing and manipulation.

Differences between SAS Data Step and PROC SQL

Article content
Differences between SAS Data Step and PROC SQL

Examples

The codes below perform little data manipulations on the dataset Clinical Data from Kaggle

Import the dataset: the code below reads in the Excel dataset "Clinical_Data_Validation" into "dm".

PROC IMPORT DATAFILE='/home/u63259226/linkedin/Clinical_Data_Validation_Cohort.xlsx'
    OUT=dm
    DBMS=XLSX
    REPLACE;
    GETNAMES=YES;
RUN;        


Example 1: Filter the out to generate only patients with negative results and keep relevant columns alone

With Data Step

DATA negative_results;
    SET dm;
    
    /* Filter for negative results */
    IF EGFR = 'Negative' AND KRAS = 'Negative';
    
    /* Keep only relevant variables */
    KEEP 'Patient ID'n Age Cigarette EGFR KRAS;

RUN;        

With PROC SQL

PROC SQL;
    CREATE TABLE negative_results AS
    SELECT 'Patient ID'n, Age, Cigarette, EGFR, KRAS
    FROM dm
    WHERE EGFR = 'Negative' AND KRAS = 'Negative';
QUIT;        

Example 2: Perform some data manipulations using conditionals

With Data step

DATA clinical_data_clean_d;
    SET dm;
    
    /* Create a new variable to classify age groups */
    IF Age < 50 THEN Age_Group = 'Under 50';
    ELSE IF 50 <= Age < 65 THEN Age_Group = '50-64';
    ELSE Age_Group = '65 and Over';

    /*Replace missing EGFR values with 'Unknown' */
    IF MISSING(EGFR) THEN EGFR = 'Unknown';

    /* Create a binary variable for smoking status */
    IF Cigarette = 'Never' THEN Smoker_Status = 0;
    ELSE Smoker_Status = 1;
RUN;        

With PROC SQL

PROC SQL;
    CREATE TABLE clinical_data_clean_p AS
    SELECT *,
        /* Create a new variable to classify age groups */
        CASE
            WHEN Age < 50 THEN 'Under 50'
            WHEN Age BETWEEN 50 AND 64 THEN '50-64'
            ELSE '65 and over'
        END AS Age_Group,
        
        /* Replace missing EGFR values with 'Unknown' */
        CASE
            WHEN MISSING(EGFR) THEN 'Unknown'
            ELSE EGFR
        END AS EGFR_Cleaned,

        /* Create a binary variable for smoking status */
        CASE
            WHEN Cigarette = 'Never' THEN 0
            ELSE 1
        END AS Smoker_Status
    FROM dm;
QUIT;        


In conclusion, the choice between using a DATA step and PROC SQL in SAS depends on the specific task at hand. The DATA step is highly efficient for data manipulation tasks such as filtering, sorting, and merging large datasets, and it offers more flexibility for complex data transformations. On the other hand, PROC SQL is preferred by those familiar with SQL syntax and is particularly useful for tasks involving complex joins, subqueries, and summarizations. However, certain tasks might necessitate the use of a DATA step due to limitations in PROC SQL functionality. Ultimately, the decision is based on the user's familiarity and the nature of the task.

I often use PROC SQL because I was familiar with SQL syntax before learning SAS, making the transition smoother for me. However, there are certain tasks where I have to use DATA steps since PROC SQL can't always be adjusted to meet the requirements.



Proc SQL does it for me all the time when it comes to creating and managing data tables. It is more straightforward and requires less data manipulation unlike data step. All data wrangling functions are inbuilt and it requires less programming steps in achieving the expected data output

To view or add a comment, sign in

More articles by Theresa Oyebode

  • SAS MACRO: THE BASICS

    SAS macros are a feature of the SAS programming language that allow for the creation of reusable pieces of code. They…

    2 Comments
  • Randomization in Clinical Trials

    Clinical trial designs are structured frameworks for testing the efficacy and safety of medical interventions…

  • Clinical Trials Data: A Soon-to-be Pill

    What is Clinical Trials Data? Clinical Trials Data (CT data) comprises any data or information outlined in the protocol…

    2 Comments
  • The Evolution of Clinical Trials: From Traditional Methods to AI-Driven Innovations

    From the Very Start of Clinical Trials The story of Daniel in the Bible has often been cited as one of the earliest…

    4 Comments
  • 10 Career Paths in Clinical Research

    Principal Investigator (PI) Responsibilities: The Principal Investigator is ultimately responsible for the clinical…

    2 Comments
  • Clinical Trials: A Brief Introduction

    What are Clinical Trials? Clinical studies are one of the final stages of the drug development process. After the drug…

    2 Comments

Others also viewed

Explore content categories