BAILEY DEBARMORE - EPICODE

Importing your Data into SAS, Stata, and R

Wed, 24 Mar 2021 18:01:51 GMT

Author: Bailey DeBarmore

In a previous post, I talked you through how to get your data clean in Excel before importing to Stata, SAS, or R [read that post here]. The next step is to import that data and start creating variables you need for analysis and running descriptives. I'll go through how to import your data in this post.

Feel free to post questions in the comments!

Jump to:
Stata
SAS
R

Stata

The best way to import data into Stata is to use the menu and have Stata do it for you. Once you import the data, you'll be able to copy the commands you need and save to your do file for the future. So I'm going to walk you through how to use the click menu rather than sharing syntax that changes from version to version in Stata.

Before we do that, let's talk about setting the working directory. If you are pulling files from multiple locations on your computer, you'll want to save the code (cd) to your do file so you can move around. We'll first use the click menu to produce the code.

Type cd followed by the path name in double quotes.

cd "C:\Users\bailey\EPICODE"

Excel file (.xls, xlsx)

If your data is in an Excel file, you can import a specific worksheet using File > Import > Excel spreadsheet. When the new window opens, click BROWSE to find your Excel file. Then, fill out the correct options. You'll see a preview of your data. You want to make sure you select 'first row as variable names' if that applies, and decide whether you want to preserve case or not. Since Stata is case-sensitive, if you named your variables like Gender, SchoolID, etc. you will need to always refer to them that way (not gender, schoolid). If you decide you want to do all lowercase or all uppercase, you can select those options in the dropdown.

When you're ready, hit OK and you should see the variable window populate. Final step is to go to File > Save as and save this data set as a Stata dataset (.dta) - you don't want to import it every time you need to work on it.

Link to Stata Manual for import excel

I'm using Stata14 and when I did the click menu steps above to import my data, the resulting syntax was:

import excel "C:\Users\bailey\EPICODE\example-survey.xlsx", sheet("") firstrow
import excel "", sheet("") firstrow

is the C:\User path where the xlsx file can be found, and dataset.xlsx is the Excel file name.
is the name of the specific worksheet in my Excel document
The option firstrow told Stata to import the first row as variable names

If you're importing multiple files, you can copy this syntax into your do file and change the path name/document name, and then save to easily run these commands over and over again. This is useful if you are appending or merging multiple files.

import excel "", sheet("") firstrow
save "", replace

import excel "", sheet("") firstrow
save "", replace

Delimited file (.csv, .tsv, .txt)

Delimited files have information in columns and rows seaparate by 'delimieters' which are often a comma (csv), tab (tsv), or comma, tab, space, linebreaks, and others (txt). Often you may have a dataset that is a csv but opened in Excel. Or you may have one that opens in a text file. Either way, when you choose File > Import > Delimited, Stata will automatically identify the delimiter for you. It will also detect if you have variables in the first row or not. Similar to the Excel import, you can choose to preserve SnakeCase variable names or not.

Link to Stata Manual for import delimited

When I import a csv to Stata14, I get the following syntax that I can save in my do file for later. There are no quotation marks and you don't include the <>, those are just here as an example.

import delimited C:\Users\bailey\EPICODE\example-survey.csv

is the C:\User path where the csv file can be found, and dataset.csv is the file name.

If you're importing multiple files, you can copy this syntax into your do file and change the path name/document name, and then save to easily run these commands over and over again. This is useful if you are appending or merging multiple files. See below how I change the working directory with a cd command without ever leaving my do file.

cd "C:\Users\bailey\EPICODE"
import delimited dataset1.csv
save "dataset1.dta", replace

cd "C:\Users\bailey\EPICODE\data"
import delimited dataset2.csv
save "dataset2.dta", replace

SAS XPORT (.xport)
I don't have a screenshot for importing a SAS XPORT file but it follows the same logic as the Excel and delimited files. Go to File > Import > SAS XPORT, browse for your file, select any options, and press OK. Then File > Save As to save it as a Stata dataset (.dta)

Link to Stata Manual for import sasxport

SAS (.sas)
If you need to import a SAS data file into Stata and it is .sas (not XPORT) then the best way is to export it from SAS as a .dta file (which is the Stata dataset file type). Just as SAS has a click menu option for import, you can also use it for export (see images and code below).

Then you can easily open it in Stata by double-clicking the file in your folder or going to File > Open and browsing for it.

PROC EXPORT DATA= WORK.EXAMPLESURVEY
OUTFILE= "C:\Users\bailey\EPICODE\examplesurvey2.dta"
DBMS=STATA REPLACE;
RUN;

If you choose to have SAS save the export code, you'll get code like the one above.

The key elements are

PROC EXPORT statement: what is the name of the SAS dataset you want to export? Where is it located?
- OUTFILE: where should SAS save your exported dataset and what should it call it? Put the path and dataset in double quotes.
- DBMS: Specify the type of file it should export to - here it is Stata (note that in the OUTFILE portion I also did .dta after the new dataset name)

SAS

Usually, the best way to import data into SAS is to use the Import Wizard and have SAS do it for you. During the import process, you can choose to have SAS save the commands in an Editor that you can access and save the code for later. I provide screenshots below on how to use the Import Wizard and the resulting code for you.

Before we get into that, be sure you've run your libname statement which tells SAS where to go look for data. I typically like to have a lib for where it will pull data from, where it will put data, and where it will save files. Type 'libname', then name the lib, and then encase the path in double quotes.

libname data "C:\Users\bailey\EPICODE\data"
libname files "C:\Users\bailey\EPICODE\output"

Excel files (.xls, .xlsx)
Using File > Import and then selecting Excel in the dropdown, you can import an Excel file into SAS. When you go to browse, make sure that in file type you have "All Files" or specifically "xls" or "xlsx" selected so that your file of choice shows up.

In the final screen, if you select a location for SAS to save an editor file, you will get code like the code below that you can reuse later. If you're having trouble with the import wizard (like you're getting an error that says it cannot connect to MS Excel) try running the code provided below.

PROC IMPORT OUT= WORK.examplesurvey
DATAFILE= "C:\Users\bailey\EPICODE\example-survey.xlsx"
DBMS=XLSX REPLACE;
SHEET="auto";
GETNAME=YES;
RUN;

DATA LIB.examplesurvey;
SET WORK.examplesurvey;
RUN;

The key elements are

PROC IMPORT statement: what will you name your new SAS data set? Do you want it to be in the work library or another lib? Make sure you have libname statement previously to establish that library in that SAS working session.
- DATAFILE: where should SAS look for your file? Put the path and dataset in double quotes.
- DBMS: Then you specify the file type (CSV here) after DBMS=. In the example below when we import a Stata file, intsead of CSV we write STATA. Replace tells SAS to replace any dataset already named examplesurvey in the WORK lib.
SHEET statement: The 'auto' option is the default and SAS will import the first sheet in the workbook. You can instead specify a sheet name with quotes.
GETNAME: If variable names are in the first row, then use YES and SAS will pull those in.

When you're done importing, run a data step to save the data in your lib of choice.

Delimited files

Using File > Import and then selecting CSV in the dropdown, you can import a csv file into SAS. In the final screen, if you select a location for SAS to save an editor file, you will get code like the code below that you can reuse later.

PROC IMPORT OUT= WORK.examplesurvey
DATAFILE= "C:\Users\bailey\EPICODE\example-survey.csv"
DBMS=CSV REPLACE;
GETNAMES=YES;
DATAROW=2;
RUN;

DATA LIB.examplesurvey;
SET WORK.examplesurvey;
RUN;

The key elements are

PROC IMPORT statement: what will you name your new SAS data set? Do you want it to be in the work library or another lib? Make sure you have libname statement previously to establish that library in that SAS working session.
- DATAFILE: where should SAS look for your file? Put the path and dataset in double quotes.
- DBMS: Then you specify the file type (CSV here) after DBMS=. In the example below when we import a Stata file, intsead of CSV we write STATA. Replace tells SAS to replace any dataset already named examplesurvey in the WORK lib.
GETNAMES statement: Get variable names from the first row
DATAROW: Data starts in row 2

If you're importing a Delimited (.) or Text (.txt) file then instead of choosing CSV in the drop-down, select those options. They will replace the CSV in the DBMS part of the PROC IMPORT statement.

When you're done importing, run a data step to save the data in your lib of choice. In the example above I have LIB.examplesurvey where LIB is just a placeholder for whatever you named your lib. If we're following my libname statements from the beginning of this section, it would be data.examplesurvey.

Stata file (.dta)
Using File > Import and then selecting Stata in the dropdown, you can import a Stata data file into SAS. In the final screen, if you select a location for SAS to save an editor file, you will get code like the code below that you can reuse later.

PROC IMPORT OUT= WORK.EXAMPLESURVEY2
DATAFILE= "C:\Users\baile\Dropbox\EPICODE\survey\example-survey.dta"
DBMS=STATA REPLACE;
RUN;

DATA LIB.examplesurvey;
SET WORK.examplesurvey;
RUN;

The key elements are
PROC IMPORT statement: what will you name your new SAS data set? Do you want it to be in the work library or another lib? Make sure you have libname statement previously to establish that library in that SAS working session.

DATAFILE: where should SAS look for your file? Put the path and dataset in double quotes.
DBMS: Then you specify the file type (Stata here) after DBMS=. In the example above when we import a csv file, instead of STATA we write CSV. Replace tells SAS to replace any dataset already named examplesurvey in the WORK lib.

When you're done importing, run a data step to save the data in your lib of choice.

R

Just like with SAS and Stata, you can use the click menu to import data. Go to File > Import Dataset and select From Text (base), From Excel, From SPSS, From SAS, From Stata. The code will appear in your console and you can save it in your script.

To set your working directory in R, use 'setwd' and enclose the path in quotes and parentheses. If you have data in R format already (.rds) you can read in using readRDS and save out using saveRDS. Filenames also need to be in quotes.

If you set your working directory (wd) then in the following examples, you would be able to simply refer to the file name rather than the full file path if you are typing out your code. Otherwise you can always specify the full path, especially if you're pulling in data form multiple locations. In addition to the code below, you can use the click menu to set your working directory. Go to Session > Set Working Directory > Choose Directory.

setwd("C:/Users/bailey/EPICODE")
examplesurvey <- readRDS("example-survey.rds")
saveRDS(examplesurvey, file="examplesurvey_final.rds")

To read in data in other formats, you can use read.csv(), read.xls(), read.spss(), read.dta(), read.delim(), read.table(). There are also packages like haven that you can use to read SAS files (read_sas()). Haven will also read Stata files (.dta) and other file types.

Excel (.xls, .xlsx)
To import my xlsx file, I used File > Import Dataset > From Excel. Then I browsed for my file, and checked the options in the bottom right - first name as variables, correct sheet, etc.

The code R used was from the readxl package, using read_excel, because base R only has a function for xls files, not xlsx. In the code below, I've read in my file example-survey-1.xslx to R as the object example_survey_1.

library(readxl)
example_survey_1 <- read_excel("C:/Users/bailey/EPICODE/example-survey-1.xlsx")

Delimited files (csv)

To import my csv file, I go to File > Import Dataset > From Text (base). Immediately, the file browser comes up. When I select my file, I can check the options - such as first row as variable names, and look at the preview. The resulting code in my console is:

example.survey <- read.csv("C:/Users/bailey/EPICODE/example-survey.csv")

R simply used the base R function read.csv to read in the file and named it with the same name.

SAS (.sas, .sas7bdat)
To use the haven package, install it, load it with the library function, and then use read_sas() and the file extension .sas7bdat to read it into R.

You can also use File > Import Dataset > From SAS and R will produce the code for you.

install.packages('haven')
library(haven)
examplesurvey <- read_sas("examplesurvey.sas7bdat")

Stata (.dta)
To use the haven package, install it, load it with the library function, and then use read_dta() and the file extension .dta to read it into R.

You can also use File > Import Dataset > From Stata and R will produce the code for you.

install.packages('haven')
library(haven)
examplesurvey <- read_dta("C:/Users/bailey/EPICODE/examplesurvey.dta")

Cleaning Survey Data for Analysis

Tue, 09 Mar 2021 16:05:50 GMT

Author: Bailey DeBarmore

A task I often help with is getting survey data prepped for data analysis. Typically a client has distributed a written survey via Qualtrics or SurveyMonkey, and has downloaded the survey results to Excel as an xls, xlsx, or csv.

Before importing that data into Stata, SAS, or R, [read that post here] there are a few steps you should do first. In this short tutorial post, I'll walk you through those steps. I highly recommend reading through this post in full before touching your data. Get an overview of what you'll need to do and then read through again to let the gears turn on how you'll need to clean your own data.

Feel free to post any questions in the comments!

Let's get started.

The screenshots used in this short tutorial are provided in slideshow form below as well as with associated text in the post.

download your data from the survey platform

If you haven't already, download your data from the survey platform that you used (e.g., Qualtrics, SurveyMonkey). You may have the option to download it WITH LABELS or AS NUMBERS. The photos below show you what that means. Basically, with labels, means that if you asked respondents "How much do you agree with this statement?" and the options were Strongly Agree, Agree, Neutral, Disagree, Strongly Disagree, downloading the data with labels would export those terms while downloading the data with numbers would turn that into 1, 2, 3, 4, 5, or 5, 4, 3, 2, 1.

I recommend downloading WITH LABELS and then recoding yourself.

You may have columns for timestamp (like I've simulated here), duration, IP address, etc. etc. etc. You can go ahead and get rid of those columns if you don't need it.

You'll see that your question number is in the top row (which we reserve for variable names) with extra information, such as the question text, in row 2. What we'll do next is create a codebook where we can save information like that question text for easy reference as we clean the data.

You'll notice that longer text, like your question text, is cut-off. Don't bother using 'wrap text'. Instead, select the sell and view the text in the top box like I've indicated in the image below. We'll be getting rid of that second row anyways, and as you re-code things like perhaps topic coding of qualitative responses, you can easily see the text in the top box.

create a codebook

If you haven't done this already, you'll want to make a codebook. You can do this in Excel or a word processor, whichever you prefer. You'll want to include the new variable name, the values, the labels for those values, and the name of the old variable or variables it was created for. In this simple example, you can see that I have the possible values from my survey under "values" and have assigned them a numeric label in column C. I've also copied over the question text so that I can delete it from my data sheet. You will likely want to reference this document while you clean your data, so I suggest making it in a new worksheet or in Word so you can have it side-by-side to your data. You can also screenshot it and paste into your data sheet or print it out. Whatever works for you!

recode with new variables

For each new variable, you're going to insert a new column. You can see that Q1 asked about gender and exported to column C, and I made a new column (column D below), and named that variable (column), gender. Per my codebook (above), cis-male is coded as 0, cis-female as 1, nonbinary as 2, and in this simulated data there were no responses for trans-male or trans-female, but if there were, they would be 3 and 4.

I'll talk about filtering and sorting in the next step, so just keep reading before you start changing your data.

I made a new variable for Q2, called age, and copied the numbers over. If you were going to create age categories from the age variable, you could do that, too.

For race (Q3), respondents could "select all that apply" which is shown in rows 7 and 8. To recode this variable into "race" (column H) I followed the coding in my codebook, where someone that indicated more than one race has a separate category. How you choose to code race and "select all that apply" variables, is up to you and your research question.

For binary variables (variables with yes/no options), you want to name the variable so that you inherently know which value is coded as 1. For example, instead of making a variable for Q4 called "ethnicity", I named it "hispanic" so that it's easy for me (and anyone else using my data) to know that data for that variable is coded as 1 = Hispanic/Latinx and 0 = Not Hispanic/Latinx. If you want to make a combined race/ethnicity variable, you could do that here in Excel, or if you're comfortable in SAS, Stata, or R, you can make it later (which is what I would do).

Note - I made the new variable names bold just so it was easier for you to see.

As we move through the variables, when we have more than 2 or 3 categories it is easier to filter or sort the data rows rather than going row by row assigning different values. If you haven't already, you'll want to get rid of the question text in row 2 because when you go to filter (up at the top - Sort & Filter - Filter), that row will get sorted as well (see second image below).

In the image below you can see I went ahead and made new variables for Q5 (year) and Q6 (phd). For year, which was reported as 1st, 2nd, 3rd, 4th, 5th, etc. I coded it as 1, 2, 3, 4 or 5. You may choose to code it has 1st year or older, in which case you would make a variable called "firstyear" with 1 = 1st year, 0 = 2nd year and above. You could make a cut-off at any value, or a categorical variable such as 0 = 1st and 2nd years, 1 = 3rd and 4th years, 2 =- 5th years and above. How you code your variables depends on your research question!

The only program options in this example were MSPH (Masters of Science in Public Health) and PhD (Doctorate of philosophy). I chose to make a variable called phd from Q6, but I easily could've made a variable called msph. If I named the variable msph, which value would be coded as 1? 0?

I prefer to select the FILTER option, rather than Sort A to Z from the menu, because it adds these little arrows to my first row (header, variables) and I can sort or select from there. Simply click the down arrow next to the variable name (Q3 below) and you'll have options to either Sort, or Select. You can do either. If you Select, for example if I only select PhD for Q6, it will hide rows where Q6 = MSPH. I can then fill in 1 for the rows under column N (phd), and then deselect PhD in column M, select MSPH, and fill in 0 for the rows that show in column N (phd). It's up to you which option you use and you don't always have to use the same one.

Because this simulated data set is so small, and the codebook is easy to see on one screen, I used a screenshot to paste it onto my data sheet so it was easy to see what values correspond with what labels. In the image below, I clicked the arrow for Q7, sort A to Z, and filled in the 1 through 5 values in the column next to it (the new variable). Then I repeated that for the remaining questions.

ready to analyze the data

Here is my cleaned dataset, ready to import into Stata, SAS, or Excel. You'll want to select options that import your first row as variable names, and it's up to you if you want to keep the original variables (Q1, Q3, etc) or not. If you don't have many variables, you may opt to keep them so that when you browse your data it's easy to see the labels. If you do have a lot of variables, it would be better to make a copy of this sheet, save as, delete those columns, and then import that smaller dataset. Once you're in your software of choice, put your codebook side-by-side and apply labels to the values so it's easy for you to remember and for someone else to understand your data.

SAS and Stata have a File > Import function or an Import Wizard that makes importing an xls, xlsx, or csv file easy. They'll produce code in the command window that you can save later to reference. I've included some helpful links below to get you onto this next step.

Ready to import your data? Head over to my blog post on Importing your data into SAS, Stata and R.

Creating a Transition Matrix in Stata

Fri, 11 Sep 2020 16:57:59 GMT

Author: Bailey DeBarmore

Do you want to model trajectories by calculating transition probabilities? You can do this in Stata with just a little extension of your longitudinal data analysis skills.

We'll be using xttrans (built in to Stata) or xttrans2 (a module you can download).

Your data will need to be in LONG format. To reshape from wide to long, use:

reshape long stub, i(id) j(time)

Where stub is the stubname of your variable, i is the variable that uniquely identifies observations in your data set, and j is the time variable.

What's a stubname? If you currently have variables status1, status2, status3 with participant status (dead/alive) at years 1, 2, and 3, your stubname would be status and your time variable (j) that we are creating would be year.

reshape long status, i(id) j(year)

The code above will create a new variable (column) called "year", and our long dataset will have 3 rows for each id (observation) and only one variable for status.

It's normal to have your dataset saved in a wide and long format. You can also go back to wide by typing "reshape wide".

Helpful Links
Reshape (Stata): https://www.stata.com/manuals13/dreshape.pdf
xt (Stata): https://www.stata.com/manuals13/xt.pdf
xttrans2 (Nicholas Cox): https://ideas.repec.org/c/boc/bocode/s416301.html (help file here)

First, xtset your data - you're telling Stata what variable uniquely identifies the "panels" since your long data form has repeated rows. For this example, it would be "xtset id".

Then we can use xttrans for our transition probabilities. Default is to just display probabilities. If you also want the frequency counts, you can use the option "freq". We can also use "by" with xttrans.

xtset id
xttrans status
xttrans status, freq
xttrans status, by(sex)

xttrans2, a module written by Nicholas Cox, expands on xttab and xttrans. You can specify additional options (the same options as tabulate, except for row and freq) but you can't use "by".

A big advantage of xttrans2 is that you can save the resulting probability matrix using the option matcell.

Interpreting Multinomial Logistic Regression in Stata

Sun, 14 Apr 2019 17:33:53 GMT

Author: Bailey DeBarmore

You may find yourself running a multinomial logistic regression, but unsure how to interpret your output. I get these questions alot from students, so I'm here to help demystify your Stata results.

Running the regression

To run a multinomial logistic regression, you'll use the command -mlogit-.

You can see the code below that the syntax for the command is mlogit, followed by the outcome variable and your covariates, then a comma, and then base(#).

In this example I have a 4-level variable, hypertension (htn). I want the reference category, or the base outcome, to be normal BP, which corresponds to htn=0. So I'll use base(0) in my code.

*Analysis question:
*Estimate the association of diabetes with hypertension stage
*Adjusting for sex and age
*SYNTAX: mlogit , base(#)

*the option base(#) allows you tell Stata what level of the outcome variable
*should be the reference - for us we'll make it htn = 0 (normal)
set more off
set line 200

*Run the model

mlogit htn diabetes female age, base(0)

Figure 1. Stata output from running an mlogit command with a 4-level hypertension outcome, with diabetes, female sex, and age (yrs) as covariates. Each box with repeated covariates corresponds to a level of the outcome compared to the reference (which we indicated in base(#)).

Above is the Stata output from running the mlogit command.

You can see that there is a box at the top for htn=0, because we set that as the base outcome. If we had set the base outcome to be htn=2, we would have covariate output for 0, 1, and 3, and where the 2 box is, would be a blank with (base outcome).

Each box corresponds to the estimated log odds of that covariate for one outcome level versus the base outcome. You can see where htn=1, it's estimating P(Elevated BP) vs P(Normal BP) (on the natural log scale).

The Equations

Let's connect this output with the regression equation. When I want to pull estimates, I often enter in the coefficients to an MS Excel spreadsheet, and knowing how the output translates to the equation is important.

Figure 2. How do we bring our regression output back to the statistical equation? You can write out the equation for each otucome versus the reference, here HTN = 0 or normal BP, with beta (coefficient) subscripts that correspond to the level of the outcome.

Usually we write out the equation with just beta-0, beta-1, beta-2, etc. but since we have multiple levels of the outcome, each coefficient will be prefixed by X, which indicates the level of the variable (gray equation). I have it written out for each HTN level.

Let's apply it in an example using elevated BP vs normal.

Figure 3. Here's an example calculation where we plug in the coefficients for HTN = 1, elevated BP, into the equation we wrote out in Figure 2.

Helpful Links

UCLA Institute for Digital Research and Education

Stata Documentation for mlogit

Exponentiate

But we are really interested in the exponentiated coefficients, or the relative risk ratio in this scenario. In other Stata regression, we can use the option "or" or "exp" to transform our coefficients into the ratio. With -mlogit-, you do something a bit different - you use the option rrr in a statement run right after your regression and Stata will transform the log odds into the relative probability ratios, or the relative risk ratio (RRR).

*Get exponentiated results;
mlogit htn diabetes female age, base(0)
mlogit,rrr

*Is the effect of diabetes on Elevated BP vs normal BP similar for Stage 2 HTN vs normal BP?
test [1]diabetes = [3]diabetes

Figure 4. To get the relative risk ratio (RRR), we run "mlogit, rrr" after running our regression. In this figure you'll see interpretations of the elevated BP vs normal BP in the grey box, and a test for the effect of diabetes in 2 outcome levels (orange box).

The output format when we run -mlogit, rrr- is the same as before, but we have exponentiated betas. If you use a calculator and exponentiate the betas in the original output you'll see they match up.

I've interpreted the RRR for elevated BP vs normal BP in the grey box.

You may be interested if the effect of one covariate is the same across levels of the outcome. For example, does the effect of diabetes differ when we look at elevated BP vs normal BP versus stage 2 hypertension vs normal BP?

We can use the test command and indicate the level of the outcome in [ ] 's.

By writing the test statement out that the values are equal to each other, we are testing the null hypothesis that they are equal, or that their difference is zero. The prob > chi2 gives us the probability of observing a more extreme chi2 value, and here our p-value of 0.16 indicates we won't be rejecting the null this time around -> the effect of diabetes in elevated BP vs normal BP versus stage 2 HTN vs normal BP is similar.

Marginal Probabilities

Figure 5. You can estimate the predicted probability of diabetes at each level of the outcome, holding the other covariates at their means. Each box corresponds to an outcome level.

If you want to estimate the predicted probability of each outcome for those with adn without diabetes, you can use the margins command. You run the margins command for each level of outcome. Be sure that your factor variable of interest (diabetes in the example) is run in the regression as a factor variable (i.variable). There's usually no need to do this with binary outcomes, so you may not have. Just re-run your regression with i.variable (you can even do so 'quietly') and then run margins. Note that if you want to always run your covariates as factor variables (binary or categorical) you can do so. For a binary variable it will just give you 1.variable for a 0-1 variable, or you can tell Stata you want 1 to be the reference with ib1.variable.

With the margins command you can set each covariate to a level, such as female=1 (the average sex part here doesn't mean much) or you can predict atmeans (which is useful for age).

About the Author

Bailey DeBarmore is a doctoral student at the University of North Carolina at Chapel Hill studying epidemiology. Find her on Twitter @BaileyDeBarmore and blogging for the American Heart Association on the Early Career Voice blog.

Calculating IPW and SMR in SAS and Stata

Sun, 14 Apr 2019 12:15:45 GMT

Author: Bailey DeBarmore

Learning about a method in class, like inverse probability weighting, is different than implementing it in practice.

This post will remind you why we might be interested in propensity scores to control for confounding - specifically inverse probability of treatment weights and SMR - and then show how to do so in SAS and Stata.

If you have corresponding code in R that you'd like to add to this post, please contact me.

A note about weighting versus multivariable regression:
Effect estimate interpretations when you use weighting are marginal effect in the target population. When you adjust for covariates in a regression model, you are interpreting a conditional effect, that is, the effect of the exposure holding (conditional on) the covariates being constant.

Conditional estimates are troublesome with time-varying covariates because we run into collider bias and conditioning on mediators, thus weights are preferable. In simpler situations, using weights over multivariable regression can help with convergence issues .

Files to Download: .txt file with SAS and Stata code, as well as a PDF version of this post with code (perfect for students) available to download at the end of the post or at my github

propensity scores

A propensity score is a predicted probability that may be used to predict exposure (or treatment) status, but can also be used for censoring or missingness.

How do we use propensity scores for confounding?
We can use propensity scores to generate WEIGHTS which, when applied to the final model, make the exposure independent from confounders (Figure 1B), usually by modeling the association between the exposure and confounders (instead of the main analysis where we model the outcome and exposure).

Propensity scores can also control for confounding via covariate adjustment (I discourage you from this option), stratification, and matching, in addition to weighting.

The 2 types of weights I'll be discussing in this post are

SMR (standardized mortality/morbidity ratio) weights
IPTW (inverse probability of treatment weights)

Figure 1. Panel A shows the observed population in our data set, where the relationship between exposure and outcome (orange) is confounded by well, confounders. In B, we have removed the arrow from confounders to exposure. We can remove the arrow in several ways, including using propensity scores (of various types) to create a pseudopopulation where the exposure and confounder are no longer associated.

Figure 2. Panel A shows the usual multivariable model we run in our analyses - to estimate the association of the exposure with the outcome, controlled for confounders. When we want to use propensity scores, we first create the weights that we will later use in our final model by modeling the association of the confounders with the exposure - so we can remove that arrow.

SMR

When you use SMR weights, you're estimating the average treatment effect in the treated (ATT). In other words, you estimate the effect had the exposed group been exposed, versus the exposed group been unexposed. The pseudopopulation that you create has a covariate (confounder) distribution equal to that observed in the exposed group.

You can also generate your SMR weights where the unexposed group is the target of interest, and model the effect had the unexposed group been unexposed versus the unexposed group been exposed.

Figure 3. In Panel A, the target group is the exposed group, so we use SMR to model had the exposed group been exposed versus unexposed, with the covariate distribution of the exposed group (yellow). In Panel B, we use the covariate distribution of the unexposed group (green), and model had unexposed group been unexposed vs exposed.

Figure 4. Equations to calculate the SMR weights for given target population corresponding to Figure 3 examples. A indicates exposure status (=1 for exposed, =0 for unexposed). The numerator is the probabilitiy of exposure given the covariate distribution of the target group, and the denominator is the probability of being assigned to what was observed. You can see that in panel B, it switches when the target group is unexposed.

coding SMR in SAS

To use SMR, you'll be generating the probabilities for the numerator and denominator, and then creating the weight (numerator/denominator) and applying it in the appropriate statement.

***********************************************
* Calculating SMR weights where exposure = 0, 1
**************************************(********;

&let data=;
&let y=;
&let x=;
&let id=;

*Estimate the predicted probability given covariates;

proc logistic data=&data desc;
model &x=;
output out=pred p=p1;
run;

*Generate the weights by exposure status, for exposed group = target their weight will be 1;

data ;
set pred;
p0 = 1-p1;
odds = p1/p0;
if &x=1 then wt=1;
else wt=odds;
run;

*Final weighted analysis;
proc logistic data= desc;
weight wt;
model &y = &x;
run;

***********************************************
* Calculating SMR weights where exposure = categorical
**************************************(********;

&let data=;
&let y=;
&let x=;
&let id=;

*Estimate the predicted probability given covariates;

proc logistic data=&data desc;
model &x= /LINK= glogit;
output out=pred p=p1;
run;

*Generate the weights by exposure status, for exposed group = target their weight will be 1;

data ;
set pred;
p0 = 1-p1;
odds = p1/p0;
if &x=1 then wt=1;
else wt=odds;
run;

*Final weighted analysis;
proc logistic data= desc;
weight wt;
model &y = &x;
run;

What is &let?
So that you can easily adapt my code, I use &let statements at the beginning of my code blocks. After the equals sign, you would replace with your dataset name, with your exposure variable, and with your outcome variable. The code as written will then run with those chosen variables. Note that you do need to replace in the model statement with your confounders.

If you don't want to use &let statements, simply go through the code and anywhere you see &, replace both the & and the text with your regular code.

coding SMR in Stata

The approach to calculating SMR is different in Stata. You'll be using the built in teffects command (reference manual [TE] here) and using options to specify SMR versus IPW.

Since using SMR gives us the average treatment effect in the treated, we'll use the option atet (average treatment effect on the treated) instead of ate (average treatment effect) which we'll use for IPTW.

******************************************
* Calculating SMR weights
*****************************************;
* Syntax for teffects statement

*teffects ipw () ( ), atet
*where is your outcome variable, is your exposure variable, and is a list of your covariates to generate your weights.

*Example: Binary
*Outcome = lowbirthwt
*Exposure = maternalsmoke
*Covariates = maternalage nonwhite

*Use the teffects statement to generate your weights and then apply them in a logistic (default) model all in 1 step

teffects ipw (lowbirthwt) (maternalsmoke maternalage nonwhite), atet

*If your outcome is continuous, you can specify a probit model

*Example: Continuous
*Outcome = birthwt
*Exposure = maternalsmoke
*Covariates = maternalage nonwhite

teffects ipw (birthwt) (maternalsmoke maternalage nonwhite, probit), atet

inverse probability of treatment weights (IPTW)

In contrast to SMR weights, when you use IPTW weights you are estimating the average treatment effect (ATE), that is the treatment effect in a population with covariate distribution equal to the entire observed study population, not just the exposed or unexposed.

Thus, you're modeling the complete counterfactual. In other words, you're estimating the effect had the entire population been exposed versus the entire population been unexposed.

You can use unstabilized or stabilized IPTW in your final model. Choosing one over the other doesn't change your ultimate interpretation but affects the width of your confidence interval (wider when you use unstabilized).

Figure 5. Panel A shows what happens in the pseudopopulation when we use unstabilized weights. We duplicate our N, creating 2 new populations with the covariate distribution of the entire observed population, but had everyone been exposed versus everyone unexposed. In B we apply the covariate distribution within strata of exposure, maintaining the same N. (E with a bar (macron) indicates unexposed).

Figure 6. In Panel A we have the equation for unstabilized IPTW for exposed and unexposed, and in Panel B, the stabilized IPTW for exposed and unexposed. You'll see additional notes that show you that you can also calculate the weight components for unexposed by subtracting the exposed probabilities from 1. You'll see this used in our code below.

Unstabilized

Create a pseudopopulation 2x the size of our observed - one where everyone is exposed and one where everyone is unexposed.

Stabilized

Create a pseudopopulation maintaining the original population size, but we adjust the covariate distribution within each strata of exposure group by upweighting and downweighting people to match the overall covariate distribution.

You can see in Figure 5 versus Figure 3 that we are using different covariate distributions. With SMR where we match the covariate distribution of the exposed or unexposed group (whichever is target). This is why IPTW generates an average treatment effect while SMR generates an average treatment effect in the treated.

coding IPTW in SAS

******************************************
* Calculating IPTW
*****************************************;

&let data=;
&let y=;
&let x=;
&let id=;

*Estimate denominator - output a dataset with results of regression called denom, with the resulting probabilities stored in variable d;

proc logistic data=&data desc;
model &x = ;
output out=denom p=d;
run;

*Generate numerator for stabilized weights - output a dataset with results of regression called num, with the resulting probabilities stored in variable n - note that there is nothing on the right side of the equation because the numerator will simply be P(A=a), where a = observed exposure status;

proc logistic data=&data desc;
model &x=;
output out=num p=n;
run;

*Generate stabilized and unstabilized weights by merging the datasets with regression output (merge on the unique identifier in your dataset, &id);

data ;
merge &data denom num;
by &id;
if &x=1 then do;
uw = 1/d;
sw = n/d;
end;

*Remember we can use 1 - P(exposed) for the unexposed weight components;

else if &x=0 then do;
uw=1/(1-d);
sw=(1-n)/(1-d);
end;
run;

*Check the distribution of your IPTW - the mean should be 1. Is the sum for uw twice the sum of sw? why? is the range of uw greater than sw? why?;

proc means data= mean sum min max;
var uw sw;
run;

*You can check to see if your exposure and covariates are associated in your new pseudopopulation ();

proc logistic data= desc;
weight sw;
model &x=;
run;

*Now you can run your main analyses and apply the weights using the weight statement - use sw variable for stabilized weights, and use uw for unstabilized weights - you can use proc genmod, glm, logistic, etc. I'll show you below with logistic you can see now we're using &y and &x - and we don't need the covariates because the confounder -> x arrow is encompassed in the sw weight statement;

proc logistic data= desc;
weight sw;
model &y = &x;
run;

coding IPTW in Stata

The approach to calculating IPTW is different in Stata than SAS. You'll be using the built in teffects command (reference manual [TE] here) and using options to specify IPTW.

Since using IPTW results in the average treatment effect overall, we'll use option ate (average treatment effect) instead of option atet (average treatment effect on the treated) that we use for SMR.

******************************************
* Calculating IPTW
*****************************************;

* Syntax for teffects statement

*teffects ipw () ( ), ate
*where is your outcome variable, is your exposure variable, and is a list of your covariates to generate your weights.

*Example: Binary
*Outcome = lowbirthwt
*Exposure = maternalsmoke
*Covariates = maternalage nonwhite

*Use the teffects statement to generate your weights and then apply them in a logistic (default) model all in 1 step

teffects ipw (lowbirthwt) (maternalsmoke maternalage nonwhite), ate

*If your outcome is continuous, you can specify a probit model

*Example: Continuous
*Outcome = birthwt
*Exposure = maternalsmoke
*Covariates = maternalage nonwhite

teffects ipw (birthwt) (maternalsmoke maternalage nonwhite, probit), ate

* Syntax to manually create IPTW for binary exposure (treatment)
logistic treatment vars
predict p
gen iptw = 1/p if treatment==1
replace iptw=1/(1-p) if treatment==0

*To calculate stabilized IPTW, run tab treatment and use the proportions for X below
tab treatment
gen siptw = X/p if treatment==1
replace siptw = (1-X)/(1-p) if treatment==0

UPDATE: Feb 22 2021
If your treatment (exposure) is categorical you'll need to change things up a bit. I'm writing this blurb in response to a comment below, and in response to this topic coming up several times in just the past week!

Figure 7. Calculating IPTW for a categorical A.

In your software of choice, you need to calculate the probability of being treated given your covariates, save those probabilities, and then assign the probability given each observations observed status.

In SAS, you can use code like the sample below (thanks, Paul!) which uses a generalized logit model for a treatment variable with multiple categories.

The output data set will have probabilities for every value of treatment. For example, a person with treatment=1 will have values for treatment=0,1,2,3 so you need to do some more coding to tell the computer you only want the predicted value for treatment if treatment=treatment.

PROC LOGISTIC DATA=...;
MODEL treatment = vars / LINK=glogit;
OUTPUT OUT=denom_ipw P=d;
RUN;

In Stata, you can run mlogit to estimate the probability of adjusted for variables and then use the postestimation commands, predict, to create a denominator variable for each value of treatment (0, 1, 2, 3). You use the option "outcome" to tell Stata this.

mlogit treatment vars
predict p0, outcome(0)
predict p1, outcome(1)
predict p2, outcome(2)
predict p3, outcome(3)

Then create your weight variable, iptw below.

gen iptw=.
replace iptw=1/p0 if treatment==0
replace iptw=1/p1 if treatment==1
replace iptw=1/p2 if treatment==2
replace iptw=1/p3 if treatment==3

If you want to create stabilized weights, you can run a tab to get the proportion in each category, and then calculate your weights. For example, let's say my groups are distributed as P(A=0)=0.6, P(A=1)=0.14, P(A=2)=0.2, and P(A=3)=0.06, I would make a stabilized iptw variable, siptw, below.

gen siptw=.
replace siptw=0.6/p0 if treatment==0
replace siptw=0.14/p1 if treatment==1
replace siptw=0.2/p2 if treatment==2
replace siptw=0.06/p3 if treatment==3

Download resources here, or get the most up-to-date versions on my GitHub

Calculating SMR and IPW - SAS - 2021 - EPICODE
File Size:	8 kb
File Type:	txt

Download File

Calculating SMR and IPW in Stata - 2022
File Size:	6 kb
File Type:	txt

Download File

Coding IPW and SMR in SAS and Stata - PDF for teachers
File Size:	551 kb
File Type:	pdf

Download File

Suggested citation

DeBarmore BM. “Coding IPW and SMR in SAS and Stata”. 2019. Updated 2021. Retrieved from http://www.baileydebarmore.com/epicode/calculating-ipw-and-smr-in-sas

About the author

How to Make Publication Quality Graphs in Excel

Sun, 16 Sep 2018 12:42:21 GMT

Learning to clean your data and run the appropriate statistical tests in your program of choice is challenge enough. When it comes to producing graphs for your posters and publications, it can be frustrating to learn how to code to get exactly the graph you want.

Using Microsoft Excel to produce graphs can open up a whole new world for you. Easily organizing your data from columns to rows, adjusting categories and axes in real-time can streamline the entire process.

While the default figures left much to be desired, this post will summarize several ways to strategically change the aesthetics to create publication quality graphics.

For details on how to transform your Excel graphs from lackluster to publication-read, download the PDF below

download

summary

Make sure your data is in the right format for inserting a chart
Change font from default Calibri to Arial, and change all font from gray to black
Add X- and Y-axes lines in black
Add axis titles
Change colors and patterns of graph for optimal design

Optional

Add secondary Y-axis if needed
Change units if needed

About the Author

Bailey DeBarmore is a doctoral student at the University of North Carolina at Chapel Hill studying epidemiology. Find her on Twitter@BaileyDeBarmore and blogging for the American Heart Association on the Early Career Voice blog.

Modeling Categorical Predictors in SAS

Sat, 25 Aug 2018 22:04:52 GMT

Trying to figure out how to model a categorical predictor in your regression?

Done this code a million times but can never remember the syntax for the class statement?

Want to generate exponentiated estimates and confidence intervals?

We'll give examples for binary, 3 levels, 4 levels, and stratified.

Binary Predictor

For all of these code snippets, define &data, &outcome, and &var with your data set, outcome variable, and predictor variable.

&let data=;
&let outcome=;
&let var=;

Here we've used proc genmod, and called on logistic regression using the logit link and binomial distribution.

The estimate statement puts a value of 1 in for the predictor variable. The "exp" option has SAS exponentiate the OR and 95% CI for us. You'll want to go to the row labeled Exp(&VAR), over to the estimate and 95% CI column.

proc genmod data=&data descending;
      model &outcome = &var / link=logit dist=binomial type3;
      estimate "&var" &var 1 / exp;
      run;

Multi-Level Categorical Predictor

This example is for an age category variable, which can easily have more than 3 or 4 levels. Here, &var is "agecat", and the reference value is the youngest age group, 18-29. You'd need to replace the estimate "OR LABEL" &var... statement with your own labels.

The tricky part here is how many zeros? You can see for each subsequent level, we add a zero, extending the 1 out. You'll see later even when we make the reference value a different level, it's similar.

proc genmod data=&data descending;
      class &var (ref='1') / param=ref;
      model &outcome = &var / link=logit dist=binomial type3;
      estimate "OR 30-39" &var 1 / exp;
      estimate "OR 40-49" &var 0 1 / exp;
      estimate "OR 50-59" &var 0 0 1 / exp;
      estimate "OR 60-69" &var 0 0 0 1 / exp;
      estimate "OR 70-79" &var 0 0 0 0 1 / exp;
      estimate "OR 80+" &var  0 0 0 0 0 1 / exp;
      run;

Different Reference Value

This variable has 3 levels. We're choosing level 2 (private insurance), to be the reference. You can see that when we do our estimate statement, we don't skip over 0 1 for level 2, but instead go sequentially like in the last example.

If instead we made the reference level 3, we would adjust our labels to reflect that, and use the same 1 and 0 1 designation in the estimate statements.

proc genmod data=out.cohort6 descending;
      class &var (ref='2') / param=ref;
      model &outcome = &var / link=logit dist=binomial;
      estimate "OR Public" &var 1 / exp;
      estimate "OR Other" &var 0 1 / exp;
      run;

Stratified Logistic Regression with Categorical Predictors

What if you want to stratify? First, sort your data set by the strata variable. Then run your analyses as above, but add a -by- statement.

Your output will be for strata=0 and strata=1 (or more if you have more levels). You won't get an overall output, so you'll need to run that separately.

proc sort data=&data out=sort;
      by &strata;
      run;

proc genmod data=sort descending;
      by &strata;
      class &var (ref='1') / param=ref;
      model &outcome = &var / link=logit dist=binomial type3;
      estimate "OR q2" &var  1 / exp;
      estimate "OR q3" &var 0 1/ exp;
      estimate "OR q4" &var  0 0 1 / exp;
      run;

Enjoy! Feel free to bookmark.

About the Author

Bailey DeBarmore is a doctoral student at the University of North Carolina at Chapel Hill studying epidemiology. Find her on Twitter@BaileyDeBarmore and blogging for the American Heart Association on the Early Career Voice blog.

zEpid: a Python library for epidemiology tools

Sun, 08 Jul 2018 01:08:48 GMT

Author: Paul Zivich

Python is a general computer programming language but has recently garnered popularity among data scientists with its versatility, ability to quickly process large data sets, and large library of machine learning models. I taught myself Python two years ago and while there are several Python libraries for epidemiology, I found the libraries were no longer actively maintained, did not interact with pandas (the main data management Python library), or implement causal inference methods (like inverse probability weights). To fill this gap, I created zEpid with the goal of making epidemiologic analyses in Python e-z.

Functional Form Assessment

I have a few features that I especially like and will highlight them here. First is the functional form assessment. I always found coding functional form assessments to be tedious and difficult to obtain a nice-looking plot from SAS. The code I wrote creates a functional form plot and prints the model results. Below is a fully contained example

import zepid as ze
import matplotlib.pyplot as plt
df = ze.load_sample_data(timevary=False)
ze.graphics.func_form_plot(df,outcome='dead',var='age0',discrete=True)
plt.show()

Which gives the following output:

Warning: missing observations of model variables are dropped
0 observations were dropped from the functional form assessment
                 Generalized Linear Model Regression Results
==============================================================================
Dep. Variable:                   dead   No. Observations:                  547
Model:                            GLM   Df Residuals:                      545
Model Family:                Binomial   Df Model:                            1
Link Function:                  logit   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:                -239.25
Date:                Tue, 26 Jun 2018   Deviance:                       478.51
Time:                        08:25:47   Pearson chi2:                     553.
No. Iterations:                     5   Covariance Type:             nonrobust
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     -3.6271      0.537     -6.760      0.000      -4.679      -2.575
age0           0.0507      0.013      4.012      0.000       0.026       0.075
==============================================================================
AIC: 482.50783872152573
BIC: -2957.4167585984537

Generate Splines

Assessing other functional forms, creating splines, and adding points which correspond to groups of observations are also easily implementable. Since I mentioned splines, zEpid also has easy to use functionality to generate splines. The following line of code will generate a restricted quadratic spline with knots at 30, 40, and 55. Continuing with the functional form plot code from previous, we can generate another functional form plot

df[['rqs0','rqs1']] = ze.spline(df,var='age0',n_knots=3,knots=[30,40,50],restricted=True)
ze.graphics.func_form_plot(df,outcome='dead',var='age0',f_form='age0 + rqs0 + rqs1',discrete=True)
plt.vlines(30,0,0.85,colors='gray',linestyles='--')
plt.vlines(40,0,0.85,colors='gray',linestyles='--')
plt.vlines(55,0,0.85,colors='gray',linestyles='--')
plt.show()

Inverse Probability Weights

Lastly, zEpid has functionalities for inverse probability weights. Currently, inverse probability of treatment weights, inverse probability of censoring weights, and inverse probability of missing weights are implemented. The following block of code can be used to fit a time-fixed IPTW model. Note that we will use statsmodels to obtain the final result. Currently, zEpid only generates the weights to maintain user functionality (i.e. ability to manipulate weights for a marginal structural model).

#Loading necessary packages to fit model
import zepid as ze
import statsmodels.api as sm
import statsmodels.formula.api as smf
from statsmodels.genmod.families import family,links

#Loading the example data within zEpid
df = ze.load_sample_data(timevary=False)

#Creating polynomial terms
df['cd40sq'] = df['cd40']**2
df['cd40cu'] = df['cd40']**3

#Generating stabilized IPTW for ART as exposure
model = 'male + age0 + cd40 + cd40sq + cd40cu + dvl0'
df['iptw'] = ze.ipw.iptw(df,treatment='art',model_denominator=model,stabilized=True)

#Fitting a GEE model with the statsmodels library to obtain the risk of death by ART exposure (Risk Difference)
ind = sm.cov_struct.Independence()
f = sm.families.family.Binomial(sm.families.links.identity)
linrisk = smf.gee('dead ~ art',df['id'],df,cov_struct=ind,family=f,weights=df['iptw']).fit()
print(linrisk.summary())

Which gives us the following results

                               GEE Regression Results
===================================================================================
Dep. Variable:                        dead   No. Observations:                  547
Model:                                 GEE   No. clusters:                      547
Method:                        Generalized   Min. cluster size:                   1
                      Estimating Equations   Max. cluster size:                   1
Family:                           Binomial   Mean cluster size:                 1.0
Dependence structure:         Independence   Num. iterations:                     2
Date:                     Tue, 26 Jun 2018   Scale:                           1.000
Covariance type:                    robust   Time:                         13:56:22
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      0.1817      0.018     10.008      0.000       0.146       0.217
art           -0.0826      0.037     -2.205      0.027      -0.156      -0.009
==============================================================================
Skew:                          1.7574   Kurtosis:                       1.1278
Centered skew:                 0.0000   Centered kurtosis:             -3.0000
==============================================================================

You can visit the following website for a description on fitting a marginal structural model with an inverse probability weighted Kaplan Meier:
zEpid Docs – MSM with IPW-KM

For further description of the above features and others, a guide is available at http://zepid.readthedocs.io/en/latest/

Note: At the time of this blog post, we are on version 0.1.3

Download zEpid

You can download zEpid via GitHub, PyPI, or directly from the command line using

pip install zepid

In the background, zEpid uses:

NumPy
SciPy
pandas
statsmodels
matplotlib
tabulate

If you are interested in conducting analyses in Python, I also recommend the packages:

lifelines (survival analysis tools)
biopython (biological computation tools and search PubMed)
seaborn (improved visualizations)
sas7bdat (read in SAS files)
sklearn (machine learning models)
NetworkX (network analysis)

For an introduction to Python intended for epidemiologists, I have a guide in development at https://github.com/pzivich/Python-for-Epidemiologists

Note that zEpid is distributed under the MIT license.

Paul

About the Author

Paul Zivich is an epidemiology PhD student at University of North Carolina at Chapel Hill. His interests include infectious disease epidemiology and causal inference in the presence of interference.

To request features or ask questions, contact him on GitHub at /pzivich/zepid, on Twitter @zEpidpy, or by email.

Calculate Mean or Average Value in SAS

Thu, 14 Jun 2018 14:23:42 GMT

Author: Bailey DeBarmore

Short post today on how to use the MEAN function in SAS 9.4. Let's get started.

It seems like every time I need to calculate a mean variable in SAS, I find myself looking up which CALL functions deal with missing values in this way, and which in that way.

For example, blood pressure readings are often taken 3 times, and then we average those 3 readings together for a mean value. In some code I ran earlier this morning, I kept getting negative values in my "avg_bp" variable. What's up with that?

The code I had used was:

avg_bp=mean(bp1-bp3)

Looks OK right?

Then I tried:

avg_bp=average(bp1-bp3)

Guess what? There's no average function in SAS. Oops!

Then I tried:

avg_bp=mean(of bp1-bp3)

and ta da! Perfecto!

Syntax for MEAN function

The syntax of the MEAN statement varies depending on how you list your variables.
1) If you use a dash list, like I did, you need to include "of"
2) If you separate variables with a comma, you do not need to include 'of'.

SUM versus MEAN

So why wouldn't you just use

avg_bp=SUM(of bp1-bp3)/3

Well, with SUM, if there is a missing value, SUM treats it as a zero. Usually if we only have 2 out of 3 values, when we calculate the average, we would do bp1+bp2 divided by 2, right? If you use SUM and a fixed divisor, you need to be sure that you have NO MISSINGNESS. In contrast, the MEAN function will manage missing values appropriately. That is, if you're missing 2 of 3 values, it will divide by 2 (not 3).

NB: Don't just exclude all your missing values! That's a post for another day, but briefly, if you exclude observations with missing values you can introduce bias. Persons with all missing values versus 1 versus 2 out of the 3 may be different in some systematic way. You're also reducing your precision by excluding those.

Take Home Message

In summary, to calculate the mean value, be sure to use the MEAN function so that it will manage missing values appropriately, and be sure to use the correct syntax to avoid wonky results (use "of" with a - list, otherwise use a comma separated list, see below).

NEWVAR = mean(of VAR1-VAR3)
NEWVAR = mean(VAR1, VAR2, VAR3)

That's all for today - quick and dirty, but useful. Now every time I look this up, I'll just come back to my own blog post! Just kidding - hopefully writing this post for you guys has solidified the concept.

Bailey

About the Author

P for trend

Tue, 22 May 2018 17:54:21 GMT

Author: Bailey DeBarmore

While I'm not a big fan of p values, sometimes your coauthors, reviewers, or editors ask for them. In this post I'll show you how to calculate p for trend for ordered categories, like in a Table 1, and for adjusted odds ratios or similar regression.

R users: I don't use R much, but encourage you to search for "prop.trend.test" to learn more about trend tests in R.

Jump to:
Stata
SAS

Stata

Test for Trend using nptrend

If you want to compare mean values across ordered categories, call the nptrend test after tab (for categorical) or tabstat (for continuous). It is an extension of the Wilcoxon Rank Sum test.

Binary and Ordinal Example

tab diabetes agegrp, col
nptrend diabetes, by(agegrp)

You can stratify, too.

sort male
by male: tab diabetes agegrp, col
nptrend diabetes if male==0, by(agegrp)
nptrend diabetes if male==1, by (agegrp)

This code produces proportions of diabetes by age group and then tests for trend by age group. The second block of code does this tabulation and trend test separately for males and females.

Note the difference in tab versus nptrend in the by group syntax: in tab, agegrp is included before the comma with no additional words, but in nptrend (and later in tabstat) you include it after the comma with a by().

Continuous and Ordinal Example

tabstat bmi, by(agegrp) stats(mean sd) format(%9.2f)
nptrend bmi, by(agegrp)

This code produces mean and sd of BMI by age group to 2 decimal places and then produces a test for trend by age group. Note that the syntax of the by grouping is similar in both tabstat and agegrp.

The default when you omit the stats option is to only give you the mean.

Other statistics you can request are mean, count, n, sum, max, min, range, sd, variance, cv, semean, skewness, kurtosis, p1, p5, p10, p25, median, p50, p75, p90, p95, p99, iqr, q.

Note that p50 is the same as median, and q is the same as writing p25 p50 and p75.

Count is the count of nonmissing observations and is the same as n.

CV is the coefficient of variation (sd/mean) and semean is the se of the mean (sd/sq rt n).

You can stratify here, too.

sort male
by male: tabstat bmi, by(agegrp) stats(mean sd) format(%9.2f)
nptrend bmi if male==0, by(agegrp)
nptrend bmi if male==1, by(agegrp)

ADJUSTED ESTIMATES: Test for Trend using Post-Estimation

After you conduct a regression with a categorical variable, you can test for trend using the post-estimation CONTRAST command.

You will want to indicate your categorical variable using the i. prefix in your regression statement. Then, when you call on CONTRAST (immediately after the regression) you can use a prefix for that variable that indicates the type of trend you want to look at.

Let's look at BMI (continuous) and age group in a linear regression (or ANOVA in this case).

anova bmi agegrp race

regress bmi i.agegrp race

You can run these contrast statements (and others)

1. Difference from reference level

contrast r.agegrp

2. Difference from next level

contrast a.agegrp

3. Difference from previous level

contrast ar.agegrp

4. Looking p-for-trend for linear, quadratic, cubic, quartic, and joint

contrast p.agegrp, noeffects

Using the p. prefix is only meaningful if you have ordinal categories.

If you're using a non-linear model, you can use the same contrast post-estimation statements after your regression, such as:

logit diabetes i.agegrp race bmi
contrast p.agegrp, noeffects

logistic diabetes i.agegrp race bmi
contrast p.agegrp, noeffects

SAS

Test for Trend using PROC FREQ: Binary and Ordinal

If you have a binary variable and a ordinal variable, you can use PROC FREQ to generate your trend test using the Cochran-Armitage test in the TABLES statement. It will test for trend across the column variable.

Just a refresher for which is the row and which is the column variable.

PROC FREQ data=[data];
TABLES row * col / trend;
run;

You may also want to request confidence limits (CL) and measures (MEASURES) with your trend test.

You can get the same results as the Stata nptrend by specifying SCORES=MIDRIDIT in the TABLES statement, after the / .

PROC FREQ data=stroke;
TABLES diabetes * agegrp / trend;
run;

This code will give you a test for trend of diabetes frequency across age groups. The output you're looking for is titled "Cochran-Armitage Trend Test". The one-sided p-value is for a test of trend in a pre-defined direction. The two-sided p-value is for a test of trend when you don't know what direction to expect. (I'm partial to two-sided p-values).

A small p-value means you can reject the null hypothesis of NO TREND.

Test for Trend using PROC NPAR1WAY: Continuous and Ordinal

If you want to test for trend with a continuous variable across ordinal categories, you can use PROC NPAR1WAY and request the Wilcoxon Rank Sum test.

PROC NPAR1WAY data=stroke WILCOXON;
CLASS agegrp;
VAR bmi;
*exact wilcoxon;
run;

This code would compute p for trend of BMI as a continuous variable across age groups. Note that if you have a small sample size that likely does not meet the normal distribution assumptions, you should include the "exact wilcoxon" statement.

In the output, look for the Normal Approximation two-sided p-value, where a small p-value let's you reject the null hypothesis of no trend. If you used the exact option, look for the two-sided p-value under Exact Test.

ADJUSTED ESTIMATES: Test for Trend

In the output from PROC LOGISTIC, the "Testing Global Null Hypothesis: BETA=0" is equivalent to the Cochran-Armitage test used in PROC FREQ, but for your adjusted odds ratios.

You can also ask for separate Wald tests of the betas by using the TEST statement.

PROC LOGISTIC data=[data];
MODEL diabetes = agegrp bmi race;
TEST agegrp;
run;

Hopefully you found this post helpful in understanding exactly what your output is giving you. I know I learned a lot just by researching it for you.

Bailey

About the Author