Program for Making Data Management Decisions

The following program creates a subset of the NESARC dataset that limits respondents to those who have had episodes in which they continued to drink to feel an effect and who were between 18 and 39 years old at the time of the survey. This subset is then used to identify respondents who have sought and received treatment for alcohol dependence. The variables include:
Age & Sex of respondents
Responses to questions on the type of treatment sought/received
A variable was created to sum the number of different types of treatments respondents sought. Another variable was derived from that to indicate whether a respondent had received any type or treatment or not.
Response values that did not indicate a yes or no to a question on treatment type were converted to NaN.
Frequency counts and percentage distributions were computed for:

  • Age
  • Age Group
  • Sex
  • Number of treatment types sought
  • Categories of number of treatment types sought
  • Whether treatment of any kind was sought
This dataset and the variable created from it will now be used to evaluate the effectiveness of treatment and the influence of family support on treatment outcomes.


PROGRAM


# -*- coding: utf-8 -*-
"""
Created on Sat Jul  6 10:48:29 2019

@author: gdeal
"""
# program to import nesarc dataset limited to respondents who ever needed to drink 
# more to get the intended effect
# then create variables to assess how these respondents sought treatment
# finally save the dataframe to a csv file

import pandas
import numpy
import sys

# print version information
print("Python version is ", sys.version)

# option to avoid runtime errors/warnings
pandas.set_option('display.float_format',lambda x: '%f'%x)

# read in full nesarc dataset
nesarc_df=pandas.read_csv('nesarc_pds.csv',low_memory=False)

# convert variable names to upper case
nesarc_df.columns=map(str.upper,nesarc_df.columns)

# show number of observations and columns in base dataset
# print(len(nesarc_df))
# print(len(nesarc_df.columns))

nesarc_df['S2BQ1A2'] = pandas.to_numeric(nesarc_df['S2BQ1A2'],errors='coerce')
nesarc_df['S2CQ1'] = pandas.to_numeric(nesarc_df['S2CQ1'],errors='coerce')
nesarc_df['S2CQ2A1'] = pandas.to_numeric(nesarc_df['S2CQ2A1'],errors='coerce')
nesarc_df['S2CQ2A2'] = pandas.to_numeric(nesarc_df['S2CQ2A2'],errors='coerce')
nesarc_df['S2CQ2A3'] = pandas.to_numeric(nesarc_df['S2CQ2A3'],errors='coerce')
nesarc_df['S2CQ2A4'] = pandas.to_numeric(nesarc_df['S2CQ2A4'],errors='coerce')
nesarc_df['S2CQ2A5'] = pandas.to_numeric(nesarc_df['S2CQ2A5'],errors='coerce')
nesarc_df['S2CQ2A6'] = pandas.to_numeric(nesarc_df['S2CQ2A6'],errors='coerce')
nesarc_df['S2CQ2A7'] = pandas.to_numeric(nesarc_df['S2CQ2A7'],errors='coerce')
nesarc_df['S2CQ2A8'] = pandas.to_numeric(nesarc_df['S2CQ2A8'],errors='coerce')
nesarc_df['S2CQ2A9'] = pandas.to_numeric(nesarc_df['S2CQ2A9'],errors='coerce')
nesarc_df['S2CQ2A10'] = pandas.to_numeric(nesarc_df['S2CQ2A9'],errors='coerce')
nesarc_df['S2CQ2A11'] = pandas.to_numeric(nesarc_df['S2CQ2A9'],errors='coerce')
nesarc_df['S2CQ2A12'] = pandas.to_numeric(nesarc_df['S2CQ2A9'],errors='coerce')
nesarc_df['S2CQ2A13'] = pandas.to_numeric(nesarc_df['S2CQ2A9'],errors='coerce')

nesarc_df['AGE'] = pandas.to_numeric(nesarc_df['AGE'],errors='coerce')
nesarc_df['SEX'] = pandas.to_numeric(nesarc_df['SEX'],errors='coerce')
nesarc_df['S1Q1F'] = pandas.to_numeric(nesarc_df['S1Q1F'],errors='coerce')

# create subset dataframe with only respondents who have ever drunk more to 
# feel the effect they wanted and who are 18 years or older but younger than 40. 
# S2BQ1A2 = 1 is condition to keep

df_sub1=nesarc_df[(nesarc_df['S2BQ1A2']==1) & (nesarc_df['AGE']<40)]
print("printing count of respondents with possible alcohol dependence")
print(len(df_sub1))

# make copy of df_sub1 -- not sure why though
df_sub3 = df_sub1.copy()

# criteria for determining if respondent sought or received treatment
# S2CQ1 Ever sought help because of drinking
# S2CQ2A1 Ever went to AA or 12 step
# S2CQ2A2 Ever went to family services
# S2CQ2A3  Ever went to detox
# S2CQ2A4 Ever went to inpatient psych
# S2CQ2A5 Ever went to outpatient 
# S2CQ2A6 Ever went to alcohol rehab
# S2CQ2A7 Ever went to ED
# S2CQ2A8 Ever went to halfway house
# S2CQ2A9 Ever went to crisis center
# S2CQ2A10  Evern went to EAP
# S2CQ2A12 Ever went to private physician
# S2CQ2A13 Ever went to any other program or professional

# replace undefined values with nan for S1Q1F value 9 "Born in the U.S."
df_sub3['S1Q1F']=df_sub3['S1Q1F'].replace(9, numpy.nan)
df_sub3['S2CQ2A1']=df_sub3['S2CQ2A1'].replace(9, numpy.nan)
df_sub3['S2CQ2A2']=df_sub3['S2CQ2A2'].replace(9, numpy.nan)
df_sub3['S2CQ2A3']=df_sub3['S2CQ2A3'].replace(9, numpy.nan)
df_sub3['S2CQ2A4']=df_sub3['S2CQ2A4'].replace(9, numpy.nan)
df_sub3['S2CQ2A5']=df_sub3['S2CQ2A5'].replace(9, numpy.nan)
df_sub3['S2CQ2A6']=df_sub3['S2CQ2A6'].replace(9, numpy.nan)
df_sub3['S2CQ2A7']=df_sub3['S2CQ2A7'].replace(9, numpy.nan)
df_sub3['S2CQ2A8']=df_sub3['S2CQ2A8'].replace(9, numpy.nan)
df_sub3['S2CQ2A9']=df_sub3['S2CQ2A9'].replace(9, numpy.nan)
df_sub3['S2CQ2A10']=df_sub3['S2CQ2A10'].replace(9, numpy.nan)
df_sub3['S2CQ2A11']=df_sub3['S2CQ2A11'].replace(9, numpy.nan)
df_sub3['S2CQ2A12']=df_sub3['S2CQ2A12'].replace(9, numpy.nan)
df_sub3['S2CQ2A13']=df_sub3['S2CQ2A13'].replace(9, numpy.nan)

# create new variables with counter for each criteria - these will be summed
df_sub3["S2CQ1_cntr"] = df_sub3["S2CQ1"].apply(lambda x: 1 if x == 1 else 0)
df_sub3["S2CQ2A1_cntr"] = df_sub3["S2CQ2A1"].apply(lambda x: 1 if x == 1 else 0)
df_sub3["S2CQ2A2_cntr"] = df_sub3["S2CQ2A2"].apply(lambda x: 1 if x == 1 else 0)
df_sub3["S2CQ2A3_cntr"] = df_sub3["S2CQ2A3"].apply(lambda x: 1 if x == 1 else 0)
df_sub3["S2CQ2A4_cntr"] = df_sub3["S2CQ2A4"].apply(lambda x: 1 if x == 1 else 0)
df_sub3["S2CQ2A5_cntr"] = df_sub3["S2CQ2A5"].apply(lambda x: 1 if x == 1 else 0)
df_sub3["S2CQ2A6_cntr"] = df_sub3["S2CQ2A6"].apply(lambda x: 1 if x == 1 else 0)
df_sub3["S2CQ2A7_cntr"] = df_sub3["S2CQ2A7"].apply(lambda x: 1 if x == 1 else 0)
df_sub3["S2CQ2A8_cntr"] = df_sub3["S2CQ2A8"].apply(lambda x: 1 if x == 1 else 0)
df_sub3["S2CQ2A9_cntr"] = df_sub3["S2CQ2A9"].apply(lambda x: 1 if x == 1 else 0)
df_sub3["S2CQ2A10_cntr"] = df_sub3["S2CQ2A10"].apply(lambda x: 1 if x == 1 else 0)
df_sub3["S2CQ2A11_cntr"] = df_sub3["S2CQ2A11"].apply(lambda x: 1 if x == 1 else 0)
df_sub3["S2CQ2A12_cntr"] = df_sub3["S2CQ2A12"].apply(lambda x: 1 if x == 1 else 0)
df_sub3["S2CQ2A13_cntr"] = df_sub3["S2CQ2A13"].apply(lambda x: 1 if x == 1 else 0)

# sum treatmet conditions to get treatment count
df_sub3["treatment_type_count"] = df_sub3["S2CQ2A1_cntr"] + df_sub3["S2CQ2A2_cntr"] \
+  df_sub3["S2CQ2A3_cntr"] + df_sub3["S2CQ2A4_cntr"] + df_sub3["S2CQ2A5_cntr"] \
+  df_sub3["S2CQ2A6_cntr"] + df_sub3["S2CQ2A7_cntr"] + df_sub3["S2CQ2A8_cntr"] \
+  df_sub3["S2CQ2A9_cntr"] + df_sub3["S2CQ2A10_cntr"] + df_sub3["S2CQ2A11_cntr"] \
+  df_sub3["S2CQ2A12_cntr"] + df_sub3["S2CQ2A13_cntr"] + df_sub3["S2CQ1_cntr"]

df_sub3["treatment_flag"] = df_sub3["treatment_type_count"].apply(lambda x: 1 if x > 1 else 0)

# create categorical variable for age groups: 18-21, 22-25, 26-30,31-35,36-39
df_sub3['AGEGROUP'] = pandas.cut(df_sub3.AGE, [17, 21, 25, 30,35,39])
print("Print distribution of AGEGROUP Categorial Variable")
print_df_sub3 = df_sub3['AGEGROUP'].value_counts(sort=False, dropna=True)
print(print_df_sub3)

# print counts and percentages for sex of possible alcolol dependenc subset
print("printing count of sex for subset of respondents with possible alcohol dependence")
count_sex = df_sub3['SEX'].value_counts(sort=False)
print(count_sex)

print("printing percentages for sex for subset of respondents with possible alcohol dependence")
pct_sex = df_sub3['SEX'].value_counts(sort=False,normalize=True)
print(pct_sex)

# print counts and percentages for ages of possible alcolol dependenc subset
print("printing count of ages for subset of respondents with possible alcohol dependence")
count_ages = df_sub3 ['AGE'].value_counts(sort=False)
print(count_ages)

print("printing percentage of ages for subset of respondents with possible alcohol dependence")
pct_ages = df_sub3['AGE'].value_counts(sort=False,normalize=True)
print(pct_ages)

# print counts and percentages for whether respondents who have been to a detox clinic
print("printing frequency count of total treatment types for each respondent")
count_treatment_type = df_sub3["treatment_type_count"].value_counts(sort=False)
print(count_treatment_type)

print("printing frequency count of total treatment types for each respondent")
pct_treatment_type = df_sub3["treatment_type_count"].value_counts(sort=False,normalize=True)
print(pct_treatment_type)

# print counts and percentages for whether respondents have sought or had any type of treatment
print("printing frequency count of total treatment types for each respondent")
count_treatment_flag = df_sub3["treatment_flag"].value_counts(sort=False)
print(count_treatment_flag)

print("printing frequency count of total treatment types for each respondent")
pct_treatment_flag = df_sub3["treatment_flag"].value_counts(sort=False,normalize=True)
print(pct_treatment_flag)

df_sub3['TREATMENT_TYPES'] = pandas.cut(df_sub3.treatment_type_count, [0, 1, 2, 4, 6, 8, 13])
print("Print frequency distribution for TREATMENT_TYPES Categorial Variable")
print_df_sub3 = df_sub3['TREATMENT_TYPES'].value_counts(sort=False, dropna=True)
print(print_df_sub3)

df_sub3['TREATMENT_TYPES'] = pandas.cut(df_sub3.treatment_type_count, [0,1, 2, 4, 6,8,13])
print("Print percentages for TREATMENT_TYPES Categorial Variable")
print_df_sub3 = df_sub3['TREATMENT_TYPES'].value_counts(sort=False, dropna=True, normalize=True)
print(print_df_sub3)

# write working dataframe out to csv file
df_sub3.to_csv('nesarc_alcohol_dependence.csv')

PROGRAM OUTPUT

runfile('C:/Users/gdeal/OneDrive/Documents/Coursera Classes/Data Analysis/refine_nesarc_drinkers.py', wdir='C:/Users/gdeal/OneDrive/Documents/Coursera Classes/Data Analysis')
Python version is  3.7.3 (default, Mar 27 2019, 17:13:21) [MSC v.1915 64 bit (AMD64)]
printing count of respondents with possible alcohol dependence
2476
Print distribution of AGEGROUP Categorial Variable
(17, 21]    403
(21, 25]    478
(25, 30]    541
(30, 35]    618
(35, 39]    436
Name: AGEGROUP, dtype: int64
printing count of sex for subset of respondents with possible alcohol dependence
2    1037
1    1439
Name: SEX, dtype: int64
printing percentages for sex for subset of respondents with possible alcohol dependence
2   0.418821
1   0.581179
Name: SEX, dtype: float64
printing count of ages for subset of respondents with possible alcohol dependence
18     78
20    102
22    116
24    133
26     92
28     95
30    130
32    137
34    137
36    103
38    113
19    101
21    122
23    123
25    106
27    107
29    117
31    121
33    121
35    102
37    119
39    101
Name: AGE, dtype: int64
printing percentage of ages for subset of respondents with possible alcohol dependence
18   0.031502
20   0.041195
22   0.046850
24   0.053716
26   0.037157
28   0.038368
30   0.052504
32   0.055331
34   0.055331
36   0.041599
38   0.045638
19   0.040792
21   0.049273
23   0.049677
25   0.042811
27   0.043215
29   0.047254
31   0.048869
33   0.048869
35   0.041195
37   0.048061
39   0.040792
Name: AGE, dtype: float64
printing frequency count of total treatment types for each respondent
0     2064
2      100
4       54
6       48
8       13
12       3
14       3
1       43
3       67
5       47
7       24
9        4
11       5
13       1
Name: treatment_type_count, dtype: int64
printing frequency count of total treatment types for each respondent
0    0.833603
2    0.040388
4    0.021809
6    0.019386
8    0.005250
12   0.001212
14   0.001212
1    0.017367
3    0.027060
5    0.018982
7    0.009693
9    0.001616
11   0.002019
13   0.000404
Name: treatment_type_count, dtype: float64
printing frequency count of total treatment types for each respondent
0    2107
1     369
Name: treatment_flag, dtype: int64
printing frequency count of total treatment types for each respondent
0   0.850969
1   0.149031
Name: treatment_flag, dtype: float64
Print frequency distribution for TREATMENT_TYPES Categorical Variable
(0, 1]      43
(1, 2]     100
(2, 4]     121
(4, 6]      95
(6, 8]      37
(8, 13]     13
Name: TREATMENT_TYPES, dtype: int64
Print percentages for TREATMENT_TYPES Categorical Variable
(0, 1]    0.105134
(1, 2]    0.244499
(2, 4]    0.295844
(4, 6]    0.232274
(6, 8]    0.090465
(8, 13]   0.031785

Name: TREATMENT_TYPES, dtype: float64


Comments

Popular posts from this blog

Creating Graphs - Week 4

First Program