SAS, R, Analytics, Big Data and Me: November 2013

Wednesday, November 27, 2013

Stats with R - 7

Source DataSet :- https://d396qusza40orc.cloudfront.net/stats1%2Fdatafiles%2FStats1.13.HW.07.txt
Run a regression model with salary as the outcome variable and years of experience as the predictor variable. What is the 95% confidence interval for the regression coefficient? Type your answer exactly as it appears in R but include only two decimal places (for example, if the 95% confidence interval is -1 to +1 then type -1.00 1.00)

Ans :- 27360.82 38259.76

Question Explanationmodel1 = lm(data$salary ~ data$years) AND confint(model1)

Run a regression model with salary as the outcome variable and courses as the predictor variable. What is the 95% confidence interval for the regression coefficient?
Ans:- 59656.54 65782.32

Question Explanationmodel2 = lm(data$salary ~ data$courses) AND confint(model2)

3 .Run a multiple regression model with both predictors and compare it with both the model from Question 1 and the model from Question 2. Is the model with both predictors significantly better than:
both single predictor models
the single predictor model based on years of experience
the single predictor model based on courses
none of the above

Question Explanationmodel3 = lm(data$salary ~ data$years + data$courses) AND anova(model1, model3) AND anova(model2, model3)

Run a standardized multiple regression model with both predictors. Do the confidence interval values differ from the corresponding unstandardized model?
Ans :- yes
model3.z = lm(scale(data$salary) ~ scale(data$years) + scale(data$courses)) AND confint(model3.z)

What function could you use to take a random subset of the data?
sample

Run the following command in R: set.seed(1). Now take a random subset of the original data so that N=15. Is the correlation coefficient between salary and years of experience in this sample higher or lower than in the whole data set?
Ans :- Lower

Wednesday, November 13, 2013

Stats with R - 6

Source DataSet :- https://d396qusza40orc.cloudfront.net/stats1%2Fdatafiles%2FStats1.13.HW.06.txt

data<- read.table("Stats1.13.HW.04.txt",header = T)

In a model predicting salary, what is the unstandardized regression coefficient for years, assuming years is the only predictor variable in the model?

5638

data <-read.table("week6.txt",header = T)

summary(model1 <- lm((data$salary) ~ (data$years)))

In a model predicting salary, what is the 95% confidence interval for the unstandardized regression coefficient for years, assuming years is the only predictor variable in the model?
4930 6345

data <-read.table("week6.txt",header = T)
summary(model1 <- lm((data$salary) ~ (data$years)))
confint(model1)

In a model predicting salary, what is the unstandardized regression coefficient for years, assuming years and courses are both included as predictor variables in the model?
4807
summary(model2 <- lm((data$salary) ~ (data$years)+(data$courses)))

In a model predicting salary, what is the 95% confidence interval for the unstandardized regression coefficient for years, assuming years and courses are both included as predictor variables in the model?
4140 5473
summary(model2 <- lm((data$salary) ~ (data$years)+(data$courses)))
confint(model2)

What is the predicted difference in salary between Doctors and Lawyers assuming an equal and average number of years and courses?
9204

Sunday, November 10, 2013

Stats with R - 5

1.Which of the following are characteristics of experimental research?
a .Random sampling from a population
b. Random assignment to treatment conditions
Both a and b

2.The distribution of household income in the United States, currently, is:
Positively skewed

3.When distributions are skewed, the most accurate measure of central tendency is:
The median

4.Given a distribution of scores, the average of the squared deviation scores is equal to:
The variance

5.Complete the following syllogism: SS is to SD as SP is to:
Correlation

6.Pearson’s product moment correlation coefficient (r) is used when X and Y are:
Both continuous variables

7.Which of the following pairs of variables is most likely to be negatively correlated?
Hours watching TV per week and college GPA

8.Systematic measurement error represents:
bias

9.We all know that correlation does not imply causation but correlations are useful because they can be used to assess:
Reliability,Validity,Prediction errors

10.In a regression analysis, which distribution will have the largest standard deviation?
the observed scores on the outcome variable, Y

11.The difference between an observed score and a predicted score in a regression analysis is known as:
Residual

12.In a simple regression analysis with outcome variable Y, the standardized regression coefficient for X will always equal:
The correlation coefficient

13.If the regression line in a scatterplot is horizontal then what is the regression coefficient?
0

14.In a regression analysis, if the residuals are correlated with X then what assumption has most likely been violated?
homoscedasticity assumption

15.When converting from an unstandardized to a standardized multiple regression analysis which of the following values will change?
regression coefficients

16.In multiple regression what is the difference between R and R^2?
R is the correlation between predicted and observed scores whereas R^2 is the percent of the variance in Y that can be explained by the regression model

17.In the faculty salary example, Ŷ = 46,910 + (1,382)X1 + (502)X2 – (3,484)X3, where X1 = years since graduation, X2 = publications, and X3 = gender (male coded as 0 and female coded as 1). According to this model, the predicted salary for a male faculty member who just graduated (years = 0), with zero publications, is:
Ans : $46,910

18.In the faculty salary example the actual difference in average salary between men and women was NOT = $3,484. $3,484 is:
The predicted difference between male and female faculty who are average in years since they graduated and have an average number of publications

19.In multiple regression analysis, the null hypothesis assumes that the unstandardized regression coefficient, B, is zero. The standard error of the regression coefficient depends on:
Sample size, Sum of Squared Residuals, and the number of other predictor variables in the regression model

20.When conducting a null hypothesis significance test, the p value represents:
The probability of the data given the null hypothesis is true

21.Use the R output above to answer the 5 questions below. The R output is from a quick analysis conducted on data collected at Columbia University and demonstrates a slight positive correlation between overall SAT score (sat) and proportion of items recalled on a working memory span task (span1). What is the unstandardized regression coefficient for working memory span in the regression equation predicting SAT?
300.9

22.R output. What is the predicted SAT score for a student who scored .50 on the working memory span task (round to a possible SAT score, for example, 2400 is a possible score, 2399.56 is not)?
2000

23.R output. What percentage of variance in SAT is explained by working memory span?
3

24.R output. What is the standard error of the sampling distribution of unstandardized regression coefficients?
211.2