SAS, R, Analytics, Big Data and Me

Tuesday, May 7, 2013

SAS interview questions

Following are the most frequent n favourite question of SAS Interviewers :-

1.What SAS statements would you code to read an external raw data file to a DATA step?

2.How do you read in the variables that you need?

3.Are you familiar with special input delimiters? How are they used?

4.If reading a variable length file with fixed input, how would you prevent SAS from reading the next record if the last variable didn’t have a value?

5.What is the difference between an informat and a format? Name three informats or formats.

6.Name and describe three SAS functions that you have used, if any?

7.How would you code the criteria to restrict the output to be produced?

8.What is the purpose of the trailing @? The @@? How would you use them?

9.Under what circumstances would you code a SELECT construct instead of IF statements?

10.What statement do you code to tell SAS that it is to write to an external file? What statement do you code to write the record to the file?

11.If reading an external file to produce an external file, what is the shortcut to write that record without coding every single variable on the record?

12.If you’re not wanting any SAS output from a data step, how would you code the data statement to prevent SAS from producing a set?

13.What is the one statement to set the criteria of data that can be coded in any step?

14.Have you ever linked SAS code? If so, describe the link and any required statements used to either process the code or the step itself.

15.How would you include common or reuse code to be processed along with your statements?

16.When looking for data contained in a character string of 150 bytes, which function is the best to locate that data: scan, index, or indexc?

17.If you have a data set that contains 100 variables, but you need only five of those, what is the code to force SAS to use only those variable?

18.Code a PROC SORT on a data set containing State, District and County as the primary variables, along with several numeric variables.

19.How would you delete duplicate observations?

20.How would you delete observations with duplicate keys?

21.How would you code a merge that will keep only the observations that have matches from both sets.

22.How would you code a merge that will write the matches of both to one data set, the non-matches from the left-most data set to a second data set, and the non-matches of the right-most data set to a third data set.

23.What is the Program Data Vector (PDV)? What are its functions?

24.Does SAS ‘Translate’ (compile) or does it ‘Interpret’? Explain.

25.At compile time when a SAS data set is read, what items are created?

26.Name statements that are recognized at compile time only?

27.Identify statements whose placement in the DATA step is critical.

Winnners

The reason most people never reach their goals is that they don't define them, learn about them, or even seriously consider them as believable or achievable. Winners can tell you where they are going, what they plan to do along the way, and who will be sharing the adventure with them.

Big Data History

E-commerce, in particular, has exploded data management challenges along three
dimensions: volumes, velocity and variety.

On Volume:
The lower cost of e-channels enables and enterprise to offer its goods or services to more
individuals or trading partners, and up to 10x the quantity of data about an individual
transaction may be collected—thereby increasing the overall volume of data to be
managed.

On Velocity:
E-commerce has also increased point-of-interaction (POI) speed, and consequently the
pace data used to support interactions and generated by interactions

On Variety:
Through 2003/04, no greater barrier to effective data management will exist than the
variety of incompatible data formats, non-aligned data structures, and inconsistent data
semantics.

Introduction to Big Data

Big Data

“Big Data is any data that is expensive to manage
and hard to extract value from.”

Big Data Now

“…the necessity of grappling with Big Data, and

the desirability of unlocking the information hidden

within it, is now a key theme in all the sciences –

arguably the key scientiﬁc theme of our times.”

Big Data: Three challenges

• Volume
– the size of the data

• Velocity
– the latency of data processing relative to
the growing demand for interactivity

• Variety
– the diversity of sources, formats, quality,
structures

Friday, March 15, 2013

Stats with R - Questions

The correct answer choices are marked in blue :

1.When a distribution is extremely skewed, the best measure of central tendency is typically the:

mean
median
standard deviation
standard error

2.According to the central limit theorem, the shape of the distribution of sample means is always:

positively skewed
uniform
normal
negatively skewed

3.The standard deviation of the distribution of sample means is called:

percent error
special error
standard error
likely error

4.Systematic measurement error represents:

bias
chance error
outliers
covariance

5.What value is expected for the t statistic if the null hypothesis is true?

1
0
2
1.96

6.What happens to the t-distribution as sample size increases?

The distribution appears more and more like a normal distribution
The distribution appears less and less like a normal distribution
The distribution becomes uniform
The distribution is unaffected

7.Degrees of freedom (df) for the single sample t-test is equal to:

N
N + 1
N - 1
the square root of N

8.In an independent t-test, what is the standard error of the difference?

the standard deviation of the distribution of sample means
the pooled standard deviation
the standard deviation of the distribution of sample mean differences
the standard deviation of the sample means

9.How many subjects were included in an independent samples t-test if a researcher reports t(20) = 3.68

10.In a factorial ANOVA, a significant interaction effect indicates that:

the influence of one IV is the same for each level of the other IV
the influence of one IV is not the same for each level of the other IV
the influence of one IV is different from the influence of the other IV
the dependent variable differs depending on the level of an independent variable (IV)

11.In any factorial ANOVA, if FAxB is significant, then which of the following is true?

FA is significant
FB is significant
Both FA and FB are significant
None of the above

12. The General Linear Model assumes that the relationships between predictor variables (X’s) and the outcome variable (Y) are:

Linear
Additive
Linear and additive
None of the above

13.In null hypothesis significance testing, if the probability of a Type II error is .20 and alpha = .04, then what is power?

14. If the 68% confidence interval for the regression coefficient for predictor X does not contain zero, can you safely assume that the effect of X on an outcome variable Y will be significant with alpha = .05?

15.In a standard regression analysis, if the unstandardized regression coefficient is 2 and the standard

error of the regression coefficient is 4 then what is the corresponding t-value?

t = 0.25
t = 0.50
t = 0.75
t = 1

16.Pearson’s product moment correlation coefficient (r) is used when X and Y are:

Both nominal variables
Both categorical variables
Both dichotomous variables
Both continuous variables

17. In a regression analysis, which distribution will have the largest standard deviation?

the residuals
the predicted scores on the outcome variable
the standardized regression coefficients
the observed scores on the outcome variable

18. In multiple regression analysis, the null hypothesis assumes that the unstandardized regression

coefficient, B, is zero. The standard error of the regression coefficient depends on:

Sample size
Sample size and the Sum of Squared Residuals
Sample size, Sum of Squared Residuals, and the number of other predictor variables in the regression model
Sample size, Sum of Squared Residuals, the number of other predictor variables in the regression model, and the p-value

19. Which of the following r-values indicates the strongest relationship between two variables?

+.65
-.89
+.10
-.10

20.What is the slope in the following regression equation? Y = 2.69X – 3.92

2.69
-2.69
3.92
-3.92

21. When we square the correlation coefficient to produce r2, the result is equal to the:

proportion of variance in Y not accounted for by X
proportion of variance in Y accounted for by X
sum of squared residuals
standard error

22.When testing for mediation, what significance test would you conduct to further strengthen your

argument for (or against) mediation?

Tukey’s HSD
Fisher’s Exact Test
Sobel Test
Bonferroni correction

23.Why it is preferable to conduct a one-way ANOVA to compare multiple sample means rather than a series of independent t-tests?

More powerful error term
Reduces the likelihood of Type I error
Both a and b
None of the above

24.The sphericity assumption assumes:

homogeneity of variance
homogeneity of covariance
both a and b
none of the above

25.Suppose a dependent t-test is conducted to compare scores in an experiment with a pre/post design (like the working memory training example). Suppose two experiments were conducted: A and B. In Experiment A the correlation between pre and post scores was high. In Experiment B the correlation between pre and post scores was low. Which Experiment would have a lower standard error, and why?

Experiment A would have a lower standard error because the variance associated with subjects is more systematic.

Experiment B would have a lower standard error because the variance associated with subjects is more systematic.
Experiment A would have a lower standard error because the variance associated with subjects is less systematic.
Experiment B would have a lower standard error because the variance associated with subjects is less systematic.

Monday, March 4, 2013

OLAP ,MOLAP, DOLAP, ROLAP

OLAP - On-Line Analytical Processing.
Designates a category of applications and technologies that allow the collection, storage, manipulation and reproduction of multidimensional data, with the goal of analysis.

MOLAP - Multidimensional OLAP.
This term designates a cartesian data structure more specifically. In effect, MOLAP contrasts with ROLAP. Inb the former, joins between tables are already suitable, which enhances performances. In the latter, joins are computed during the request.
Targeted at groups of users because it's a shared environment. Data is stored in an exclusive server-based format. It performs more complex analysis of data.

DOLAP - Desktop OLAP.
Small OLAP products for local multidimensional analysis Desktop OLAP. There can be a mini multidimensional database (using Personal Express), or extraction of a datacube (using Business Objects).
Designed for low-end, single, departmental user. Data is stored in cubes on the desktop. It's like having your own spreadsheet. Since the data is local, end users don't have to worry about performance hits against the server.

ROLAP - Relational OLAP.
Designates one or several star schemas stored in relational databases. This technology permits multidimensional analysis with data stored in relational databases.
Used for large departments or groups because it supports large amounts of data and users.

HOLAP:Hybridization of OLAP, which can include any of the above.

Monday, February 11, 2013

Useful Linux commands

Remove header from files :
Delete 1st line: sed '1d' file-name
tail +2 filname

Deleting the 10 th row from a file :
sed '10d' file-name

Delete line # 5 to 10
sed '5,10d' file-name