Pages

Friday, March 15, 2013

Stats with R - Questions

The correct answer choices are marked in blue :

1.When a distribution is extremely skewed, the best measure of central tendency is typically the:
  • mean
  • median 
  • standard deviation
  • standard error

2.According to the central limit theorem, the shape of the distribution of sample means is always:
  • positively skewed
  • uniform
  • normal 
  • negatively skewed


3.The standard deviation of the distribution of sample means is called:
  • percent error
  • special error
  • standard error 
  • likely error

4.Systematic measurement error represents:
  • bias 
  • chance error
  • outliers
  • covariance

5.What value is expected for the t statistic if the null hypothesis is true?
  • 1
  • 0
  • 2
  • 1.96

6.What happens to the t-distribution as sample size increases?
  • The distribution appears more and more like a normal distribution
  • The distribution appears less and less like a normal distribution
  • The distribution becomes uniform
  • The distribution is unaffected
7.Degrees of freedom (df) for the single sample t-test is equal to:
  • N
  • N + 1
  • N - 1
  • the square root of N
8.In an independent t-test, what is the standard error of the difference?
  • the standard deviation of the distribution of sample means
  • the pooled standard deviation
  • the standard deviation of the distribution of sample mean differences
  • the standard deviation of the sample means
9.How many subjects were included in an independent samples t-test if a researcher reports t(20) = 3.68
  • 20
  • 19
  • 22 
  • 18
10.In a factorial ANOVA, a significant interaction effect indicates that:
  • the influence of one IV is the same for each level of the other IV
  • the influence of one IV is not the same for each level of the other IV
  • the influence of one IV is different from the influence of the other IV
  • the dependent variable differs depending on the level of an independent variable (IV)

11.In any factorial ANOVA, if FAxB is significant, then which of the following is true?
  • FA is significant
  • FB is significant
  • Both FA and FB are significant
  • None of the above

12. The General Linear Model assumes that the relationships between predictor variables (X’s) and the outcome variable (Y) are:
  • Linear
  • Additive
  • Linear and additive 
  • None of the above

13.In null hypothesis significance testing, if the probability of a Type II error is .20 and alpha = .04, then what is power?
  • .20
  • .40
  • .60
  • .80

14. If the 68% confidence interval for the regression coefficient for predictor X does not contain zero, can you safely assume that the effect of X on an outcome variable Y will be significant with alpha = .05?
  • No

15.In a standard regression analysis, if the unstandardized regression coefficient is 2 and the standard
error of the regression coefficient is 4 then what is the corresponding t-value?
  • t = 0.25
  • t = 0.50 
  • t = 0.75
  • t = 1

16.Pearson’s product moment correlation coefficient (r) is used when X and Y are:
  • Both nominal variables
  • Both categorical variables
  • Both dichotomous variables
  • Both continuous variables

17. In a regression analysis, which distribution will have the largest standard deviation?
  • the residuals
  • the predicted scores on the outcome variable
  • the standardized regression coefficients
  • the observed scores on the outcome variable

18. In multiple regression analysis, the null hypothesis assumes that the unstandardized regression
coefficient, B, is zero. The standard error of the regression coefficient depends on:
  • Sample size
  • Sample size and the Sum of Squared Residuals
  • Sample size, Sum of Squared Residuals, and the number of other predictor variables in the regression model
  • Sample size, Sum of Squared Residuals, the number of other predictor variables in the regression model, and the p-value

19. Which of the following r-values indicates the strongest relationship between two variables?
  • +.65
  • -.89 
  • +.10
  • -.10

20.What is the slope in the following regression equation? Y = 2.69X – 3.92
  • 2.69 
  • -2.69
  • 3.92
  • -3.92

21. When we square the correlation coefficient to produce r2, the result is equal to the:
  • proportion of variance in Y not accounted for by X
  • proportion of variance in Y accounted for by X 
  • sum of squared residuals
  • standard error

22.When testing for mediation, what significance test would you conduct to further strengthen your
argument for (or against) mediation?
  • Tukey’s HSD
  • Fisher’s Exact Test
  • Sobel Test 
  • Bonferroni correction

23.Why it is preferable to conduct a one-way ANOVA to compare multiple sample means rather than a series of independent t-tests?
  • More powerful error term
  • Reduces the likelihood of Type I error
  • Both a and b 
  • None of the above

24.The sphericity assumption assumes:
  • homogeneity of variance
  • homogeneity of covariance
  • both a and b 
  • none of the above


25.Suppose a dependent t-test is conducted to compare scores in an experiment with a pre/post design (like the working memory training example). Suppose two experiments were conducted: A and B. In Experiment A the correlation between pre and post scores was high. In Experiment B the correlation between pre and post scores was low. Which Experiment would have a lower standard error, and why?
  • Experiment A would have a lower standard error because the variance associated with subjects is more systematic.
  • Experiment B would have a lower standard error because the variance associated with subjects is more systematic.
  • Experiment A would have a lower standard error because the variance associated with subjects is less systematic.
  • Experiment B would have a lower standard error because the variance associated with subjects is less systematic.

Monday, March 4, 2013

OLAP ,MOLAP, DOLAP, ROLAP

OLAP - On-Line Analytical Processing.
Designates a category of applications and technologies that allow the collection, storage, manipulation and reproduction of multidimensional data, with the goal of analysis.


MOLAP - Multidimensional OLAP.
This term designates a cartesian data structure more specifically. In effect, MOLAP contrasts with ROLAP. Inb the former, joins between tables are already suitable, which enhances performances. In the latter, joins are computed during the request.
Targeted at groups of users because it's a shared environment. Data is stored in an exclusive server-based format. It performs more complex analysis of data.

DOLAP - Desktop OLAP.
Small OLAP products for local multidimensional analysis Desktop OLAP. There can be a mini multidimensional database (using Personal Express), or extraction of a datacube (using Business Objects).
Designed for low-end, single, departmental user. Data is stored in cubes on the desktop. It's like having your own spreadsheet. Since the data is local, end users don't have to worry about performance hits against the server.

ROLAP - Relational OLAP.
Designates one or several star schemas stored in relational databases. This technology permits multidimensional analysis with data stored in relational databases.
Used for large departments or groups because it supports large amounts of data and users.
HOLAP:Hybridization of OLAP, which can include any of the above.

Monday, February 11, 2013

Useful Linux commands

Remove header from files :
Delete 1st line:  sed '1d' file-name
tail +2  filname

Deleting the 10 th row from a file :
sed '10d' file-name

Delete line # 5 to 10
sed '5,10d' file-name

Tuesday, January 22, 2013

Basic Differences Between Proc MEANS and Proc SUMMARY in SAS

Basic Differences Between Proc MEANS and Proc SUMMARY

Proc SUMMARY and Proc MEANS are essentially the same procedure.  Both procedures compute descriptive statistics.  The main difference concerns the default type of output they produce.  Proc MEANS by default produces printed output in the LISTING window or other open destination whereas Proc SUMMARY does not.  Inclusion of the print option on the Proc SUMMARY statement will output results to the output window.
The second difference between the two procedures is reflected in the omission of the VAR statement.  When all variables in the data set are character the same output: a simple count of observations, is produced for each procedure.  However, when some variables in the dataset are numeric, Proc MEANS analyses all numeric variables not listed in any of the other statements and produces default statistics for these variables (N, Mean, Standard Deviation, Minimum and Maximum). 
Using the SASHELP data set SHOES the example reflecting this difference is shown.
proc means data = sashelp.shoes;
run;
Basic differences between Proc MEANS and Proc SUMMARY1
proc summary data = sashelp.shoes print;
run;
Basic differences between Proc MEANS and Proc SUMMARY2
Inclusion of a VAR statement in both Proc MEANS and Proc SUMMARY, produces output that contains exactly the same default statistics.
Using the SASHELP data set SHOES the example reflecting this similarity is shown.
proc means data = sashelp.shoes;
  class product;
  var Returns;
run;
proc summary data = sashelp.shoes print;
  class product;
  var Returns;
run;
Basic differences between Proc MEANS and Proc SUMMARY4

Sunday, December 9, 2012

How do I read/write Excel files in SAS?


Reading an Excel file into SAS

Suppose that you have an Excel spreadsheet called auto.xls. The data for this spreadsheet are shown below.
MAKE           MPG  WEIGHT PRICE
AMC Concord    22   2930  4099
AMC Pacer      17   3350  4749
AMC Spirit     22   2640  3799
Buick Century  20   3250  4816
Buick Electra  15   4080  7827
Using the Import Wizard is an easy way to import data into SAS.  The Import Wizard can be found on the drop down file menu.  Although the Import Wizard is easy it can be time consuming if used repeatedly.  The very last screen of the Import Wizard gives you the option to save the statements SAS uses to import the data so that they can be used again.  The following is an example that uses common options and also shows that the file was imported correctly.
PROC IMPORT OUT= WORK.auto1 DATAFILE= "C:\auto.xls" 
            DBMS=xls REPLACE;
     SHEET="auto1"; 
     GETNAMES=YES;
RUN;
  • The out= option in the proc import tells SAS what the name should be for the newly-created SAS data file and where to store the data set once it is imported. 
  • Next the datafile= option tells SAS where to find the file we want to import. 
  • The dbms= option is used to identify the type of file being imported. 
  • The replace option will overwrite an existing file.
  • To specify which sheet SAS should import use the sheet="sheetname" statement.  The default is for SAS to read the first sheet.  Note that sheet names can only be 31 characters long.
  • The getnames=yes is the default setting and SAS will automatically use the first row of data as variable names.  If the first row of your sheet does not contain variable names use the getnames=no

Writing Excel files out from SAS

It is very easy to write out an Excel file using proc export in SAS.
Here is a sample program that writes out SAS data called mydata to an Excel file called mydata.xls into the directory "c:\dissertation".
proc export data=mydata outfile='c:\mydata.xls' dbms = xls replace;
run;

Friday, August 3, 2012

OLTP and OLAP


What is OLTP?
OLTP is abbreviation of On-Line Transaction Processing. This system is an application that modifies data the instance it receives and has a large number of concurrent users.

What is OLAP?
OLAP is abbreviation of Online Analytical Processing. This system is an application that collects, manages, processes and presents multidimensional data for analysis and management purposes.

What is the difference between OLTP and OLAP?
Data Source
OLTP: Operational data is from original data source of the data
OLAP: Consolidation data is from various source.

Process Goal
OLTP: Snapshot of business processes which does fundamental business tasks
OLAP: Multi-dimensional views of business activities of planning and decision making

Queries and Process Scripts
OLTP: Simple quick running queries ran by users.
OLAP: Complex long running queries by system to update the aggregated data.

Database Design
OLTP: Normalized small database. Speed will be not an issue due to smaller database and normalization will not degrade performance. This adopts entity relationship(ER) model and an application-oriented database design.
OLAP: De-normalized large database. Speed is issue due to larger database and de-normalizing will improve performance as there will be lesser tables to scan while performing tasks. This adopts star, snowflake or fact constellation mode of subject-oriented database design.

Describes the foreign key columns in fact table and dimension table?
Foreign keys of dimension tables are primary keys of entity tables.
Foreign keys of facts tables are primary keys of Dimension tables.

Dimensional Modeling


What is Dimensional Modeling?
Dimensional data model concept involves two types of tables and it is different from the 3rd normal form. This concepts uses Facts table which contains the measurements of the business and Dimension table which contains the context(dimension of calculation) of the measurements.

What is Fact table?
Fact table contains measurements of business processes also fact table contains the foreign keys for the dimension tables. For example, if your business process is “paper production” then “average production of paper by one machine” or “weekly production of paper” would be considered as measurement of business process.

What is Dimension table?
Dimensional table contains textual attributes of measurements stored in the facts tables. Dimensional table is a collection of hierarchies, categories and logic which can be used for user to traverse in hierarchy nodes.

What are the Different methods of loading Dimension tables?
There are two different ways to load data in dimension tables.
Conventional (Slow) :
All the constraints and keys are validated against the data before, it is loaded, this way data integrity is maintained.
Direct (Fast) :
All the constraints and keys are disabled before the data is loaded. Once data is loaded, it is validated against all the constraints and keys. If data is found invalid or dirty it is not included in index and all future processes are skipped on this data.