SAS, R, Analytics, Big Data and Me

Tuesday, October 1, 2013

Stats with R - 1

Source DataSet :- https://spark-public.s3.amazonaws.com/stats1/datafiles/Stats1.13.HW.02.txt

Initially I read the text file in form of a dataframe in R by running the following command ,I named the dataframe as "impact" :-

> impact <- read.table("Stats1.13.HW.02.txt",header = T)

A data set with 96 obsetvations and 4 varibles is created in R

1.How many rows of data are in the data file?
96
Make sure you read the data with the argument: header=TRUE Then: nrow(data) OR dim(data)

2.What is the name of the dependent variable?
command : names(data)
Ans : SR

3.What is the mean of SR across all subjects?
command : mean(data$SR) OR describe(data)

> mean(impact$SR)
[1] 12.65625

4.What is the variance of SR across all subjects?
Command : var(data$SR

output :
> var(impact$SR)
[1] 6.54375

5.What is the mean of SR for all subjects at pretest?
command :- pre = subset(data, data$time=="pre") THEN mean(pre$SR)

Output :
> pre<- subset(impact,impact[ ,3]== "pre")
> mean(pre$SR)
[1] 12.02083

6.What is the standard deviation of SR for all subjects at posttest?
command :- post = subset(data, data$time=="post") THEN sd(post$SR)

Output :
> sd(post$SR)
[1] 2.449128

7.What is the median of SR for all subjects at posttest?
describe(post$SR) AND median(post$SR)

Output:-
> describe(post$SR)
var n mean sd median trimmed mad min max range skew kurtosis se
1 1 48 13.29 2.45 13.5 13.28 2.22 9 19 10 0.06 -0.49 0.35
> median(post$SR)
[1] 13.5

8.Which group has the highest mean at posttest?
Ans :- DS
.Command :- describeBy(post, post$condition)

Output :-

9.Which one best approximates a normal distribution?
Option :-
pre.wm = subset(pre, pre$condition=="WM")
post.wm = subset(post, post$condition=="WM")
pre.pe = subset(pre, pre$condition=="PE")
post.pe = subset(post, post$condition=="PE")
pre.ds = subset(pre, pre$condition=="DS")
post.ds = subset(post, post$condition=="DS")
par(mfrow= c(2,3))
hist(pre.wm[,4])
hist(post.wm[,4])
hist(pre.pe[,4])
hist(post.pe[,4])
hist(pre.ds[,4])
hist(post.ds[,4])

10.Which group showed the biggest gains in SR?
Code :-
pre.wm = subset(pre, pre$condition=="WM")
post.wm = subset(post, post$condition=="WM")
pre.pe = subset(pre, pre$condition=="PE")
post.pe = subset(post, post$condition=="PE")
pre.ds = subset(pre, pre$condition=="DS")
post.ds = subset(post, post$condition=="DS")
> mean(post.wm$SR)-mean(pre.wm$SR)
[1] 1.3125
> mean(post.pe$SR)-mean(pre.pe$SR)
[1] 0.0625
> mean(post.ds$SR)-mean(pre.ds$SR)
[1] 2.4375

ANS :- DS

Basics of R

The two packages in R required for descriptive Statistics :
First we need to install the following packages and then load them in R
install.packages("psych")
install.packages("sm")

Reading a text file in R:
Text <- read.table("file.txt",header = T)
Make sure you read the data with the argument: header=TRUE .R does not read the 1st line and continues reading from the second line

Command to find no of rows in a data file :
nrow(data) OR dim(data)

Output :-
> dim(Text)
[1] 96 4

> nrow(Text)
[1] 96

Subsetting a dataset :

Command :-
New <- subset(impact,impact[ ,3]== "updated")

Here we are subsetting on the variables of the 3rd column into New dataset . we are only selecting the variables og the 3rd column which has an value "post"

To obtain a descriptive statistic of a dataset :
Suppose we have a dataset named "impact". The describe function gives us the all possible statistic of all variable of that dataset

what is the utility of describeBy function in R ?
DescribeBy function gives you the summary statistic by categorical variable of a defined column

Monday, September 30, 2013

How to get started with Hadoop

http://bigdatastudio.com/2013/07/26/how-to-get-started-with-hadoop/

http://bigdatastudio.com/2013/07/26/whats-the-easiest-way-to-learn-hadoop/

Wednesday, August 21, 2013

Data Warehousing Fundamentals

What is Data Warehousing?
A data warehouse is the main repository of an organization’s historical data, its corporate memory. It contains the raw material for management’s decision support system. The critical factor leading to the use of a data warehouse is that a data analyst can perform complex queries and analysis, such as data mining, on the information without slowing down the operational systems (Ref:Wikipedia). Data warehousing collection of data designed to support management decision making. Data warehouses contain a wide variety of data that present a coherent picture of business conditions at a single point in time. It is a repository of integrated information, available for queries and analysis. What are fundamental stages of Data Warehousing?
Offline Operational Databases – Data warehouses in this initial stage are developed by simply copying the database of an operational system to an off-line server where the processing load of reporting does not impact on the operational system’s performance.
Offline Data Warehouse – Data warehouses in this stage of evolution are updated on a regular time cycle (usually daily, weekly or monthly) from the operational systems and the data is stored in an integrated reporting-oriented data structure.
Real Time Data Warehouse – Data warehouses at this stage are updated on a transaction or event basis, every time an operational system performs a transaction (e.g. an order or a delivery or a booking etc.)
Integrated Data Warehouse – Data warehouses at this stage are used to generate activity or transactions that are passed back into the operational systems for use in the daily activity of the organization.
(Reference Wikipedia)

Wednesday, July 31, 2013

Basics of R 1

1.What should you type in the R console to install the "car" package?
install.packages("car")

2.Once you have this package installed, what should you type in the R console to load the ”car” package?
library(car)

3.What should you type in the R console to check what packages you have installed and loaded on your computer? search()

4.What should you type to get help about the “data.frame” function?
?data.frame

5.Create two vectors, the first one named "numbers" including all natural numbers from 1 to 10, and the second one named "words" containing the following series:"One", "Two", "Three", "Four", "Five", "Six", "Seven", "Eight", "Nine", "Ten". From these two vectors, create a dataframe "nw" with each vector as a separate column. What should you type to check the attributes of "nw"?
attributes(nw)

6.What command should you type to get R to return the number “8” from the dataframe "nw"?
nw[8,1]
7.What command should you type to get R to return the word “eight” from the dataframe "nw"?
nw[8,2]

8.What should you type to create a matrix “a” comprising all natural numbers from 1 to 10, with 2 rows and 5 columns.
a=matrix(1:10,2,5)

9.Create a vector "x" comprising all natural numbers from 1 to 6 and another vector "y" comprising all natural numbers from 5 to 10. What should you type to combine them in a matrix of 2 rows and 6 columns?
rbind(x,y)

10.Create a vector "x" comprising all natural numbers from 1 to 6 and another vector "y" comprising all natural numbers from 5 to 10. What should you type to combine them in a matrix of 6 rows and 2 columns?
cbind(x,y)

Wednesday, May 29, 2013

A Very Short History Of Data Science

http://www.forbes.com/sites/gilpress/2013/05/28/a-very-short-history-of-data-science/?goback=%2Egde_4332669_member_244682548

Wednesday, May 15, 2013

concepts on DLM, DSD

The contents of the SAS data set PERM.JAN_SALES are listed below:

VARIABLE NAME TYPE
idnum character variable
sales_date numeric date value
A comma delimited raw data file needs to be created from the PERM.JAN_SALES data set. The SALES_DATE values need to be in
a MMDDYY10 form.
Which one of the following SAS DATA steps correctly creates this raw data file?

A. libname perm 'SAS-data-library';
data _null_;
set perm.jan_sales;
file 'file-specification' dsd = ',';
put idnum sales_date : mmddyy10.;
run;

B. libname perm 'SAS-data-library';
data _null_;
set perm.jan_sales;
file 'file-specification' dlm = ',';
put idnum sales_date : mmddyy10.;
run;

C. libname perm 'SAS-data-library';
data _null_;
set perm.jan_sales;
file 'file-specification';
put idnum sales_date : mmddyy10. dlm = ',';
run;

D. libname perm 'SAS-data-library';
data _null_;
set perm.jan_sales;
file 'file-specification';
put idnum sales_date : mmddyy10. dsd = ',';
run;
The correct answer is: B
concepts :-

DSD = ',' is invalid because by default DSD is a comma. If you use DSD alone it would work.

First, in put statement, $ sign for character variable, Idnum is not needed.
Ans, A is not correct as the default delimeter (,) for DSD is defined as dsd = ','. It is correct if it was used as: DSD
Ans C and D are not correct because of using the options dlm and dsd in PUT statement. They are the INFILE options

WM group at pretest
WM group at postest Ans :- WM group at postest	Correct
PE group at pretest
PE group at posttest
DS group at pretest
DS group at postest Code :-

Pages