--- title: "In-semester Exam" output: html_document: code_folding: show css: null fig_caption: yes number_sections: no self_contained: yes theme: spacelab toc: yes toc_depth: 3 toc_float: yes subtitle: "Week 8" --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) ``` ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE,tidy=TRUE) ``` ## SID: # Instructions - The exam will run for 2 hours. - Use the pre-filled rmarkdown file from **Canvas** to help arrange your answers. You are welcome to use or ignore the code chunks I have inserted or add additional ones. - Submit a final html file to **turnitin** in Canvas to be marked. - There are 3 questions. The first two are worth $40\%$ each, the third is worth $20\%$. - The exam question sheet is 6 pages long. \clearpage # Question 1 ($40\%$) ### Question 1 has 3 parts. ## Part 1 ```{r echo=TRUE, eval=FALSE} melanoma = read.csv('https://wimr-genomics.vip.sydney.edu.au/AMED3002/data/e21/melanoma2021.csv') View(melanoma) head(melanoma ) ``` a. How many variables and observations are in the dataset? **Answer:** ```{r} ``` b. Comment on the class of these variables and how they are stored in R. **Answer:** ## X,status, sex, age, year, ulcer are integers and are stored as 32-bit vectors. ## thickness is a number vector stored in 32-bit memory ## location is a factor which is like a categorical variable. ```{r} ``` c. Is there any missing data in this dataset? **Answer:** ## There are no missing values in the dataset. ```{r} if (any(is.na(melanoma[,1:8]))){next} print('No missing values') ``` ## Part 2 The researchers would like to understand the relationships between tumour thickness and the other variables. d. Test to see if tumour thickness is different between people with and without ulceration. Check your assumptions and make a conclusion using a signficance threshold of 0.05. **Answer:** ## Thickness and ulcer has the highest correlation with ulcer having a P-value of 2.89e-09 ```{r} melfit<-lm(thickness~ulcer + status + sex + age + year + X + location, data = melanoma) summary(melfit) ``` e. Test to see if tumour thickness is higher in females than males. Check your assumptions and make a conclusion using a signficance threshold of 0.05. **Answer:** ## Its higher in males than in females. ```{r} set.seed(1234) melanoma['sex'] <- lapply(melanoma['sex'], factor) x <- melanoma$sex y <- melanoma$thickness plot(y~x, type = "l") ``` f. Fit a two-way ANOVA model without an interaction effect to assess if ulceration and sex both have a relationship with tumour thickness when included in the same model. Be sure to comment on the assumptions needed for this test. **Answer:** ```{r} f <- c(t(as.matrix(melanoma))) l <- c('ulcer', 'sex','thickness') k <- 3 n <-190 tm <- gl(k,1,n*k,factor(l)) blk <- gl(n,k,k*n) av <- aov(f~tm + blk) summary(av) ``` g. Create an interaction plot with sex and ulceration as factors and thickness as the response. From this plot, is there evidence that there is an interaction effect? Why? **Answer:** ```{r} x <- melanoma$sex +melanoma$ulcer y <- melanoma$thickness plot(x, y, type = 'l') ``` h. Include an interaction term in the two-way ANOVA model to assess if ulceration and sex both have a relationship with tumour thickness. What is your conclusion? **Answer:** ```{r} ``` ## Part 3 The researchers would like to test if age has an effect on tumour thickness. i. If there was no relationship between tumour thinkess and age, what value should the slope of an ppropriate regression model be? **Answer:** ## It would show a p-value greater than 0.05 j. Fit a regression model and then assess and comment on the model assumptions. **Answer:** The regression model assumes that there is a a linear correlation between the variables. ```{r} y<-melanoma$thickness; x<-melanoma$age lfit<-lm(y~x) summary(lfit) ``` k. What would you conclude from the test? **Answer:** The p-value is 0.0141 which is less than 0.05 which shows age is a highly significant value. **Answer:** \clearpage # Question 2 ($40\%$) ## Part 1 ```{r echo = TRUE, eval=FALSE} urine = read.csv('https://wimr-genomics.vip.sydney.edu.au/AMED3002/data/e21/heartDiseaseV1.csv') head(urine) summary(urine) ``` From clinical experience the researchers believe that there may be a relationship between sex and whether an individual had heart disease. a. What types of variables should sex and whether they had heart disease be? **Answer:** - They should both be factors. b. What is an appropriate statistical test that could be used by the researchers to test this question? **Answer:** The best statistical test would be a chi-square to check for any statistical depedence. c. What is the corresponding null and alternate hypothesis? **Answer:** - The null hypothesis is that there is no statistical depedence between the variables while the alternate hypothesis is there is depedence. d. Construct a contingency table using the variables *sex* and *disease*. **Answer:** ```{r} tble <- table(urine$sex,urine$disease) ``` e. Perform the appropriate test. **Answer:** ```{r} p_val <- chisq.test(tble) p_val ``` f. Using a siginficance threshold of 0.05 what would you conclude from this test? **Answer:** - There is a depedency between sex and disease. g. What were the assumptions for this test? Comment on them in the context of the observed data. **Answer:** h. Report and interpret the corresponding Odds Ratio. **Answer:** ```{r} ``` ## Part 2 Use logistic regression to model the chance of having heart disease. For the following do not check model assumptions: i. Fit a logistic regression model that uses all of the variables to model heart disease. Comment on which variables appear to be informative. **Answer:** ```{r} urfit<-lm(disease~sex + cp + age + trestbps + chol + fbs + restecg + thalach + exang + oldpeak + slope + ca, data = urine) summary(urfit) ``` j. Use backwards step-wise variable selection to fit a simpler model. How many variables are included in the model? **Answer:** ```{r} urnfit<-lm(disease~sex + cp + trestbps + thalach + exang + ca, data = urine) summary(urnfit) ``` k. When the step-wise variable selection was performed, which model fit criteria was used to decide whether a variable should be included or not? Feel free to use an acronym. **Answer:** - I chose the fist 6 variables with the lowest p-values below the 0.05 significant level. l. From either one of the logistic regression models, calculate odds ratios and comment on the relationship between heart disease and the maximum heart rate achieved. **Answer:** ```{r} ``` \clearpage # Question 3 ($20\%$) ### Question 3 has 2 parts ## Part 1 ```{r echo = TRUE, eval=FALSE} urine2 = read.csv('https://wimr-genomics.vip.sydney.edu.au/AMED3002/data/e21/UrineDataV1.csv') head(urine2) summary(urine2) ``` a. How many variables and observations are in the dataset? **Answer:** -There are 7 variables and 79 observations ```{r} ``` b. Comment on the class of these variables and how they are stored in R. **Answer:** -crystals,osmo,urea are stored as 32-bit vectors. -gravity,ph,cond,calc are stored as 64-bit vectors. - ```{r} ``` c. Use a visualization to check if the dataset has any missing data? **Answer:** ```{r} plot(urine2, asp = 20) ``` d. In a sentence, explain why you would conclude that the data is either MCAR, MAR or MNAR? **Answer:** e. Perform case deletion. **Answer:** ```{r} ``` ## Part 2 f. Use hierarchical clustering to cluster the variables. Hint: you made need to use `t()`. **Answer:** ```{r} ``` g. Comment on why you did or did not decide to *scale* the data when performing the analysis. **Answer:** h. How does this clustering inform the researchers' primary question? **Answer:** i. Use k-means clustering to cluster the observations in the dataset. Use all of the variables except for `crystals` to cluster, and, set a seed of `51773` before you cluster. **Answer:** ```{r} ``` j. Why is it advisable to set a seed? **Answer:** k. Is there any evidence of a relationship between this k-means clustering and the formation of crystals? **Answer:** ```{r} ``` l. How does this clustering inform the researchers' primary question? **Answer:**