# Data Science – Essaylink

Get your Assignment in a Minimum of 3 hours

Our academic experts are ready and waiting to assist with any writing project you may have. From simple essay plans, through to full dissertations, you can guarantee we have a service perfectly matched to your needs.

Free Inquiry Order A Paper Now Cost EstimateSESSION 2 FORMAL EXAMINATIONS { NOVEMBER 2020

EXAMINATION DETAILS:

Unit Code: | COMP2200/COMP6200 |

Unit Name: | Data Science |

Duration of exam: | 3 hours in a 6 hour window |

Total number of questions: | 8 |

Total number of pages: | 5 (incl. this cover sheet) |

Total number of marks: | 100 |

INSTRUCTIONS:

Answer ALL questions in a single word processor file and upload your answers to the provided Turnitin

submission page by the due time. You can upload a Word or PDF file.

Collaboration with others in completing this exam is not allowed. The work you submit should be your

own. Any evidence of copying or collusion will be referred to the Faculty Discipline Committee. Note

that your submissions will be passed through Turnitin to identify copying from the Internet or from other

students.

1. (10 marks) You are working as a Data Scientist in a big retail store, say Woolworths, and your task is

to optimise various retail processes such as inventory management, product placement, and customised

offers. Using the CRISP-DM model, can you explain what you will do in each stage of the data science

project life cycle, what your input will be, and what you will deliver at each stage? (Write no more

than 500 words in total)

2. The following graph 1 shows the relationship between the US spending on science and the number

of suicides (by hanging, strangulation, and suffocation). Based on this graph, answer the following

questions.

(a) (5 marks) What does the correlation mean in this context? What does the R2 value mean?

(Write no more than 200 words in total)

(b) (5 marks) One of your friends Mr. Citizen thinks that this correlation is because of the increasing

pressure on researchers to continuously produce output. How would you evaluate this explanation? Looking at the numbers in the data displayed, can you determine whether this explanation

could account for the effect shown? (Write no more than 200 words in total)

3. (a) (5 marks) For the following data scenarios, which chart should you use to visualise? Justify your

answers. (Write no more than 200 words in total)

(1) Bureau of Meteorology data having average monthly rainfall in Sydney from 2016 to 2020.

(2) Hospital data having systolic pressure and weight of 2000 patients.

(3) Australian Bureau of Statistics data having yearly household expenses (grocery, transport,

education, rent/mortgage, and entertainment) for Australian population

(4) Australian Bureau of Statistics providing Census data showing population density for each

suburb across New South Wales.

(5) Bureau of Meteorology weather data having multiple weather conditions in Sydney with

features including date, precipitation, max temperature, min temperature, wind speed, and

weather (drizzle, rain, sunny, snow, and fog).

(b) (5 marks) You are working on a project that analyses the census data provided by Australian

Bureau of Statistics. Table 1 shows a sample dataset. What data cleaning and normalisation

1Data sources: U.S. Office of Management and Budget and Centers for Disease Control & Prevention

Page 2

techniques should you apply on this data so that you can apply unsupervised learning methods?

(Write no more than 200 words in total)

Table 1: Sample Census dataset from Australian Bureau of Statistics

Census Code Suburb State Area sqkm

CED101 Berowra NSW 78644.32

CED101 wentworthville New South Wales 89232.53645

CED101 north sydney nsw 10324.45

CED101 mt. druitt 10583.12

CED105 st. Kilda Vic. 8524.96762

CED105 South melb. vic 45321.87

CED105 gelong Victoria 24534.2534

4. (a) (5 marks) I have data on different laptops from different brands with features for weight (grams),

size (cm), RAM (GB), Hard Drive (GB), Processor (Intel core i5, Intel core i7, Intel core i3, AMD

Ryzen, AMD Athlon, etc), and price (Australian Dollars). I want to cluster similar laptops based

on their specifications. Discuss your approach to applying a clustering algorithm on this data.

What transformations would be needed before you could work with this data and why? (Write

no more than 200 words in total)

(b) (5 marks) You built a regression model to predict baby length based on mother’s height and

mother’s age. Based on the training regression model using training data, the model coefficient’s

for mother’s height and mother’s age are [0:2539; -0:0075] and intercept is 4:7623. What is your

interpretation from these coefficients and intercept values? Can you figure out how change in

variables effect the baby’s length? (Write no more than 200 words in total)

5. You plan to build a machine learning model to predict whether a patient in a hospital is healthy” or

not healthy” based on the patient’s medical measurements. The dataset is highly imbalanced where

not healthy” outnumbered healthy” individuals.

(a) (5 marks) To evaluate the performance of a trained model, you can create a confusion matrix

for the comparison between the predicted results and the testing data class labels. From the

confusion matrix, you calculated accuracy score. Explain why reporting accuracy score on such

dataset is not indicative of model’s true performance. What measures you should take to mitigate

any inflated results. What other metrics can you formulate from confusion matrix which are true

indicative of model’s robust performance. (Write no more than 200 words in total)

(b) (5 marks) If the training data size is very big (e.g., 1 billion data instances) and the testing dataset

has 1000 instances, which model do you prefer to use, KNN (k-Nearest Neighbors) classifier or

Na¨ıve Bayes classifier? Justify your answer. (Write no more than 200 words in total)

6. There is a robot in an animal shelter which needs to learn to discriminate Dogs and Cats based on

the fur and colour features. You are required to train the robot with classification models on the

following dataset (Table 2) and make a prediction on a testing data instance. The feature Fur takes

one of the two possible values (Coarse and Fine), and Colour also takes one of the two possible values

(Brown and Black). For denotation convenience, you can use X1 and X2 to represent the two features

respectively, and Y to represent the prediction target during the inference.

Page 3

Table 2: Animal Data

Index | Fur Colour | class |

#1 #2 #3 #4 #5 |
Coarse Brown Fine Black Coarse Black Coarse Black Fine Brown |
Dog Cat Cat Dog Cat |

(a) (5 marks) You are required to build a KNN (k-Nearest Neighbors) classification model and predict

the class label for the following data instance (#6 in Table 3). You can randomly choose k from its

possible value range to consider the k-nearest neighbors. The distance between two data instances

is calculated as the number of features having different values. For example, the distance between

the 1st and the 2nd data instances is 2 because they differ from each other on both features ‘Fur’

and ‘Colour’. Specify the value of k you will use, and show the details of learning and prediction.

Table 3: Testing Dataset

Index | Fur Colour | class |

#6 | Fine Brown |

(b) (10 marks) You are required to build a Na¨ıve Bayes classifier from the dataset and predict

the class label for the data instance #6, using the Laplacian correction technique if the zeroprobability issue occurs. Show the details of learning and prediction.

7. (a) (5 marks) The linear regression model can be regarded as a simple type of artificial neural network. From the perspective of artificial neural networks, what activation function corresponds

to the linear regression model? Specify the mathematical form of the activation function. Is it a

good idea to build multi-layer neural network models with this activation function? Justify your

answer. (Write no more than 200 words in total)

(b) (10 marks) As the gradient descent method can be used to learn model parameters in neural

network models, you can use it to estimate the parameters in a linear regression model. You

are required to perform the initial steps of gradient descent on the following dataset (Table 4) to

estimate the parameters w0 and w1 for the linear regression model y = w0 + w1x. The sum of

squared errors is used for the loss function. Concretely, you need to formulate the loss function

L(w0; w1) and derive its gradient rL(w0; w1) = (@L(@w w00;w1); @L(@w w01;w1)). Then, pick a pair of values

randomly to initialize w0 and w1, and evaluate the gradient with the w0 and w1 values. Show the

key steps of inference and calculation.

Table 4: 2-Dimensional Data

Index | X | Y |

#1 #2 |
1 2 |
1 3 |

Page 4

(c) (5 marks) Based on the gradient obtained in the above step, update the estimate for w0 and w1.

Assume that the learning rate η is 0.5. Show the key steps of inference and calculation.

8. The following dataset (Table 5) describes COVID-19 testing records for 5 people. You want to build

a decision tree classification model from the dataset to predict if a person suffers from COVID-19 or

not according to the two symptoms Cough and Fever. Both the feature Cough and Fever take one of

the two possible values yes (having a symptom) and no (not having a symptom). The target attribute

COVID-19 also takes one of the two possible values yes (infected) and no (normal). For denotation

convenience, you can use X1 and X2 to represent the two features respectively, and Y to represent the

prediction target.

Table 5: COVID-19 Data

Index | Cough | Fever | COVID-19 |

#1 | no | no | no |

#2 | yes | yes | yes |

#3 | no | yes | yes |

#4 | no | yes | no |

#5 | yes | no | no |

(a) (10 marks) You are required to build a decision tree with the Gini impurity heuristic. Show the

key steps of inference and calculation.

(b) (5 marks) Which issue might the decision tree model built above suffer from, overfitting or underfitting? Propose two different strategies to mitigate the possible issue with justification. (Write

no more than 200 words in total)

Page 5

"Is this question part of your assignment? We Can Help!"

"Our Prices Start at $11.99. As Our First Client, Use Coupon Code GET15 to claim 15% Discount This Month!!"

Get Started