Individual Assignment

For the individual assignment, we will be utilizing the same design instructions as the tutorial. Your objective is to respond to the questions we posed during the tutorial, which involve analyzing the dataset, pinpointing how machine learning can assist with their work, and clarifying the challenges and potential benefits associated with implementing this approach to tackle heart diseases.

Data analysis

Question: Do you understand what each column inside dataset represents?
- Hint: https://archive.ics.uci.edu/ml/datasets/Heart+Disease

Question: How does the distribution look like for both labels and features?

# Hint 1: Visualize label distribution through pandas function
data.hist(column='label')
plt.savefig('img/hist.pdf')
# Hint 2: Get numeric summarization of data through pandas function
print(data.describe())

Model training

Question: How do we know which model and parameters work better for our problem? What are your expectations?
- Hint 1: Classification vs Regression
- Hint 2: Supervised vs Unsupervised
- Hint 3: Model selection

Model evaluation

Question: Can you calculate the Precision and Recall of each class based on the confusion matrix? What’s more important when designing a system for predicting if a person has heart disease or not?
- Hint: For the confusion matrix under the binary scenario, we have learnt from first tutorial at Task 5: Evaluation
- Hint: Definition of precision and recall
Question: What do the different columns and rows represent? Are the ‘‘macro’’ and ‘‘weighted’’ really needed? What do they show? Which evaluation metric fits our problem better?
- Hint ‘micro’: Calculate metrics globally by counting the total true positives, false negatives and false positives.
- Hint ‘macro’: Calculate metrics for each label, and find their unweighted mean. This does not take label imbalance into account.
- Hint ‘weighted’: Calculate metrics for each label, and find their average weighted by support (the number of true instances for each label). This alters ‘macro’ to account for label imbalance; it can result in an F-score that is not between precision and recall.
Cross-validation
- Question 1: Do you see a difference in your results? Do you understand the impact the way you split your data could have on your model performance?
- Question 2: Why do we see three scores for each metric and not 4 (i.e., one for each of our class)? Which evaluation metric fits our problem better?
- Hint: https://scikit-learn.org/stable/modules/cross_validation.html