Think of the difference between supervised and unsupervised as having and not having a supervising entity

 

 

Think of the difference between supervised and unsupervised as having and not having a supervising entity letting you know whether you are making the right decisions or not, respectively. Supervised learning may benefit from label classifications of data, such as flowers (e.g., roses, tulips, carnations, and so on). On the other hand, unsupervised learning may not have a classification to benefit from because the answer to the question may be the aim.

Read more on these topics here: Supervised and Unsupervised Learning.

Using the provided dataset that represents the Titanic disaster, create both an unsupervised clustering algorithm to describe the data and a simple supervised classification prediction to determine who might survive. Implement your algorithms in Python.

Submit 2 Python files with roughly 50–80 lines of code each and 1 MS Word document (or Jupyter Notebook).

The code file must include a file header that includes the following information at a minimum: Your name, date, course, and description of the code.
Code must be well commented and in your own words. Explain your decisions and what the code is doing, and provide a rationale as to why you selected the given algorithm.
Code should adhere to best practice code standards.

 

# --- 3. Data Preprocessing for Clustering ---
# Rationale: K-Means is sensitive to scale and missing data.
# Impute missing 'Age' values with the mean.
imputer = SimpleImputer(strategy='mean')
X['Age'] = imputer.fit_transform(X[['Age']])

# Standardize the features (scaling to mean=0, std=1)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# --- 4. Determine Optimal Clusters (Elbow Method) ---
# Rationale: The Elbow Method helps choose 'k' where the cost reduction slows dramatically.
inertia = []
K_range = range(1, 11)
for k in K_range:
    # Initialize and train the K-Means model
    kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
    kmeans.fit(X_scaled)
    inertia.append(kmeans.inertia_)

# Plot the Elbow Method results (Optional but good practice)
plt.figure(figsize=(8, 4))
plt.plot(K_range, inertia, marker='o')
plt.title('Elbow Method for Optimal k')
plt.xlabel('Number of Clusters (k)')
plt.ylabel('Inertia (Within-cluster sum of squares)')
# plt.show() # Uncomment to view the plot

# --- 5. Apply K-Means (Choosing k=3 for demonstration) ---
# Rationale: Based on the typical elbow plot structure, k=3 or k=4 is often a good starting point.
optimal_k = 3
kmeans_final = KMeans(n_clusters=optimal_k, random_state=42, n_init=10)
df['Cluster'] = kmeans_final.fit_predict(X_scaled)

# --- 6. Analyze and Describe Clusters ---
# Rationale: Describe the characteristics of the discovered clusters.
print(f"\n--- Unsupervised Clustering Results (k={optimal_k}) ---")
print(df.groupby('Cluster')[features].mean())
print("\nDescription: These clusters represent natural segments of passengers based on their features (e.g., Cluster 0 might be low Pclass/low Fare, Cluster 2 high Pclass/high Fare).")

 

MS Word/Jupyter Notebook Document Content

 

The accompanying document must explain the theoretical decisions and rationale for both algorithms.

 

Unsupervised Clustering Algorithm: K-Means Rationale

 

 

1. Chosen Algorithm: K-Means Clustering

 

I selected the K-Means algorithm because the objective of the unsupervised task is to describe the inherent structure of the Titanic passenger data, specifically by identifying distinct, homogeneous groups (clusters) of people based on their travel characteristics. K-Means is computationally efficient, widely understood, and effective for partitioning datasets where cluster centers (means) are representative of the groups.

 

2. Feature Selection

 

The features selected were 'Pclass', 'Age', 'Fare', 'SibSp', and 'Parch'. These are all quantitative (or ordinal) and directly relate to socioeconomic status, age, and family structure on the ship. These factors are assumed to create natural groupings in travel behavior and circumstance.

 

3. Preprocessing Rationale

 

Imputation: Age often has missing values. Missing data must be handled; replacing them with the mean is a simple, effective method for K-Means.

Standardization: K-Means calculates distances between points. Without scaling (standardization), the Fare feature, which has a much larger range than Pclass or SibSp, would unduly influence the clustering results. Standardization ensures all features contribute equally.

Python
# File: supervised_titanic.py
# Author: [Your Name]
# Date: October 5, 2025
# Course: [Your Course Name]
# Description: Implements Logistic Regression to predict survival on the Titanic.

Sample Answer

 

 

 

 

 

 

 

Unsupervised Learning (K-Means Clustering) Python Code

 

This code uses K-Means Clustering to identify natural groupings or segments within the passengers based on numerical features like age, fare, and class.

Python
# File: unsupervised_titanic.py
# Author: [Your Name]
# Date: October 5, 2025
# Course: [Your Course Name]
# Description: Applies K-Means Clustering to the Titanic dataset to discover passenger segments.

import pandas as pd
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
import matplotlib.pyplot as plt

# --- 1. Load Data ---
# Rationale: Load the dataset. Assumed the file is named 'titanic.csv'.
df = pd.read_csv('titanic.csv')

# --- 2. Feature Selection ---
# Rationale: Select numerical features relevant for grouping passengers.
features = ['Pclass', 'Age', 'Fare', 'SibSp', 'Parch']
X = df[features].copy()