lab 2

最后更新: 2024-09-20
创建日期: 2024-09-20

Table of Content

Lab2: Dimensionality Reduction in Face Recognition¶

forms the foundation of any machine learning algorithm, without it, Data Science can not happen. Sometimes, it can contain a huge number of features, some of which are not even required. Such redundant information makes modeling complicated. Furthermore, interpreting and understanding the data by visualization gets difficult because of the high dimensionality. This is where dimensionality reduction comes into play.

In this lab you will engage in a series of tasks:

TASK 1: Visualize sample face images
TASK 2: Dimensionality reduction and visualization
TASK 3: Analyzing PCA's Role in Face Recognition Through Image Reconstruction
[Optional] Task 4: Evaluating the Impact of Dimensionality Reduction on Face Recognition Performance
[Optional] Task 5: Carry out PCA without scikit learn

Please note that areas in your notebook requiring code implementation are marked with # TODO.

Loading the dataset¶

You will use the Labeled Faces in the Wild (LFW) dataset, which is a widely-used benchmark dataset for studying face recognition problems. The dataset contains face images of various individuals collected from the internet, with variations in pose, lighting, and expression.

In [22]:

Copied!





import warnings
warnings.filterwarnings("ignore")
from sklearn.datasets import fetch_lfw_people

# Load the Labeled Faces in the Wild (LFW) dataset with a specific configuration
# Ensure at least 70 images per person, and resize images for quicker processing
lfw_people = fetch_lfw_people(min_faces_per_person=70, resize=0.4)

# Extracting the features (images) and labels (person IDs)
X = lfw_people.data  # face images 
y = lfw_people.target  # labels for each image
target_names = lfw_people.target_names  # names corresponding to each label
n_samples, h, w = lfw_people.images.shape  # dimensions of the images

# Summarize the loaded dataset
print(f"Dataset contains {n_samples} samples, each of dimension {h}x{w}.")
print(f"Number of categories: {len(target_names)}")
print(f"Categories: {target_names}")
import warnings
warnings.filterwarnings("ignore")
from sklearn.datasets import fetch_lfw_people

# Load the Labeled Faces in the Wild (LFW) dataset with a specific configuration
# Ensure at least 70 images per person, and resize images for quicker processing
lfw_people = fetch_lfw_people(min_faces_per_person=70, resize=0.4)

# Extracting the features (images) and labels (person IDs)
X = lfw_people.data  # face images 
y = lfw_people.target  # labels for each image
target_names = lfw_people.target_names  # names corresponding to each label
n_samples, h, w = lfw_people.images.shape  # dimensions of the images

# Summarize the loaded dataset
print(f"Dataset contains {n_samples} samples, each of dimension {h}x{w}.")
print(f"Number of categories: {len(target_names)}")
print(f"Categories: {target_names}")

Dataset contains 1217 samples, each of dimension 50x37.
Number of categories: 6
Categories: ['Ariel Sharon' 'Colin Powell' 'Donald Rumsfeld' 'George W Bush'
 'Gerhard Schroeder' 'Tony Blair']

In [23]:

Copied!

X.shape
X.shape

Out[23]:

(1217, 1850)

Fallback for dataset loading¶

If you encounter issues loading the dataset directly via the API, you can download it from the e-learning platform and then load it from a local file:

In [24]:

Copied!





import numpy as np
## Alternative method to load the dataset if the direct fetch fails
# X = np.load('./lfw_people_X.npy')  # face images 
# y = np.load('./lfw_people_y.npy')  # labels for each image
# target_names = lfw_people.target_names  # names corresponding to each label
# n_samples, h, w = lfw_people.images.shape  # dimensions of the images

# # Summarize the loaded dataset
# print(f"Dataset contains {n_samples} samples, each of dimension {h}x{w}.")
# print(f"Number of categories: {len(target_names)}")
# print(f"Categories: {target_names}")
import numpy as np
## Alternative method to load the dataset if the direct fetch fails
# X = np.load('./lfw_people_X.npy')  # face images 
# y = np.load('./lfw_people_y.npy')  # labels for each image
# target_names = lfw_people.target_names  # names corresponding to each label
# n_samples, h, w = lfw_people.images.shape  # dimensions of the images

# # Summarize the loaded dataset
# print(f"Dataset contains {n_samples} samples, each of dimension {h}x{w}.")
# print(f"Number of categories: {len(target_names)}")
# print(f"Categories: {target_names}")

TASK 1: Visualize sample face images¶

To get a sense of the data we are working with, let's visualize the first 10 face images along with their corresponding labels.

Task Requirements:¶

Figure Creation: Set up a figure that can comfortably fit 10 images. A recommended layout for this would be 2 rows by 5 columns.
Image Iteration: Go through the first 10 images in the dataset, displaying each one on the figure with its associated label.
Image Display: Render each image in grayscale and use the individual's name from the dataset as the title for its subplot.

Implementation Tips:¶

The figsize parameter in the matplotlib.pyplot.figure() function allows you to specify the overall size of the figure. Adjust this to ensure that all subplots fit well within the display area. Check out the documentation for matplotlib.pyplot.figure for more guidance.
To efficiently create a figure with a set of subplots, matplotlib.pyplot.subplots() can be a handy function. This not only initializes the figure but also returns an array of axes objects that can be used to plot individual subplots. Check out the documentation for matplotlib.pyplot.subplots for more guidance.

In [45]:

Copied!





# TODO

import matplotlib.pyplot as plt

# Set up the figure size to accommodate 10 images (2 rows x 5 columns)
plt.figure(figsize=(12, 8))

# Loop through the first 10 images in the dataset
for i in range(10):
    plt.subplot(2, 5, i + 1)
    plt.imshow(X[i].reshape((h, w)), cmap='gray')
    plt.title(target_names[y[i]])
plt.show()
# TODO

import matplotlib.pyplot as plt

# Set up the figure size to accommodate 10 images (2 rows x 5 columns)
plt.figure(figsize=(12, 8))

# Loop through the first 10 images in the dataset
for i in range(10):
    plt.subplot(2, 5, i + 1)
    plt.imshow(X[i].reshape((h, w)), cmap='gray')
    plt.title(target_names[y[i]])
plt.show()

No description has been provided for this image

TASK 2: Dimensionality reduction and visualization¶

In this section, we aim to apply dimensionality reduction techniques learned from the previous class to our dataset.

Data preparation¶

Let's first create a pandas DataFrame named df that contains our features X and the target y. Then, we'll filter this DataFrame to retain only the data points corresponding to specific targets of interest, in this case, targets 0 and 5.

In [26]:

Copied!





import pandas as pd

# Create a DataFrame with the features and the target
df = pd.DataFrame(X)
df['label'] = y

# Filter the DataFrame to include only the data points with label 5 or 6
df = df[df['label'].isin([0, 5])]
import pandas as pd

# Create a DataFrame with the features and the target
df = pd.DataFrame(X)
df['label'] = y

# Filter the DataFrame to include only the data points with label 5 or 6
df = df[df['label'].isin([0, 5])]

You are required perform dimensionality reduction on the prepared data and visualize the data points. Complete the code to achieve the following objectives:

PCA Transformation:
- Extract the feature data from the DataFrame df (i.e., omitting the 'label' column)
- Implement the following algorithms to reduce the feature data to two dimensions: (1) Principal Component Analysis (PCA), (2) Multidimensional Scaling (MDS), (3) Locally Linear Embedding (LLE), (4) Laplacian Eigenmaps and (5) t-Distributed Stochastic Neighbor Embedding (t-SNE).
- [Optional] Optionally, to make computation more efficient for methods like t-SNE, you can first reduce the dimensionality using PCA.
Data Visualization:
- Display the data points in a two-dimensional space, as transformed by each dimensionality reduction method. Use different colors to distinguish between the classes (label 0 and label 5).

In [27]:

Copied!





# TODO
from sklearn.decomposition import PCA
import seaborn as sns

# PCA
dfX = df.drop(columns=['label'], inplace=False)

# TODO: Apply PCA to the filtered DataFrame dfX to reduce its dimensionality to 100 components
pca = PCA(n_components=2)
X_pca = pca.fit_transform(dfX)

# TODO: Add two new columns to the DataFrame df, named dim1 and dim2, which correspond to the first and second principal components of the PCA-transformed data X_pca
df['dim1'] = X_pca[:, 0]
df['dim2'] = X_pca[:, 1]

# TODO: Use the seaborn scatterplot function to visualize the data points in the reduced dimensional space, with different colors representing different classes (label 5 and label 6)
sns.scatterplot(data=df, 
                x="dim1", 
                y="dim2", 
                hue="label",
                palette="deep")
plt.title("PCA Visualization")
plt.show()
# TODO
from sklearn.decomposition import PCA
import seaborn as sns

# PCA
dfX = df.drop(columns=['label'], inplace=False)

# TODO: Apply PCA to the filtered DataFrame dfX to reduce its dimensionality to 100 components
pca = PCA(n_components=2)
X_pca = pca.fit_transform(dfX)

# TODO: Add two new columns to the DataFrame df, named dim1 and dim2, which correspond to the first and second principal components of the PCA-transformed data X_pca
df['dim1'] = X_pca[:, 0]
df['dim2'] = X_pca[:, 1]

# TODO: Use the seaborn scatterplot function to visualize the data points in the reduced dimensional space, with different colors representing different classes (label 5 and label 6)
sns.scatterplot(data=df, 
                x="dim1", 
                y="dim2", 
                hue="label",
                palette="deep")
plt.title("PCA Visualization")
plt.show()

In [28]:

Copied!





from sklearn import manifold 
# TODO: Apply MDS to the filtered DataFrame dfX to reduce its dimensionality to 100 components(use random_state=42 for reproducibility)
mds = manifold.MDS(n_components=2)
X_mds = mds.fit_transform(dfX)

# TODO: Add two new columns to the DataFrame df, named dim1 and dim2, which correspond to the first and second dimensions of the MDS-transformed data X_mds
df['dim1'] = X_mds[:, 0]
df['dim2'] = X_mds[:, 1]

# TODO: Use the seaborn scatterplot function to visualize the data points in the reduced dimensional space, with different colors representing different classes (label 0 and label 5)
sns.scatterplot(data=df, 
                x="dim1", 
                y="dim2", 
                hue="label",
                palette="deep")
plt.title("MDS Visualization")
plt.show()
from sklearn import manifold 
# TODO: Apply MDS to the filtered DataFrame dfX to reduce its dimensionality to 100 components(use random_state=42 for reproducibility)
mds = manifold.MDS(n_components=2)
X_mds = mds.fit_transform(dfX)

# TODO: Add two new columns to the DataFrame df, named dim1 and dim2, which correspond to the first and second dimensions of the MDS-transformed data X_mds
df['dim1'] = X_mds[:, 0]
df['dim2'] = X_mds[:, 1]

# TODO: Use the seaborn scatterplot function to visualize the data points in the reduced dimensional space, with different colors representing different classes (label 0 and label 5)
sns.scatterplot(data=df, 
                x="dim1", 
                y="dim2", 
                hue="label",
                palette="deep")
plt.title("MDS Visualization")
plt.show()

In [29]:

Copied!





lle = manifold.LocallyLinearEmbedding(n_components=2)
X_lle = lle.fit_transform(dfX)

# TODO: Add two new columns to the DataFrame df, named dim1 and dim2, which correspond to the first and second dimensions of the LLE-transformed data X_lle
df['dim1'] = X_lle[:, 0]
df['dim2'] = X_lle[:, 1]

# TODO: Use the seaborn scatterplot function to visualize the data points in the reduced dimensional space, with different colors representing different classes (label 0 and label 5)
sns.scatterplot(data=df, 
                x="dim1", 
                y="dim2", 
                hue="label",
                palette="deep")
plt.title("LLE Visualization")
plt.show()
lle = manifold.LocallyLinearEmbedding(n_components=2)
X_lle = lle.fit_transform(dfX)

# TODO: Add two new columns to the DataFrame df, named dim1 and dim2, which correspond to the first and second dimensions of the LLE-transformed data X_lle
df['dim1'] = X_lle[:, 0]
df['dim2'] = X_lle[:, 1]

# TODO: Use the seaborn scatterplot function to visualize the data points in the reduced dimensional space, with different colors representing different classes (label 0 and label 5)
sns.scatterplot(data=df, 
                x="dim1", 
                y="dim2", 
                hue="label",
                palette="deep")
plt.title("LLE Visualization")
plt.show()

In [30]:

Copied!





# TODO: Apply Laplacian Eigenmaps to the filtered DataFrame dfX to reduce its dimensionality to 100 components(use random_state=42 for reproducibility)
le = manifold.SpectralEmbedding(n_components=2)
X_le = le.fit_transform(dfX)

# TODO: Add two new columns to the DataFrame df, named dim1 and dim2, which correspond to the first and second dimensions of the Laplacian Eigenmaps-transformed data X_le
df['dim1'] = X_le[:, 0]
df['dim2'] = X_le[:, 1]

# TODO: Use the seaborn scatterplot function to visualize the data points in the reduced dimensional space, with different colors representing different classes (label 0 and label 5)
sns.scatterplot(data=df, 
                x="dim1", 
                y="dim2", 
                hue="label",
                palette="deep")
plt.title("Laplacian Eigenmaps Visualization")
plt.show()
# TODO: Apply Laplacian Eigenmaps to the filtered DataFrame dfX to reduce its dimensionality to 100 components(use random_state=42 for reproducibility)
le = manifold.SpectralEmbedding(n_components=2)
X_le = le.fit_transform(dfX)

# TODO: Add two new columns to the DataFrame df, named dim1 and dim2, which correspond to the first and second dimensions of the Laplacian Eigenmaps-transformed data X_le
df['dim1'] = X_le[:, 0]
df['dim2'] = X_le[:, 1]

# TODO: Use the seaborn scatterplot function to visualize the data points in the reduced dimensional space, with different colors representing different classes (label 0 and label 5)
sns.scatterplot(data=df, 
                x="dim1", 
                y="dim2", 
                hue="label",
                palette="deep")
plt.title("Laplacian Eigenmaps Visualization")
plt.show()

In [31]:

Copied!





# TODO: Apply t-SNE to the filtered DataFrame dfX to reduce its dimensionality to 2 components(use random_state=42 for reproducibility)
tsne = manifold.TSNE(n_components=2)
X_tsne = tsne.fit_transform(dfX)

# TODO: Add two new columns to the DataFrame df, named dim1 and dim2, which correspond to the first and second dimensions of the t-SNE-transformed data X_tsne
df['dim1'] = X_tsne[:, 0]
df['dim2'] = X_tsne[:, 1]

# TODO: Use the seaborn scatterplot function to visualize the data points in the reduced dimensional space, with different colors representing different classes (label 0 and label 5)
sns.scatterplot(data=df, 
                x="dim1", 
                y="dim2", 
                hue="label",
                palette="deep")
plt.title("t-SNE Visualization")
plt.show()
# TODO: Apply t-SNE to the filtered DataFrame dfX to reduce its dimensionality to 2 components(use random_state=42 for reproducibility)
tsne = manifold.TSNE(n_components=2)
X_tsne = tsne.fit_transform(dfX)

# TODO: Add two new columns to the DataFrame df, named dim1 and dim2, which correspond to the first and second dimensions of the t-SNE-transformed data X_tsne
df['dim1'] = X_tsne[:, 0]
df['dim2'] = X_tsne[:, 1]

# TODO: Use the seaborn scatterplot function to visualize the data points in the reduced dimensional space, with different colors representing different classes (label 0 and label 5)
sns.scatterplot(data=df, 
                x="dim1", 
                y="dim2", 
                hue="label",
                palette="deep")
plt.title("t-SNE Visualization")
plt.show()

Task 3: Analyzing PCA's Role in Face Recognition Through Image Reconstruction¶

In this section, we will investigate how PCA can be used for dimensionality reduction in face recognition and its impact on image reconstruction quality.

Your objective is to use PCA for reducing the dimensionality of a dataset of images and then to reconstruct these images from the reduced-dimensional space.

Basically, PCA operates as $$Z=WX.$$

For reconstruction, the process is $$W^{-1}Z=X.$$

Given that the matrix $W$ is orthonormal (they are the eigenvectors!), it simplifies to $W^{T}Z=X,$ where $W^{T}Z$ represents the reconstructed image.

Specifically, the inverse_transform method of the PCA object will be utilized to revert the PCA-transformed data back to its original space.

Dimensionality reduction via PCA:
- First, you'll need to determine the number of principal components to retain, then fit PCA on your data and transform it. You can try to change the number of components, see what happens to the reconstructed image.
Image reconstruction:
- Use the inverse_transform method of the PCA object to revert the PCA-transformed data back to its original space, then reshape it to obtain the reconstructed images. You can refer to the document: sklearn.decomposition.PCA.inverse_transform()
Visualize reconstructed images:
- Compare one of the original images with the reconstructed one by visualizing them side by side.

In [47]:

Copied!





n_components = 515  # try to change the number of n_components, see what happens
pca = PCA(n_components=n_components).fit(X)
X_pca = pca.transform(X)

# TODO: Reconstruct the images from the reduced-dimensional space(first create X_inv_proj, then reshape it to the original image dimensions)
X_inv_proj = pca.inverse_transform(X_pca)
X_proj_img = np.reshape(X_inv_proj, (n_samples, h, w))

chosen_images = X[::n_samples // 10]
chosen_reconstructed = X_proj_img[::n_samples // 10]

# TODO: Visualize
# Visualization setup for displaying original and reconstructed images
n_row, n_col = 2, 5  # Define the layout for the subplot

# Loop through a subset of images for comparison
plt.figure(figsize=(2 * n_col, 2.4 * n_row))
for i in range(n_row * n_col // 2):
    # Plot the original images
    plt.subplot(n_row, n_col, i + 1)
    plt.imshow(X[i].reshape((h, w)), cmap=plt.cm.gray)
    plt.title("Original\nImage %d" % (i + 1))
    plt.xticks(())  # Remove x-axis ticks
    plt.yticks(())  # Remove y-axis ticks
    
    # Plot the reconstructed images
    plt.subplot(n_row, n_col, i + 1 + n_col)
    plt.imshow(chosen_reconstructed[i].reshape((h, w)), cmap=plt.cm.gray)
    plt.title("Reconstructed\nImage %d" % (i + 1))
    plt.xticks(())  # Remove x-axis ticks
    plt.yticks(())  # Remove y-axis ticks
plt.tight_layout()  # Adjust subplots to fit into the figure neatly
plt.show()  # Display the figure
n_components = 515  # try to change the number of n_components, see what happens
pca = PCA(n_components=n_components).fit(X)
X_pca = pca.transform(X)

# TODO: Reconstruct the images from the reduced-dimensional space(first create X_inv_proj, then reshape it to the original image dimensions)
X_inv_proj = pca.inverse_transform(X_pca)
X_proj_img = np.reshape(X_inv_proj, (n_samples, h, w))

chosen_images = X[::n_samples // 10]
chosen_reconstructed = X_proj_img[::n_samples // 10]

# TODO: Visualize
# Visualization setup for displaying original and reconstructed images
n_row, n_col = 2, 5  # Define the layout for the subplot

# Loop through a subset of images for comparison
plt.figure(figsize=(2 * n_col, 2.4 * n_row))
for i in range(n_row * n_col // 2):
    # Plot the original images
    plt.subplot(n_row, n_col, i + 1)
    plt.imshow(X[i].reshape((h, w)), cmap=plt.cm.gray)
    plt.title("Original\nImage %d" % (i + 1))
    plt.xticks(())  # Remove x-axis ticks
    plt.yticks(())  # Remove y-axis ticks
    
    # Plot the reconstructed images
    plt.subplot(n_row, n_col, i + 1 + n_col)
    plt.imshow(chosen_reconstructed[i].reshape((h, w)), cmap=plt.cm.gray)
    plt.title("Reconstructed\nImage %d" % (i + 1))
    plt.xticks(())  # Remove x-axis ticks
    plt.yticks(())  # Remove y-axis ticks
plt.tight_layout()  # Adjust subplots to fit into the figure neatly
plt.show()  # Display the figure

[Optional] Task 4: Evaluating the Impact of Dimensionality Reduction on Face Recognition Performance¶

In this optional task, you will assess how dimensionality reduction affects the efficiency of a face recognition classifier. Follow these steps to conduct your evaluation:

Apply dimensionality reduction: Utilize dimensionality reduction methods to transform the high-dimensional face data into a more manageable, lower-dimensional representation. For this task, consider using PCA.

In [33]:

Copied!

n_components = 150  # try to change the number of n_components, see what happens
pca = PCA(n_components=n_components).fit(X)
X_pca = pca.transform(X)
n_components = 150  # try to change the number of n_components, see what happens
pca = PCA(n_components=n_components).fit(X)
X_pca = pca.transform(X)

Data Splitting: Properly partition your dataset into training and testing sets to effectively assess the model's performance.

In [34]:

Copied!

from sklearn.model_selection import train_test_split

# Split the original high-dimensional data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Split the reduced-dimensionality data data into training and testing sets
X_train_pca, X_test_pca, y_train_pca, y_test_pca = train_test_split(X_pca, y, test_size=0.3, random_state=42)
from sklearn.model_selection import train_test_split

# Split the original high-dimensional data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Split the reduced-dimensionality data data into training and testing sets
X_train_pca, X_test_pca, y_train_pca, y_test_pca = train_test_split(X_pca, y, test_size=0.3, random_state=42)

Classifier training & performance comparison: Implement and train a Support Vector Machine (SVM) classifier (or any classifier of your choice) on both the original high-dimensional and reduced-dimensionality data to see the impact of dimensionality reduction on learning. Evaluate the classifier's performance on each data set by comparing accuracy and running time.

Example code is provided for training SVM with original high-dimensional data, focusing on accuracy and running time. Extend this to include training on reduced-dimensional data and report accuracy and running time. By completing these two tasks, you'll gain insights into the trade-offs between computational efficiency and classifier performance brought about by dimensionality reduction in face recognition tasks.

In [41]:

Copied!





# Example: training SVM with high-dimensional data
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
import time

# Start the timer to measure the time taken for training and prediction
start = time.time()

# Initialize and train the SVM classifier with a linear kernel on the original high-dimensional data
clf = SVC(kernel='linear')
clf.fit(X_train, y_train)

# Predict the labels of the test set
y_pred = clf.predict(X_test)

# Stop the timer after the classification process is complete
end = time.time()

# Calculate and print the performance metrics: accuracy and time taken
print(f"Accuracy (Original): {accuracy_score(y_test, y_pred)}, Time elapsed: {end - start}s")
# Example: training SVM with high-dimensional data
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
import time

# Start the timer to measure the time taken for training and prediction
start = time.time()

# Initialize and train the SVM classifier with a linear kernel on the original high-dimensional data
clf = SVC(kernel='linear')
clf.fit(X_train, y_train)

# Predict the labels of the test set
y_pred = clf.predict(X_test)

# Stop the timer after the classification process is complete
end = time.time()

# Calculate and print the performance metrics: accuracy and time taken
print(f"Accuracy (Original): {accuracy_score(y_test, y_pred)}, Time elapsed: {end - start}s")

Accuracy (Original): 0.8278688524590164, Time elapsed: 1.0597460269927979s

In [42]:

Copied!





# TODO
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
import time

start = time.time()

clf = SVC(kernel='linear')
clf.fit(X_train_pca, y_train_pca)
y_pred = clf.predict(X_test_pca)

end = time.time()

# Performance metrics
print(f"Accuracy (PCA): {accuracy_score(y_test, y_pred)}, Time elapsed: {end - start}s")
# TODO
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
import time

start = time.time()

clf = SVC(kernel='linear')
clf.fit(X_train_pca, y_train_pca)
y_pred = clf.predict(X_test_pca)

end = time.time()

# Performance metrics
print(f"Accuracy (PCA): {accuracy_score(y_test, y_pred)}, Time elapsed: {end - start}s")

Accuracy (PCA): 0.7950819672131147, Time elapsed: 0.08615994453430176s

By comparing the classification results of the original data and the PCA-reduced data, we can draw the following conclusions:

Impact of Dimensionality Reduction on Classification Performance: If the classification accuracy of the PCA-reduced data is close to or exceeds that of the original data, this indicates that sufficient information has been retained for effective classification while reducing the dimensions.
Efficiency Improvement via Dimensionality Reduction: Dimensionality reduction is especially useful for handling high-dimensional data, as it can reduce computational complexity and avoid overfitting.

[Optional] Task 5: Carry out PCA without scikit learn¶

In this section, we will implement PCA from scratch without using scikit-learn. This will help you understand the underlying mathematics and concepts behind PCA.

To manually implement the PCA algorithm, you need to understand the basic steps of PCA:

Centering:

Calculate the mean of each feature.
Subtract this mean from each feature to center the data around zero.
This ensures that PCA deals with variance and covariance without being affected by the absolute values of features.

Covariance Matrix:

Compute the covariance matrix of the features.
You can use np.cov() to compute the covariance matrix of the centered features.

Eigenvalues and Eigenvectors:

Solve for the eigenvalues and corresponding eigenvectors of the covariance matrix.
You can utilize np.linalg.eigh() to compute the eigenvalues and eigenvectors of the covariance matrix.

Select Principal Components:

Order the eigenvalues from highest to lowest to prioritize the most significant features.
Select the top k eigenvectors based on the sorted eigenvalues; you can use np.argsort().
These selected eigenvectors define your new feature space, capturing the most variance.

Project Data:

Project the original data onto the new feature space to get the reduced-dimension data.
You can use np.dot() to compute the product of two arrays

Let's do it!

In [37]:

Copied!





from sklearn.preprocessing import StandardScaler

def pca(X, n_components):
    # 1. Standardize
    # TODO:
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X)

    # 2. Calculate the covariance matrix
    # TODO:
    covariance_matrix = np.cov(X_scaled.T)

    # 3. Calculate the eigenvectors and eigenvalues
    # TODO:
    eigenvalues, eigenvectors = np.linalg.eigh(covariance_matrix)

    # 4. Choose the top n_components eigenvectors
    # TODO:
    idx = np.argsort(eigenvalues)[::-1]
    selected_eigenvectors = eigenvectors[:, idx[:n_components]]

    # 5. Project the data into the new space
    # TODO:
    X_reduced = np.dot(X_scaled, selected_eigenvectors)

    return X_reduced, selected_eigenvectors, scaler.mean_

lfw_people = fetch_lfw_people(min_faces_per_person=70, resize=0.4)
X = lfw_people.data
n_samples, h, w = lfw_people.images.shape

n_components = 150
X_pca, eigenvectors, mean = pca(X, n_components)
from sklearn.preprocessing import StandardScaler

def pca(X, n_components):
    # 1. Standardize
    # TODO:
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X)

    # 2. Calculate the covariance matrix
    # TODO:
    covariance_matrix = np.cov(X_scaled.T)

    # 3. Calculate the eigenvectors and eigenvalues
    # TODO:
    eigenvalues, eigenvectors = np.linalg.eigh(covariance_matrix)

    # 4. Choose the top n_components eigenvectors
    # TODO:
    idx = np.argsort(eigenvalues)[::-1]
    selected_eigenvectors = eigenvectors[:, idx[:n_components]]

    # 5. Project the data into the new space
    # TODO:
    X_reduced = np.dot(X_scaled, selected_eigenvectors)

    return X_reduced, selected_eigenvectors, scaler.mean_

lfw_people = fetch_lfw_people(min_faces_per_person=70, resize=0.4)
X = lfw_people.data
n_samples, h, w = lfw_people.images.shape

n_components = 150
X_pca, eigenvectors, mean = pca(X, n_components)

This code reduces the dataset X to n_components principal components. You can adjust the number of principal components as needed.

Now, let's compare the image before using the manual implemented PCA and after using the manual implemented PCA.

In [38]:

Copied!





def reconstruct(X_reduced, eigenvectors, mean):
    X_reconstructed = np.dot(X_reduced, eigenvectors.T)
    X_reconstructed = X_reconstructed + mean
    return X_reconstructed

X_reconstructed = reconstruct(X_pca, eigenvectors, mean)

# Visualization setup for displaying original and reconstructed images
n_row, n_col = 2, 5  # Define the layout for the subplot

# Loop through a subset of images for comparison
plt.figure(figsize=(2 * n_col, 2.4 * n_row))
for i in range(n_row * n_col // 2):
    # Plot the original images
    plt.subplot(n_row, n_col, i + 1)
    plt.imshow(X[i].reshape((h, w)), cmap=plt.cm.gray)
    plt.title("Original\nImage %d" % (i + 1))
    plt.xticks(())  # Remove x-axis ticks
    plt.yticks(())  # Remove y-axis ticks
    
    # Plot the reconstructed images
    plt.subplot(n_row, n_col, i + 1 + n_col)
    plt.imshow(X_reconstructed[i].reshape((h, w)), cmap=plt.cm.gray)
    plt.title("Reconstructed\nImage %d" % (i + 1))
    plt.xticks(())  # Remove x-axis ticks
    plt.yticks(())  # Remove y-axis ticks
plt.tight_layout()  # Adjust subplots to fit into the figure neatly
plt.show()  # Display the figure
def reconstruct(X_reduced, eigenvectors, mean):
    X_reconstructed = np.dot(X_reduced, eigenvectors.T)
    X_reconstructed = X_reconstructed + mean
    return X_reconstructed

X_reconstructed = reconstruct(X_pca, eigenvectors, mean)

# Visualization setup for displaying original and reconstructed images
n_row, n_col = 2, 5  # Define the layout for the subplot

# Loop through a subset of images for comparison
plt.figure(figsize=(2 * n_col, 2.4 * n_row))
for i in range(n_row * n_col // 2):
    # Plot the original images
    plt.subplot(n_row, n_col, i + 1)
    plt.imshow(X[i].reshape((h, w)), cmap=plt.cm.gray)
    plt.title("Original\nImage %d" % (i + 1))
    plt.xticks(())  # Remove x-axis ticks
    plt.yticks(())  # Remove y-axis ticks
    
    # Plot the reconstructed images
    plt.subplot(n_row, n_col, i + 1 + n_col)
    plt.imshow(X_reconstructed[i].reshape((h, w)), cmap=plt.cm.gray)
    plt.title("Reconstructed\nImage %d" % (i + 1))
    plt.xticks(())  # Remove x-axis ticks
    plt.yticks(())  # Remove y-axis ticks
plt.tight_layout()  # Adjust subplots to fit into the figure neatly
plt.show()  # Display the figure