Image Classification With PyTorch: A Practical Guide

Nov 17, 2025 by Alex Braham 53 views

Hey guys! Ever wondered how computers can recognize images like cats, dogs, or even different types of flowers? That's the magic of image classification! And today, we're diving into how you can build your own image classifier using PyTorch, a super popular and flexible deep learning framework.

What is Image Classification?

Image classification is the task of assigning a label to an image based on its visual content. For example, if you show a picture of a cat to an image classification model, it should predict "cat." Sounds simple, right? But behind the scenes, there's a lot of cool math and algorithms working together.

Why is Image Classification Important?

Image classification is used in tons of real-world applications:

Self-driving cars: Recognizing traffic signs, pedestrians, and other vehicles.
Medical imaging: Detecting diseases from X-rays, MRIs, and CT scans.
E-commerce: Identifying products in images for visual search.
Security: Facial recognition for unlocking devices or identifying suspects.
Agriculture: Identifying plant diseases or monitoring crop health.

Basically, if you can see it, a computer can (potentially) be trained to recognize it too!

Setting Up Your Environment

Before we start coding, let's make sure you have everything you need. You'll need Python installed, along with a few key libraries. I recommend using Anaconda to manage your Python environment. If you don't have it installed, grab it from the Anaconda website. It's a lifesaver for managing packages and dependencies.

Create a new environment:

Open your terminal or Anaconda Prompt and run:
```
conda create -n pytorch_env python=3.8
conda activate pytorch_env
```
This creates a new environment named pytorch_env with Python 3.8. You can choose a different Python version if you prefer.
Install PyTorch:

Now, let's install PyTorch. Head over to the PyTorch website and select your operating system, package manager (conda), Python version, and CUDA version (if you have a compatible NVIDIA GPU). The website will give you a command to run. For example, it might look something like this:
```
conda install pytorch torchvision torchaudio cudatoolkit=11.3 -c pytorch
```
Important: If you don't have a GPU, you can still use PyTorch on your CPU. Just omit the cudatoolkit part from the command.
Install other libraries:

We'll also need a few other libraries for data manipulation and visualization:
```
pip install numpy matplotlib scikit-learn
```
- Numpy is for numerical operations.
- Matplotlib is for plotting graphs and images.
- Scikit-learn is for various machine learning tools.

Preparing Your Data

Data is the fuel that drives our image classification engine. We'll use a popular dataset called CIFAR-10, which contains 60,000 32x32 color images in 10 different classes (like airplane, automobile, bird, cat, deer, dog, frog, horse, ship, and truck).

Downloading CIFAR-10

PyTorch's torchvision package makes it super easy to download and load the CIFAR-10 dataset. Here's how:

import torch
import torchvision
import torchvision.transforms as transforms

# Define transformations to apply to the data
transform = transforms.Compose(
    [transforms.ToTensor(),
     transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])

# Download the training set
trainset = torchvision.datasets.CIFAR10(root='./data', train=True,
                                        download=True, transform=transform)
trainloader = torch.utils.data.DataLoader(trainset, batch_size=4,
                                          shuffle=True, num_workers=2)

# Download the test set
testset = torchvision.datasets.CIFAR10(root='./data', train=False,
                                       download=True, transform=transform)
testloader = torch.utils.data.DataLoader(testset, batch_size=4,
                                         shuffle=False, num_workers=2)

# Define the classes
classes = ('plane', 'car', 'bird', 'cat',
           'deer', 'dog', 'frog', 'horse', 'ship', 'truck')

Let's break down what's happening here:

transforms.Compose allows us to chain multiple transformations together. In this case, we're converting the images to PyTorch tensors (transforms.ToTensor()) and normalizing the pixel values (transforms.Normalize()). Normalization helps the model learn faster and better.
torchvision.datasets.CIFAR10 downloads the CIFAR-10 dataset and applies the specified transformations.
torch.utils.data.DataLoader creates an iterator that feeds data to the model in batches. batch_size determines how many images are processed in each iteration. shuffle=True shuffles the data to prevent the model from learning the order of the data.

Visualizing the Data

It's always a good idea to take a peek at the data to make sure everything looks right. Let's display a few images from the training set:

import matplotlib.pyplot as plt
import numpy as np

# Function to show an image
def imshow(img):
    img = img / 2 + 0.5     # unnormalize
    npimg = img.numpy()
    plt.imshow(np.transpose(npimg, (1, 2, 0)))
    plt.show()


# Get some random training images
dataiter = iter(trainloader)
images, labels = dataiter.next()

# Show images
imshow(torchvision.utils.make_grid(images))
# Print labels
print(' '.join('%5s' % classes[labels[j]] for j in range(4)))

This code snippet will display a grid of images along with their corresponding labels. If you see cats, dogs, and airplanes, you're on the right track!

Building the Model

Now comes the fun part: defining our image classification model. We'll use a convolutional neural network (CNN), which is a type of neural network that's particularly well-suited for image processing.

Defining the CNN Architecture

Here's a simple CNN architecture that we can use:

import torch.nn as nn
import torch.nn.functional as F

class Net(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv1 = nn.Conv2d(3, 6, 5)
        self.pool = nn.MaxPool2d(2, 2)
        self.conv2 = nn.Conv2d(6, 16, 5)
        self.fc1 = nn.Linear(16 * 5 * 5, 120)
        self.fc2 = nn.Linear(120, 84)
        self.fc3 = nn.Linear(84, 10)

    def forward(self, x):
        x = self.pool(F.relu(self.conv1(x)))
        x = self.pool(F.relu(self.conv2(x)))
        x = torch.flatten(x, 1) # flatten all dimensions except batch
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return x

net = Net()

Let's break down this code:

nn.Conv2d defines a convolutional layer. The first argument is the number of input channels (3 for RGB images), the second argument is the number of output channels (number of filters), and the third argument is the kernel size (size of the filter). These convolutional layers are the backbone of the whole operation. They work by sliding the convolutional filter window across the image extracting features from each patch until the whole image is accounted for.
nn.MaxPool2d defines a max pooling layer, which reduces the spatial dimensions of the feature maps. This helps to reduce the number of parameters and make the model more robust to variations in the input. Max pooling layers are used to reduce the dimensionality and computational cost, while also making the model more robust to variations in the input.
nn.Linear defines a fully connected layer. The first argument is the number of input features, and the second argument is the number of output features. The fully connected layers learn complex, non-linear relationships between the features extracted by the convolutional layers.
The forward method defines how the input data flows through the network. It applies the convolutional layers, pooling layers, and fully connected layers in sequence, using ReLU activation functions after each convolutional and fully connected layer.
ReLU (Rectified Linear Unit) is a common activation function that introduces non-linearity into the model, allowing it to learn more complex patterns.
torch.flatten flattens the output of the convolutional layers into a 1D tensor, which can then be fed into the fully connected layers.

Defining the Loss Function and Optimizer

We need to define a loss function to measure how well our model is performing, and an optimizer to update the model's parameters based on the loss. We'll use cross-entropy loss and stochastic gradient descent (SGD) with momentum:

import torch.optim as optim

criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(net.parameters(), lr=0.001, momentum=0.9)

nn.CrossEntropyLoss is a common loss function for multi-class classification problems.
optim.SGD is a popular optimization algorithm that updates the model's parameters in the direction of the negative gradient of the loss function. lr is the learning rate, which controls the step size of the updates. momentum helps to accelerate learning by accumulating the gradients over time.

Training the Model

Now we're ready to train our model! This involves iterating over the training data, feeding the data to the model, calculating the loss, and updating the model's parameters.

for epoch in range(2):  # loop over the dataset multiple times

    running_loss = 0.0
    for i, data in enumerate(trainloader, 0):
        # get the inputs; data is a list of [inputs, labels]
        inputs, labels = data

        # zero the parameter gradients
        optimizer.zero_grad()

        # forward + backward + optimize
        outputs = net(inputs)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()

        # print statistics
        running_loss += loss.item()
        if i % 2000 == 1999:    # print every 2000 mini-batches
            print('[%d, %5d] loss: %.3f' %
                  (epoch + 1, i + 1, running_loss / 2000))
            running_loss = 0.0

print('Finished Training')

Here's what's happening in the training loop:

We iterate over the training data for a specified number of epochs. An epoch is one complete pass through the entire training dataset.
For each batch of data, we zero the parameter gradients using optimizer.zero_grad(). This is important because PyTorch accumulates gradients by default.
We feed the input data to the model using outputs = net(inputs). This performs a forward pass through the network and produces the model's predictions.
We calculate the loss using loss = criterion(outputs, labels). This measures the difference between the model's predictions and the true labels.
We perform a backward pass through the network using loss.backward(). This calculates the gradients of the loss function with respect to the model's parameters.
We update the model's parameters using optimizer.step(). This applies the gradients to the model's parameters, moving them in the direction that reduces the loss.
We print the running loss every 2000 mini-batches to monitor the training progress. It's super useful because if this running loss isn't decreasing, you'll know that there's some debugging to do.

Evaluating the Model

Now that we've trained our model, let's see how well it performs on the test set. We'll calculate the overall accuracy and the accuracy for each class.

Calculating Overall Accuracy

correct = 0
total = 0
with torch.no_grad():
    for data in testloader:
        images, labels = data
        outputs = net(images)
        _, predicted = torch.max(outputs.data, 1)
        total += labels.size(0)
        correct += (predicted == labels).sum().item()

print('Accuracy of the network on the 10000 test images: %d %%' % (100 * correct / total))

Here's how we calculate the overall accuracy:

We iterate over the test data.
For each batch of data, we feed the input data to the model and get the predicted labels.
We compare the predicted labels to the true labels and count the number of correct predictions.
We calculate the overall accuracy by dividing the number of correct predictions by the total number of test images.

Calculating Class-Specific Accuracy

class_correct = list(0. for i in range(10))
class_total = list(0. for i in range(10))
with torch.no_grad():
    for data in testloader:
        images, labels = data
        outputs = net(images)
        _, predicted = torch.max(outputs, 1)
        c = (predicted == labels).squeeze()
        for i in range(4):
            label = labels[i]
            class_correct[label] += c[i].item()
            class_total[label] += 1


for i in range(10):
    print('Accuracy of %5s : %2d %%' % (
        classes[i], 100 * class_correct[i] / class_total[i]))

This code calculates the accuracy for each class in the CIFAR-10 dataset. This will give you a more granular view on how each class performed individually.

Conclusion

And there you have it! You've built your own image classifier using PyTorch. Of course, this is a very basic example, but it demonstrates the fundamental concepts of image classification and how to implement them using PyTorch.

Further Exploration

Want to take your image classification skills to the next level? Here are a few ideas:

Experiment with different CNN architectures: Try adding more layers, using different activation functions, or exploring more advanced architectures like ResNet or DenseNet.
Use data augmentation: Apply random transformations to the training data (like rotations, flips, and zooms) to increase the diversity of the data and improve the model's generalization ability.
Fine-tune a pre-trained model: Use a model that's been pre-trained on a large dataset like ImageNet as a starting point and fine-tune it on your specific task. This can significantly improve performance and reduce training time.
Try different datasets: Explore other image classification datasets like MNIST, Fashion-MNIST, or your own custom dataset.

Keep experimenting, keep learning, and have fun building awesome image classification models!