Computer Vision • Deep Learning

Video Object Segmentation

Custom deep learning pipeline for real-time segmentation using the DAVIS 2017 dataset. Integrated ResNet50 backbone, ConvLSTM for temporal modeling, and CBAM attention.

Overview

This project tackled the problem of Video Object Segmentation (VOS), where the goal is to segment and track objects across video sequences. Traditional CNNs treat frames independently and fail at temporal consistency, so I designed a custom deep learning pipeline combining ResNet50 backbone, ConvLSTM layers for temporal modeling, and CBAM attention for adaptive focus.

Technical Highlights

Dataset: DAVIS 2017 (semi-supervised + supervised splits)
Backbone: ResNet50 pre-trained on ImageNet
Temporal Model: ConvLSTM for motion awareness
Attention: CBAM (Convolutional Block Attention Module)
Frameworks: TensorFlow/Keras, OpenCV
Training: Adam optimizer, BCE loss, 50 epochs
Hardware: NVIDIA RTX 3060 GPU (12GB)

Results

Achieved smooth segmentation masks across sequences with IoU improvement of +12% over baseline CNNs. Visual comparisons showed reduced flickering and more consistent boundaries.

Challenges & Solutions

Initial models struggled with maintaining object continuity across fast motion. Adding ConvLSTM layers allowed the network to “remember” motion patterns. Integrating CBAM attention further improved performance by focusing on salient regions.