Saliency cut: an automatic approach for video object segmentation based on saliency energy minimization
Video object segmentation (VOS) is an important task in computer vision with a lot of applications, e.g., video editing, object tracking, and object based encoding. Different from image object segmentation, video object segmentation must consider both spatial and temporal coherence for the object. Despite extensive previous work, the problem is still challenging. Usually, foreground object in the video draws more attention from humans, i.e. it is salient. In this thesis we tackle the problem from the aspect of saliency, where saliency means a certain subset of visual information selected by a visual system (human or machine). We present a novel unsupervised method for video object segmentation that considers both low level vision cues and high level motion cues. In our model, video object segmentation can be formulated as a unified energy minimization problem and solved in polynomial time by employing the min-cut algorithm. Specifically, our energy function comprises the unary term and pair-wise interaction energy term respectively, where unary term measures region saliency and interaction term smooths the mutual effects between object saliency and motion saliency. Object saliency is computed in spatial domain from each discrete frame using multi-scale context features, e.g., color histogram, gradient, and graph based manifold ranking. Meanwhile, motion saliency is calculated in temporal domain by extracting phase information of the video. In the experimental section of this thesis, our proposed method has been evaluated on several benchmark datasets. In MSRA 1000 dataset the result demonstrates that our spatial object saliency detection is superior to the state-of-art methods. Moreover, our temporal motion saliency detector can achieve better performance than existing motion detection approaches in UCF sports action analysis dataset and Weizmann dataset respectively. Finally, we show the attractive empirical result and quantitative evaluation of our approach on two benchmark video object segmentation datasets.