Aug. 2012 - Jul. 2016 ,Department of Automation, Tsinghua University
Jul. 2015 - Sep. 2015, Department of Electrical & Systems Engineer, Wanshington University in Stlouis
Visual tracking is confronted by the dilemma to locate a target both accurately and efficiently, and make decisions online whether and how to adapt the appearance model or even restart tracking. In this paper, we propose a deep reinforcement learning with iterative shift (DRL-IS) method for single object tracking, where an actor-critic network is introduced to predict the iterative shifts of object bounding boxes, and evaluate the shifts to take actions on whether to update object models or re-initialize tracking. Since locating an object is achieved by an iterative shift process, rather than online classification on many sampled locations, the proposed method is robust to cope with large deformations and abrupt motion, and computationally efficient since finding a target takes up to 10 shifts. In offline training, the critic network guides to learn how to make decisions jointly on motion estimation and tracking status in an end-to-end manner.
Most existing multi- object tracking methods employ the tracking-by-detection strategy which first detects objects in each frame and then associates them across different frames. However, the performance of these methods rely heavily on the detection results, which are usually unsatisfiFed in many real applications, especially in crowded scenes. To address this, we develop a deep prediction-decision network in our C-DRL, which simultaneously detects and predicts objects under a unified network via deep reinforcement learning. Specifically, we consider each object as an agent and track it via the prediction network, and seek the optimal tracked results by exploiting the collaborative interactions of different agents and environments via the decision network, so that the influences of occlusions and noisy detection results can be well alleviated.
While a variety of deep hashing methods have been proposed in recent years, most of them are confronted by the dilemma to obtain optimal binary codes in a truly end-to-end manner with non-smooth sign activations. Unlike existing methods which usually employ a general relaxation framework to adapt to the gradient-based algorithms, our approach formulates the non-smooth part of the hashing network as sampling with a stochastic policy, so that the retrieval performance degradation caused by the relaxation can be avoided. Specifically, our method directly generates the binary codes and maximizes their rewards for similarity preservation, where the network can be trained directly via policy gradient. Hence, the differentiation challenge for discrete optimization can be naturally addressed, which leads to more effective gradients and optimal binary hash codes.
Unlike most existing person re-identification methods which identify whether two pedestrian images are from the same person or not, our approach aims to obtain the maximal correct matches for the whole camera network. Different from recently proposed camera network based re-identification methods which only consider the consistent information in the matching stage to obtain a globally optimal association, we exploit such consistent-aware information under a deep learning framework where both feature representation and image matching are automatically learned. Specifically, we reach the globally optimal solution and balance the performance between different cameras by optimizing the similarity and data association iteratively with certain consistent constraints.
Unlike most existing person re-identification methods which only use RGB images, our approach recognizes people from RGB-D images so that more information such as anthropometric measures and body shapes can be exploited for re-identification. In order to exploit useful information from depth images, we use the deep network to extract efficient anthropometric features from processed depth images which also have three channels. Moreover, we design a multi-modal fusion layer to combine these features extracted from both depth images and RGB images through the network with a uniform latent variable which is robust to noise, and optimize the fusion layer with two CNN networks jointly.
1. Liangliang Ren, Xin Yuan, Jiwen Lu, Ming Yang, and Jie Zhou, Deep Reinforcement Learning with Iterative Shift for Visual Tracking, European Conference on Computer Vision, Munich, Sep, 2018.
2. Liangliang Ren, Jiwen Lu, Zifeng Wang, and Jie Zhou, Collaborative Deep Reinforcement Learning for Multi-Object Tracking, European Conference on Computer Vision, Munich, Sep, 2018.
3. Xin Yuan, Liangliang Ren, Jiwen Lu, Jie Zhou, Towards Optimal Deep Hashing via Policy Gradient, European Conference on Computer Vision, Munich, Sep, 2018.
4. Liangliang Ren, Jiwen Lu, jianjiang Feng, and Jie Zhou, Multi-Modal uniform deep learning for RGB-D person re-identification, Pattern Recognition , 2017.
5. Ji Lin, Liangliang Ren, Jiwen Lu, jianjiang Feng, and Jie Zhou, Consistent-Aware Deep Learning for Person Re-identification in a Camera Network, IEEE International Conference on Computer Vision and Pattern Recognition, 2017.