3D-CNN

3DCNN_photo

As intelligent devices have become essential in the daily life of human beings recently, a more intuitive and convenient interface is required. Among them, hand gesture recognition (HGR) is robust to surrounding noise and it takes less physical space, thus, HGR is widely adopted as a control interface in head-mounted displays  and in-vehicle infotainment . As an algorithm for HGR, 3D-CNN that uses a stack of multiple subsequent frames according to the input duration, has been highlighted. Compared to traditional 2D-CNN and LSTM architecture, 3D-CNN moves kernels not only in spatial direction but also in temporal direction to simultaneously consider spatio-temporal coherences. However, 3D-CNN involves more massive MAC operations and memory footprint compared to the conventional 2D-CNN. For example, HGR with 3D-ResNet under 112×112 resolution and 16-frames input duration requires 11.4 G MAC operations and 81.5 MB memory (leads to 51.5 GB/s), which are 13.04× and 3.44× compared to the conventional ResNet-18, respectively. Moreover, it requires up to 13.5 MB of weight and 1.53 MB of input/output memory for just one layer that cannot be stored on-chip SRAM at once in mobile devices. Therefore, a real-time 3D-CNN was not feasible on mobile devices; e.g. implementation on Qualcomm’s Snapdragon 865 showed 2.5 s of inference latency.  To fulfill the requirement for HGR as the next generation of Human-computer interfaces for mobile or IoT Devices, the dedicated ASIC processor for 3D-CNN should be researched.

Super Resolution

CNN_SR

4K (3840×2160) resolution videos with 60 frames-per-second (fps) have been common and users are adapted to such high-quality media, feeling uncomfortable with less resolution or less framerate. However, video streaming services mostly supply FHD (1920×1080) due to the bottleneck of communication bandwidth even with > 4K display on edge devices. The gap between communication bandwidth and display resolution is barely filled by simple interpolation of pixels, but it discomforts users who pursue seamless high-quality video service. As a solid solution to overcome the resolution gap between communication and display, super resolution (SR) using deep neural network (DNN) shown in Fig. 1(a) is widely utilized. Most of the SR networks are very heavy that they are accelerated in high-performance GPUs such as RTX video super resolution in Nvidia or Radeon super resolution in AMD. However, the power-hungry GPUs (> 100W) are not suitable for battery-limited mobile devices.