As intelligent devices have become essential in the daily life of human beings recently, a more intuitive and convenient interface is required. Among them, hand gesture recognition (HGR) is robust to surrounding noise and it takes less physical space, thus, HGR is widely adopted as a control interface in head-mounted displays and in-vehicle infotainment . As an algorithm for HGR, 3D-CNN that uses a stack of multiple subsequent frames according to the input duration, has been highlighted. Compared to traditional 2D-CNN and LSTM architecture, 3D-CNN moves kernels not only in spatial direction but also in temporal direction to simultaneously consider spatio-temporal coherences. However, 3D-CNN involves more massive MAC operations and memory footprint compared to the conventional 2D-CNN. For example, HGR with 3D-ResNet under 112×112 resolution and 16-frames input duration requires 11.4 G MAC operations and 81.5 MB memory (leads to 51.5 GB/s), which are 13.04× and 3.44× compared to the conventional ResNet-18, respectively. Moreover, it requires up to 13.5 MB of weight and 1.53 MB of input/output memory for just one layer that cannot be stored on-chip SRAM at once in mobile devices. Therefore, a real-time 3D-CNN was not feasible on mobile devices; e.g. implementation on Qualcomm’s Snapdragon 865 showed 2.5 s of inference latency. To fulfill the requirement for HGR as the next generation of Human-computer interfaces for mobile or IoT Devices, the dedicated ASIC processor for 3D-CNN should be researched.