We propose a deep action recognition model based on convolutional architecture using multiplicative interactions. The developed model generates feature representation that is sensitive to both temporal dynamic and static appearance of a video. We show that both the two kinds of information arise from the intrinsic properties of the products of adjacent frames, which is different from recent convolutional approaches. Our model also remedies the shortcoming of energy-based methods in scaling up them to more realistic datasets that have larger images because convolution is dramatically more efficient than dense matrix multiplication in terms of memory requirements and statistical efficiency. Experimental results show the developed model outperforms other baseline methods on the UCF101 dataset, and it achieves competitive performance on the KTH dataset. They also suggest that, to reliably model action, the static appearance should be captured in addition to motion information. The research work described in this paper is supported by Science Research Foundation of Hunan Provincial Education Department under grant number 12B023.