Pengenalan Viseme Dinamis Bahasa Indonesia Menggunakan Convolutional Neural Network

Aris Nasuha, Tri Arief Sardjono, Mauridhi Hery Purnomo


There has been very little researches on automatic lip reading in Indonesian language, especially the ones based on dynamic visemes. To improve the accuracy of a recognition process, for certain problems, choosing suitable classifiers or combining of some methods may be required. This study aims to classify five dynamic visemes of Indonesian language using a CNN (Convolutional Neural Network) and to compare the results with an MLP (Multi Layer Perceptron). Varying some parameters theoretically improving the recognition accuracy was attempted to obtain the best result. The data includes videos on pronunciation of daily words in Indonesian language by 28 subjects recorded in frontal view. The best recognition result gives 96.44% of validation accuracy using the CNN classifier with three convolution layers.


viseme dinamis, bahasa Indonesia, Convolution Neural Network

Full Text:



S.H. Leung, A.W.C. Liew, W.H. Lau, dan S.L. Wang, “Automatic Lipreading with Limited Training Data,” in 18th International Conference on Pattern Recognition (ICPR’06), 2006, Vol. 3, hal. 881–884.

V. Estellers dan J.-P. Thiran, “Multi-pose Lipreading and Audio-Visual Speech Recognition,” EURASIP J. Adv. Signal Process., Vol. 51, hal. 1-23, 2012.

A. B. Hassanat, “Visual Passwords Using Automatic Lip Reading,” Int. J. Sci. Basic Appl. Res., Vol. 13, No. 1, pp. 218–231, Sep. 2014.

S. Chen, D. M. Quintian, dan Y. L. Tian, “Towards a Visual Speech Learning System for the Deaf by Matching Dynamic Lip Shapes,” ICCHP’12, 2012, hal. 1–9.

B. Denby, T. Schultz, K. Honda, T. Hueber, J. M. Gilbert, dan J. S. Brumberg, “Silent Speech Interfaces,” Speech Commun., Vol. 52, No. 4, hal. 270–287, Apr. 2010.

Y.-U. Kim, S.-K. Kang, dan S.-T. Jung, “Design and Implementation of a Lip Reading System in Smart Phone Environment,” 2009 IEEE International Conference on Information Reuse Integration, 2009, hal. 101–104.

F. Arifin, A. Nasuha, dan H.D. Hermawan, “Lip Reading Based on Background Subtraction and Image Projection,” 2015 International Conference on Information Technology Systems and Innovation (ICITSI), 2015, hal. 1–3.

A. Nasuha, F. Arifin, T.A. Sardjono, H. Takahashi, dan M.H. Purnomo, “Automatic Lip Reading for Daily Indonesian Words Based on Frame Difference and Horizontal-Vertical Image Projection,” J. Theor. Appl. Inf. Technol., Vol. 95, No. 2, hal. 393-402, Jan. 2017.

F. Faridah dan N. Effendy, “Pengenalan Pola Gerak Bibir dalam Pengucapan Fonem Vokal Bahasa Indonesia,” Teknofisika, Vol. 1, No. 2, hal. 96–100, Sep. 2012.

B. Achmad, F. Faridah, dan L. Fadillah, “Lip Motion Pattern Recognition for Indonesian Syllable Pronunciation Utilizing Hidden Markov Model Method,” TELKOMNIKA Telecommun. Comput. Electron. Control, Vol. 13, No. 1, hal. 173–180, 2015.

E. Setyati, S. Sumpeno, M.H. Purnomo, K. Mikami, M. Kakimoto, dan K. Kondo, “Phoneme-Viseme Mapping for Indonesian Language Based on Blend Shape Animation,” IAENG Int. J. Comput. Sci., Vol. 42, No. 3, 2015.

M. Muljono, S. Sumpeno, A. Arifin, D. Arifianto, dan M.H. Purnomo, “Indonesian Text to Audio Visual Speech with Animated Talking Head,” Int. Rev. Comput. Softw. IRECOS, Vol. 11, No. 3, hal. 261-269, Mar. 2016.

Arifin, S. Sumpeno, M. Muljono, dan M. Hariadi, “A Model of Indonesian Dynamic Visemes from Facial Motion Capture Database Using a Clustering-Based Approach,” IAENG Int. J. Comput. Sci., Vol. 44, No. 1, hal. 41–51, Jan. 2017.

W. Rawat dan Z. Wang, “Deep Convolutional Neural Networks for Image Classification: A Comprehensive Review,” Neural Comput., Vol. 29, No. 9, hal. 2352–2449, Sep. 2017.

L. Breiman, “Bagging Predictors,” Mach. Learn., Vol. 24, No. 2, hal. 123–140, Agt. 1996.

Y. Yamasari, S.M.S. Nugroho, D.F. Suyatno, and M.H. Purnomo, “Meta-Algoritme Adaptive Boosting untuk Meningkatkan Kinerja Metode Klasifikasi pada Prestasi Belajar Mahasiswa,” J. Nas. Tek. Elektro Dan Teknol. Inf. JNTETI, Vol. 6, No. 3, hal. 333-341, 2017.

G. Cybenko, “Approximation by Superpositions of a Sigmoidal Function,” Math. Control Signals Syst., Vol. 2, No. 4, hal. 303–314, Des. 1989.

C.G. Fisher, “Confusions Among Visually Perceived Consonants,” J. Speech Lang. Hear. Res., Vol. 11, No. 4, hal. 796–804, Des. 1968.

S.L. Taylor, M. Mahler, B.-J. Theobald, dan I. Matthews, “Dynamic Units of Visual Speech,” Proceedings of the ACM SIGGRAPH/Eurographics Symposium on Computer Animation, 2012, hal. 275–284.

H. Li dan C. J. Tang, “Dynamic Chinese Viseme Model Based on Phones and Control Function,” Electron. Lett., Vol. 47, No. 2, hal. 144–145, Jan. 2011.

Y. LeCun, L. Bottou, Y. Bengio, dan P. Haffner, “Gradient-Based Learning Applied to Document Recognition,” Proceedings of the IEEE, Vol. 86, No. 11, hal. 2278-2324, 1998.

A. Krizhevsky, I. Sutskever, dan G. E. Hinton, “ImageNet Classification with Deep Convolutional Neural Networks,” Commun. ACM, Vol. 60, No. 6, hal. 84–90, Mei 2017.

K. Simonyan dan A. Zisserman, “Very Deep Convolutional Networks for Large-Scale Image Recognition,” ArXiv14091556 Cs, Sep. 2014.

C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke. A. Rabinovich, “Going Deeper with Convolutions,” 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, hal. 1–9.

K. He, X. Zhang, S. Ren, dan J. Sun, “Deep Residual Learning for Image Recognition,” 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, hal. 770–778.

N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, dan R. Salakhutdinov, “Dropout: A Simple Way to Prevent Neural Networks from Overfitting,” J. Mach. Learn. Res., Vol. 15, hal. 1929–1958, 2014.

J. Brownlee (2016) Dropout Regularization in Deep Learning Models With Keras. [Online],, tanggal akses: 20-Jun-2018.

M. D. Zeiler, “ADADELTA: An Adaptive Learning Rate Method,” ArXiv12125701 Cs, Dec. 2012.

J. Duchi, E. Hazan, dan Y. Singer, “Adaptive Subgradient Methods for Online Learning and Stochastic Optimization,” J. Mach. Learn. Res., Vol. 12, hal. 2121–2159, 2011.

D. P. Kingma dan J. Ba, “Adam: A Method for Stochastic Optimization,” ArXiv14126980 Cs, Dec. 2014.

T. Dozat, “Incorporating Nesterov Momentum into Adam,” Stanford University, Technical Report, 054, 2015.

K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, dan Y. Bengio., “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention,” Proceedings of the 32nd International Conference on Machine Learning (PMLR 37), 2015, hal. 2048–2057.

K. Gregor, I. Danihelka, A. Graves, D. Rezende, dan D. Wierstra, “DRAW: A Recurrent Neural Network For Image Generation,” Proceedings of the 32nd International Conference on Machine Learning (PMLR 37), 2015, hal. 1462–1471.

S. Ruder, “An Overview of Gradient Descent Optimization Algorithms,” ArXiv160904747 Cs, Sep. 2016.

M. G. Song, M. Tariquzzaman, J. Y. Kim, S. T. Hwang, dan S. H. Choi, “A Robust and Real-Time Visual Speech Recognition for Smartphone Application,” Int. J. Innov. Comput. Inf. Control, Vol. 8, No. 4, hal. 2837-2853, Apr. 2012.

Young-Un Kim, Sun-Kyung Kang, dan Sung-Tae Jung, “Design and Implementation of a Lip Reading System in Smart Phone Environment,” 2009 IEEE International Conference on Information Reuse & Integration, 2009, hal. 101–104.

S. C. Wong, A. Gatt, V. Stamatescu, dan M. D. McDonnell, “Understanding Data Augmentation for Classification: When to Warp?,” 2016 International Conference on Digital Image Computing: Techniques and Applications (DICTA), 2016, hal. 1–6.

I. Rebai, Y. BenAyed, W. Mahdi, and J.-P. Lorré, “Improving speech recognition using data augmentation and acoustic model fusion,” Procedia Comput. Sci., vol. 112, pp. 316–322, 2017.



  • There are currently no refbacks.

Copyright (c) 2018 Jurnal Nasional Teknik Elektro dan Teknologi Informasi (JNTETI)

JNTETI (Jurnal Nasional Teknik Elektro dan Teknologi Informasi)

Departemen Teknik Elektro dan Teknologi Informasi, Fakultas Teknik Universitas Gadjah Mada
Jl. Grafika No 2. Kampus UGM Yogyakarta 55281
+62 274 552305