ISSN: 3048-6815

A Hybrid Deep Learning Architecture for Enhanced Sentiment Analysis of Multimodal Social Media Data: Leveraging Contextual Embeddings and Attention Mechanisms

Abstract

This paper introduces a novel hybrid deep learning architecture designed to enhance sentiment analysis of multimodal social media data. Social media sentiment is often expressed through a combination of textual, visual, and sometimes auditory content, necessitating approaches that can effectively integrate and interpret these diverse modalities. Our architecture leverages contextual embeddings derived from pre-trained language models like BERT and RoBERTa for textual analysis, alongside convolutional neural networks (CNNs) for visual feature extraction. Crucially, we incorporate attention mechanisms to dynamically weight the importance of different textual and visual features, allowing the model to focus on the most salient information for sentiment prediction. Furthermore, we introduce a fusion module that combines the modality-specific representations using a gated mechanism, enabling adaptive control over the contribution of each modality. The proposed architecture is evaluated on a benchmark multimodal sentiment analysis dataset, demonstrating significant improvements in accuracy, F1-score, and area under the ROC curve (AUC) compared to state-of-the-art methods. The results highlight the effectiveness of our hybrid approach in capturing nuanced sentiment expressed through the complex interplay of textual and visual cues in social media. We also provide an ablation study to analyze the contribution of each component of the proposed architecture. The paper concludes with a discussion of limitations and directions for future research, including exploring the integration of audio data and addressing biases in multimodal sentiment datasets.

References

  1. Cambria, E., Schuller, B., Xia, Y., & Havasi, C. (2013). New avenues in opinion mining and sentiment analysis. IEEE Intelligent Systems, 28(2), 15-26.
  2. Poria, S., Cambria, E., Hussain, A., Huang, G. B., & Kurucz, M. (2015). Fusing audio, visual and textual clues for sentiment analysis from multimodal data. Neurocomputing, 148, 84-93.
  3. Zadeh, A., Liang, P. P., Mazumder, S., Poria, S., Cambria, E., & Morency, L. P. (2017). Tensor fusion network for multimodal sentiment analysis. arXiv preprint arXiv:1707.07250.
  4. Tsai, Y. H. H., Bai, S., Soh, Y., & Morency, L. P. (2019). Multimodal transformer for unaligned multimodal language sequences. arXiv preprint arXiv:1906.00410.
  5. Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
  6. Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., ... & Stoyanov, V. (2019). Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
  7. Rahman, M. A., Islam, M. R., & Hasan, M. A. (2020). Effective multimodal sentiment analysis using bert and cross-modal attention. IEEE Access, 8, 98737-98746.
  8. Chen, L., Wei, Z., Liu, Y., & Zhuang, Y. (2017). Multimodal sentiment analysis based on attention mechanism. Multimedia Tools and Applications, 76(23), 25255-25273.
  9. Tsai, Y. H. H., Liang, P. P., Zadeh, A., Morency, L. P., & Salakhutdinov, R. (2018). Learning fine-grained multimodal representations with intra-and inter-modality attention. arXiv preprint arXiv:1812.02295.
  10. Liang, P. P., Zadeh, A., Morency, L. P., & Salakhutdinov, R. (2018). Visual grounding for spoken language understanding. arXiv preprint arXiv:1804.06759.
  11. Park, D. H., Kim, J. Y., & Lee, J. H. (2020). Data augmentation for multimodal sentiment analysis. Expert Systems with Applications, 159, 113552.
  12. Hao, H., Li, J., Liu, K., & Zhao, J. (2020). Towards mitigating modality bias for multimodal sentiment analysis. arXiv preprint arXiv:2005.00468.
  13. Ghate, A., Bhatia, P., & Shah, R. R. (2020). Graph-mfn: Graph multimodal fusion network for multimodal sentiment analysis. arXiv preprint arXiv:2003.04188.
  14. Speer, R., Chin, J., & Havasi, C. (2017). Conceptnet 5.5: An open multilingual graph of general knowledge. In AAAI.
  15. Selvaraju, R. R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., & Batra, D. (2017). Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE international conference on computer vision (pp. 618-626).
  16. He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770-778).
  17. Kingma, D. P., & Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
  18. Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., ... & Rush, A. M. (2020). Transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771.
  19. Zadeh, A., Zellers, R., Liang, P. P., Poria, S., Cambria, E., & Morency, L. P. (2016). Multimodal sentiment intensity dataset (mosi). arXiv preprint arXiv:1603.01437.
Download PDF

How to Cite

Manoj Kumar Chaturvedi , (2025-05-02 10:29:31.765). A Hybrid Deep Learning Architecture for Enhanced Sentiment Analysis of Multimodal Social Media Data: Leveraging Contextual Embeddings and Attention Mechanisms. JANOLI International Journal of Artificial Intelligence and its Applications, Volume EOCMPeqBj5R9ZDur0Rlk, Issue 3.