Neurosymbolic visual reasoning with scene graph enrichment
View/ Open
Date
2024-02-26Embargo Date
2025-02-28
Author
Khan, Muhammad Jaleed
Metadata
Show full item recordUsage
This item's downloads: 0 (view details)
Abstract
Visual reasoning is a critical component of artificial intelligence that aims to understand,
interpret, and reason about complex visual content. It has an interdisciplinary nature
incorporating visual feature extraction and image generation from computer vision, linguistic feature extraction and language generation from natural language processing,
and graph-based representation and semantic enrichment from knowledge representation and reasoning. Data-centric visual reasoning techniques often face limitations in
intuitively interpreting visual content due to the limited expressiveness and generalisability of scene representations. We propose a knowledge-enhanced neurosymbolic visual
reasoning framework based on scene graph enrichment. This framework employs deep
learning techniques for object detection and relationship prediction in visual content to
generate scene graph representations, which are then refined and semantically enriched
using common sense knowledge extracted from a heterogeneous knowledge graph. The
enriched scene graphs are used in downstream visual reasoning tasks, including image
captioning, visual question answering and image generation. A comprehensive experimental analysis on the standard datasets and evaluation benchmarks demonstrates considerable improvement over existing state-of-the-art methods in terms of relationship
recall rate, image captioning quality, question answering accuracy and image generation
realism. The encouraging results validate the effectiveness of leveraging heterogeneous
common sense knowledge for enhanced scene understanding and visual reasoning.