Editing Natural language generation (section)

===Image captioning===

Over the past few years, there has been an increased interest in [[automatic image annotation|automatically generating captions]] for images, as part of a broader endeavor to investigate the interface between vision and language. A case of data-to-text generation, the algorithm of image captioning (or automatic image description) involves taking an image, analyzing its visual content, and generating a textual description (typically a sentence) that verbalizes the most prominent aspects of the image.

An image captioning system involves two sub-tasks. In Image Analysis, features and attributes of an image are detected and labelled, before mapping these outputs to linguistic structures. Recent research utilize''s'' deep learning approaches through features from a pre-trained [[convolutional neural network]] such as AlexNet, VGG or Caffe, where caption generators use an activation layer from the pre-trained network as their input features. Text Generation, the second task, is performed using a wide range of techniques. For example, in the Midge system, input images are represented as triples consisting of object/stuff detections, action/[[pose (computer vision)|pose]] detections and spatial relations. These are subsequently mapped to <noun, verb, preposition> triples and realized using a tree substitution grammar.<ref name=":0" />

A common method in image captioning is to use a vision model (such as a [[Residual neural network|ResNet]]) to encode an image into a vector, then use a language model (such as an [[Recurrent neural network|RNN]]) to decode the vector into a caption.<ref>{{Cite journal |last=Vinyals |first=Oriol |last2=Toshev |first2=Alexander |last3=Bengio |first3=Samy |last4=Erhan |first4=Dumitru |date=2015 |title=Show and Tell: A Neural Image Caption Generator |url=https://www.cv-foundation.org/openaccess/content_cvpr_2015/html/Vinyals_Show_and_Tell_2015_CVPR_paper.html |pages=3156–3164}}</ref><ref>{{Cite journal |last=Karpathy |first=Andrej |last2=Fei-Fei |first2=Li |date=2015 |title=Deep Visual-Semantic Alignments for Generating Image Descriptions |url=https://www.cv-foundation.org/openaccess/content_cvpr_2015/html/Karpathy_Deep_Visual-Semantic_Alignments_2015_CVPR_paper.html |pages=3128–3137}}</ref>

Despite advancements, challenges and opportunities remain in image capturing research. Notwithstanding the recent introduction of Flickr30K, MS COCO and other large datasets ''have'' enabled the training of more complex models such as neural networks, it has been argued that ''research in image captioning could benefit from larger and diversified datasets.'' Designing automatic measures that can mimic human judgments in evaluating the suitability of image descriptions is another need in the area. Other open challenges include visual [[question answering|question-answering]] (VQA),<ref>{{cite conference |last1=Kodali |first1=Venkat |last2=Berleant |first2=Daniel |title=Recent, Rapid Advancement in Visual Question Answering Architecture: a Review |book-title=Proceedings of the 22nd IEEE International Conference on EIT |pages=133–146 |year=2022 |arxiv=2203.01322 }}</ref> as well as the construction and evaluation multilingual repositories for image description.<ref name=":0" />