Open main menu
Home
Random
Recent changes
Special pages
Community portal
Preferences
About Wikipedia
Disclaimers
Incubator escapee wiki
Search
User menu
Talk
Dark mode
Contributions
Create account
Log in
Editing
Natural language generation
(section)
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
===Image captioning=== Over the past few years, there has been an increased interest in [[automatic image annotation|automatically generating captions]] for images, as part of a broader endeavor to investigate the interface between vision and language. A case of data-to-text generation, the algorithm of image captioning (or automatic image description) involves taking an image, analyzing its visual content, and generating a textual description (typically a sentence) that verbalizes the most prominent aspects of the image. An image captioning system involves two sub-tasks. In Image Analysis, features and attributes of an image are detected and labelled, before mapping these outputs to linguistic structures. Recent research utilize''s'' deep learning approaches through features from a pre-trained [[convolutional neural network]] such as AlexNet, VGG or Caffe, where caption generators use an activation layer from the pre-trained network as their input features. Text Generation, the second task, is performed using a wide range of techniques. For example, in the Midge system, input images are represented as triples consisting of object/stuff detections, action/[[pose (computer vision)|pose]] detections and spatial relations. These are subsequently mapped to <noun, verb, preposition> triples and realized using a tree substitution grammar.<ref name=":0" /> A common method in image captioning is to use a vision model (such as a [[Residual neural network|ResNet]]) to encode an image into a vector, then use a language model (such as an [[Recurrent neural network|RNN]]) to decode the vector into a caption.<ref>{{Cite journal |last=Vinyals |first=Oriol |last2=Toshev |first2=Alexander |last3=Bengio |first3=Samy |last4=Erhan |first4=Dumitru |date=2015 |title=Show and Tell: A Neural Image Caption Generator |url=https://www.cv-foundation.org/openaccess/content_cvpr_2015/html/Vinyals_Show_and_Tell_2015_CVPR_paper.html |pages=3156β3164}}</ref><ref>{{Cite journal |last=Karpathy |first=Andrej |last2=Fei-Fei |first2=Li |date=2015 |title=Deep Visual-Semantic Alignments for Generating Image Descriptions |url=https://www.cv-foundation.org/openaccess/content_cvpr_2015/html/Karpathy_Deep_Visual-Semantic_Alignments_2015_CVPR_paper.html |pages=3128β3137}}</ref> Despite advancements, challenges and opportunities remain in image capturing research. Notwithstanding the recent introduction of Flickr30K, MS COCO and other large datasets ''have'' enabled the training of more complex models such as neural networks, it has been argued that ''research in image captioning could benefit from larger and diversified datasets.'' Designing automatic measures that can mimic human judgments in evaluating the suitability of image descriptions is another need in the area. Other open challenges include visual [[question answering|question-answering]] (VQA),<ref>{{cite conference |last1=Kodali |first1=Venkat |last2=Berleant |first2=Daniel |title=Recent, Rapid Advancement in Visual Question Answering Architecture: a Review |book-title=Proceedings of the 22nd IEEE International Conference on EIT |pages=133β146 |year=2022 |arxiv=2203.01322 }}</ref> as well as the construction and evaluation multilingual repositories for image description.<ref name=":0" />
Edit summary
(Briefly describe your changes)
By publishing changes, you agree to the
Terms of Use
, and you irrevocably agree to release your contribution under the
CC BY-SA 4.0 License
and the
GFDL
. You agree that a hyperlink or URL is sufficient attribution under the Creative Commons license.
Cancel
Editing help
(opens in new window)