W3AI - Document Parsing by Donut

Document Parsing by Donut

Donut consists of a vision encoder (Swin Transformer) and a text decoder (BART). Given an image, the encoder first encodes the image into a tensor of embeddings (of shape batch_size, seq_len, hidden_size), after which the decoder autoregressively generates text, conditioned on the encoding of the encoder.

mit

Image-to-Text

PyTorch

English

by @AIOZNetwork

•

0.0 (0)

Last updated: a month ago

Details

No discussions yet. Start the first one.

New Discussion