Microsoft recently released Phi-3, a powerful language model called Phi-3-vision-128k-instruct, which has achieved impressive results on public benchmarks. This blog post explores how to use Phi-3-vision-128k-instruct for various tasks such as Optical Character Recognition, Image Captioning, Table Parsing, Figure Understanding, Reading Comprehension on Scanned Documents, and Set-of-Mark Prompting. The post provides a code snippet to run the model locally using transformers and bitsandbytes.
The code snippet demonstrates how to load the Phi-3 model with bitsandbytes to fit in consumer-grade GPU memory. It then processes prompts with images to perform tasks like transcribing text from images, captioning images, parsing tables, understanding figures, and answering questions about scanned documents. The model’s capabilities are showcased through examples where it accurately transcribes text, describes images, extracts table content, interprets figures, and summarizes book content.
Phi-3-Vision is highlighted as a powerful model for working with images and text, excelling in tasks like document parsing, table structure understanding, and OCR in the wild. Despite its small size with 4 billion parameters, it stands out for its efficiency compared to larger models. The model’s compact nature makes it suitable for deployment on edge devices or local consumer-grade GPUs, especially after quantization. Overall, Phi-3-Vision proves to be a versatile tool for various data science tasks and shows promise for further fine-tuning on specialized tasks.
Source link
Source link: https://towardsdatascience.com/6-real-world-uses-of-microsofts-newest-phi-3-vision-language-model-8ebbfa317fe8?source=rss——llm-5
GIPHY App Key not set. Please check settings