Hi Everyone,
In this edition of The Weekly Kaitchup:
NVLM-D-72B: A 72B parameter VLM by NVIDIA
Whisper Large V3 Turbo: As Good as Large V2 but 6x Faster
LFM: The New Mysterious Models by Liquid AI
I’m now formatting the first chapter of my book. This chapter will be released on October 15th. If you purchased the book, you will receive the chapters at the email address you provided.
The 50% discount for the presale is available until next Friday, October 11th.
You can also get the book for free by subscribing to The Kaitchup Pro:
NVLM-D-72B: A 72B parameter VLM by NVIDIA
Two weeks ago, NVIDIA introduced NVLM:
NVLM: Open Frontier-Class Multimodal LLMs
NVLM-1.0 is a new family of multimodal large language models (LLMs) designed to address the limitations in current multimodal LLM architectures. NVLM-1.0 has three distinct model architectures: NVLM-D, a decoder-only model; NVLM-X, a cross-attention-based model; and NVLM-H, a hybrid model. They are all trained on the same carefully curated dataset, allowing for direct comparison between the architectures. NVLM-X is designed to be more computationally efficient for handling high-resolution images, while NVLM-D is good at multimodal reasoning and achieves better accuracy in tasks like OCR. Building on the strengths of both models, NVLM-H introduces a hybrid architecture that improves multimodal reasoning while maintaining computational efficiency with high-resolution image inputs.
To further boost the accuracy of high-resolution image processing, NVLM-1.0 exploits a tile-tagging mechanism, which dynamically tiles image inputs and tags them with corresponding text-based identifiers before feeding them into the model. This technique improves performance on OCR-related tasks and multimodal reasoning.
NVLM-1.0 models are trained on a blend of multimodal and high-quality text-only data, which helps preserve text performance while improving vision-language capabilities. Notably, the curated datasets contain a wide variety of tasks, with a focus on the quality and diversity of the data.
NVIDIA only released NVLM-D, a 72B parameter VLM. It seems to significantly outperform Llama 3.2 90B on public benchmarks but they didn’t publish comparisons against Qwen2-VL 72B which I suspect remains superior.
Hugging Face: nvidia/NVLM-D-72B (CC-BY-NC)
Whisper Large V3 Turbo: As Good as Large V2 but 6x Faster
OpenAI rarely releases open-source models, but they make exceptions with Whisper, their advanced speech-to-text model that supports multiple languages.
The latest Whisper model, large-v3-turbo, is an optimized smaller version of Whisper large-v3, reducing the number of decoder layers from 32 to just 4. While large-v3 has 1.54 billion parameters, turbo is more lightweight at 809 million parameters.
Inspired by Hugging Face's Distil-Whisper, this new model boosts transcription speed with minimal accuracy loss. Unlike Distil-Whisper, turbo underwent two additional fine-tuning epochs using the same multilingual transcription data. While turbo performs comparably to large-v2 across most languages, it shows slightly larger accuracy degradation in languages like Thai and Cantonese. However, it performs better on cleaner datasets, such as FLEURS, compared to Common Voice.
turbo is indeed significantly worse than large-v3, but the model is faster than the base version of the first Whisper model while performing much better:
You can find the model and code to use it on the Hugging Face Hub:
Hugging Face: openai/whisper-large-v3-turbo (Apache 2.0 license)
LFM: The New Mysterious Models by Liquid AI
Liquid AI has introduced the Liquid Foundation Models (LFMs), a new generation of AI models with top performance while maintaining efficient memory usage and reduced inference time.
The first generation of LFMs introduces three language models: a 1.3B model and a 3.1B model, both optimized for smaller configurations, and a 40.3B Mixture of Experts (MoE) model for more complex tasks. They seem to use a new neural architecture. LFMs are not transformer-based models.
According to their blog post, LFMs use custom computational units organized with weight and feature sharing to adapt computation based on input data. The key principles seem to involve token-mixing, channel-mixing, and featurization. We don’t know much more yet but they are planning to publish blog posts with more explanation.
The LFMs are particularly memory-efficient, especially for long-context tasks, such as document analysis and Retrieval-Augmented Generation (RAG).
source: Liquid Foundation Models: Our First Series of Generative AI Models
GPU Cost Tracker
This section keeps track, week after week, of the cost of GPUs. It only covers consumer GPUs, from middle-end, e.g., RTX 4060, to high-end, e.g., RTX 4090.
While consumer GPUs have much less memory than GPUs dedicated to AI, they are more cost-effective, by far, for inference with small batches and fine-tuning LLMs with up to ~35B parameters using PEFT methods.
To get the prices of GPUs, I use Amazon.com. If the price of a GPU drops on Amazon, there is a high chance that it will also be lower at your favorite GPU provider. All the links in this section are Amazon affiliate links.
GPU Selection of the Week:
RTX 4090 (24 GB): GIGABYTE GeForce RTX 4090 Gaming OC 24G Graphics Card
RTX 4080 SUPER (16 GB): GIGABYTE GeForce RTX 4080 Super WINDFORCE V2
RTX 4070 Ti SUPER (16 GB): MSI Gaming RTX 4070 Ti Super 16G Ventus
RTX 4060 Ti (16 GB): MSI Gaming GeForce RTX 4060 Ti 16GB GDRR6 Extreme Clock
The Salt
The Salt is my other newsletter that takes a more scientific approach. In The Salt, I primarily feature short reviews of recent papers (for free), detailed analyses of noteworthy publications, and articles centered on LLM evaluation.
This week, I briefly reviewed:
⭐Instruction Following without Instruction Tuning
Discovering the Gems in Early Layers: Accelerating Long-Context LLMs with 1000x Input Token Reduction
VPTQ: Extreme Low-bit Vector Post-Training Quantization for Large Language Models
Making Text Embedders Few-Shot Learners
That’s all for this week.
If you like reading The Kaitchup, consider sharing it with friends and coworkers (there is a 20% (or 30% for Pro subscribers) discount for group subscriptions):
Have a nice weekend!