At the Microsoft Ignite event, NVIDIA and Microsoft has revealed several new tools for Windows developers to help swiftly create and optimize AI-driven applications on RTX AI PCs so that not everyone has to be Linux-proficient in order to get started in the AI career.
Just a quick refresh on James who is a digital avatar powered by NVIDIA NIM microservices, NVIDIA ACE, and ElevenLabs digital human technologies, Team Green actually developed multimodal small language models capable of processing both text and imagery that excel in role-playing scenarios with rapid responses.
In this context, the upcoming NVIDIA Nemovision-4B-Instruct model uses the latest NVIDIA VILA and NeMo frameworks for distillation, pruning, and quantization, enabling it to perform efficiently on RTX GPUs with high accuracy. With this alone, digital avatars are capable of interpreting visual imagery from both the real world and on-screen which then provides relevant and context-aware responses.
On the other hand, they are also launching the Mistral NeMo Minitron 128k Instruct family, a series of large-context small language models designed for optimized and efficient digital human interactions that come in 8B-, 4B-, and 2B-parameter versions – essentially covering almost all tiers of RTX GPUs with balanced speed, memory usage, and accuracy.
These models are unique in terms of accessibility since they removed the need for data segmentation and reassembly thanks to their ability to process extensive datasets in one pass. Built via the GGUF format, they enhance efficiency on low-power devices while maintaining compatibility with various programming languages.
Lastly, TensorRT Model Optimizer (ModelOpt) has finally been updated to streamline model optimization for ONNX Runtime deployment on Windows. With this, devs can now optimize models into an ONNX checkpoint for seamless deployment in ONNX runtime environments, utilizing GPU execution providers like CUDA, TensorRT, and DirectML.
Specifically, ModelOpt incorporates advanced quantization techniques, such as INT4-Activation Aware Weight Quantization, which reduces memory usage and boosts throughput performance on RTX GPUs compared to other tools like Olive. These updates enable models to reduce memory footprints by up to 2.6x compared to FP16 models during deployment, ensuring faster throughput with minimal accuracy loss. This makes AI applications more scalable across a broader range of PCs.