Phi-4 Multimodal: Microsoft's AI Leap in Speech, Image, Text • Daily CyberSecurity

Phi-4 Multimodal: Microsoft’s AI Leap in Speech, Image, Text

Do Son February 27, 2025 2 minutes read

In addition to its ongoing collaboration with OpenAI on artificial intelligence models, Microsoft continues to enhance its Phi series of lightweight language models. The recently unveiled Phi-4-multimodal introduces advanced multimodal capabilities, supporting speech, image, and text processing. It is available through managed platforms such as Azure AI Foundry, Hugging Face, and the Nvidia API Catalog.

Compared to its predecessor, Phi-4, this new iteration significantly strengthens multimodal processing capabilities, refining speech recognition, visual analysis, and text inference to enhance multitasking AI performance on edge devices.

By adopting a multimodal approach, Phi-4-multimodal eliminates the inefficiencies of previous models, which required converting speech into text before processing and relied on separate vision models for image analysis. This prior method often led to noticeable execution delays and increased memory and resource consumption on devices.

The newly introduced Phi-4-multimodal integrates speech, image, and text processing within a unified neural network architecture, vastly improving data processing efficiency. It boasts 5.6 billion parameters, supports 128,000 tokens of context, and features preference optimization and reinforcement learning with human feedback (RLHF) while prioritizing security.

Supporting over 20 languages, including English, Chinese, Japanese, Korean, German, and French, the model’s speech recognition capabilities extend to English, Chinese, Spanish, and Japanese. However, its image processing functionality is currently limited to English-language comprehension.

Alongside Phi-4-multimodal, Microsoft has also introduced Phi-4-mini, a more compact model with 3.8 billion parameters that focuses primarily on text processing. It excels in code generation, mathematical reasoning, and long-form content comprehension, supporting 128,000 tokens of context. Despite its smaller scale, Phi-4-mini is designed to deliver superior inference capabilities and improved instruction adherence compared to other models of similar size.

Support Our Threat Intelligence

If you find our CVE report and cybersecurity news helpful, consider supporting our work.

Buy Me a Coffee PayPal

Written by

@DdoS · Security Researcher

Do Son

Do Son is the Founder and Editor of SecurityOnline.info. Working in cybersecurity since 2013, he reports on vulnerabilities, malware, and emerging threats, providing timely analysis to help organizations and individuals stay ahead of evolving risks.

Related Posts:

Get Zero-Hour Vulnerability Alerts

Support Our Threat Intelligence

Do Son

Leave a Reply Cancel reply