Recent advancements in large vision-language models (VLMs), such as CLIP, have revolutionized the field with their ability to generalize across diverse visual tasks. However, their computational overhead and rigidity pose challenges for real-world deployment in resource constrained environments requiring real time processing. This research introduces a novel model distillation framework that addresses these challenges by combining category expansion with learned image augmentation to transfer the capabilities of large scale VLMs into compact student models, all without relying on human labeled data. By harnessing the text encoder of VLMs, our method broadens the scope of task-relevant categories, enabling the student model to represent a richer set of visual concepts. Simultaneously, we employ a policy for learning image augmentations that align with these expanded categories. The similarities between augmented images and expanded categories are then distilled into the student model through a trainable projection head. Extensive evaluation on small-scale datasets demonstrates that our approach achieves competitive or superior performance compared to existing distillation and self-supervised techniques. This presentation will focus on the methodology behind this label-free distillation framework, emphasizing how leveraging linguistic guidance from VLMs improves the efficiency and effectiveness of knowledge transfer to student models in low-data scenarios.