Machine Learning (ML) models are now integral to many critical systems, from self-driving cars to aviation, where their reliability and safety are crucial. Validating that these models perform their intended functions without failure is essential to prevent catastrophic outcomes. This thesis introduces novel tools and approaches inspired by software testing to specify and fuzz ML models for their functional correctness. By leveraging fuzzing and metamorphic testing techniques, we address the challenges of generating test inputs and defining test oracles for ML models. We begin by focusing on sequential decision-making problems, developing techniques to test action policies for reliability. Our PI-fuzz framework identifies bugs by generating diverse test states and applying test oracles relying on metamorphic relations. We then formalize metamorphic relations as hyperproperties and show their generalization across diverse domains and ML models. This led to the development of NOMOS, a declarative, domain-agnostic specification language for expressing and testing these hyperproperties. NOMOS is shown to be effective in identifying property violations across various ML domains. Additionally, we extend NOMOS to support code translation models. We evaluate several state-of-the-art models against a range of hyperproperties, uncovering numerous violations. This work contributes to the field by providing a comprehensive framework for assessing the reliability and safety of ML models in various applications.