Our analysis is centered around engaging LLMs in two specific types of cognitive tasks: first, syntactically-rich (semantically-poor) tasks such as recognizing formal grammars, and next, semantically-rich (syntactically-poor) tasks such as answering factual knowledge questions about real-world entities. Using carefully designed experimental frameworks, we attempt to answer the following foundational questions:
(a) how can we estimate what latent skills and knowledge a (pre-trained) LLM possesses?
(b) (how) can we distinguish whether some LLM has learnt some training data by rote vs. by understanding?
(c) what is the minimum amount of training data/costs needed for a (pre-trained) LLM to acquire a new skill or knowledge?
(d) when solving a task, is training over task examples better, worse, or similar to providing them as in-context demonstrations?
I will present some initial empirical results from experimenting with a number of large open-source language models and argue that our findings they have important implications for the privacy of training data (including potential for memorization), the reliability of generated outputs (including potential for hallucinations), and the robustness of LLM-based applications (including our podcast assistant for science communication).