Evaluating LLMs — Anthony Devito

This is a research project a friend and I did as apart of our Data Science class. The goal was to take the common LLM models (~late 2024) and figure out which ones preformed best at different tasks. The tasks included, Performance of Random String Identification, Performance of Random String Generation, Performance of Multi-step problem solving, and Performance of Complex Calculations.

From our research, google/gemma-2-27b-it emerged as the most specialized model for random string generation, outperforming other models in this task. In the complex calculations category, meta-llama/llama-3.1-70b-instruct demonstrated exceptional capabilities, highlighting its strength in mathematical problem-solving. For random string identification, x-ai/grok-beta stood out as the most specialized model, showcasing its ability to distinguish between random and non-random strings. Finally, google/gemini-flash-1.5 excelled in multistep problems, demonstrating its capacity to break down complex tasks into manageable steps.

For more info and graphs, check out our paper! (PDF)

evaluating technical capabilities of llms