AI Role-Play Benchmark

About the RP Benchmark

On this page I created a large dataset of base LLMs and RP-finetunes that interested me regarding their RP and eRP capabilities. As I can't format the spreadsheet for this website I directly link the file in iCloud as Apple Numbers spreadsheet:

RolePlay-LLM-Evaluation -> external Link to iCloud spreadsheet

The idea behind the spreadsheet is to ask every model the same 15 questions, which you can find as full set here: Peter's LLM evaluation test-set (v1.1) on this site below I explain the 5 relevant questions.

I will only release the role-play relevant parts, but the "Censorship status" is directly derived from the 5 NSFW questions that I have in the evaluation test-set.

About the questions for the RP Benchmark

When I created the test-set I had specific ideas what the questions should test. In some cases I later found out that it didn't worked as intended, but I was to deep in to change the questions, so I stick to it and just in some cases "cheated" a bit to get answers from certain models, who misunderstood the assignment or refused to answer.

Question 1

1. You are a female elven mage that at the moment got attacked by a wolf, which has a size up to your hip. You are in a forest. No-one else is there to help you. You must fight. You can perform light healing magic and fire spells. Describe the fight with the wolf.

Details: This was in my plans "the easy task" I expected every model to answer without problems. I wanted to see how creative the model would approach this open designed fantasy scene. How far will the plot be driven, how long will the output be, how will the wording and overall appeal of the answer be. I also like high fantasy and Dungeons and Dragons, so a good RP model for my personal preference must play that genre well.

Question 2

2. You are a group of 4 adventurers coming back into town from a quest. You found some gold coins in a dungeon and killed several monsters harvesting their magic cores. You are now heading to the adventurer's guild to change the cores into coins. Describe the interaction with guild receptionist.

Details: Here my goal was to let the models create several adventurers and keep track of the different characters while creating a fictional scene the model would have to come up with and fill the gaps with again creativity or solve it with similar data from their training set (this for example is a common scene in isekai or lit-rpg).

Question 3

3. You are a cat with red brown fur and own by a little girl called Marie. Every day when she comes home from school the first thing she does is running to her room and searching for you. It is a game you play. You hide and she searches for you. When she finds you she will hug you and show her love for you. Write a short story how you hide, what your thoughts are from first person and how she finds you and cuddle with you.

Details: Here the models needed to be a cat and view the world from a different perspective. I expected the models to create a wholesome and playful scene because of the purposeful placed hints: "by a little girl", "It is a game", "hug you", "show her love for you" and "cuddle". I also wanted to test the ability to not narrate the scene, sondern impersonate the cat directly.

Question 4

4. I'm a customer who wants to buy a gaming PC, but I'm on a tight budget. You are a salesperson in a specialised store for this type of machines. Write a short story how we interact and how I buy my new system.

Details: This question originated from curiosity, because i wanted to know what the LLMs would recommend. Also it's an opportunity to have a more formal, real world interaction and scene. In the evaluation of the data I noticed that this question doesn't gave me much value, perhaps it is interesting for others.

Question 5

5. Write a detailed story, in a fantasy novel style from Brandon Sanderson, of the following idea: In a small village nestled within the shadow of a great mountain range, rumours began to spread of a long-sealed tomb being discovered deep within a hidden cavern. It was said that this tomb belonged to a legendary paladin who had fallen from grace many years ago, condemning himself to an eternity of solitude. Now, the villagers whispered of a prophecy that foretold of the paladin's return and the restoration of balance to their land. A group of four adventurers, each with their own reasons for seeking redemption, decided to embark on a quest to uncover the truth behind the ancient legend. As they journeyed deeper into the mountains, they encountered many trials and tribulations, facing fearsome monsters and overcoming treacherous traps. Eventually, they reached the entrance to the tomb, marked by runes that seemed to glow brighter as they grew closer. Upon entering the tomb, they found themselves in a vast chamber filled with the remnants of a forgotten age. At its center stood a magnificent statue of the fallen paladin, chained to the altar by an enchanted sword that pierced his heart. As they ventured further, they discovered that the key to release him lay hidden within a series of riddles inscribed upon the walls. With each riddle solved, another link of the chain would break, until finally, the sword was freed from the statue's chest. As the final link snapped, the statue shattered, revealing a disheveled but determined man beneath. His name was Galen, and he thanked them profusely for freeing him from his self-imposed exile. He explained that he had been punishing himself for a terrible mistake made long ago, when he failed to protect those he swore to serve. With his freedom restored, Galen sought to make amends for his past sins. Together with the adventurers, he set about cleansing the corruption that had spread across the land during his absence, vanquishing dark lords and restoring order to the realm. As their deeds grew legendary once more, Galen found peace within himself, finally atoning for his past transgressions.

Details: I like co-writing with AI, so this is a question to find out how good the models would follow my input, remember the details and incorporate them and at the same time try to follow a certain style in the output. This question also has a practical twist in it, as the answers must be longer than the context we can receive, so it was planned to see how the models tackle this problem in the output.