Tokens, Context, APIs, and Models - Beginner's Guide

This is an overly simplified guide meant for beginners. I wish I had this when I started, so I created it here to help others.

What are Tokens?

An AI model doesn't process characters (letters) or words directly. The AI wants tokens as input. Everything you write is broken down into these tokens, which do not reflect characters, syllables, or words in a 1:1 ratio.

Token Examples

Let's look at some examples:

Megumin = Meg + umin = 7 characters = 2 tokens
EXPLOSION! = EXP + LOS + ION + ! = 10 characters = 4 tokens
I love SillyTavern. = I + _love + _Sil + ly + T + av + ern + . = 19 characters = 8 tokens

Note: Spaces are combined with adjacent words, shown here with underscores

Word Order Affects Tokenization

If Yoda said that sentence: - SillyTavern I love. = S + illy + T + av + ern + _I + _love + . = 19 characters = 8 tokens

Notice how the token cuts are different, though the total count happens to be the same in this example.

Important Notes

Different models use different tokenization methods
Tokenization varies by language and special characters
SillyTavern has a built-in token counter tool
Your software must support the model you want to run

What are Tokens Used For?

Tokens are the language that AI models understand. The API must break down all your inputs into tokens before the model can process them.

Context Size and Token Limits

Context Size

The context size is the maximum number of tokens you want to send or process through SillyTavern, API, and model. This is a value you can set, but it affects RAM usage and performance.

Token Limit

The token limit is always a combination of input tokens plus output tokens. Generally, the token limit is defined when the model is created (trained), representing the maximum the model can receive and process.

Common Sizes

Context sizes and token limits are counted in powers of two: - 1024 (1k) - 2048 (2k) - 4096 (4k) - 8192 (8k) - 16384 (16k) - 32768 (32k)

RoPE and Degradation

RoPE (Rotary Position Embedding)

When running a local AI model, you can use RoPE to extend the token limit above the model's original limit. This works up to 2x-4x the original model size, but may lead to: - Nonsense outputs - Recursions - Performance degradation

Sweet Spot

There's an agreed-upon sweet spot where speed and model performance are optimized—often said to be 16k tokens.

Recommendations

For 7B or 8B models: Don't exceed 32k tokens to avoid significant degradation
Test limits yourself, as results are model-dependent

Understanding Context

Warning: This is highly simplified and doesn't cover technical details like chat completion vs. text completion

What is Context?

Context is everything that gets sent to the API for the model. SillyTavern combines all your settings, inputs, prompts, and chat history intelligently, then sends this context with every new prompt.

Important: The model itself has no memory or personality—it receives input and generates output fresh each time, requiring all information to be sent every time.

Context Structure

Think of the prompt as having DOWN (first thing the model receives, most important) and UP (last thing received, less important).

Token Types in SillyTavern

Permanent Tokens

System Message (optional)
Character name (sent at start of every character message)
Character description box
Character personality box
Scenario box

Temporary Tokens

First message box
Example messages box (can be configured as permanent)

Highly Configurable

Character's notes
Author's notes
Persona

Context Evolution Example

Initial chat setup:

System Message
Character description
Character personality
Scenario
Persona
Example messages
First message

After first exchange:

System Message
Character description
Character personality
Scenario
Persona
Example messages
First message
First user message
First AI response

After more messages (within context size):

System Message
Character description
Character personality
Scenario
Persona
Example messages
First message
Chat History (oldest to newest)
Last user message
Last AI response

When context exceeds size limit:

System Message
Character description
Character personality
Scenario
Persona
Chat History (most recent 5 messages)
Last user message
Last AI response

Context Management

Temporary tokens (example messages, first message) are removed first
Then chat history is shortened from oldest to newest
Longer chats mean character behavior is defined more by chat history than original description

@Depth Settings

To keep important information relevant, use @Depth settings. Setting something to depth 4 (good default) keeps that information "DOWN" in the context and therefore more influential.

Example with Author's Note @Depth 4:

System Message
Character description
Character personality
Scenario
Persona
Example messages
First message
Chat History (recent messages)
<<< Author's Note (at depth 4)
Chat History (last few messages)
Last user message
Last AI response

Understanding Model Parameters (B = Billion)

When running local models, you have many choices: servers, APIs, types, models, Bs (billions), Qs (quantizations), etc.

Popular LLM Servers

Different servers are optimized for different model types:

KoboldCPP: CPU/GPU split and GGUF format
TabbyAPI: Fast EXL2 format
Oobabooga: All-in-one solution
LMStudio: Integrated model download
Ollama: General purpose
ComfyUI: Specialized for image creation

Model File Types

PyTorch: More for development implementations
Safetensors: Raw format, more compatible than PyTorch
EXL2: Fast if run fully from VRAM
GGML: Format before GGUF, can run on CPU+RAM
GGUF: Can run on GPU/CPU+VRAM/RAM split, mostly default for llama.cpp

What Does "B" Mean?

The B represents billions of parameters the model was trained on. Higher B doesn't necessarily mean more intelligent—it means the model "knows" more things.

Think of it like encyclopedias: - 1B parameters = encyclopedia with 1 billion entries - 7B parameters = encyclopedia with 7 billion entries
- 20B parameters = encyclopedia with 20 billion entries

Quantization (Q)

What is Quantization?

Models can be in full floating point (FP32 or FP16), which are extremely large. Quantization optimizes models by reducing precision without significantly degrading performance, saving memory and processing costs.

Quantization Naming Convention

fxx = float (ignore f16, too large)
Qx = quantization
IQx = iMatrix quant (not instruct/INST)

Quantization Levels

Qx_0: Legacy format (ignore)
Qx_K: Standard quantization
Qx_K_L: Large
Qx_K_M: Medium
Qx_K_S: Small
IQx_M: Smaller but often comparable performance to Qx_K_M
IQx_S: Smaller but often comparable performance to Qx_K_S

Quantization Guidelines

Never use Q or IQ below 2: Models degrade too much
Q8 vs Q6: Often indistinguishable, so Q6 is recommended
Q4_K_M: Good balanced default (size, performance, quality)

The Golden Rule: B vs Q

A higher B with lower Q (above 2) is always better than lower B with higher Q.

Hardware Requirements

VRAM/RAM Rule of Thumb

Model size minus 2GB = VRAM/RAM needed (for low context sizes)

Example Configurations

8GB VRAM GPU

7B model with Q5_K
8B model with Q4_K_M

16GB VRAM GPU

22B models (amazing step above 7-13B)
Consider GGUF 20B on 8GB VRAM + RAM split (still bearable, much more fun)

Model Size Recommendations

Everything over 20B: Amazing quality
70B-72B: Too much for most gaming PCs, will be slow from RAM
100B+: Amazing but require rented hardware
Warning: Can spoil your fun—you'll be disappointed going back to lower B models
Advice: Stay within your long-term available range

Smaller Models

1B or smaller than 7B: For embedded applications on small devices, not suitable for roleplay on gaming PCs

Mixture of Experts (MoE) Models

What Does "8x7B" Mean?

These models combine several experts, with 2 experts selected for each layer per token. The idea is higher parameter count while only using costs of lower B models.

Example: Mistral AI's Mixtral models use this approach. See: https://mistral.ai/news/mixtral-of-experts/

Personal note: I don't use these models and can't vouch for their effectiveness.

Hugging Face Model Variations

Model Modifications

Quants: Quantization as described above
Merges: Mixing models to combine strengths or emphasize characteristics
Finetunes: Extending model knowledge by adding datasets (e.g., roleplay datasets)

Common Model Tags

Uncensored, Abliterated, NSFW: Models that actively engage/push in certain directions
eRP: Finetuned for erotic roleplay, or allowing adult content in fantasy settings
RolePlay, RP: Finetuned for general roleplay
Adventure: Finetuned for NetHack-like adventure games

Popular Base Models

LLaMA 2/3
Mistral
Mixtral
Falcon
Qwen
Command-R
DeepSeek

Legal Considerations

Some models have copyright restrictions or unclear legal status. For personal use, this generally falls into a legal gray area. However, for professional or commercial use, check the specific model's license and stick to properly licensed base models.

Note: I'm not a lawyer—make your own informed decisions.

Conclusion

If you just want to roleplay, you should now have all the information needed to make informed decisions about tokens, context, and model selection.

Key Takeaways: - Higher B with reasonable Q beats lower B with higher Q - Stay within your hardware's long-term capabilities - 16k context is often the sweet spot - Q4_K_M is a good balanced default - Test different models to find what works for your use case

For character creation guidance, that will be covered in a separate guide.