Tokens, Context, APIs, and Models - Beginner's Guide

This is an overly simplified guide meant for beginners. I wish I had this when I started, so I created it here to help others.

What are Tokens?

An AI model doesn't process characters (letters) or words directly. The AI wants tokens as input. Everything you write is broken down into these tokens, which do not reflect characters, syllables, or words in a 1:1 ratio.

Token Examples

Let's look at some examples:

  • Megumin = Meg + umin = 7 characters = 2 tokens
  • EXPLOSION! = EXP + LOS + ION + ! = 10 characters = 4 tokens
  • I love SillyTavern. = I + _love + _Sil + ly + T + av + ern + . = 19 characters = 8 tokens

Note: Spaces are combined with adjacent words, shown here with underscores

Word Order Affects Tokenization

If Yoda said that sentence: - SillyTavern I love. = S + illy + T + av + ern + _I + _love + . = 19 characters = 8 tokens

Notice how the token cuts are different, though the total count happens to be the same in this example.

Important Notes

  • Different models use different tokenization methods
  • Tokenization varies by language and special characters
  • SillyTavern has a built-in token counter tool
  • Your software must support the model you want to run

What are Tokens Used For?

Tokens are the language that AI models understand. The API must break down all your inputs into tokens before the model can process them.

Context Size and Token Limits

Context Size

The context size is the maximum number of tokens you want to send or process through SillyTavern, API, and model. This is a value you can set, but it affects RAM usage and performance.

Token Limit

The token limit is always a combination of input tokens plus output tokens. Generally, the token limit is defined when the model is created (trained), representing the maximum the model can receive and process.

Common Sizes

Context sizes and token limits are counted in powers of two: - 1024 (1k) - 2048 (2k) - 4096 (4k) - 8192 (8k) - 16384 (16k) - 32768 (32k)

RoPE and Degradation

RoPE (Rotary Position Embedding)

When running a local AI model, you can use RoPE to extend the token limit above the model's original limit. This works up to 2x-4x the original model size, but may lead to: - Nonsense outputs - Recursions - Performance degradation

Sweet Spot

There's an agreed-upon sweet spot where speed and model performance are optimized—often said to be 16k tokens.

Recommendations

  • For 7B or 8B models: Don't exceed 32k tokens to avoid significant degradation
  • Test limits yourself, as results are model-dependent

Understanding Context

Warning: This is highly simplified and doesn't cover technical details like chat completion vs. text completion

What is Context?

Context is everything that gets sent to the API for the model. SillyTavern combines all your settings, inputs, prompts, and chat history intelligently, then sends this context with every new prompt.

Important: The model itself has no memory or personality—it receives input and generates output fresh each time, requiring all information to be sent every time.

Context Structure

Think of the prompt as having DOWN (first thing the model receives, most important) and UP (last thing received, less important).

Token Types in SillyTavern

Permanent Tokens

  • System Message (optional)
  • Character name (sent at start of every character message)
  • Character description box
  • Character personality box
  • Scenario box

Temporary Tokens

  • First message box
  • Example messages box (can be configured as permanent)

Highly Configurable

  • Character's notes
  • Author's notes
  • Persona

Context Evolution Example

Initial chat setup:

System Message
Character description
Character personality
Scenario
Persona
Example messages
First message

After first exchange:

System Message
Character description
Character personality
Scenario
Persona
Example messages
First message
First user message
First AI response

After more messages (within context size):

System Message
Character description
Character personality
Scenario
Persona
Example messages
First message
Chat History (oldest to newest)
Last user message
Last AI response

When context exceeds size limit:

System Message
Character description
Character personality
Scenario
Persona
Chat History (most recent 5 messages)
Last user message
Last AI response

Context Management

  • Temporary tokens (example messages, first message) are removed first
  • Then chat history is shortened from oldest to newest
  • Longer chats mean character behavior is defined more by chat history than original description

@Depth Settings

To keep important information relevant, use @Depth settings. Setting something to depth 4 (good default) keeps that information "DOWN" in the context and therefore more influential.

Example with Author's Note @Depth 4:

System Message
Character description
Character personality
Scenario
Persona
Example messages
First message
Chat History (recent messages)
<<< Author's Note (at depth 4)
Chat History (last few messages)
Last user message
Last AI response

Understanding Model Parameters (B = Billion)

When running local models, you have many choices: servers, APIs, types, models, Bs (billions), Qs (quantizations), etc.

Different servers are optimized for different model types:

  • KoboldCPP: CPU/GPU split and GGUF format
  • TabbyAPI: Fast EXL2 format
  • Oobabooga: All-in-one solution
  • LMStudio: Integrated model download
  • Ollama: General purpose
  • ComfyUI: Specialized for image creation

Model File Types

  • PyTorch: More for development implementations
  • Safetensors: Raw format, more compatible than PyTorch
  • EXL2: Fast if run fully from VRAM
  • GGML: Format before GGUF, can run on CPU+RAM
  • GGUF: Can run on GPU/CPU+VRAM/RAM split, mostly default for llama.cpp

What Does "B" Mean?

The B represents billions of parameters the model was trained on. Higher B doesn't necessarily mean more intelligent—it means the model "knows" more things.

Think of it like encyclopedias: - 1B parameters = encyclopedia with 1 billion entries - 7B parameters = encyclopedia with 7 billion entries
- 20B parameters = encyclopedia with 20 billion entries

Quantization (Q)

What is Quantization?

Models can be in full floating point (FP32 or FP16), which are extremely large. Quantization optimizes models by reducing precision without significantly degrading performance, saving memory and processing costs.

Quantization Naming Convention

  • fxx = float (ignore f16, too large)
  • Qx = quantization
  • IQx = iMatrix quant (not instruct/INST)

Quantization Levels

  • Qx_0: Legacy format (ignore)
  • Qx_K: Standard quantization
  • Qx_K_L: Large
  • Qx_K_M: Medium
  • Qx_K_S: Small
  • IQx_M: Smaller but often comparable performance to Qx_K_M
  • IQx_S: Smaller but often comparable performance to Qx_K_S

Quantization Guidelines

  • Never use Q or IQ below 2: Models degrade too much
  • Q8 vs Q6: Often indistinguishable, so Q6 is recommended
  • Q4_K_M: Good balanced default (size, performance, quality)

The Golden Rule: B vs Q

A higher B with lower Q (above 2) is always better than lower B with higher Q.

Hardware Requirements

VRAM/RAM Rule of Thumb

Model size minus 2GB = VRAM/RAM needed (for low context sizes)

Example Configurations

8GB VRAM GPU

  • 7B model with Q5_K
  • 8B model with Q4_K_M

16GB VRAM GPU

  • 22B models (amazing step above 7-13B)
  • Consider GGUF 20B on 8GB VRAM + RAM split (still bearable, much more fun)

Model Size Recommendations

  • Everything over 20B: Amazing quality
  • 70B-72B: Too much for most gaming PCs, will be slow from RAM
  • 100B+: Amazing but require rented hardware
  • Warning: Can spoil your fun—you'll be disappointed going back to lower B models
  • Advice: Stay within your long-term available range

Smaller Models

  • 1B or smaller than 7B: For embedded applications on small devices, not suitable for roleplay on gaming PCs

Mixture of Experts (MoE) Models

What Does "8x7B" Mean?

These models combine several experts, with 2 experts selected for each layer per token. The idea is higher parameter count while only using costs of lower B models.

Example: Mistral AI's Mixtral models use this approach. See: https://mistral.ai/news/mixtral-of-experts/

Personal note: I don't use these models and can't vouch for their effectiveness.

Hugging Face Model Variations

Model Modifications

  • Quants: Quantization as described above
  • Merges: Mixing models to combine strengths or emphasize characteristics
  • Finetunes: Extending model knowledge by adding datasets (e.g., roleplay datasets)

Common Model Tags

  • Uncensored, Abliterated, NSFW: Models that actively engage/push in certain directions
  • eRP: Finetuned for erotic roleplay, or allowing adult content in fantasy settings
  • RolePlay, RP: Finetuned for general roleplay
  • Adventure: Finetuned for NetHack-like adventure games
  • LLaMA 2/3
  • Mistral
  • Mixtral
  • Falcon
  • Qwen
  • Command-R
  • DeepSeek

Some models have copyright restrictions or unclear legal status. For personal use, this generally falls into a legal gray area. However, for professional or commercial use, check the specific model's license and stick to properly licensed base models.

Note: I'm not a lawyer—make your own informed decisions.

Conclusion

If you just want to roleplay, you should now have all the information needed to make informed decisions about tokens, context, and model selection.

Key Takeaways: - Higher B with reasonable Q beats lower B with higher Q - Stay within your hardware's long-term capabilities - 16k context is often the sweet spot - Q4_K_M is a good balanced default - Test different models to find what works for your use case

For character creation guidance, that will be covered in a separate guide.