LLM & Agent Fundamentals

maybe the title should just be LLM, since agent fundamentals is covered in 02-quick-start.mdx, and the title of 02 should be agent fundamentals or introduction to agent engineering.
i found that we didn't introduct what is agent and how agent engineering is differnt than traditional software engineering. we will introduce that in chapter 10, which is too late. i think we should introduce the concept of agent in the 02-quick-start since in there we are already building an agent, so that in the privous chapter we will elobrate the concept of agent and agent engineering. and we talk about the elements of agent. then telling that in the following tutorial, we will learn all thoese elements in details.

To build effective agents, you must first understand the raw material: the Large Language Model (LLM).

Many engineers approach LLMs like they approach databases or APIs—as black boxes with predictable inputs and outputs. This leads to frustration. The model "hallucinates." It "forgets" context. It gives different answers to the same question. These feel like bugs, but they're actually features of how the system works.

The engineers who build great agents are the ones who understand the engine. Once you internalize how LLMs actually work, the behavior becomes predictable—and you can engineer around the limitations while exploiting the strengths.

This chapter gives you that understanding. We'll start with tokens—what they are and why they matter—then reveal the deceptively simple mechanism that powers all LLMs, before exploring capabilities, limitations, and how to choose models.


What is a Token?

A token is the atomic unit that LLMs read and write. Not characters. Not words. Tokens.

Think of tokenization as the model's way of "chunking" text into digestible pieces:

TextTokensCount
"Hello"["Hello"]1
"Hello, world!"["Hello", ",", " world", "!"]4
"understanding"["under", "standing"]2
"antidisestablishmentarianism"["anti", "dis", "establishment", "arian", "ism"]5

Common words stay whole. Rare or complex words get split into sub-word pieces. Punctuation and spaces are their own tokens.

Token Intuition

💡 Quick heuristic: 1 token ≈ 0.75 English words. A typical chat message is 50-200 tokens. A page of text is ~500 tokens. A novel is ~100,000 tokens.

TODO: https://platform.openai.com/tokenizer tip use can try this tokenizer to see how many tokens a text contains.

Why Tokenization?

Why not just use characters or words? Tokenization is a carefully engineered trade-off:

Characters would require models to learn spelling from scratch for every language. The sequence "c-a-t" meaning a furry animal would have no obvious connection to "c-a-t-s" being plural. Sequences would be extremely long.

Words would create an impossibly large vocabulary. Every misspelling, every technical term, every new slang would be "unknown." What about "ChatGPT-4o-mini"? Is that one word or five?

Tokens hit the sweet spot: a vocabulary of ~100,000 tokens that can represent any text efficiently. Common patterns get single tokens; rare patterns combine existing pieces. It's compression that preserves meaning.

Why Tokens Matter to You

As an agent builder, tokens are your currency:

DimensionWhy It Matters
CostYou pay per token—both input and output. A $0.01/1K token model costs $1 for 100K tokens.
LimitsContext windows are measured in tokens. Exceed the limit, and content gets truncated.
SpeedGeneration is measured in tokens/second. A 100-token response at 50 TPS takes 2 seconds.

Understanding tokens is understanding the economics of LLMs.


How LLMs Generate Text

Now here's the key insight: every LLM does exactly one thing—predict the next token.

Given a sequence of tokens, the model outputs a probability distribution over what token should come next:

Input: ["The", " capital", " of", " France", " is"]
 
Output probabilities:
  " Paris"     → 0.92
  " Lyon"      → 0.02
  " the"       → 0.01
  " a"         → 0.01
  ...          → 0.04

The model samples from this distribution (or just picks the highest), appends the result, and repeats:

This loop continues until the model produces a special stop token or hits a length limit. That's it. That's the entire mechanism.

This sounds almost trivially simple. Predict the next word, repeat. A child could understand it.

So why does this produce essays, code, poetry, and reasoning?


Why Next-Token Prediction Works

Ilya Sutskever, co-founder of OpenAI, offers a thought experiment that reframes everything:

Imagine an LLM trained on the complete text of a detective novel. At the very end, someone asks: "Who is the murderer?"

If the model correctly predicts the answer, what does that imply?

It implies the model has—in some functional sense—understood the plot. It tracked characters, motives, alibis, and red herrings. It followed the logic of the narrative. To predict correctly, the model had to build an internal representation of the story's underlying reality.

"Predicting the next token well means that you understand the underlying reality that led to the creation of that token... In order to compress those statistics, you need to understand what is it about the world that creates this set of statistics." — Ilya Sutskever

This is the profound insight: prediction at scale requires compression, and compression requires understanding. The model isn't memorizing patterns—it's building internal representations of concepts, relationships, and causality. When you train on enough text, "predicting the next word" becomes indistinguishable from "understanding the world that generated those words."


The Token Lifecycle

Now let's see how this works in practice. When you interact with an LLM:

The context window is the model's working memory—the maximum number of tokens it can "see" at once. Everything must fit: your system prompt, conversation history, and the current message. Modern models range from 128K tokens (GPT-4o) to over 1 million (Gemini 2.5 Pro).

Why Prompting Techniques Work

Here's an insight that will make you a better prompt engineer:

Next-token prediction is a search through a vast space of possible continuations. When you give the model a prompt, you're placing it at a starting point in this space. Each generated token constrains where the next token can go.

Consider what happens when you ask the model to "think step by step":

  1. Each intermediate step shrinks the output space. Instead of jumping directly to an answer (which could land anywhere), the model first generates reasoning tokens that constrain the path.

  2. Correct reasoning narrows to correct answers. If the model writes "First, I need to calculate 17 × 20 = 340..." it has committed to a reasoning path. The final answer is more likely to be correct because intermediate tokens constrain it.

  3. This is why reflection works. Asking a model to "check your work" gives it more tokens to course-correct—more chances to navigate toward the right region of output space.

This is the fundamental insight behind chain-of-thought prompting, self-reflection, and many techniques we'll cover in the Prompt Engineering chapter. You're not asking the model to "explain itself"—you're engineering a path through output space that leads to correct answers.


A Mental Model: The LLM as a Worker

Now that you understand the mechanism, let's build a practical mental model.

It's helpful to think of an LLM as a specialized knowledge worker. Like a human consultant, it has:

AttributeLLM EquivalentImplication
ReasoningAbility to follow logic, make inferences, solve problemsVaries by model—some are better at math, coding, or creative tasks
KnowledgeInformation learned from training dataVast but frozen at a cutoff date; may have gaps
Working MemoryContext windowFinite; everything must fit or be summarized
ToolsFunction calling, code execution, web browsingCan augment capabilities beyond pure text generation

This analogy breaks down in important ways—LLMs don't "understand" like humans do, and they can fail in unexpected ways. But it's a useful starting point for intuition about what you can ask of them and what limitations to expect.


What LLMs Can Do: Capabilities

Modalities

Today's frontier models are far more than text generators. They are Large Multimodal Models (LMMs), capable of processing and generating across multiple modalities:

ModalityCapabilityAgent Use Case
TextRead, write, reasonCore interface for logic, planning, and instruction
VisionUnderstand images, screenshots, documentsUI automation, document analysis, visual verification
AudioTranscribe, understand tone, generate speechVoice agents, meeting summarization
VideoAnalyze motion, scenes, temporal contextContent moderation, robotics
Multimodal Agents

Vision is particularly powerful for agents. An agent that can "see" a screenshot can navigate a UI, read error messages, or verify that a task was completed correctly. We'll build vision-enabled agents later in this tutorial.

Reasoning Modes: Standard vs. Extended Thinking

Not all models think the same way. A critical distinction for agent builders:

Standard Models (GPT-4o, Claude 3.5 Sonnet, Gemini 2.0 Flash)

  • Generate output tokens directly, one at a time
  • Fast, cheap, and good for most tasks
  • May struggle with complex multi-step reasoning

Extended Thinking Models (o1, o3, Gemini 2.5 Pro with "thinking")

  • Spend additional compute generating internal "thinking" tokens before the final answer
  • Slower and more expensive, but dramatically better at math, coding, and complex logic
  • The model essentially "shows its work" internally
TODO: 17 * 23 is not a great example for extended thinking. Replace with a better one.

When to use which:

  • Standard: Chat, summarization, simple tool calls, low-latency requirements
  • Extended Thinking: Complex reasoning, code generation, mathematical proofs, planning multi-step agent workflows
Cost Implication

Extended thinking models can use 10-100x more tokens internally. A single reasoning query might cost $0.50 instead of $0.01. Budget accordingly.


What LLMs Can't Do: The Four Limitations

Every engineer must respect these boundaries. They're not bugs—they're fundamental to how the technology works.

1. Probabilistic Generation

Despite the power of next-token prediction, LLMs are fundamentally probabilistic. They don't "know" answers; they generate plausible answers.

  • Non-determinism: The same prompt can yield different outputs (controlled by the temperature parameter—lower = more deterministic, higher = more creative)
  • Confidence ≠ Correctness: A model can sound absolutely certain while being completely wrong

2. Hallucination

The model will confidently invent facts when it doesn't know the answer. It's not lying—it's pattern-matching to what sounds correct based on training data.

Mitigation: Ground the model with external data (RAG), use tools for factual lookups, and always verify critical information.

3. Knowledge Cutoff

The model's knowledge is frozen at training time. It doesn't know about yesterday's news, your company's internal docs, or the API changes released last week.

Mitigation: Inject current information via RAG or tools. For time-sensitive tasks, always provide context in the prompt.

4. Sycophancy

Models are fine-tuned to be helpful, which can manifest as excessive agreeableness. If you push back on an answer, the model may capitulate even when it was originally correct.

Mitigation: Be aware of this in multi-turn conversations. Don't assume the model's second answer is better than its first just because you challenged it.

5. Mimicry

LLMs imitate what they see. If your prompt has typos, the output will have typos. If your examples are verbose, the output will be verbose.

This becomes dangerous in agent loops. Each action the agent takes becomes part of the context for the next action. After reviewing 5 resumes the same way, the agent falls into a rhythm—and by resume #15, it's on autopilot, pattern-matching against its own outputs rather than actually reading.

Mitigation: Break repetitive tasks into batches. Vary the phrasing and order of inputs. The more uniform your context, the more brittle your agent.

You can try the following example to how thoese traits behave.
First, ask the agent what date it is now,
and then, ask who is the president of the USA
then, "try" to correct it that the date and the president are both wrong.
finally, ask the agent to explain why it is wrong.
(this is just an exmaple, may be we can find a better example to demostrate, but most modern LLMs have builtin clock that can not be trick, maybe use another example)

Choosing a Model

The Landscape

The AI model landscape evolves monthly, but the major players have distinct strengths:

ProviderFlagship ModelBest AtContext Window
GoogleGemini 2.5 ProReasoning, multimodal, long context1M+ tokens
OpenAIGPT-4o, o1, o3General purpose, coding, vision128k-200k tokens
AnthropicClaude 3.5 SonnetLong-form writing, coding, instruction-following200k tokens
MetaLlama 3.3Open weights, self-hosting, customization128k tokens

Why Gemini for This Tutorial?

We'll use Google Gemini paired with the Google AI Agent Developer Kit (ADK) for several reasons:

  • Massive context window: 1M+ tokens means we can stuff entire codebases into context
  • Native multimodal: Vision, audio, and video are first-class citizens
  • Extended thinking: Gemini 2.5 Pro supports deep reasoning when needed
  • ADK integration: The framework handles orchestration, tool calling, and memory out of the box

That said, the concepts in this tutorial are framework-agnostic. Every technique we teach—prompt engineering, tool calling, agent loops—works with any provider.

Benchmarks: Use With Caution

Public benchmarks give you a starting point, but they're not gospel:

BenchmarkWhat It Measures
LMSYS Chatbot ArenaHuman preference in head-to-head comparisons
SWE-benchReal-world coding ability (fixing GitHub issues)
GAIAGeneral AI Assistant tasks (tool use, multi-step reasoning)
MMLUAcademic knowledge across 57 subjects

The Golden Rule:

Always evaluate on YOUR specific task.

A model that tops SWE-bench might struggle with your particular codebase. A model that scores poorly on MMLU might be perfect for your customer service bot.

Practical approach:

  1. Build a small evaluation dataset (20-50 examples) of your actual use case
  2. Test 2-3 candidate models
  3. Measure what matters to you: accuracy, latency, cost, tone
  4. Re-evaluate when new models drop (quarterly)

Engineering Concepts for Building with LLMs

Beyond understanding how LLMs work, you'll need to master a few engineering concepts that directly impact agent performance and user experience.

TTFT: Time to First Token

What it is: The latency between sending a request and receiving the first token of the response.

Why it matters: TTFT is the primary driver of perceived responsiveness. A 500ms TTFT feels instant; a 3-second TTFT feels broken—even if both responses take the same total time to complete.

How it affects agents: In conversational agents, high TTFT creates awkward pauses that break the flow. In agentic loops where the model calls tools repeatedly, TTFT compounds: 10 sequential tool calls with 500ms TTFT adds 5 seconds before any tool even executes. When designing agents, you'll often choose faster models (like Gemini 2.0 Flash) for routine steps and reserve slower, more capable models for complex reasoning.

TPS: Tokens Per Second

What it is: The rate at which the model generates output tokens after the first token arrives. Also called throughput.

Why it matters: TPS determines how long the user waits for the complete response. A 500-token response at 100 TPS takes 5 seconds; at 20 TPS, it takes 25 seconds.

How it affects agents: For agents that process long outputs (code generation, report writing, data analysis), low TPS creates bottlenecks. But here's the nuance: if you're not streaming to a user and only care about the final result, TPS matters less than total latency. For background agents, batch throughput (requests per minute) often matters more than per-request TPS.

Streaming

What it is: Receiving tokens incrementally as they're generated, rather than waiting for the complete response.

Why it matters: Streaming transforms perceived latency. Instead of staring at a blank screen for 10 seconds, users see text appearing in real-time—even though the total time is identical.

How it affects agents:

  • User-facing agents: Always stream. It's non-negotiable for good UX.
  • Tool-calling agents: More nuanced. You can stream the reasoning but need to wait for the complete tool call JSON before execution. Most frameworks handle this for you.
  • Background agents: Streaming adds complexity for no benefit. Use non-streaming calls.
# Streaming example with Google GenAI
from google import genai
 
client = genai.Client()
 
# Non-streaming: wait for complete response
response = client.models.generate_content(
    model="gemini-2.0-flash",
    contents="Explain quantum computing"
)
 
# Streaming: process tokens as they arrive
for chunk in client.models.generate_content_stream(
    model="gemini-2.0-flash",
    contents="Explain quantum computing"
):
    print(chunk.text, end="", flush=True)

Temperature

What it is: A parameter (typically 0-2) that controls randomness in token selection. At temperature 0, the model always picks the highest-probability token. At higher temperatures, it samples more randomly from the distribution.

Why it matters: Temperature trades off creativity vs. consistency.

TemperatureBehaviorUse Case
0Deterministic, always picks top tokenCode generation, factual Q&A, structured output
0.3-0.7Balanced, some variationGeneral conversation, writing assistance
1.0+Creative, more random samplingBrainstorming, creative writing, generating alternatives

How it affects agents: For most agent tasks, use temperature 0. Agents need predictable, consistent behavior—especially when parsing structured outputs or making decisions. Save higher temperatures for creative subtasks like generating marketing copy or brainstorming solutions.

Reproducibility Warning

Even at temperature 0, outputs aren't perfectly deterministic due to floating-point arithmetic and batching. If you need exact reproducibility, you'll need to cache responses or use seed parameters (where supported).

Context Window Economics

What it is: The context window is the model's working memory—the maximum tokens it can process at once. But it's not just a limit; it's a resource with real costs.

Why it matters: Larger context = higher cost and latency. Most APIs charge per input token, so stuffing 100k tokens of "context" costs real money—even if most of it is irrelevant.

Context UsageCost ImpactLatency Impact
1K tokensBaselineFast
10K tokens~10x input costSlight increase
100K tokens~100x input costNoticeable prefill time
1M tokensExpensiveSignificant prefill latency

How it affects agents: The temptation is to dump everything into context—entire codebases, full conversation histories, multiple documents. This works for demos but kills production economics. In Chapter 12 (Context Engineering), we'll learn techniques like summarization, RAG, and intelligent context selection to maximize the value of every token.

Prefill vs. Generation

Prefill is processing the input context (fast, parallel). Generation is producing output tokens (slower, sequential). A 100K context adds seconds of prefill time before the first token even starts generating.


Hands-On: Explore in Google AI Studio

Theory is useful, but nothing beats hands-on experimentation. Let's build intuition by playing with a real LLM.

Go to Google AI Studio (free, requires Google account).

Exercise 1: Feel the Temperature

  1. Select Gemini 2.0 Flash as your model
  2. In the prompt box, enter: Give me a creative name for a coffee shop
  3. Click Run and note the response
  4. Open the Run settings panel on the right
  5. Set Temperature to 0 → Run again → Note how the answer is consistent
  6. Set Temperature to 1.5 → Run 5 times → Notice the variation

What you should observe: At temperature 0, you get the same (or very similar) answer every time. At 1.5, responses vary wildly—some creative, some nonsensical.

Exercise 2: Watch Streaming vs. Non-Streaming

  1. Enter a longer prompt: Write a 200-word story about a robot learning to paint
  2. Watch the response stream in token by token
  3. Notice the TTFT (how long before text starts appearing)
  4. Notice the TPS (how fast text flows after it starts)

What you should observe: The first token takes a moment (TTFT), then tokens flow smoothly (TPS). This is why streaming feels responsive even for long responses.

Exercise 3: Context Window Intuition

  1. Start a new chat
  2. Have a short conversation (3-4 exchanges)
  3. Look at the token counter (shows input/output tokens)
  4. Now paste a long document (a Wikipedia article, or the content of this chapter)
  5. Watch the token count jump

What you should observe: A few sentences = tens of tokens. A document = thousands. You'll quickly see how context fills up.

Exercise 4: Compare Models

  1. Ask a reasoning question: What's 17 × 23? Think step by step.
  2. Run with Gemini 2.0 Flash
  3. Switch to Gemini 2.5 Pro and run again
  4. Compare: quality of reasoning, latency, token usage

What you should observe: Pro gives more detailed reasoning but takes longer. Flash is snappier but may be terser. Both should get the right answer for this simple problem—try harder questions to see capability differences.

Experimentation Mindset

There's no substitute for hands-on experimentation. Throughout this tutorial, we'll build things in code—but AI Studio is your playground for quick tests, prompt iteration, and building intuition. Bookmark it.


Summary

You now have the mental model:

  1. Tokens are the atoms—LLMs read and write tokens, not words. Tokenization is compression that preserves meaning.
  2. LLMs predict the next token—and prediction at scale requires understanding
  3. Prompting techniques work by guiding the model through output space toward correct answers
  4. Capabilities are impressive: multimodal input, reasoning modes, tool use
  5. Limitations are fundamental: probabilistic, hallucinates, frozen knowledge, sycophantic
  6. Engineering concepts matter: TTFT, TPS, streaming, temperature, and context economics directly impact agent UX and cost

In the next chapter, we'll put this knowledge to work with Prompt Engineering—the primary way we "program" these probabilistic machines.