Your AI service just returned an empty response. No error code. No stack trace. The model simply decided not to answer. Traditional debugging assumes deterministic behavior — same input, same output. AI systems break that assumption, and your logging strategy needs to catch up.
Understanding the Shift to Observability
This guide focuses on implementing observability for AI systems using large language models (LLMs) and agent workflows. You'll learn how to instrument non-deterministic components, trace decision paths, and validate outputs when traditional debugging methods fail. This is relevant for chatbots, code analysis tools, automated security assessments, and any service where an LLM makes decisions.
Key Concepts
Observability vs. Logging
Logging captures what happened. Observability explains why. In AI systems, you need to trace the decision chain: which prompt template fired, what context the model received, how it interpreted instructions, and where validation failed.
Span-Based Tracing
A span represents one operation in your system. For AI workflows, each span captures a discrete step: prompt construction, model invocation, response parsing, and validation. Spans nest to show parent-child relationships.
Structured Validation
AI outputs are probabilistic. Validate structure (did we get JSON?), schema (does it match expected fields?), and semantic meaning (is the content reasonable?). Log each validation step separately.
Instrumentation Points
Prompt Construction
Capture the exact prompt sent to the model, including:
- Template variables and their values
- System messages and role instructions
- Context window contents
- Token counts
Model Invocation
Record request parameters:
- Model version (e.g., gpt-4-0613, claude-3-opus-20240229)
- Temperature, top_p, max_tokens
- Request timestamp and latency
- Response metadata (finish_reason, token usage)
Output Processing
Track transformation steps:
- Raw model response
- Parsing attempts and failures
- Validation results
- Final processed output
Trace Context Propagation
Distributed traces require context propagation. When your AI service calls external APIs or spawns sub-agents, pass trace identifiers through headers. OpenTelemetry standardizes this with the W3C Trace Context format.
Implementation Guidance
Install Dependencies
OpenTelemetry provides vendor-neutral instrumentation. Install the SDK and exporters:
pip install opentelemetry-api opentelemetry-sdk
pip install opentelemetry-exporter-otlp
Configure Tracing
Initialize a tracer at application startup:
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
provider = TracerProvider()
processor = BatchSpanProcessor(your_exporter)
provider.add_span_processor(processor)
trace.set_tracer_provider(provider)
Instrument AI Workflows
Wrap each decision point in a span:
tracer = trace.get_tracer(__name__)
with tracer.start_as_current_span("prompt_construction") as span:
prompt = template.format(**variables)
span.set_attribute("prompt.template", template_name)
span.set_attribute("prompt.token_count", count_tokens(prompt))
with tracer.start_as_current_span("model_invocation") as span:
response = client.chat.completions.create(
model="gpt-4-0613",
messages=[{"role": "user", "content": prompt}]
)
span.set_attribute("model.version", "gpt-4-0613")
span.set_attribute("response.finish_reason", response.finish_reason)
Validate and Log Failures
AI systems fail silently. Validate every output:
with tracer.start_as_current_span("output_validation") as span:
try:
parsed = json.loads(response.content)
span.set_attribute("validation.json_valid", True)
except json.JSONDecodeError as e:
span.set_attribute("validation.json_valid", False)
span.record_exception(e)
# Log the raw response for analysis
span.set_attribute("response.raw", response.content)
Make Components Independently Observable
Don't instrument only the top-level workflow. Each component — retrieval, ranking, generation, validation — needs its own span. This lets you identify which stage introduced the problem.
Consider a Retrieval-Augmented Generation (RAG) pipeline. Instrument:
- Document retrieval (which chunks were selected?)
- Relevance scoring (what were the scores?)
- Context assembly (how was context prioritized?)
- Generation (what prompt was actually sent?)
Common Pitfalls
Logging Only Inputs and Outputs
You need the intermediate states. When a model hallucinates, you need to see what context it received, not just the final bad output.
Ignoring Model Version Changes
Model providers update their APIs. gpt-4 today is not gpt-4 next month. Always log the exact version string returned in the response metadata.
Treating All Failures as Errors
Some AI behaviors aren't errors — they're edge cases. A model refusing to answer an unsafe prompt is working correctly. Tag these cases separately from true failures.
Not Capturing Token Usage
Token consumption affects cost and latency. Track input tokens, output tokens, and total per request. This helps you identify expensive prompts.
Insufficient Cardinality in Attributes
Generic tags like "model=gpt4" don't help. Use specific values: model version, temperature, prompt template ID, user segment. High-cardinality data lets you slice traces by any dimension.
Quick Reference
| Component | Required Attributes | Optional Attributes |
|---|---|---|
| Prompt Construction | template_id, token_count | variables, system_message |
| Model Invocation | model_version, temperature, latency_ms | max_tokens, top_p, finish_reason |
| Response Parsing | parse_success, format_type | error_message, retry_count |
| Validation | schema_valid, semantic_valid | validation_errors, confidence_score |
| Context Retrieval | chunk_count, retrieval_method | relevance_scores, source_ids |
Trace Sampling
Don't sample AI traces the same as HTTP requests. Sample 100% of failures and a representative percentage of successes. You can't debug what you didn't capture.
Retention
Keep traces for at least 30 days. AI issues often surface gradually as usage patterns change.
Alert Thresholds
- Validation failure rate > 5%
- Average latency > 2x baseline
- Token usage > 150% of expected
- finish_reason != "stop" for > 10% of requests
Start with these instrumentation points. When you encounter a new failure mode, add spans to capture it. Observability is iterative — you build it as you learn how your system actually behaves.



