Argus Observability Architecture
Version: 1.0 Last Updated: January 29, 2026
Overview
Argus implements a comprehensive observability stack covering: 1. LLM Tracing - Langfuse for AI model observability 2. Infrastructure Metrics - Prometheus + Grafana 3. Application Logging - Structured logging with correlation 4. Distributed Tracing - Request tracking across services
Architecture Diagram
┌─────────────────────────────────────────────────────────────────────────────────┐
│ ARGUS OBSERVABILITY STACK │
├─────────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────────────────────────────────────────────────────────────┐ │
│ │ USER REQUESTS │ │
│ │ │ │
│ │ Dashboard ──────▶ API Gateway ──────▶ Backend Services │ │
│ └─────────────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ┌──────────────────┼──────────────────┐ │
│ ▼ ▼ ▼ │
│ ┌──────────────────┐ ┌──────────────────┐ ┌──────────────────┐ │
│ │ LANGFUSE │ │ PROMETHEUS │ │ LOGGING │ │
│ │ │ │ │ │ │ │
│ │ • LLM Traces │ │ • Metrics │ │ • Structlog │ │
│ │ • Token Usage │ │ • Alerts │ │ • JSON Format │ │
│ │ • Cost Tracking │ │ • SLOs │ │ • Correlation │ │
│ │ • Model Perf │ │ • Dashboards │ │ • Request IDs │ │
│ └────────┬─────────┘ └────────┬─────────┘ └────────┬─────────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌─────────────────────────────────────────────────────────────────────────┐ │
│ │ GRAFANA DASHBOARDS │ │
│ │ │ │
│ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ │
│ │ │ AI Intel │ │ Cognee │ │ Browser │ │ System │ │ │
│ │ │ Dashboard │ │ Pipeline │ │ Pool │ │ Overview │ │ │
│ │ └─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘ │ │
│ └─────────────────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────────┘
1. LLM Observability (Langfuse)
Overview
Langfuse provides comprehensive LLM observability including: - Token usage and cost tracking per request - Model performance metrics (latency, errors) - Trace visualization for multi-step AI workflows - User/session attribution
Integration Points
| Component | Integration | File |
| Chat API | CallbackHandler | src/api/chat.py |
| Orchestrator | Trace spans | src/orchestrator/langfuse_integration.py |
| Agents | Per-agent traces | src/agents/base.py |
| Cognee Worker | LLM calls | data-layer/cognee-worker/src/worker.py |
Configuration
# src/orchestrator/langfuse_integration.py
def get_langfuse_handler(
user_id: str,
session_id: str,
trace_name: str,
tags: list[str],
metadata: dict
) -> CallbackHandler:
"""Create Langfuse callback handler for LangChain/LangGraph."""
return CallbackHandler(
public_key=os.environ.get("LANGFUSE_PUBLIC_KEY"),
secret_key=os.environ.get("LANGFUSE_SECRET_KEY"),
host=os.environ.get("LANGFUSE_HOST", "https://cloud.langfuse.com"),
user_id=user_id,
session_id=session_id,
trace_name=trace_name,
tags=tags,
metadata=metadata,
)
Environment Variables
LANGFUSE_ENABLED=true
LANGFUSE_PUBLIC_KEY=pk-lf-xxx
LANGFUSE_SECRET_KEY=sk-lf-xxx
LANGFUSE_HOST=https://cloud.langfuse.com
Kubernetes Secrets
# data-layer/kubernetes/monitoring/langfuse-secrets.yaml
apiVersion: v1
kind: Secret
metadata:
name: langfuse-credentials
type: Opaque
data:
LANGFUSE_PUBLIC_KEY: <base64>
LANGFUSE_SECRET_KEY: <base64>
2. Infrastructure Metrics (Prometheus)
Stack Components
| Component | Purpose | Configuration |
| Prometheus | Metrics collection | kube-prometheus-stack-values.yaml |
| Grafana | Visualization | grafana-dashboards-configmap.yaml |
| AlertManager | Alert routing | alerting-rules.yaml |
| ServiceMonitors | Endpoint scraping | servicemonitors-*.yaml |
ServiceMonitors
Argus Data Layer:
# data-layer/kubernetes/monitoring/servicemonitors-argus-data.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: argus-data-metrics
spec:
selector:
matchLabels:
app: cognee-worker
endpoints:
- port: metrics
interval: 30s
path: /metrics
Browser Pool:
# data-layer/kubernetes/monitoring/servicemonitors-browser-pool.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: browser-pool-metrics
spec:
selector:
matchLabels:
app: selenium-grid
endpoints:
- port: web
interval: 30s
Key Metrics
| Metric | Type | Description |
cognee_ecl_operations_total | Counter | Cognee ECL pipeline operations |
cognee_search_latency_seconds | Histogram | Knowledge search latency |
healing_suggestions_total | Counter | Self-healing suggestions generated |
llm_tokens_total | Counter | LLM token usage by model |
browser_pool_active_sessions | Gauge | Active browser sessions |
3. Grafana Dashboards
Dashboard Inventory
| Dashboard | Purpose | Panels |
| AI Intelligence | LLM performance, costs | Token usage, latency, errors |
| Cognee Pipeline | Knowledge layer health | ECL metrics, search latency |
| Browser Pool | Browser automation | Session count, execution time |
| System Overview | Infrastructure health | CPU, memory, network |
AI Intelligence Dashboard
{
"title": "AI Intelligence Dashboard",
"panels": [
{
"title": "LLM Token Usage",
"type": "timeseries",
"targets": [
{
"expr": "sum(rate(llm_tokens_total[5m])) by (model)"
}
]
},
{
"title": "LLM Latency P95",
"type": "stat",
"targets": [
{
"expr": "histogram_quantile(0.95, rate(llm_request_duration_seconds_bucket[5m]))"
}
]
},
{
"title": "Healing Success Rate",
"type": "gauge",
"targets": [
{
"expr": "sum(healing_suggestions_accepted) / sum(healing_suggestions_total) * 100"
}
]
}
]
}
4. Alerting Rules
Critical Alerts
# data-layer/kubernetes/monitoring/alerting-rules.yaml
groups:
- name: argus-critical
rules:
- alert: HighLLMErrorRate
expr: rate(llm_errors_total[5m]) > 0.1
for: 5m
labels:
severity: critical
annotations:
summary: "High LLM error rate detected"
- alert: CogneeUnhealthy
expr: up{job="cognee-worker"} == 0
for: 2m
labels:
severity: critical
annotations:
summary: "Cognee worker is down"
- alert: BrowserPoolExhausted
expr: browser_pool_available_sessions == 0
for: 5m
labels:
severity: warning
annotations:
summary: "No available browser sessions"
5. Secure Access (Cloudflare Tunnel)
Architecture
┌──────────────┐ ┌──────────────────┐ ┌──────────────┐
│ External │ │ Cloudflare │ │ K8s │
│ User │─────▶│ Tunnel │─────▶│ Grafana │
│ │ HTTPS│ (cloudflared) │ │ Service │
└──────────────┘ └──────────────────┘ └──────────────┘
Configuration
# data-layer/kubernetes/monitoring/cloudflare-tunnel-config.yaml
tunnel: argus-monitoring
credentials-file: /etc/cloudflared/credentials.json
ingress:
- hostname: grafana.argus.example.com
service: http://kube-prometheus-stack-grafana:80
- hostname: prometheus.argus.example.com
service: http://kube-prometheus-stack-prometheus:9090
- service: http_status:404
6. Logging Standards
import structlog
logger = structlog.get_logger()
# Standard log format
logger.info(
"Chat message processed",
thread_id=thread_id,
user_id=user_id,
model=model_id,
tokens_used=token_count,
latency_ms=latency,
)
Log Fields
| Field | Type | Description |
thread_id | string | Conversation thread identifier |
user_id | string | Authenticated user ID |
org_id | string | Organization ID (multi-tenant) |
model | string | LLM model used |
tokens_used | int | Token count for request |
latency_ms | int | Request latency |
trace_id | string | Langfuse trace ID |
7. Health Endpoints
API Health Checks
| Endpoint | Purpose | Response |
/health | Basic liveness | {"status": "ok"} |
/health/ready | Readiness probe | {"ready": true} |
/api/v1/health/data-layer | Infrastructure health | Component statuses |
Example Response
{
"status": "healthy",
"components": {
"cognee": {"status": "healthy", "version": "0.5.1"},
"falkordb": {"status": "healthy"},
"supabase": {"status": "healthy"},
"selenium_grid": {"status": "healthy", "nodes": 3},
"prometheus": {"status": "healthy"},
"grafana": {"status": "healthy", "version": "12.3.1"}
}
}
8. Deployment
Enable Monitoring Stack
# Apply Prometheus stack
helm install kube-prometheus-stack prometheus-community/kube-prometheus-stack \
-f data-layer/kubernetes/monitoring/kube-prometheus-stack-values.yaml
# Apply custom configs
kubectl apply -f data-layer/kubernetes/monitoring/
# Verify deployment
kubectl get servicemonitors
kubectl get prometheusrules
Environment Variables (Railway)
# Set in Railway dashboard or CLI
railway variables set LANGFUSE_ENABLED=true
railway variables set LANGFUSE_PUBLIC_KEY=pk-lf-xxx
railway variables set LANGFUSE_SECRET_KEY=sk-lf-xxx
9. Troubleshooting
Common Issues
| Issue | Symptom | Resolution |
| Missing Langfuse traces | No traces in dashboard | Check LANGFUSE_ENABLED=true |
| Metrics not appearing | Empty Grafana panels | Verify ServiceMonitor labels match |
| Alerts not firing | No notifications | Check AlertManager config |
Diagnostic Commands
# Check Prometheus targets
kubectl port-forward svc/kube-prometheus-stack-prometheus 9090
# Visit http://localhost:9090/targets
# Check Langfuse connectivity
curl -X POST https://cloud.langfuse.com/api/public/health
# View Cognee worker logs
kubectl logs -l app=cognee-worker -f
10. Future Enhancements
- OpenTelemetry Integration - Unified tracing standard
- Custom Langfuse Dashboards - LLM-specific analytics
- Anomaly Detection - ML-based alert generation
- Cost Allocation - Per-tenant AI cost tracking
- SLO Monitoring - Service level objective tracking