Argus Observability Architecture¶

Version: 1.0 Last Updated: January 29, 2026

Overview¶

Argus implements a comprehensive observability stack covering: 1. LLM Tracing - Langfuse for AI model observability 2. Infrastructure Metrics - Prometheus + Grafana 3. Application Logging - Structured logging with correlation 4. Distributed Tracing - Request tracking across services

Architecture Diagram¶

┌─────────────────────────────────────────────────────────────────────────────────┐
│                         ARGUS OBSERVABILITY STACK                                │
├─────────────────────────────────────────────────────────────────────────────────┤
│                                                                                  │
│  ┌─────────────────────────────────────────────────────────────────────────┐    │
│  │                         USER REQUESTS                                    │    │
│  │                                                                          │    │
│  │    Dashboard ──────▶ API Gateway ──────▶ Backend Services               │    │
│  └─────────────────────────────────────────────────────────────────────────┘    │
│                                    │                                             │
│                 ┌──────────────────┼──────────────────┐                         │
│                 ▼                  ▼                  ▼                         │
│  ┌──────────────────┐  ┌──────────────────┐  ┌──────────────────┐              │
│  │    LANGFUSE      │  │   PROMETHEUS     │  │    LOGGING       │              │
│  │                  │  │                  │  │                  │              │
│  │  • LLM Traces    │  │  • Metrics       │  │  • Structlog     │              │
│  │  • Token Usage   │  │  • Alerts        │  │  • JSON Format   │              │
│  │  • Cost Tracking │  │  • SLOs          │  │  • Correlation   │              │
│  │  • Model Perf    │  │  • Dashboards    │  │  • Request IDs   │              │
│  └────────┬─────────┘  └────────┬─────────┘  └────────┬─────────┘              │
│           │                     │                     │                         │
│           ▼                     ▼                     ▼                         │
│  ┌─────────────────────────────────────────────────────────────────────────┐    │
│  │                         GRAFANA DASHBOARDS                               │    │
│  │                                                                          │    │
│  │   ┌─────────────┐  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐    │    │
│  │   │ AI Intel    │  │   Cognee    │  │  Browser    │  │   System    │    │    │
│  │   │ Dashboard   │  │  Pipeline   │  │    Pool     │  │  Overview   │    │    │
│  │   └─────────────┘  └─────────────┘  └─────────────┘  └─────────────┘    │    │
│  └─────────────────────────────────────────────────────────────────────────┘    │
│                                                                                  │
└─────────────────────────────────────────────────────────────────────────────────┘

1. LLM Observability (Langfuse)¶

Overview¶

Langfuse provides comprehensive LLM observability including: - Token usage and cost tracking per request - Model performance metrics (latency, errors) - Trace visualization for multi-step AI workflows - User/session attribution

Integration Points¶

Component	Integration	File
Chat API	CallbackHandler	`src/api/chat.py`
Orchestrator	Trace spans	`src/orchestrator/langfuse_integration.py`
Agents	Per-agent traces	`src/agents/base.py`
Cognee Worker	LLM calls	`data-layer/cognee-worker/src/worker.py`

Configuration¶

# src/orchestrator/langfuse_integration.py

def get_langfuse_handler(
    user_id: str,
    session_id: str,
    trace_name: str,
    tags: list[str],
    metadata: dict
) -> CallbackHandler:
    """Create Langfuse callback handler for LangChain/LangGraph."""
    return CallbackHandler(
        public_key=os.environ.get("LANGFUSE_PUBLIC_KEY"),
        secret_key=os.environ.get("LANGFUSE_SECRET_KEY"),
        host=os.environ.get("LANGFUSE_HOST", "https://cloud.langfuse.com"),
        user_id=user_id,
        session_id=session_id,
        trace_name=trace_name,
        tags=tags,
        metadata=metadata,
    )

Environment Variables¶

LANGFUSE_ENABLED=true
LANGFUSE_PUBLIC_KEY=pk-lf-xxx
LANGFUSE_SECRET_KEY=sk-lf-xxx
LANGFUSE_HOST=https://cloud.langfuse.com

Kubernetes Secrets¶

# data-layer/kubernetes/monitoring/langfuse-secrets.yaml
apiVersion: v1
kind: Secret
metadata:
  name: langfuse-credentials
type: Opaque
data:
  LANGFUSE_PUBLIC_KEY: <base64>
  LANGFUSE_SECRET_KEY: <base64>

2. Infrastructure Metrics (Prometheus)¶

Stack Components¶

Component	Purpose	Configuration
Prometheus	Metrics collection	`kube-prometheus-stack-values.yaml`
Grafana	Visualization	`grafana-dashboards-configmap.yaml`
AlertManager	Alert routing	`alerting-rules.yaml`
ServiceMonitors	Endpoint scraping	`servicemonitors-*.yaml`

ServiceMonitors¶

Argus Data Layer:

# data-layer/kubernetes/monitoring/servicemonitors-argus-data.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: argus-data-metrics
spec:
  selector:
    matchLabels:
      app: cognee-worker
  endpoints:
    - port: metrics
      interval: 30s
      path: /metrics

Browser Pool:

# data-layer/kubernetes/monitoring/servicemonitors-browser-pool.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: browser-pool-metrics
spec:
  selector:
    matchLabels:
      app: selenium-grid
  endpoints:
    - port: web
      interval: 30s

Key Metrics¶

Metric	Type	Description
`cognee_ecl_operations_total`	Counter	Cognee ECL pipeline operations
`cognee_search_latency_seconds`	Histogram	Knowledge search latency
`healing_suggestions_total`	Counter	Self-healing suggestions generated
`llm_tokens_total`	Counter	LLM token usage by model
`browser_pool_active_sessions`	Gauge	Active browser sessions

3. Grafana Dashboards¶

Dashboard Inventory¶

Dashboard	Purpose	Panels
AI Intelligence	LLM performance, costs	Token usage, latency, errors
Cognee Pipeline	Knowledge layer health	ECL metrics, search latency
Browser Pool	Browser automation	Session count, execution time
System Overview	Infrastructure health	CPU, memory, network

AI Intelligence Dashboard¶

{
  "title": "AI Intelligence Dashboard",
  "panels": [
    {
      "title": "LLM Token Usage",
      "type": "timeseries",
      "targets": [
        {
          "expr": "sum(rate(llm_tokens_total[5m])) by (model)"
        }
      ]
    },
    {
      "title": "LLM Latency P95",
      "type": "stat",
      "targets": [
        {
          "expr": "histogram_quantile(0.95, rate(llm_request_duration_seconds_bucket[5m]))"
        }
      ]
    },
    {
      "title": "Healing Success Rate",
      "type": "gauge",
      "targets": [
        {
          "expr": "sum(healing_suggestions_accepted) / sum(healing_suggestions_total) * 100"
        }
      ]
    }
  ]
}

4. Alerting Rules¶

Critical Alerts¶

# data-layer/kubernetes/monitoring/alerting-rules.yaml

groups:
  - name: argus-critical
    rules:
      - alert: HighLLMErrorRate
        expr: rate(llm_errors_total[5m]) > 0.1
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High LLM error rate detected"

      - alert: CogneeUnhealthy
        expr: up{job="cognee-worker"} == 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Cognee worker is down"

      - alert: BrowserPoolExhausted
        expr: browser_pool_available_sessions == 0
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "No available browser sessions"

5. Secure Access (Cloudflare Tunnel)¶

Architecture¶

┌──────────────┐      ┌──────────────────┐      ┌──────────────┐
│   External   │      │   Cloudflare     │      │   K8s        │
│   User       │─────▶│   Tunnel         │─────▶│   Grafana    │
│              │ HTTPS│   (cloudflared)  │      │   Service    │
└──────────────┘      └──────────────────┘      └──────────────┘

Configuration¶

# data-layer/kubernetes/monitoring/cloudflare-tunnel-config.yaml
tunnel: argus-monitoring
credentials-file: /etc/cloudflared/credentials.json
ingress:
  - hostname: grafana.argus.example.com
    service: http://kube-prometheus-stack-grafana:80
  - hostname: prometheus.argus.example.com
    service: http://kube-prometheus-stack-prometheus:9090
  - service: http_status:404

6. Logging Standards¶

Structured Logging Format¶

import structlog

logger = structlog.get_logger()

# Standard log format
logger.info(
    "Chat message processed",
    thread_id=thread_id,
    user_id=user_id,
    model=model_id,
    tokens_used=token_count,
    latency_ms=latency,
)

Log Fields¶

Field	Type	Description
`thread_id`	string	Conversation thread identifier
`user_id`	string	Authenticated user ID
`org_id`	string	Organization ID (multi-tenant)
`model`	string	LLM model used
`tokens_used`	int	Token count for request
`latency_ms`	int	Request latency
`trace_id`	string	Langfuse trace ID

7. Health Endpoints¶

API Health Checks¶

Endpoint	Purpose	Response
`/health`	Basic liveness	`{"status": "ok"}`
`/health/ready`	Readiness probe	`{"ready": true}`
`/api/v1/health/data-layer`	Infrastructure health	Component statuses

Example Response¶

{
  "status": "healthy",
  "components": {
    "cognee": {"status": "healthy", "version": "0.5.1"},
    "falkordb": {"status": "healthy"},
    "supabase": {"status": "healthy"},
    "selenium_grid": {"status": "healthy", "nodes": 3},
    "prometheus": {"status": "healthy"},
    "grafana": {"status": "healthy", "version": "12.3.1"}
  }
}

8. Deployment¶

Enable Monitoring Stack¶

# Apply Prometheus stack
helm install kube-prometheus-stack prometheus-community/kube-prometheus-stack \
  -f data-layer/kubernetes/monitoring/kube-prometheus-stack-values.yaml

# Apply custom configs
kubectl apply -f data-layer/kubernetes/monitoring/

# Verify deployment
kubectl get servicemonitors
kubectl get prometheusrules

Environment Variables (Railway)¶

# Set in Railway dashboard or CLI
railway variables set LANGFUSE_ENABLED=true
railway variables set LANGFUSE_PUBLIC_KEY=pk-lf-xxx
railway variables set LANGFUSE_SECRET_KEY=sk-lf-xxx

9. Troubleshooting¶

Common Issues¶

Issue	Symptom	Resolution
Missing Langfuse traces	No traces in dashboard	Check `LANGFUSE_ENABLED=true`
Metrics not appearing	Empty Grafana panels	Verify ServiceMonitor labels match
Alerts not firing	No notifications	Check AlertManager config

Diagnostic Commands¶

# Check Prometheus targets
kubectl port-forward svc/kube-prometheus-stack-prometheus 9090
# Visit http://localhost:9090/targets

# Check Langfuse connectivity
curl -X POST https://cloud.langfuse.com/api/public/health

# View Cognee worker logs
kubectl logs -l app=cognee-worker -f

10. Future Enhancements¶

OpenTelemetry Integration - Unified tracing standard
Custom Langfuse Dashboards - LLM-specific analytics
Anomaly Detection - ML-based alert generation
Cost Allocation - Per-tenant AI cost tracking
SLO Monitoring - Service level objective tracking