Kubernetes Infrastructure¶

Version: 2.0.0 Last Updated: 2026-01-27T19:45:00Z Document Status: ✅ FULLY IMPLEMENTED - All Phases Complete Source Files: data-layer/kubernetes/*.yaml (18 manifests), data-layer/cognee-worker/ (4 files)

Implementation Status¶

All phases of the data layer deployment plan have been implemented and validated:

Phase	Description	Status	Evidence
Phase 1	Cognee Worker K8s Manifest	✅ Complete	`cognee-worker.yaml` (321 lines) with HPA, PDB, probes
Phase 2	Network Policies	✅ Complete	`network-policies.yaml` (291 lines) zero-trust
Phase 3	Secrets Configuration	✅ Complete	`secrets.yaml` template with External Secrets guidance
Phase 4	Deploy Script	✅ Complete	`deploy.sh` (264 lines) with `--generate-secrets`
Phase 5	Backend Integration	✅ Complete	`src/api/server.py:2357-2399`, `src/api/tests.py:437-440`
Phase 6	Environment Variables	✅ Complete	ConfigMap in `cognee-worker.yaml:6-69`

Additional Implementation (Beyond Original Plan)¶

Component	Status	Files
KEDA Autoscaling	✅ Complete	`keda-cognee-scaler.yaml` - Kafka lag-based scaling
Flink Platform	✅ Complete	`flink-operator.yaml`, `flink-cluster.yaml`, `flink-platform/`
Cognee Worker Source	✅ Complete	`data-layer/cognee-worker/src/worker.py` (710 lines)
Multi-tenant Isolation	✅ Complete	Dataset naming: `org_{id}_project_{id}_{type}`
Neo4j Aura Integration	✅ Complete	Cold start retry (5 attempts, 15s delay)
Terraform Alternative	✅ Complete	`terraform/confluent-cloud/main.tf`

Architecture Overview¶

The Argus data layer implements a comprehensive Kubernetes-based streaming and knowledge graph processing infrastructure on Vultr Kubernetes Engine (VKE).

graph TB
    subgraph "External Services"
        Supabase["Supabase PostgreSQL<br/>pgvector + Real-time"]
        Neo4j["Neo4j Aura<br/>Knowledge Graph"]
        Anthropic["Anthropic API<br/>Claude LLM"]
        Redpanda-SL["Redpanda Serverless<br/>SASL_SSL"]
    end

    subgraph "Vultr Kubernetes Engine - argus-data namespace"
        subgraph "Stateful Components"
            Redpanda["Redpanda<br/>StatefulSet (1-3 replicas)<br/>Port 9092, 8081, 8082"]
            FalkorDB["FalkorDB<br/>StatefulSet (1 replica)<br/>Graph DB: 6379"]
            Valkey["Valkey<br/>StatefulSet (1 replica)<br/>Cache: 6379"]
        end

        subgraph "Workers & Processors"
            Cognee["Cognee Worker<br/>Deployment (1-5 replicas)<br/>KEDA + HPA Scaling"]
            Flink["Flink Cluster<br/>JobManager + TaskManagers<br/>Stream Processing"]
        end

        subgraph "Control Plane"
            KEDA["KEDA ScaledObject<br/>Kafka Lag Monitoring"]
            NP["NetworkPolicies<br/>Zero-trust enforcement"]
            RQ["ResourceQuota<br/>Namespace Limits"]
        end
    end

    Cognee -->|Consumes| Redpanda
    Cognee -->|Writes graphs| FalkorDB
    Cognee -->|Caches| Valkey
    Cognee -->|External APIs| Anthropic
    Cognee -->|Sync state| Supabase
    Cognee -->|Graph DB| Neo4j

    Flink -->|Consumes| Redpanda
    Redpanda -->|Backup| Redpanda-SL

Component Summary¶

Component	Type	Replicas	Storage	Purpose
Redpanda	StatefulSet	1-3	40Gi	Kafka-compatible event streaming
FalkorDB	StatefulSet	1	40Gi	Redis-based graph database
Valkey	StatefulSet	1	40Gi	Redis successor for caching
Cognee Worker	Deployment	1-5	5Gi (ephemeral)	Knowledge graph builder
Flink	FlinkDeployment	1+2	Checkpoints only	Stream processing

Namespace & Resource Management¶

Namespace Configuration¶

File: data-layer/kubernetes/namespace.yaml

apiVersion: v1
kind: Namespace
metadata:
  name: argus-data
  labels:
    name: argus-data
    environment: production

Resource Quota¶

apiVersion: v1
kind: ResourceQuota
metadata:
  name: argus-data-quota
  namespace: argus-data
spec:
  hard:
    requests.cpu: "20"
    requests.memory: 40Gi
    limits.cpu: "40"
    limits.memory: 80Gi
    persistentvolumeclaims: "10"
    requests.storage: 500Gi

Limit Range¶

apiVersion: v1
kind: LimitRange
metadata:
  name: argus-data-limits
  namespace: argus-data
spec:
  limits:
    - default:
        cpu: "1"
        memory: 1Gi
      defaultRequest:
        cpu: 100m
        memory: 128Mi
      min:
        cpu: 50m
        memory: 64Mi
      max:
        cpu: "8"
        memory: 16Gi
      type: Container

Stateful Components¶

Redpanda (Event Streaming)¶

File: data-layer/kubernetes/redpanda-values.yaml

Deployment: Helm Chart (redpanda/redpanda)

statefulset:
  replicas: 1  # Increase to 3+ for HA

resources:
  cpu:
    cores: 1
  memory:
    container:
      max: 2Gi
    redpanda:
      memory: 1Gi
      reserveMemory: 200Mi

storage:
  persistentVolume:
    enabled: true
    size: 40Gi
    storageClass: vultr-block-storage-hdd

auth:
  sasl:
    enabled: true
    secretRef: redpanda-superusers
    mechanism: SCRAM-SHA-512
    users:
      - name: admin
        mechanism: SCRAM-SHA-512
      - name: argus-service
        mechanism: SCRAM-SHA-512

Topics Created:

Topic	Purpose	Partitions
`argus.codebase.ingested`	Source code events	6
`argus.codebase.analyzed`	Analysis results	6
`argus.test.created`	New test creation	6
`argus.test.executed`	Test execution results	6
`argus.test.failed`	Test failures	6
`argus.healing.requested`	Self-healing requests	6
`argus.healing.completed`	Healing completion	6
`argus.dlq`	Dead letter queue	3

Ports: - 9092: Kafka protocol (SASL_PLAINTEXT) - 8081: Schema Registry (HTTP Basic) - 8082: HTTP Proxy (HTTP Basic) - 9644: Admin API - 33145: Internal RPC

FalkorDB (Graph Database)¶

File: data-layer/kubernetes/falkordb.yaml

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: falkordb
  namespace: argus-data
spec:
  serviceName: falkordb-headless
  replicas: 1
  template:
    spec:
      containers:
        - name: falkordb
          image: falkordb/falkordb:v4.4.1
          ports:
            - containerPort: 6379
          resources:
            requests:
              cpu: 100m
              memory: 256Mi
            limits:
              cpu: 500m
              memory: 1Gi
          env:
            - name: REDIS_ARGS
              value: "--requirepass $(FALKORDB_PASSWORD) --maxmemory 1gb --appendonly yes"
          volumeMounts:
            - name: data
              mountPath: /data
        - name: redis-exporter
          image: oliver006/redis_exporter:v1.66.0
          ports:
            - containerPort: 9121
          resources:
            requests:
              cpu: 50m
              memory: 64Mi
            limits:
              cpu: 100m
              memory: 128Mi
  volumeClaimTemplates:
    - metadata:
        name: data
      spec:
        accessModes: ["ReadWriteOnce"]
        storageClassName: vultr-block-storage-hdd
        resources:
          requests:
            storage: 40Gi

Features: - AOF persistence enabled (everysec fsync) - Prometheus metrics via redis_exporter - Password authentication

Valkey (Cache Store)¶

File: data-layer/kubernetes/valkey.yaml

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: valkey
  namespace: argus-data
spec:
  serviceName: valkey-headless
  replicas: 1
  template:
    spec:
      containers:
        - name: valkey
          image: valkey/valkey:8.0-alpine
          ports:
            - containerPort: 6379
          resources:
            requests:
              cpu: 50m
              memory: 128Mi
            limits:
              cpu: 200m
              memory: 512Mi
          args:
            - --requirepass
            - $(VALKEY_PASSWORD)
            - --maxmemory
            - 1536mb
            - --maxmemory-policy
            - allkeys-lru
            - --appendonly
            - yes
        - name: valkey-exporter
          image: oliver006/redis_exporter:v1.66.0
          ports:
            - containerPort: 9121

Features: - LRU eviction policy (1.5GB max) - AOF persistence enabled - Prometheus metrics via redis_exporter

Worker Components¶

Cognee Worker (Knowledge Graph Builder)¶

File: data-layer/kubernetes/cognee-worker.yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: cognee-worker
  namespace: argus-data
spec:
  replicas: 1
  template:
    spec:
      containers:
        - name: cognee-worker
          image: ghcr.io/samuelvinay91/cognee-worker:latest
          resources:
            requests:
              cpu: 100m
              memory: 256Mi
            limits:
              cpu: 500m
              memory: 1Gi
          env:
            - name: KAFKA_BOOTSTRAP_SERVERS
              value: "redpanda-0.redpanda.argus-data.svc.cluster.local:9092"
            - name: KAFKA_CONSUMER_GROUP
              value: "argus-cognee-workers"
            - name: LLM_PROVIDER
              value: "anthropic"
            - name: LLM_MODEL
              value: "claude-sonnet-4-5-20250929"
          livenessProbe:
            httpGet:
              path: /health
              port: 8080
            initialDelaySeconds: 90
            periodSeconds: 30
          readinessProbe:
            httpGet:
              path: /ready
              port: 8080
            initialDelaySeconds: 60
            periodSeconds: 10
          volumeMounts:
            - name: cognee-cache
              mountPath: /app/data
            - name: cognee-logs
              mountPath: /app/logs
      volumes:
        - name: cognee-cache
          emptyDir:
            sizeLimit: 5Gi
        - name: cognee-logs
          emptyDir:
            sizeLimit: 1Gi

Pod Disruption Budget:

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: cognee-worker-pdb
spec:
  minAvailable: 1
  selector:
    matchLabels:
      app.kubernetes.io/name: cognee-worker

Cognee Worker Implementation Details¶

Source File: data-layer/cognee-worker/src/worker.py (710 lines)

The Cognee Worker implements a complete event-driven knowledge graph builder with multi-tenant isolation.

Multi-Tenant Dataset Isolation¶

def _get_dataset_name(self, org_id: str, project_id: str, dataset_type: str) -> str:
    """Generate tenant-scoped dataset name.

    Returns: Dataset name like 'org_abc123_project_xyz789_codebase'
    """
    return f"org_{org_id}_project_{project_id}_{dataset_type}"

Dataset Types: | Type | Purpose | |------|---------| | codebase | Source code analysis and knowledge extraction | | tests | Test execution data and patterns | | failures | Failure pattern learning for self-healing |

Neo4j Aura Cold Start Handling¶

async def _test_neo4j_connection(self):
    """Neo4j Aura Free tier auto-pauses after 3 days of inactivity.
    Can take 30-60 seconds to wake up on first connection."""

    max_retries = 5
    retry_delay = 15  # seconds

    for attempt in range(1, max_retries + 1):
        try:
            async with driver.session() as session:
                await session.run("RETURN 1 AS test")
            return  # Success
        except ServiceUnavailable:
            if attempt < max_retries:
                await asyncio.sleep(retry_delay)
            else:
                raise RuntimeError("Failed to connect to Neo4j Aura")

Event Processing Flow¶

sequenceDiagram
    participant Redpanda
    participant Worker as Cognee Worker
    participant Neo4j as Neo4j Aura
    participant DLQ as Dead Letter Queue

    Redpanda->>Worker: argus.codebase.ingested
    Worker->>Worker: Extract tenant context (org_id, project_id)
    Worker->>Worker: Generate dataset name
    Worker->>Neo4j: cognee.add() + cognee.cognify()

    alt Success
        Worker->>Redpanda: argus.codebase.analyzed
    else Failure
        Worker->>DLQ: argus.dlq (with error context)
    end

    Worker->>Worker: Commit Kafka offset

Health Endpoints¶

Endpoint	Purpose	Response
`GET /health`	Liveness probe	`{"status": "healthy"}`
`GET /ready`	Readiness probe	`{"status": "ready"}` or 503

KEDA Autoscaling¶

File: data-layer/kubernetes/keda-cognee-scaler.yaml

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: cognee-worker-scaledobject
  namespace: argus-data
spec:
  scaleTargetRef:
    name: cognee-worker
  minReplicaCount: 1
  maxReplicaCount: 5
  pollingInterval: 15
  cooldownPeriod: 300
  triggers:
    - type: kafka
      metadata:
        bootstrapServers: redpanda-0.redpanda.argus-data.svc.cluster.local:9092
        consumerGroup: argus-cognee-workers
        topic: argus.codebase.ingested
        lagThreshold: "10"
        activationLagThreshold: "5"
    - type: kafka
      metadata:
        bootstrapServers: redpanda-0.redpanda.argus-data.svc.cluster.local:9092
        consumerGroup: argus-cognee-workers
        topic: argus.test.created
        lagThreshold: "5"
        activationLagThreshold: "2"
  advanced:
    horizontalPodAutoscalerConfig:
      behavior:
        scaleUp:
          stabilizationWindowSeconds: 0
          policies:
            - type: Percent
              value: 100
              periodSeconds: 15
            - type: Pods
              value: 2
              periodSeconds: 60
          selectPolicy: Max
        scaleDown:
          stabilizationWindowSeconds: 300
          policies:
            - type: Percent
              value: 25
              periodSeconds: 60
          selectPolicy: Min
    restoreToOriginalReplicaCount: true
  fallback:
    failureThreshold: 3
    replicas: 2

Scaling Triggers: - Kafka lag on argus.codebase.ingested > 10 messages - Kafka lag on argus.test.created > 5 messages - CPU utilization > 70% - Memory utilization > 80%

Apache Flink (Stream Processing)¶

File: data-layer/kubernetes/flink-cluster.yaml

apiVersion: flink.apache.org/v1beta1
kind: FlinkDeployment
metadata:
  name: argus-flink
  namespace: argus-data
spec:
  image: flink:1.20-java17
  flinkVersion: v1_20
  flinkConfiguration:
    taskmanager.numberOfTaskSlots: "2"
    state.backend: hashmap
    state.checkpoints.dir: file:///tmp/flink-checkpoints
    state.savepoints.dir: file:///tmp/flink-savepoints
    execution.checkpointing.interval: "60000"
    execution.checkpointing.mode: EXACTLY_ONCE
    kubernetes.cluster-id: argus-flink
    high-availability: kubernetes
    high-availability.storageDir: file:///tmp/flink-ha
  serviceAccount: flink
  jobManager:
    resource:
      memory: "1024m"
      cpu: 0.5
    replicas: 1
  taskManager:
    resource:
      memory: "2048m"
      cpu: 1
    replicas: 2

Features: - EXACTLY_ONCE checkpointing (60s interval) - Kubernetes-based high availability - hashmap state backend (upgrade to RocksDB for production)

Network Policies¶

File: data-layer/kubernetes/network-policies.yaml

Default Deny Egress¶

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny-egress
  namespace: argus-data
spec:
  podSelector: {}
  policyTypes:
    - Egress
  egress:
    - to:
        - namespaceSelector: {}
      ports:
        - protocol: UDP
          port: 53
        - protocol: TCP
          port: 53

Cognee Worker Policy¶

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: cognee-worker-policy
  namespace: argus-data
spec:
  podSelector:
    matchLabels:
      app.kubernetes.io/name: cognee-worker
  policyTypes:
    - Ingress
    - Egress
  ingress:
    - from:
        - namespaceSelector:
            matchLabels:
              name: argus-data
      ports:
        - protocol: TCP
          port: 8080
  egress:
    # Internal services
    - to:
        - podSelector:
            matchLabels:
              app.kubernetes.io/name: redpanda
      ports:
        - protocol: TCP
          port: 9092
    - to:
        - podSelector:
            matchLabels:
              app.kubernetes.io/name: falkordb
      ports:
        - protocol: TCP
          port: 6379
    - to:
        - podSelector:
            matchLabels:
              app.kubernetes.io/name: valkey
      ports:
        - protocol: TCP
          port: 6379
    # External services (non-RFC1918)
    - to:
        - ipBlock:
            cidr: 0.0.0.0/0
            except:
              - 10.0.0.0/8
              - 172.16.0.0/12
              - 192.168.0.0/16
      ports:
        - protocol: TCP
          port: 443   # HTTPS (Anthropic, Cohere, Supabase)
        - protocol: TCP
          port: 5432  # PostgreSQL (Supabase)
        - protocol: TCP
          port: 7687  # Bolt (Neo4j Aura)

Network Policy Matrix¶

Source	Destination	Ports	Policy
Any pod	Namespace internal	All	`allow-namespace-internal`
Any pod	External DNS	53	`default-deny-egress`
cognee-worker	redpanda	9092	`cognee-worker-policy`
cognee-worker	falkordb	6379	`cognee-worker-policy`
cognee-worker	valkey	6379	`cognee-worker-policy`
cognee-worker	External APIs	443, 5432, 7687	`cognee-worker-policy`
redpanda pods	redpanda pods	33145, 9092, 9644	`redpanda-policy`

Secret Management¶

File: data-layer/kubernetes/secrets.yaml

Secrets Structure¶

Secret Name	Keys	Purpose
`argus-data-secrets`	database-url, falkordb-password, valkey-password, redpanda-password, anthropic-api-key, cohere-api-key, neo4j-, supabase-	Global credentials
`redpanda-superusers`	users.txt (username:password:mechanism)	Redpanda SASL users
`falkordb-auth`	password	FalkorDB authentication
`valkey-auth`	password	Valkey authentication
`keda-kafka-secrets`	sasl, username, password	KEDA Kafka authentication
`redpanda-credentials`	bootstrap_servers, sasl_username, sasl_password	Flink Redpanda connection
`flink-r2-credentials`	AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY	Flink R2 checkpoint storage

Secret Injection Example¶

env:
  - name: KAFKA_SASL_PASSWORD
    valueFrom:
      secretKeyRef:
        name: argus-data-secrets
        key: redpanda-password
  - name: LLM_API_KEY
    valueFrom:
      secretKeyRef:
        name: argus-data-secrets
        key: anthropic-api-key
envFrom:
  - configMapRef:
      name: cognee-worker-config

Service Discovery¶

Internal Services¶

Service	Type	Endpoints
`redpanda`	ClusterIP	Port 9092
`redpanda-headless`	ClusterIP (None)	`redpanda-0.redpanda.argus-data.svc.cluster.local`
`falkordb`	ClusterIP	Port 6379
`falkordb-headless`	ClusterIP (None)	`falkordb-0.falkordb-headless.argus-data.svc.cluster.local`
`valkey`	ClusterIP	Port 6379
`valkey-headless`	ClusterIP (None)	`valkey-0.valkey-headless.argus-data.svc.cluster.local`
`flink-webui`	ClusterIP	Port 8081

DNS Resolution¶

# Cognee Worker Configuration
KAFKA_BOOTSTRAP_SERVERS=redpanda-0.redpanda.argus-data.svc.cluster.local:9092
FALKORDB_HOST=falkordb-headless.argus-data.svc.cluster.local
VALKEY_HOST=valkey-headless.argus-data.svc.cluster.local

Backend Integration¶

The FastAPI backend integrates with the data layer through the Event Gateway service.

Event Gateway Lifecycle¶

File: src/api/server.py:2357-2399

@app.on_event("startup")
async def startup_event():
    # ... other startup tasks ...
    from src.services.event_gateway import get_event_gateway
    event_gateway = get_event_gateway()
    await event_gateway.start()  # Line 2359

@app.on_event("shutdown")
async def shutdown_event():
    from src.services.event_gateway import get_event_gateway
    event_gateway = get_event_gateway()
    await event_gateway.stop()  # Line 2399

Event Emission Points¶

Location	Event Type	Trigger
`src/api/server.py:964-970`	`TEST_EXECUTED` / `TEST_FAILED`	After test run completion
`src/api/tests.py:437-440`	`TEST_CREATED`	After new test creation

Example Event Emission (src/api/tests.py:437-440):

from src.services.event_gateway import EventType, get_event_gateway

event_gateway = get_event_gateway()
if event_gateway.is_running:
    await event_gateway.publish(
        EventType.TEST_CREATED,
        {"test_id": test_id, "project_id": project_id, ...}
    )

Required Environment Variables¶

For backend to connect to the data layer:

# Redpanda/Kafka Connection
REDPANDA_BROKERS=redpanda.argus-data.svc.cluster.local:9092
REDPANDA_SASL_USERNAME=argus-service
REDPANDA_SASL_PASSWORD=<from-secrets>

# Optional: External Redpanda Serverless
REDPANDA_BROKERS=<serverless-endpoint>:9092
KAFKA_SECURITY_PROTOCOL=SASL_SSL

Deployment Sequence¶

Full Stack Deployment¶

# 1. Create namespace and quotas
kubectl apply -f namespace.yaml

# 2. Create secrets
kubectl apply -f secrets.yaml

# 3. Deploy storage layer (parallel)
kubectl apply -f falkordb.yaml &
kubectl apply -f valkey.yaml &
wait

# 4. Deploy Redpanda via Helm
helm repo add redpanda https://charts.redpanda.com
helm install redpanda redpanda/redpanda \
  -n argus-data \
  -f redpanda-values.yaml \
  --wait

# 5. Apply network policies
kubectl apply -f network-policies.yaml

# 6. Create services
kubectl apply -f services.yaml

# 7. Create Kafka topics
kubectl exec -n argus-data redpanda-0 -- rpk topic create \
  argus.codebase.ingested argus.codebase.analyzed \
  argus.test.created argus.test.executed argus.test.failed \
  argus.healing.requested argus.healing.completed argus.dlq \
  --partitions 6 --replicas 1

# 8. Deploy Cognee worker
kubectl apply -f cognee-worker.yaml
kubectl apply -f keda-cognee-scaler.yaml

# 9. Deploy Flink (optional)
kubectl apply -f flink-operator.yaml
kubectl apply -f flink-cluster.yaml

Minimal Deployment (External Services)¶

# Uses Redpanda Serverless + Supabase PostgreSQL externally
./deploy-minimal.sh

# Components deployed:
# - Namespace + Network Policies
# - Cognee Worker (connects to external Redpanda Serverless)
# - Flink Cluster (optional)

Resource Allocation¶

Per-Component Resources¶

Component	CPU Request	CPU Limit	Memory Request	Memory Limit	Storage
Redpanda	1000m	-	1.5-2Gi	-	40Gi
FalkorDB	100m	500m	256Mi	1Gi	40Gi
FalkorDB-Exporter	50m	100m	64Mi	128Mi	-
Valkey	50m	200m	128Mi	512Mi	40Gi
Valkey-Exporter	50m	100m	64Mi	128Mi	-
Cognee Worker	100m	500m	256Mi	1Gi	5Gi (ephemeral)
Flink JobManager	500m	-	1024Mi	-	-
Flink TaskManager	1000m	-	2048Mi	-	-

Namespace Totals¶

Resource	Requested	Limit	Quota
CPU	3.85 cores	-	20-40 cores
Memory	~7.5Gi	-	40-80Gi
Storage	160Gi	-	500Gi

Monitoring & Observability¶

Prometheus Metrics¶

Deployed Exporters: - redis_exporter (FalkorDB): Port 9121 - redis_exporter (Valkey): Port 9121 - Flink metrics: Port 9999

Pod Annotations:

annotations:
  prometheus.io/scrape: "true"
  prometheus.io/port: "9121"
  prometheus.io/path: "/metrics"

Log Aggregation¶

# View Cognee logs
kubectl logs -n argus-data -l app.kubernetes.io/name=cognee-worker -f

# View Flink logs
kubectl logs -n argus-data -l app=argus-flink,component=jobmanager -f

# Check Redpanda health
kubectl exec -n argus-data redpanda-0 -- rpk cluster health

Security Configuration¶

Credential Rotation Required

The secrets.yaml template contains placeholder values that must be replaced before deployment. If any real credentials were committed to the repository, rotate them immediately:

Anthropic API key
Neo4j Aura credentials
Cohere API key
Supabase service key

Recommended: Use External Secrets Operator or Sealed Secrets for production deployments.

Pod Security¶

securityContext:
  runAsUser: 1000
  runAsGroup: 1000
  runAsNonRoot: true
  fsGroup: 1000
  seccompProfile:
    type: RuntimeDefault

containerSecurityContext:
  allowPrivilegeEscalation: false
  capabilities:
    drop:
      - ALL
  readOnlyRootFilesystem: false  # Required for persistence

Recommendations¶

External Secrets Operator: Replace static secrets with dynamic syncing
Mutual TLS: Enable TLS for internal service communication
RBAC: Limit service account permissions
Image Scanning: Scan container images before deployment
Audit Logging: Enable Kubernetes audit logs

Troubleshooting¶

Cognee Worker Not Scaling¶

# Check KEDA status
kubectl describe scaledobject cognee-worker-scaledobject -n argus-data

# View Kafka consumer lag
kubectl exec -n argus-data redpanda-0 -- \
  rpk group describe argus-cognee-workers \
  -X user=admin -X pass=<password>

# Manual scaling (overrides KEDA)
kubectl scale deployment cognee-worker --replicas=3 -n argus-data

Redpanda SASL Authentication Fails¶

# List SASL users
kubectl get secret redpanda-superusers -n argus-data \
  -o jsonpath='{.data.users\.txt}' | base64 -d

# Test SASL connection
kubectl exec -n argus-data redpanda-0 -- \
  rpk cluster info \
  -X user=admin \
  -X pass=<password> \
  -X sasl.mechanism=SCRAM-SHA-512

FalkorDB Data Loss¶

# Check persistence
kubectl get pvc -n argus-data

# Verify AOF is enabled
kubectl exec -n argus-data falkordb-0 -- redis-cli CONFIG GET appendonly

# Manual snapshot
kubectl exec -n argus-data falkordb-0 -- \
  redis-cli -a <password> BGSAVE

File Manifest¶

data-layer/
├── kubernetes/
│   ├── namespace.yaml              # Namespace, ResourceQuota, LimitRange
│   ├── secrets.yaml                # 6 secrets (credentials, auth, API keys)
│   ├── redpanda-values.yaml        # Helm values for Redpanda
│   ├── falkordb.yaml               # FalkorDB StatefulSet + Service
│   ├── valkey.yaml                 # Valkey StatefulSet + Service
│   ├── cognee-worker.yaml          # Cognee Deployment + HPA + PDB (321 lines)
│   ├── keda-cognee-scaler.yaml     # KEDA ScaledObject + TriggerAuthentication
│   ├── network-policies.yaml       # Zero-trust network policies (291 lines)
│   ├── services.yaml               # ClusterIP services
│   ├── flink-cluster.yaml          # Flink FlinkDeployment + ServiceAccount + RBAC
│   ├── flink-operator.yaml         # Flink Kubernetes Operator
│   ├── flink-platform/
│   │   ├── keda-autoscaler.yaml
│   │   ├── checkpoint-config.yaml
│   │   ├── self-healing-operator.yaml
│   │   ├── monitoring.yaml
│   │   └── deploy-platform.sh
│   ├── flink-jobs/
│   │   └── test-analytics.yaml
│   ├── deploy.sh                   # Full stack deployment (264 lines)
│   ├── deploy-minimal.sh           # Minimal deployment
│   └── deploy-flink.sh             # Flink + Cloudflare R2 deployment
│
├── cognee-worker/                  # Worker Implementation
│   ├── src/
│   │   ├── __init__.py
│   │   ├── config.py               # Pydantic settings for worker config
│   │   └── worker.py               # Main worker implementation (710 lines)
│   ├── scripts/
│   │   └── init_neo4j_schema.py    # Neo4j schema initialization
│   ├── Dockerfile
│   ├── requirements.txt
│   └── README.md
│
├── schemas/
│   └── neo4j-multitenant-schema.cypher  # Multi-tenant Cypher schema
│
├── terraform/
│   └── confluent-cloud/
│       └── main.tf                 # Confluent Cloud alternative
│
└── docker/
    ├── Dockerfile.cognee-worker
    └── docker-compose.data-layer.yml

Backend Integration Files¶

src/
├── api/
│   ├── server.py                   # Lines 2357-2399: Event gateway lifecycle
│   └── tests.py                    # Lines 437-440: TEST_CREATED emission
│
└── services/
    └── event_gateway.py            # EventGateway class for Kafka publishing

Last Updated: January 27, 2026 - v2.0.0