AI Infrastructure Engineer

Alice

Alice

Software Engineering, Other Engineering, Data Science

Vietnam

Posted on May 1, 2026

AI Infrastructure Engineer

  • Operations
  • Vietnam
  • Full-time

Description

Alice is building its internal AI infrastructure layer from the ground up. We have real agents running in production, a growing base of employees using AI in their daily work, and a clear architectural direction. What we don't have yet is a dedicated engineer to own it.

You'll be the first. Your job is to close the gap between "working prototype" and "production platform" - owning the AWS/Kubernetes/Terraform foundation that hosts our agents, the GitOps and CI/CD pipelines that ship them, and the reliability layer (observability, cost controls, audit trails, evals) that makes it safe to run AI at scale in a trust & safety company.

This is an infrastructure-first role with deep AI fluency - not a prompt engineer, not a wrapper-framework operator, not a no-code builder. You should be equally comfortable writing a Terraform module, debugging a Kubernetes pod, and tracing an agent's tool-call chain.

We don’t operate with a predefined backlog here; you will be responsible for identifying high-impact needs and bringing them to life. The perfect fit for this role has a track record of deploying agentic systems that have held up under real-world usage, balances a focus on infrastructure with a deep concern for user experience, and recognizes that the primary hurdle in AI integration is rarely the model itself.

Responsibilities:

Platform & Infrastructure

  • Architect, build, and run the AWS/Kubernetes platform that hosts Alice's internal AI agents and tools; drive AWS Well-Architected pillars (operational excellence, security, reliability, performance, cost, sustainability).
  • Own Infrastructure-as-Code: Terraform modules, standards, and reviews for Bedrock, agent runtimes, vector DBs, and supporting services.

AI Systems

  • Design and ship production-grade agents and multi-agent pipelines using the Anthropic Agent SDK, Claude Code, AWS Bedrock, and MCP — not wrapper frameworks.
  • Own the full agent lifecycle: scoping → prototyping → eval → deploy → monitor → iterate.
  • Integrate agentic workflows into internal and product systems via APIs, databases, webhooks, Slack, and email.

Reliability, Observability, Cost

  • Build first-class observability across apps and infra: OpenTelemetry, Prometheus, plus LLM-specific tracing (Langfuse or equivalent), token/cost metrics, and eval pipelines.
  • Define SLOs/SLIs and error budgets for AI services - latency, model fallback chains, eval regression gates, agent success rates. Lead incident readiness, response, and post-mortems.
  • Drive FinOps: model routing by cost, cache hit rates, batch vs. realtime tradeoffs, budget alarms, per-team chargeback visibility.
  • Implement guardrails: prompt-injection defenses, PII redaction, model allowlists, human-in-the-loop checkpoints, audit trails.

Org Impact

  • Identify high-leverage workflows across the organization and translate them into scalable agentic automations.
  • Partner with R&D, Delivery, security, and external vendors to deliver platform capabilities.

Requirements

Requirements (must-have)

  • 3-5 years in software engineering, shipping and operating production-grade systems.
  • 2+ years hands-on AWS, Kubernetes, and Terraform in production — not familiarity, ownership.
  • 1-2 years hands-on building and deploying LLM-powered or agentic systems in production.
  • Proficiency in Python: async patterns, REST APIs, cloud-native architecture.
  • Production experience with native agentic SDKs (Anthropic Agent SDK, Claude Code) and MCP - tool-calling patterns, server configuration, memory systems, vector DBs.
  • Hands-on AWS Bedrock for model access, IAM-based auth, and enterprise deployment patterns.
  • Production CI/CD ownership (GitHub Actions, Argo CD, or equivalent) and observability stack experience (OpenTelemetry + Prometheus, plus LLM tracing).
  • Proven ownership: design → implement → release → operate → improve, independently and within a team.
  • Strong debugging instincts across multi-step agent chains and distributed infrastructure.
  • Clear written and verbal communication in English; comfortable with internal and external stakeholders.
  • Startup mindset: move fast, own decisions end-to-end, comfortable with ambiguity.

Nice to Have

  • Background in trust & safety, content moderation, or compliance-sensitive environments.
  • FinOps experience at scale (cost attribution, budget enforcement, optimization playbooks).
  • Experience building lightweight internal dashboards or UI layers for agentic workflows.
  • LLM evaluation framework experience (Braintrust, Langfuse evals, custom harnesses).

About Alice

Alice is a trust, safety, and security company built for the AI era. We safeguard the communicative technologies people use to create, collaborate, and interact—whether with each other or with machines.

In a world where AI has fundamentally changed the nature of risk, Alice provides end-to-end coverage across the entire AI lifecycle. We support frontier model labs, enterprises, and UGC platforms with a comprehensive suite of solutions: from model hardening evaluations and pre-deployment red-teaming to runtime guardrails and ongoing drift detection.