AI Infrastructure Engineer
Alice
Software Engineering, Other Engineering, Data Science
Vietnam
AI Infrastructure Engineer
- Operations
- Vietnam
- Full-time
Description
Alice is building its internal AI infrastructure layer from the ground up. We have real agents running in production, a growing base of employees using AI in their daily work, and a clear architectural direction. What we don't have yet is a dedicated engineer to own it.
You'll be the first. Your job is to close the gap between "working prototype" and "production platform" - owning the AWS/Kubernetes/Terraform foundation that hosts our agents, the GitOps and CI/CD pipelines that ship them, and the reliability layer (observability, cost controls, audit trails, evals) that makes it safe to run AI at scale in a trust & safety company.
This is an infrastructure-first role with deep AI fluency - not a prompt engineer, not a wrapper-framework operator, not a no-code builder. You should be equally comfortable writing a Terraform module, debugging a Kubernetes pod, and tracing an agent's tool-call chain.
We don’t operate with a predefined backlog here; you will be responsible for identifying high-impact needs and bringing them to life. The perfect fit for this role has a track record of deploying agentic systems that have held up under real-world usage, balances a focus on infrastructure with a deep concern for user experience, and recognizes that the primary hurdle in AI integration is rarely the model itself.
Responsibilities:
Platform & Infrastructure
- Architect, build, and run the AWS/Kubernetes platform that hosts Alice's internal AI agents and tools; drive AWS Well-Architected pillars (operational excellence, security, reliability, performance, cost, sustainability).
- Own Infrastructure-as-Code: Terraform modules, standards, and reviews for Bedrock, agent runtimes, vector DBs, and supporting services.
AI Systems
- Design and ship production-grade agents and multi-agent pipelines using the Anthropic Agent SDK, Claude Code, AWS Bedrock, and MCP — not wrapper frameworks.
- Own the full agent lifecycle: scoping → prototyping → eval → deploy → monitor → iterate.
- Integrate agentic workflows into internal and product systems via APIs, databases, webhooks, Slack, and email.
Reliability, Observability, Cost
- Build first-class observability across apps and infra: OpenTelemetry, Prometheus, plus LLM-specific tracing (Langfuse or equivalent), token/cost metrics, and eval pipelines.
- Define SLOs/SLIs and error budgets for AI services - latency, model fallback chains, eval regression gates, agent success rates. Lead incident readiness, response, and post-mortems.
- Drive FinOps: model routing by cost, cache hit rates, batch vs. realtime tradeoffs, budget alarms, per-team chargeback visibility.
- Implement guardrails: prompt-injection defenses, PII redaction, model allowlists, human-in-the-loop checkpoints, audit trails.
Org Impact
- Identify high-leverage workflows across the organization and translate them into scalable agentic automations.
- Partner with R&D, Delivery, security, and external vendors to deliver platform capabilities.
Requirements
Requirements (must-have)
- 3-5 years in software engineering, shipping and operating production-grade systems.
- 2+ years hands-on AWS, Kubernetes, and Terraform in production — not familiarity, ownership.
- 1-2 years hands-on building and deploying LLM-powered or agentic systems in production.
- Proficiency in Python: async patterns, REST APIs, cloud-native architecture.
- Production experience with native agentic SDKs (Anthropic Agent SDK, Claude Code) and MCP - tool-calling patterns, server configuration, memory systems, vector DBs.
- Hands-on AWS Bedrock for model access, IAM-based auth, and enterprise deployment patterns.
- Production CI/CD ownership (GitHub Actions, Argo CD, or equivalent) and observability stack experience (OpenTelemetry + Prometheus, plus LLM tracing).
- Proven ownership: design → implement → release → operate → improve, independently and within a team.
- Strong debugging instincts across multi-step agent chains and distributed infrastructure.
- Clear written and verbal communication in English; comfortable with internal and external stakeholders.
- Startup mindset: move fast, own decisions end-to-end, comfortable with ambiguity.
Nice to Have
- Background in trust & safety, content moderation, or compliance-sensitive environments.
- FinOps experience at scale (cost attribution, budget enforcement, optimization playbooks).
- Experience building lightweight internal dashboards or UI layers for agentic workflows.
- LLM evaluation framework experience (Braintrust, Langfuse evals, custom harnesses).
About Alice
Alice is a trust, safety, and security company built for the AI era. We safeguard the communicative technologies people use to create, collaborate, and interact—whether with each other or with machines.
In a world where AI has fundamentally changed the nature of risk, Alice provides end-to-end coverage across the entire AI lifecycle. We support frontier model labs, enterprises, and UGC platforms with a comprehensive suite of solutions: from model hardening evaluations and pre-deployment red-teaming to runtime guardrails and ongoing drift detection.