Large Scale Dataset Engineer
Lightricks
Lightricks, an AI-first company, is revolutionizing how visual content is created. With a mission to bridge the gap between imagination and creation, Lightricks is dedicated to bringing cutting-edge technology to the creative and business spaces. Our AI photo and video generation models, which power our apps and platforms including Facetune, Photoleap, Videoleap, and LTX Studio, allow creators and brands to leverage the latest research breakthroughs, offering endless control over their creative potential. Our influencer marketing platform, Popular Pays, provides creators the ability to monetize their work and offers brands opportunities to scale their content through tailored creator partnerships.
The Core Generative AI team at Lightricks Research is a unified group of researchers and engineers dedicated to developing our generative foundational models that serve LTX Studio, our AI-based video creation platform. Our focus is on creating a controllable, cutting-edge video generative model by merging cutting-edge algorithms with exceptional engineering. This involves enhancing machine learning components within our sophisticated internal training framework, crucial for developing advanced models. We specialize in both research and engineering that enable efficient and scalable training and inference, allowing us to deliver state-of-the-art AI-generated video models.
About the Role
As a Large Scale Dataset ML Engineer you will play a key role in improving training efficiency by increasing both the quantity and quality of training data. This role demands excellent engineering skills for designing, implementing, and optimizing advanced data pipelines, alongside implementing robust machine learning and computer vision algorithms for data processing. Your expertise in optimizing the performance of distributed systems, understanding statistics, and eliminating bugs will be crucial, as our video training sets consist of extensive data volumes processed across numerous virtual machines.
This role is designed for individuals who are not only technically proficient but also deeply passionate about pushing the boundaries of AI and machine learning through innovative engineering and collaborative research.
What you will be doing-
- Own and lead engineering projects focused on data acquisition, processing, clustering, evaluation and filtering.
- Design algorithms for balancing and filtering training sets.
- Develop high-performance and scalable distributed systems capable of handling petabytes of data.
- Collaborate with researchers and product stakeholders to iteratively improve training sets based on model performance.
Your skills and experience
- 6+ years of experience with small to large scale ML experiments and multi-modal ML pipelines.
- Strong software engineering skills, proficient in Python and experienced with Kubernetes Infrastructure-as-Code.
- Experience with data processing and distributed systems.
- Ability to develop, and fine-tune computer vision and ML models for data evaluation and filtering.
- Understanding of relevant topics in statistics and clustering.
- Enjoys delving into system implementations to enhance performance and maintainability.
- Background in PyTorch/JAX/TensorFlow, or similar technologies is a plus.