Pretraining Data Engineer - Remote Role at Poolside
About the Role
We're hiring a Pretraining Data Engineer to join our innovative team at Poolside. In this remote role, you will be instrumental in building and scaling our Model Factory, which is essential for training our foundation models. Your primary mission as a Pretraining Data Engineer will be to architect and maintain high-performance pipelines that transform trillions of raw tokens into the high-quality datasets that our models require.
What You'll Do
- Build and maintain high-performance pipelines for processing trillions of tokens.
- Deliver diverse and high-quality datasets for pre-training foundation models.
- Collaborate closely with teams such as Pretraining, Posttraining, Evals, and Product to ensure alignment on model quality.
- Engineer ingestion, deduplication, and streaming systems that handle petabyte-scale data.
- Bridge the gap between raw web crawls and GPU clusters, influencing model performance through superior data modeling and distributed pipeline optimization.
Requirements
- Strong background in building production-grade, distributed data systems for machine learning.
- Experience with orchestration tools like Slurm, Airflow, or Dagster.
- Familiarity with observability and reliability tools such as CI/CD, Grafana, and Prometheus.
- Proficiency in infrastructure tools including Git, Docker, Kubernetes, and cloud managed services.
- Expert-level knowledge of Python and ability to write clean, maintainable code.
- Strong algorithmic foundations and proficiency with libraries like Polars, Dask, or PySpark.
Nice to Have
- Experience in building trillion-scale SOTA pretraining datasets.
- Experience translating research to production at scale.
- Prior experience pre-training large language models (LLMs).
What We Offer
- Fully remote work with flexible hours.
- 37 days of vacation and holidays per year.
- Health insurance allowance for you and your dependents.
- Company-provided equipment and home office allowances.
- A diverse and inclusive people-first culture.
This Pretraining Data Engineer role at Poolside offers a unique opportunity to work on cutting-edge AI projects in a fully remote environment with generous benefits.
About Poolside
Explore exciting Poolside careers in 2026 with a variety of remote, hybrid, and office roles available. Our platform offers advanced filters to refine your job search, application tracking to keep your submissions organized, and valuable company insights to help you stand out. Discover your next career opportunity at Poolside today and take the next step toward a rewarding future in the industry.
Generating success profile...
Analyzing job requirements and market data
Loading market overview...
Analyzing market trends and skill demands
Industry News
Loading latest industry news...
Finding relevant articles from the last 6 months