Senior Data Scientist - Synthetic Data Generation (Remote)
About the Role
We are looking for a Senior Data Scientist - Synthetic Data Generation to join our innovative team at Poolside. This remote role focuses on enhancing the quality of datasets used for training our models, with a particular emphasis on generating synthetic data at scale. Your expertise will be crucial in defining high-quality data needs that align with our model capabilities and use cases.
What You'll Do
- Lead initiatives to improve the quality of pretraining datasets by leveraging your experience in synthetic data generation.
- Design and implement complex data pipelines that generate large volumes of diverse data while optimizing resource usage.
- Collaborate closely with teams such as Pretraining, Posttraining, Evals, and Product to ensure alignment on model quality.
- Continuously measure and refine dataset quality through quantitative data ablation experiments.
- Stay updated with the latest research in synthetic data generation and large language models (LLMs).
Requirements
- Strong background in machine learning and engineering.
- Experience with LLMs, including understanding their learning processes and scaling laws.
- Proficient in designing cost-efficient pipelines for generating synthetic datasets.
- Excellent programming skills in Python.
- Experience working with large-scale GPU clusters and distributed data pipelines.
- Strong obsession with data quality and experience in building trillion-scale pretraining datasets.
Nice to Have
- Research experience with publications in applied deep learning or LLMs.
- Familiarity with concepts like data curation, deduplication, and tokenization.
What We Offer
- Fully remote work with flexible hours.
- 37 days of vacation and holidays per year.
- Health insurance allowance for you and your dependents.
- Company-provided equipment and home office allowances.
- A culture that prioritizes wellbeing and continuous learning.
- Frequent team get-togethers to foster collaboration.
This Senior Data Scientist role at Poolside offers a unique opportunity to work on synthetic data generation in a fully remote environment. With a strong focus on AI and machine learning, you'll be part of a team that values innovation and quality.
About Poolside
Explore exciting Poolside careers in 2026 with a variety of remote, hybrid, and office roles available. Our platform offers advanced filters to refine your job search, application tracking to keep your submissions organized, and valuable company insights to help you stand out. Discover your next career opportunity at Poolside today and take the next step toward a rewarding future in the industry.
Who Will Succeed Here
Proficiency in Python for data manipulation and machine learning, with hands-on experience in libraries such as Pandas, NumPy, and TensorFlow, enabling effective synthetic data generation.
Strong understanding of data pipelines and experience with tools like Apache Airflow or Luigi, ensuring efficient data flow and processing in a remote work environment.
A growth mindset with at least 3-5 years of experience in data science, particularly in synthetic data generation, capable of adapting to new technologies and methodologies like LLM (Large Language Models).
Learning Resources
Career Path
Market Overview
Skills & Requirements
Domain Trends
Industry News
Loading latest industry news...
Finding relevant articles from the last 6 months