Poolside31.01.26
AI SCORE 8.5

Senior Data Scientist - Synthetic Data Generation (Remote)

$120K–$150K/year

About the Role

We are looking for a Senior Data Scientist - Synthetic Data Generation to join our innovative team at Poolside. This remote role focuses on enhancing the quality of datasets used for training our models, with a particular emphasis on generating synthetic data at scale. Your expertise will be crucial in defining high-quality data needs that align with our model capabilities and use cases.

What You'll Do

  • Lead initiatives to improve the quality of pretraining datasets by leveraging your experience in synthetic data generation.
  • Design and implement complex data pipelines that generate large volumes of diverse data while optimizing resource usage.
  • Collaborate closely with teams such as Pretraining, Posttraining, Evals, and Product to ensure alignment on model quality.
  • Continuously measure and refine dataset quality through quantitative data ablation experiments.
  • Stay updated with the latest research in synthetic data generation and large language models (LLMs).

Requirements

  • Strong background in machine learning and engineering.
  • Experience with LLMs, including understanding their learning processes and scaling laws.
  • Proficient in designing cost-efficient pipelines for generating synthetic datasets.
  • Excellent programming skills in Python.
  • Experience working with large-scale GPU clusters and distributed data pipelines.
  • Strong obsession with data quality and experience in building trillion-scale pretraining datasets.

Nice to Have

  • Research experience with publications in applied deep learning or LLMs.
  • Familiarity with concepts like data curation, deduplication, and tokenization.

What We Offer

  • Fully remote work with flexible hours.
  • 37 days of vacation and holidays per year.
  • Health insurance allowance for you and your dependents.
  • Company-provided equipment and home office allowances.
  • A culture that prioritizes wellbeing and continuous learning.
  • Frequent team get-togethers to foster collaboration.
Why This Job8.5 of 10

This Senior Data Scientist role at Poolside offers a unique opportunity to work on synthetic data generation in a fully remote environment. With a strong focus on AI and machine learning, you'll be part of a team that values innovation and quality.

Salary Range
Required
0/1
Optional
0/1
Bonus
0/1

About Poolside

Explore exciting Poolside careers in 2026 with a variety of remote, hybrid, and office roles available. Our platform offers advanced filters to refine your job search, application tracking to keep your submissions organized, and valuable company insights to help you stand out. Discover your next career opportunity at Poolside today and take the next step toward a rewarding future in the industry.

Industry
Tech
Location
Remote

Who Will Succeed Here

Proficiency in Python for data manipulation and machine learning, with hands-on experience in libraries such as Pandas, NumPy, and TensorFlow, enabling effective synthetic data generation.

Strong understanding of data pipelines and experience with tools like Apache Airflow or Luigi, ensuring efficient data flow and processing in a remote work environment.

A growth mindset with at least 3-5 years of experience in data science, particularly in synthetic data generation, capable of adapting to new technologies and methodologies like LLM (Large Language Models).

Learning Resources

Python for Data Science Handbookguide

Career Path

Senior Data Scientist - Synthetic Data Generation(Now)Lead Data Scientist(1-2 years)Director of Data Science(3-5 years)

Market Overview

Market Size 2024
$30B
Annual Growth
22.5%
AI Adoption
75%
Investment
+150%
Labour Demand
+45%
Avg Salary
$130K

Skills & Requirements

Required
PythonMachine LearningLLM
Growing in Demand
Deep LearningData VisualizationCloud Computing (AWS/Azure)
Declining
R ProgrammingExcel-based Data Analysis

Domain Trends

Rise of Synthetic Data
The synthetic data generation market is projected to grow by 30% annually as organizations seek to enhance AI training datasets without privacy concerns.
Increased Demand for LLMs
Over 60% of companies are integrating Large Language Models (LLMs) into their data science workflows, driving demand for expertise in this area.
Shift to GPU-accelerated Computing
Usage of GPU clusters for data processing is increasing by 40% as companies require faster model training times, leading to a surge in demand for professionals skilled in GPU computing.

Industry News

Loading latest industry news...

Finding relevant articles from the last 6 months

All job postings are automatically gathered by algorithms. We do not review or verify listings, be careful when applying and do not sign-in with iCloud or Google services.