onsite
Member of Technical Staff - Web Crawl Engineer - REFLECTION
Software Engineer
Lead the design and operation of large‑scale web crawling pipelines, ensuring high‑quality, fresh, and diverse data for AI models using Python, distributed frameworks, and cloud infrastructure.
About the role
Key Responsibilities
- Design, implement, and maintain scalable web crawling pipelines that ingest billions of pages daily.
- Optimize crawler performance and reliability across distributed clusters on AWS.
- Collaborate with data scientists to define data quality metrics and ensure coverage of niche domains.
- Integrate extracted data into downstream storage and indexing systems (e.g., S3, Elasticsearch).
- Monitor system health, troubleshoot failures, and continuously improve throughput and latency.
Requirements
- 5+ years of experience building production‑grade web crawlers or large‑scale data ingestion systems.
- Proficiency in Python and distributed processing frameworks (e.g., Apache Beam, Spark).
- Strong background in cloud infrastructure, especially AWS services such as EC2, S3, and EMR.
- Experience with data storage, indexing, and search technologies (Elasticsearch, Solr).
- Excellent problem‑solving skills and a passion for data quality and scalability.