Milestone 3 (steam engine): Building a Scalable and Modular Computing Infrastructure

In this milestone, the focus is on developing an advanced training workflows and TensorFlow-based data management will also be implemented to enhance machine learning capabilities and streamline the project’s complexities.

Key dates:

  • Due date: Oct 5th

Template Repository

Submission Instructions:

  • Please see Ed

Objectives:

  • Integrate Distributed Computing and Cloud Storage: Implement distributed computing solutions using tools like Dask, and align with cloud storage systems that support the scale and complexity of the project.

  • Utilize TensorFlow for Data Management: Implement TF Data and TF Records to enhance data ingestion and management within machine learning components of the project.

  • Develop Advanced Training Workflows: Implement and optimize complex training workflows including experiment tracking (W&B), multi-GPU training and serverless training (Vertex AI). These should align with the machine learning components and training requirements of the project.

Deliverables:

  • Data Pipeline Implementation: A comprehensive data pipeline with extraction, transformation, and versioning capabilities, including examples of versioned datasets.

  • Distributed Computing and Storage Integration: Evidence of the implemented distributed computing system and cloud storage solutions, complete with documentation on their configuration and utilization.

  • Machine Learning Workflow Implementation: Detailed implementation of the advanced training workflows, with evidence of successful training runs, experiment tracking, and utilization of multi-GPU and serverless training.