Milestone 3 (steam engine): Building a Scalable and Modular Computing Infrastructure
In this milestone, the focus is on developing an advanced training workflows and TensorFlow-based data management will also be implemented to enhance machine learning capabilities and streamline the project’s complexities.
Key dates:
- Due date: Oct 5th
Template Repository
Submission Instructions:
- Please see Ed
Objectives:
Integrate Distributed Computing and Cloud Storage: Implement distributed computing solutions using tools like Dask, and align with cloud storage systems that support the scale and complexity of the project.
Utilize TensorFlow for Data Management: Implement TF Data and TF Records to enhance data ingestion and management within machine learning components of the project.
Develop Advanced Training Workflows: Implement and optimize complex training workflows including experiment tracking (W&B), multi-GPU training and serverless training (Vertex AI). These should align with the machine learning components and training requirements of the project.
Deliverables:
Data Pipeline Implementation: A comprehensive data pipeline with extraction, transformation, and versioning capabilities, including examples of versioned datasets.
Distributed Computing and Storage Integration: Evidence of the implemented distributed computing system and cloud storage solutions, complete with documentation on their configuration and utilization.
Machine Learning Workflow Implementation: Detailed implementation of the advanced training workflows, with evidence of successful training runs, experiment tracking, and utilization of multi-GPU and serverless training.