Milestone 2 (the wheel): MLOps Infrastructure & Advanced Training Workflows - Building Atomic Containers, Versioned Data Pipelines, and Scalable Computing Solutions

In this milestone, the focus is on developing a robust and scalable MLOps infrastructure. Teams will build atomic containers for various project components, design data pipelines with version control using tools like Apache Delta Lake, and integrate distributed computing solutions alongside cloud storage

Key dates:

  • Due date: Oct 18th

Objectives:

  • Build Atomic Containers for Components: Create containerized solutions for various components using standalone containers that can run independently. This will include the development of atomic containers for individual applications and services involved in the project.
  • Construct Data Pipelines with Versioning: Design and implement a robust data pipeline that leverages Extract, Transform, Versioning tools like Delta Lake or Pachyderm/DVC. This will enable efficient data handling and version control within the project.

Deliverables:

  • Containerized Components: Fully-functional atomic containers for all individual components, aligned and ready for integration within the project architecture.
  • Virtual Environment Setup: Documented and implemented virtual machines and environments tailored to support the containerized components.

In this milestone, the focus is on developing an advanced training workflows and TensorFlow-based data management will also be implemented to enhance machine learning capabilities and streamline the project’s complexities.

[PP: TO BE INCORPORATED WITH THE PREVIOUS]

Objectives:

  • Integrate Distributed Computing and Cloud Storage: Implement distributed computing solutions using tools like Dask, and align with cloud storage systems that support the scale and complexity of the project.

  • Utilize TensorFlow for Data Management: Implement TF Data and TF Records to enhance data ingestion and management within machine learning components of the project.

  • Develop Advanced Training Workflows: Implement and optimize complex training workflows including experiment tracking (W&B), multi-GPU training, serverless training (Vertex AI), and Kubeflow. These should align with the machine learning components and training requirements of the project.

Deliverables:

  • Data Pipeline Implementation: A comprehensive data pipeline with extraction, transformation, and versioning capabilities, including examples of versioned datasets.

  • Distributed Computing and Storage Integration: Evidence of the implemented distributed computing system and cloud storage solutions, complete with documentation on their configuration and utilization.

  • Machine Learning Workflow Implementation: Detailed implementation of the advanced training workflows, with evidence of successful training runs, experiment tracking, and utilization of multi-GPU and serverless training.