Design and implement the core coordinator for a distributed MapReduce framework, completing end-to-end functionality:
(1) Implement worker node registration and heartbeat detection via Register/Heartbeat RPC, establishing a timeout/failure detection mechanism;
(2) Implement a task queue based on VecDeque and HashMap, supporting FIFO scheduling to ensure efficient task allocation across jobs;
(3) Implemented SubmitJob/PollJob RPC interfaces for job submission and status queries, rigorously validating application logic and transmitting byte parameters;
(4) Developed GetTask dynamic task distribution and FinishTask status update systems to drive Map/Reduce phase transitions;
(5) Designed and implemented a three-tier fault tolerance strategy: redistribute Map tasks upon Worker failure (persist Reduce outputs without retries), enable task-level auto-retry via FailTask RPC (retry=true), and immediately mark jobs as failed upon I/O or function exceptions (failed=true).
This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.