I've seen this test fail with:
```
- Mock #1.
Expected range of matching incoming requests: == 2
Number of matched incoming requests: 1
```
This is because we pop the wrong task_complete events and then the test
exits. I think this is because the MCP events are now buffered after
https://github.com/openai/codex/pull/8874.
So:
1. clear the buffer before we do any user message sending
2. additionally listen for task start before task complete
3. use the ID from task start to find the correct task complete event.