Conversation
BlueCrescent
left a comment
There was a problem hiding this comment.
Overall LGTM.
Should we also explicitly allow seeding for the "model_initialized" component?
It will probably inherit the random state from the model_raw component but it seems a bit risky to me to assume that (also in the future) no other interaction with the random state happens between these two components (though, probably, only interactions that are asymmetrical between the ranks would be problematic). In particular, since we cannot guarantee the order in which the components are build, something like a dataloader component might even re-seed the random state.
le1nux
left a comment
There was a problem hiding this comment.
I checked the seeding (not the test) and from my understanding the changes do not provide the expected results (also what @BlueCrescent was hinting towards).
When we seed the raw model, the model weights are indeed deterministic at instantiation time. However, we also have model weight initialization which runs afterwards and would just override the weights / seeding.
Additionally, passing device_mesh to the model is coupling two components that should normally not know anything about each other.
I think we have to integrate the seeding to the weight initializer component and can also pass in the device_mesh there.
Yes that makes sense. I moved the seeding to the model initialization component |
See #426 (comment) |
le1nux
left a comment
There was a problem hiding this comment.
Generally good state.
Left a couple of comments.
My main concern is the global setting of the seed. A generator object might be favorable.
| """NNModel class to define a base model.""" | ||
|
|
||
| def __init__(self, seed: int = None, weight_decay_groups: Optional[WeightDecayGroups] = None): | ||
| def __init__(self, seed: Optional[int] = None, weight_decay_groups: Optional[WeightDecayGroups] = None): |
There was a problem hiding this comment.
| def __init__(self, seed: Optional[int] = None, weight_decay_groups: Optional[WeightDecayGroups] = None): | |
| def __init__(self, seed: int | None = None, weight_decay_groups: Optional[WeightDecayGroups] = None): |
There was a problem hiding this comment.
Do we even want to allow setting the seed here?
Could torch.manual_seed below have side effects with the new weight init implementation?
| std: Annotated[float, Field(strict=True, ge=0.0)] | str # can be float or "auto" | ||
| hidden_dim: Optional[Annotated[int, Field(strict=True, gt=0)]] = None | ||
| num_layers: Optional[Annotated[int, Field(strict=True, gt=0)]] = None | ||
| seed: Optional[int] = None |
There was a problem hiding this comment.
| seed: Optional[int] = None | |
| seed: int | None = None |
There was a problem hiding this comment.
also other instances and docstring annotations should be adapted accordingly
| return initialization | ||
|
|
||
| @staticmethod | ||
| def _set_seed(seed: Optional[int]): |
There was a problem hiding this comment.
this sets the seed globally. I think an even more robust way would be to use a local torch rng object.
Could this be integrated?
Something like:
g = torch.Generator()
g.manual_seed(1234)There was a problem hiding this comment.
Otherwise, we might get into side-effects later on
| if seed is not None and has_parallelism_method( | ||
| device_mesh=device_mesh, parallelism_method=ParallelismDegrees.PP | ||
| ): | ||
| seed += get_parallel_rank(device_mesh=device_mesh, parallelism_method=ParallelismDegrees.PP) |
There was a problem hiding this comment.
This also means that depending on the parallelization method and also the number of parallelism degrees we get differently initialized layers even if the seed is the same.
Example:
DP with seed = 1, will have a differently initilized model than DP+PP with seed = 1.
One way to fix this is to always use the same seed but each PP stage has to skip the number of random values of the pervious stages.
However, I think this would be overkill and I would just place a warning when initialising the weights and parallelization methods are other than FSDP.
What do you think?
There was a problem hiding this comment.
We could also give the hint that for full reproducibility a Distributed Checkpoint with FSDP directly after weight init.
Maybe we could even have an entry point for that in main.
something like:
modalities create_init_cp model_config.yaml
For some unit tests, this functionality would be nice to have anyways I think.
Any thoughts?
| std: float | str, | ||
| parameter_name_regexes: list[str], | ||
| hidden_dim: Optional[int] = None, | ||
| seed: Optional[int] = None, |
There was a problem hiding this comment.
| seed: Optional[int] = None, | |
| seed: int | None = None, |
also various cases below
What does this PR do?
This PR gives a unique model seed for each pp rank, such that parameters are initialized differently across ranks.
General Changes
Breaking Changes
Checklist before submitting final PR
python tests/tests.py)CHANGELOG_DEV.md)