Contributing to iMML¶
(adapted from Scikit-learn and mvlearn)
Submitting a bug report or a feature request¶
We track bugs and feature requests using GitHub issues. If you encounter a problem or have an idea for a new feature, feel free to open an issue.
If you run into any trouble while using this package, we encourage you to submit an issue through our Issue Tracker Bug Tracker. Suggestions for enhancements or pull requests are also welcome.
Before posting, ensure your submission aligns with these guidelines:
Check for duplicates: Look for similar issues or pull requests already submitted.
Bug reports: Follow our tips in How to make a good bug report to provide all necessary details.
Adhere to guidelines: Ensure your code complies with the general Guidelines and matches the API of iMML Objects.
How to make a good bug report¶
When reporting an issue on GitHub Github, please include the following details to help us assist you effectively:
Minimal code example: Provide a concise code snippet to replicate the issue (see this for more details). If it is longer than around 50 lines, use a gist gist or a public repository.
Key details: If providing code is not practical, specify the methods or modules involved and the data shapes you are working with.
Errors and tracebacks: Include the full error traceback if applicable.
Environment information: Include your operating system, Python version, and the version of this package. Run this snippet to collect the details:
import platform; print(platform.platform()); import sys; print("Python", sys.version); import imml; print("imml", imml.version)
Formatting: Use appropriate code blocks for examples and errors. Refer to the guide on Creating and highlighting code blocks.
Contributing code¶
We recommend the following workflow for contributing code:
Use the ‘Fork’ button in the GitHub interface to copy the project into your account. This creates a copy of the code under your GitHub user account. For more details on how to fork a repository see this guide.
Clone your fork locally:
git clone git@github.com:YourLogin/imml.git cd imml
Create a
featurebranch for your changes:git checkout -b my-feature
Avoid working on the
mainbranch directly.Develop the feature on your feature branch and commit your changes:
git add modified_files git commit
Then push the changes to your GitHub account with:
git push -u origin my-feature
Pull Request Checklist¶
Before submitting a pull request, ensure:
Follow the coding-guidelines.
Descriptive title: Use a meaningful title summarizing your contribution.
Documentation: Add informative docstrings, including examples when necessary.
At least one paragraph of narrative documentation with links to references in the literature and the example.
Tests: Provide unit tests to validate functionality and type correctness.
Local Testing: Ensure all tests pass locally using
pytest. Install dependencies:pip install imml[tests]
then run
pytest
or you can run pytest on a single test file by
pytest path/to/test.py
Guidelines¶
Coding Guidelines¶
Consistently formatted code improves readability and maintainability.
Docstring Guidelines¶
Properly formatted docstrings are essential for documentation generation. Follow the conventions outlined in numpydoc. Refer to the example.py provided by numpydoc.
API of iMML Objects¶
Estimators¶
The core components of iMML are the estimators, designed to train on datasets. These objects follow the conventions
established by Scikit-learn and mvlearn. Estimators inherit from sklearn.base.BaseEstimator and adhere to its
established guidelines.
To ensure compatibility, developers should align with Scikit-learn's standards whenever possible, including using
validation checks like check_Xs in imml.utils, to confirm the suitability of the input data.
Instantiation¶
An estimator's __init__ method defines its configuration by accepting constants that influence the behavior
of its methods. These constants should not include actual data or any values derived from it, as data handling is
left to the fit method. Key points for implementing the __init__ method:
All parameters must be keyword arguments with default values.
Each parameter should be assigned as an instance attribute.
Input validation should not occur during initialization.
Randomness control. For stochastic estimators, include a random_state parameter to ensure reproducibility. The same seed (random_state) should always yield identical outputs for the same data.The random_state parameter can accept: - An int, to produce consistent results across runs. - None, for non-deterministic results.
A correct implementation of __init__ looks like:
def __init__(self, param1=1, param2=2, random_state=None):
self.param1 = param1
self.param2 = param2
self.random_state = random_state
Fitting¶
Estimators must provide a fit(Xs, y=None) method to process data. This method is invoked as:
estimator.fit(Xs, y)
or
estimator.fit(Xs)
Parameters:
- Xs: A list of (pd.DataFrame or np.ndarray) data matrices, with each matrix representing a different modality.
Xs length: n_mods
Xs[i] shape: (n_samples, n_features_i)
y: Array of labels, shape (n_samples,).
kwargs: Optional parameters.
The samples across modalities in Xs and y are matched. Note that data matrices in Xs must have the same number of samples (rows) but the number of features (columns) may differ. If a value (feature or modality) is missing, it should be represented as np.nan.
The fit method should return the instance itself (self) to support chaining operations.
Transformers¶
A transformer modify data using the transform method. An estimator may also be a transformer that learns the
transformation parameters. The transformer object implements the transform method, i.e.
Xs_transformed = transformer.transform(Xs)
This is typically called after fitting the transformer. Alternatively, the transform method combines both steps:
Xs_transformed = transformer.fit_transform(Xs, y)
Transformers in iMML should be designed to work seamlessly with lists of both pandas.DataFrame and
numpy.ndarray. The input type should dictate the output type. For instance:
If the input is a list of pandas.DataFrame, the transformer should return a list of
pandas.DataFrameor a singlepandas.DataFrame.If the input is a list of numpy.ndarray, the transformer should return a list of
numpy.ndarrayor a singlenumpy.ndarray.
Predictors¶
A predictor generate predictions from the input data via the predict method:
y_predicted = predictor.predict(Xs)
Like transformers, predictors can combine fitting and prediction using the fit_predict method:
y_predicted = predictor.fit_predict(Xs, y)
Deep Learning¶
Currently, repositories offering deep learning methods lack a unified convention. To address this, iMML adopts the Lightning AI library, which provides a structured and flexible framework for implementing deep learning models. By standardizing deep learning methods in iMML using the Lightning library, we ensure that all implementations are robust, reproducible, and easy to extend.
Dataset defition¶
In this framework, datasets are defined by creating a class that inherits from torch.utils.data.Dataset . This class should accept input parameters such as Xs (multi-modal dataset) and transform (optional data preprocessing transformations). Below is a basic example:
import torch
class CustomDataset(torch.utils.data.Dataset):
def __init__(self, X, transform=None):
self.X = X
self.transform = transform
def __len__(self):
return len(self.X)
def __getitem__(self, idx):
sample = self.X[idx]
if self.transform:
sample = self.transform(sample)
return sample
Estimator defition¶
Estimators are defined by creating a class that inherits from lightning.LightningModule. In this class, you must override specific methods to customize training, validation, and testing logic. For detailed guidance, refer to the official documentation Lightning Module. Here is an example of an estimator class:
import torch
class CustomEstimator(LightningModule):
def __init__(self, model, optimizer, loss_fn):
super().__init__()
self.model = model
self.optimizer = optimizer
self.loss_fn = loss_fn
def forward(self, X):
return self.model(X)
def training_step(self, batch, batch_idx):
X, y = batch
y_pred = self(X)
loss = self.loss_fn(y_pred, y)
self.log("train_loss", loss)
return loss
def configure_optimizers(self):
return self.optimizer
Training¶
The defined LightningModule is used as an argument to the Trainer class provided by the Lightning AI library. The Trainer handles the training process, including logging, checkpointing, and scaling across devices. Refer to the Trainer documentation for further details. Here is an example of how to train the model:
from lightning import Trainer
# Instantiate dataset, dataloaders, model, optimizer, and loss function
dataset = CustomDataset(X, transform=your_transform)
dataloader = torch.utils.data.DataLoader(dataset, batch_size=32, shuffle=True)
model = CustomModel()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
loss_fn = torch.nn.CrossEntropyLoss()
# Define the estimator
estimator = CustomEstimator(model, optimizer, loss_fn)
# Train the estimator using the Trainer
trainer = Trainer(max_epochs=10)
trainer.fit(estimator, dataloader)