@@ -391,8 +403,9 @@ The following workflows have common variables with the above illustrated workflo
The following workflows contain a search space containing a set of possible neural networks architectures that can be used by `Distributed_Auto_ML` to automatically find the best combinations of neural architectures within the search space.
*Handwritten_Digit_Classification:* trains a simple deep CNN on the MNIST dataset using the PyTorch library. This example allows to search for two types of neural architectures defined in the Handwritten_Digit_Classification_Search_Space.json file.
*Single_Handwritten_Digit_Classification:* trains a simple deep CNN on the MNIST dataset using the PyTorch library. This example allows to search for two types of neural architectures defined in the Handwritten_Digit_Classification_Search_Space.json file.
*Multiple_Objective_Handwritten_Digit_Classification:* trains a simple deep CNN on the MNIST dataset using the PyTorch library. This example allows optimizing multiple losses, such as accuracy, number of parameters, and memory access cost (MAC) measure.
=== Distributed Training
...
...
@@ -1559,37 +1572,47 @@ AutoFeat currently supports the following encoding methods:
- Label: converts each value in a categorical feature into an integer value between 0 and n-1, where n is the number of distinct categories of the variable.
- Binary: stores categories as binary bitstrings.
- OneHot: creates a new feature for each category in the Categorical Variable and replaces it with either 1 (presence of the feature) or 0 (absence of the feature). The number of the new features depends on the number of categories in the Categorical Variable.
- OneHot: creates a new feature for each category in the categorical variable and replaces it with either 1 (presence of the feature) or 0 (absence of the feature). The number of the new features depends on the number of categories in the categorical variable.
- Dummy: transforms the categorical variable into a set of binary variables (also known as dummy variables). The dummy encoding is a small improvement over the one-hot-encoding, such it uses n-1 features to represent n categories.
- BaseN: encodes the categories into arrays of their base-n representation. A base of 1 is equivalent to one-hot encoding and a base of 2 is equivalent to binary encoding.
- Target: replaces a categorical value with the mean of the target variable.
- Hash: maps each category to an integer within a pre-determined range n_components. n_components is the number of dimensions, in other words, the number of bits to use to represent the feature. We use 8 bits by default .
- Hash: maps each category to an integer within a pre-determined range n_components. n_components is the number of dimensions, in other words, the number of bits to use to represent the feature. We use 8 bits by default.
NOTE: The most of these methods are implemented using the python link:https://contrib.scikit-learn.org/category_encoders/[Category Encoders] library. Examples can be found in the https://www.kaggle.com/code/discdiver/category-encoders-examples/notebook[Category Encoders Examples] notebook .
The most of these methods are implemented using the python link:https://contrib.scikit-learn.org/category_encoders/[Category Encoders] library.
As we already mentioned, the performance of ML algorithms depends on how categorical variables are encoded. The results produced by the model vary depending on the used encoding technique. Thus, the hardest part of categorical encoding can sometimes be finding the right categorical encoding method.
There are numerous research papers and studies dedicated to the analysis of the performance of categorical encoding approaches applied to different datasets. Based on the common factors shared by the datasets using the same encoding method, we have implemented an algorithm for finding the best suited method for your data.
To access the AutoFeat page, please follow the steps below:
Open the link:https://try.activeeon.com/automation-dashboard/#/portal/workflow-execution[Workflow Execution Portal].
. Open the link:https://try.activeeon.com/studio[Studio Portal].
. Create a new workflow.
Click on the button *Submit a Job* and then search for *Import_Data_And_Automate_Feature_Engineering* workflow as described in the image below.
. Drag and drop the `Import_Data_And_Automate_Feature_Engineering` task from the *machine-learning* bucket in the ProActive Machine Learning.
It is possible to change a column information. These changes can include:
...
...
@@ -1625,12 +1648,12 @@ It is possible to change a column information. These changes can include:
- _Category Type_: Categorical variables can be divided into two categories; *Ordinal* such the categories have an inherent order and *Nominal* if the categories do not have any inherent order.
- _Label_: Check this checkbox to select the label column.
- _Label Column_: Only one column can be selected as the label column.
- _Coding Method_: The encoding method used for converting the categorical data values into numerical values. The value is set to *Auto* by default. Thereafter, the best suited method for encoding the categorical feature is automatically identified. The data scientist still has the ability to override every decision and select another encoding method from the drop-down menu. Different methods are supported by AutoFeat such as *Label*, *OneHot*, *Dummy*, *Binary*, *Base N*, *Hash* and *Target*. Some of those methods require specifying additional encoding parameters. These parameters vary depending on the selected method (e.g., the base and the number of components for BaseN and Hash, respectively, and the target column for Target encoding method). Some of those values are set by default, if no values are specified by the user.
[[_Edit_column_names_and_types]]
image::AutoFeat_edit_column_names_and_types_encoding_parameters.png["Edit column names and types",align=center]
It is also possible to perform the following actions on the dataset:
...
...
@@ -1638,7 +1661,7 @@ It is also possible to perform the following actions on the dataset:
- *Restore*, to restore the original version of the dataset loaded from the external source.
- *Delete Column*, to delete a column from the dataset.
- *Preview Encoded Data*, to display the encoding results in a new tab.
- *Cancel*, to discard any changes the user may have made and finish the workflow execution.
- *Cancel and Quit*, to discard any changes the user may have made and finish the workflow execution.
Once the encoding parameters are set, the user can proceed to display the encoded dataset by clicking on the *Preview Encoded Data*. He can also check and compare different encoding methods and/or parameters based on the obtained results.
...
...
@@ -1651,6 +1674,15 @@ The user can also download the results as a csv file by clicking on the *Downloa
[[_Encoded_data]]
image::AutoFeat_encoded_data.png[align=center]
=== ML Pipeline Example
You can connect different tasks in a single workflow to get the full pipeline from data preprocessing to model training and deployment. Each task will propagate the acquired variables to its children tasks.
The following workflow example `Vehicle_Type_Using_Model_Explainability` uses the `Import_Data_And_Automate_Feature_Engineering` task to prepare the data. It is available on the `machine_learning_workflows` bucket.
This workflow predicts vehicle type based on silhouette measurements, and apply ELI5 and Kernel Explainer to understand the model’s global behavior or specific predictions.
== ProActive Analytics
The *ProActive Analytics* is a dashboard that provides an overview of executed workflows
...
...
@@ -1683,7 +1715,7 @@ More advanced search options (_highlighted in advanced search hints_) could be u
@@ -4498,7 +4534,7 @@ NOTE: PyTorch is used to build the model architecture based on https://github.co
| Boolean (default=True)
|===
NOTE: The default parameters of the YOLO network were set for the COCO dataset (http://cocodataset.org/#home). If you'dliketouseanotherdataset,youprobablyneedtochangethedefaultparameters.
NOTE: The default parameters of the YOLO network were set for the COCO dataset (https://cocodataset.org/#home). If you'dliketouseanotherdataset,youprobablyneedtochangethedefaultparameters.
*ProActive Service Automation (PSA)* allows to automate service deployment, together with their life-cycle management. Services are instantiated by workflows (executed as a Job by the Scheduler), and related workflows allow to move instances from a state to another one.
*ProActive Service Automation (PSA)* allows automating service deployment, together with their life-cycle management. Services are instantiated by workflows (executed as a Job by the Scheduler), and related workflows allow to move instances from a state to another one.
At any point in time, each Service Instance has a specific State (RUNNING, ERROR, FINISHED, etc.).
Attached to each Service Instance, PSA service stores several information such as:
Attached to each Service Instance, PSA service stores some information such as:
Service Instance Id, Service Id, Service Instance State, the ordered list of Jobs executed for the Service, a set of variables with their values (a map that includes for instance the service endpoint), etc.
The link:https://try.activeeon.com/tutorials/basic_service_creation/basic_service_creation.html[basic service creation tutorial, window="_blank"] and link:https://try.activeeon.com/tutorials/advanced_service_creation/advanced_service_creation.html[advanced service creation tutorial, window="_blank"] on link:https://try.activeeon.com[try.activeeon.com, window="_blank"]
The link:https://try.activeeon.com/tutorials/clearwater/clearwater.html[Create your own service tutorial, window="_blank"] on link:https://try.activeeon.com[try.activeeon.com, window="_blank"]
@@ -492,7 +492,7 @@ The service requires the following variables as input:
=== Storm
This service allows to deploy through ProActive Service Automation (PSA) Portal a cluster of Apache Storm stream processing system (http://storm.apache.org).
This service allows to deploy through ProActive Service Automation (PSA) Portal a cluster of Apache Storm stream processing system (https://storm.apache.org).
The service is started using the following variables.
@@ -318,6 +362,8 @@ However, it implements a different instance management strategy that reduces the
2. The nodes share the same networking infrastructure through a common Virtual Private Cloud (VPC).
The infrastructure supports networking autoconfiguration if no parameter is supplied.
WARNING: The node source using empty policy will not benefit from this latter management strategy. The deployment with empty policy doesn't use the shared instance template and networking configuration.
===== Pre-Requisites
The configuration of the AWS Autoscaling infrastructure is subjected to several requirements.
...
...
@@ -372,7 +418,7 @@ The configuration form exposes the following fields:
- *defaultVpcId:* This parameter can be filled with the ID of the VPC to use to operate instance operating nodes.
If specified, this parameter has to refer to an existing VPC in the region and comply with the VPC ID format.
If left blank, the connector will trigger networking autoconfiguration.
If left blank, the connector will, first, try to get the default VPC ID in the specified region if set, otherwise it will trigger networking autoconfiguration.
- *defaultSubNetId:* The administrator can define which subnet has to be attached to the the instance supporting nodes.
If specified, this parameter has to refer to an existing subnet in the region affected to the specified VPC, and has to comply with the subnet ID format.
...
...
@@ -384,7 +430,7 @@ WARNING: Please do not trigger networking autoconfiguration if you operate ProAc
Otherwise, a new and distinct VPC will be used to operate the nodes created by the NodeSource, preventing their communication with the Resource Manager.
- *defaultSecurityGroup:* This parameter receives the ID of the security group to spawn instances into.
If this parameter does not meet the requirement regarding the providing the provided VPC and subnet, a new security group will be generated.
If this parameter does not meet the requirement regarding the provided VPC and subnet, a new security group will be generated by default and will be re-used if the same deployment scenario is repeated.
This parameter is mandatory, and has to comply with the format of the ID of the AWS security groups.
- *region:* The administrator specifies here the AWS region to allocate the cluster into.
...
...
@@ -813,15 +859,23 @@ You can opt to place this file in `$PROACTIVE_HOME/config/authentication/azure.c
|Linux https://github.com/Azure/azure-libraries-for-java/blob/master/azure-mgmt-compute/src/main/java/com/microsoft/azure/management/compute/KnownLinuxVirtualMachineImage.java[++[++link to source++]++] |Windows https://github.com/Azure/azure-libraries-for-java/blob/master/azure-mgmt-compute/src/main/java/com/microsoft/azure/management/compute/KnownWindowsVirtualMachineImage.java[++[++link to source++]++]
|UBUNTU_SERVER_14_04_LTS +
UBUNTU_SERVER_16_04_LTS +
DEBIAN_8 +*+ _default value_ +*+ +
CENTOS_7_2
|WINDOWS_SERVER_2008_R2_SP1 +
WINDOWS_SERVER_2012_DATACENTER +
UBUNTU_SERVER_18_04_LTS +
DEBIAN_9 +*+ _default value_ +*+ +
DEBIAN_10 +
CENTOS_8_1 +
OPENSUSE_LEAP_15_1 +
SLES_15_SP1 +
REDHAT_RHEL_8_2 +
ORACLE_LINUX_8_1
|WINDOWS_DESKTOP_10_20H1_PRO +
WINDOWS_SERVER_2019_DATACENTER +
WINDOWS_SERVER_2019_DATACENTER_WITH_CONTAINERS +
WINDOWS_SERVER_2016_DATACENTER +
WINDOWS_SERVER_2012_R2_DATACENTER
|===
...
...
@@ -890,13 +944,14 @@ Note that this user custom script will be run as root/admin user.
The following fields are the optional parameters of the Azure Billing Configuration section. The aim of this section is to configure the automatic cloud cost estimator. It is done by considering all the Azure resources related to your reservation (virtual machines, disks,..). This mechanism relies on the Azure Resource Usage and RateCard APIs (https://docs.microsoft.com/en-us/azure/cost-management-billing/manage/usage-rate-card-overview).
- *resourceUsageRefreshFreqInMin:* Periodical resource usage retrieving delay in min. The default value is 30.
- *rateCardRefreshFreqInMin:* Periodical rate card retrieving delay in min. The default value is 30.
- *offerId:* The Offer ID parameter consists of the "MS-AZR-" prefix, plus the Offer ID number. The default value is "MS-AZR-0003p" (Pay-As-You-Go offer).
- *currency:* The currency in which the resource rates need to be provided. The default value is "USD".
- *locale:* The culture in which the resource metadata needs to be localized. The default value is "en-US".
- *regionInfo:* The 2 letter ISO code where the offer was purchased. The default value is "US".
- *maxBudget:* Your max budget for the Azure resources related to the node source. Also used to compute your global cost in % of your budget. The default value is 50.
- *enableBilling:* Enable billing information (_true_/_false_). If _true_, the following parameters will be considered. The default value is _false_.
- *resourceUsageRefreshFreqInMin:* Periodical resource usage retrieving delay in min. The default value is _30_.
- *rateCardRefreshFreqInMin:* Periodical rate card retrieving delay in min. The default value is _30_.
- *offerId:* The Offer ID parameter consists of the "MS-AZR-" prefix, plus the Offer ID number. The default value is _MS-AZR-0003p_ (Pay-As-You-Go offer).
- *currency:* The currency in which the resource rates need to be provided. The default value is _USD_.
- *locale:* The culture in which the resource metadata needs to be localized. The default value is _en-US_.
- *regionInfo:* The 2 letter ISO code where the offer was purchased. The default value is _US_.
- *maxBudget:* Your max budget for the Azure resources related to the node source. Also used to compute your global cost in % of your budget. The default value is _50_.
As you can see, these parameters provide a lot of flexibility to configure your infrastructure. When creating your Azure Scale Set node source, the infrastructure should be coupled with a Dynamic Policy. This Policy will additionally define scalability parameters such as limits on the number of deployed nodes or the minimum idle time before a node can be deleted (to optimize node utilization).