- Overview of SoftNeuro
- What is tuning
- Routine implementation example
- SIMD implementation
- Use of various processors
Overview of SoftNeuro
Details will be explained later, but first, please watch the introductory video that summarizes the features of SoftNeuro.
As you can imagine from the usage scene in the introductory video, SoftNeuro is software that performs neural network inference .
When using deep learning technology to make a computer perform some task, there are roughly two steps required. These are “ learning ” and “ inference ”.
“Training” is the process of creating a neural network to perform the desired inference.
By selecting a network model with an appropriate structure according to the target task and providing a large amount of training data, a neural network that can make appropriate inferences is created.
“Inference” is the process of actually using the neural network created in the “learning” process.
Introduce a trained network and a program that performs inference processing into the environment you want to use, and let it determine what the unknown data that has been input is. SoftNeuro is the software for performing this inference processing at high speed.
The feature of SoftNeuro is that it works “everywhere at high speed”.
SoftNeuro employs a proprietary file format for storing network models, but facilities are provided for converting to this format from various learning frameworks.
Once converted to this format, it is possible to immediately execute inference processing using the trained network in any environment where SoftNeuro operates.
Furthermore, SoftNeuro achieves high-speed inference processing in any environment by using a function called tuning, which is a unique speed-up method.
In the next section, we will introduce the tuning features of SoftNeuro.
What is tuning
SoftNeuro achieves high-speed inference execution that supports a wide variety of environments through a function called tuning. Morpho has obtained a patent 1 for this tuning function.
1 : Japanese Patent No. 6574004 and other rights pending in various countries.
Tuning is a function for selecting an appropriate implementation that matches the execution environment when inferring a model .
A neural network is basically composed of many layers of processing. Similar to general learning and reasoning frameworks, SoftNeuro also implements processing layer by layer.
And for each layer, we have multiple implementations instead of just one. Morpho calls each of these implementations a routine. (Specific types of routines are described l
Various implementations are available for routines to enable high-speed processing according to the conditions for executing inference, such as hardware resources and frameworks.
By selecting the fastest routine under the execution conditions from among these multiple routines and setting it in each layer, high-speed inference processing is possible in any execution environment.
For example, let’s say you have a model consisting of layers 1 and 2, each implementing three types of routines, as shown in the table below.
SoftNeuro determines which routine can be adopted in these layers for the fastest processing based on the speed measurement results in the execution environment.
By setting appropriate routines for each layer, subsequent inference processing can always use the fastest routine settings.
If the speed measurement result like this example is obtained, the environment A will adopt the AVX implementation and the OpenCL implementation, and the environment B will adopt the CUDA implementation for each layer processing. (More precisely, we are searching for a combination of settings that will maximize the speed of inference processing for the entire model, taking into account type conversions that are automatically inserted.
Routine implementation example
By preparing various routines for each layer, the tuning function allows you to select an implementation that operates at high speed in the execution environment from a wider range of options.
From here, I will introduce how the routines that perform high-speed processing are actually implemented.
SIMD (Single Instruction/Multiple Data) refers to a mechanism that executes one instruction on multiple data in parallel.
SIMD has its own instruction set for each architecture such as x64’s AVX and ARM’s NEON, and SoftNeuro also has routines that use those instructions.
Specifically, there are routine implementations that use unique instructions such as AVX2, AVX512, and NEON, and these are routines that can only be used in the environment of the corresponding architecture, but the processing is faster than when executed with normal instructions. It will be faster
The reason why SIMD instructions are effective is that neural network inference involves a large number of parallel operations on floats.
For example, the Add (addition processing) layer executes addition instructions for multiple values in parallel, as shown in the figure above.
Since such arithmetic processing is frequently performed in neural network inference, using SIMD instructions can be expected to speed up inference.
Use of various processors
Inference processing environments may have many processors such as GPUs and AI Accelerators built into chipsets in addition to CPUs.
In order to effectively use these hardware resources, SoftNeuro also provides routine implementations using CUDA, OpenCL, HNN, etc.
As with the SIMD implementation mentioned earlier, these routines can only be used in an environment where the corresponding hardware can process them, but in many cases they are fast.
Especially when AI Accelerator is used, it is extremely fast compared to CPU processing in the same environment, and there are examples where inference processing is speeded up by more than 10 times.
Neural networks have weight values set by learning. When performing inference processing, various operations are performed on the input data using these weight values. If this weight value is an appropriate combination, it will be possible to make more accurate inferences.
Weight values are often stored in float32 format, and calculations are performed in that format even when inference processing is performed.
Quantization is a technique for executing inference processing by converting the weight values and input data to smaller data size types (float16 and qint8).
Quantized inference can be expected to speed up the entire inference process because it reduces the amount of data when performing operations, making the calculations lighter.
However, since the amount of information is reduced compared to the original inference processing, the accuracy of the inference results may deteriorate.
In SoftNeuro, this function can be used properly according to the usage conditions, and it is possible to balance accuracy and speed.
In addition to these, there are also routines that ” get the same result but with a different algorithm “.
A specific example is the winograd routine in the conv2 layer. In addition, the implementation of the ” cache efficiency version ” developed by Morpho also applies here.
These algorithms are faster than normal implementations, but there are conditions for execution on layer settings such as filter size.
Tuning also makes it possible to select the executable and fast one from among these multiple algorithms.
This time, we introduced the speed-up method for inference processing that is used in our product, SoftNeuro, the world’s fastest inference engine.
In addition to this, SoftNeuro has various speed-up devices such as model optimization by Layer Fusion.