Research team supervised by NAAI member, Prof. Pingyi Fan from Tsinghua University, releases the first foundation model for multi-modal industrial signals

Research team supervised by NAAI member, Prof. Pingyi Fan from Tsinghua University, releases the first foundation model for multi-modal industrial signals

图片1(5).png


arXiv: https://arxiv.org/abs/2507.16696

GitHub: https://github.com/jianganbai/FISHER

 With the rapid advancement of the Internet of Things (IoT) and artificial intelligence (AI) technologies, AI-based supervisory control and data acquisition (SCADA) systems have been widely deployed in modern manufacturing. However, how to efficiently analyze industrial signals is an urgent challenge, due to the unique complexity of signal mechanisms and health management tasks. In response to this, a research team supervised by NAAI member, Prof. Pingyi Fan from Tsinghua University, has announced the release of FISHER, the first foundation model for multi-modal industrial signal comprehensive representation. The model weights and the inference code are now open-source for community usage.

 Research Background

Nowadays, an increasing number of sensors are being deployed to monitor the working conditions of industrial machines. However, the efficient analysis of industrial signals collected from these sensors remains challenging, due to the heterogeneity of signals. This challenge is summarized by the research team as the M5 problem: Multi-modal, Multi-sampling-rate, Multi-scale, Multitask and Minim fault.

 As a result, existing approaches only focus on small sub-problems, and models are trained specifically for each sub-problem. While these specialized models achieve superior performance within their respective domains, they fail to leverage the potential synergies among different modalities and the potential gains when scaling up. Moreover, they introduce extra burdens during model development and deployment, since each sub-problem must be dealt by an exclusive model.

 Research Motivations

Although these signals are heterogeneous in appearance, their internal patterns imply exploitable similarities:

 Identical semantic information: Different industrial signals are perceptions of the same mechanical event.

Similar generation principles: Sound and vibration, two most common modalities, are essentially different observational forms of vibrations.

Similar analysis methods: Spectral analysis is widely employed for analyzing various industrial signals, indicating that these signals can be modeled in a unified manner.

Similar malfunction patterns: Since machines are assembled from components, their failure patterns are often comparable.

Shared features for multi-tasking: A dense representation extracted by a powerful foundation model is sufficient to handle multi-tasking.

 Based on these similarities and motivated by the scaling law, the research team boldly scales up a single model to uniformly characterize heterogeneous signals, consequently proposing FISHER, short for Foundation model for multi-modal Industrial Signal compreHEnsive Representation.

 

Overview of FISHER


图片1(6).png

FISHER is the first foundation model for multi-modal industrial signals. FISHER emphasizes the importance of sub-band information and considers it as the building blocks of the overall information. This approach enables FISHER to deal with the variable frequency shapes. The detailed introduction is as follows:

 Sub-band Modeling

In FISHER, the input signal is represented by Short-time Fourier Transform (STFT) with fixed-duration window and hop size, regardless of the modality. In contrary with common sound pre-trained models that adopt log-mel spectrogram as the signal representation, FISHER reverts to STFT since:

 Malfunctions often appear in high frequencies, which would be diluted in mel scale.

The harmonic relationships of characteristic frequencies are essential, which would be smoothed in mel scale.

 On the one hand, the information gain of a higher sampling rate lies in the additional sub-bands as depicted in the following figure. As known, all sensors employ anti-aliasing filtering to prevent signal aliasing. Therefore, the spectrogram does not contain any information about frequencies higher than half the sampling rate. On the other hand, sampling rates of common large-scale datasets, i.e. 16 kHz, 32 kHz, 44.1 kHz and 48 kHz, are integer multiples of a base frequency, making sub-bands a natural modeling unit for data with multiple sampling rates. 

 

图片1(7).png

Therefore, the research team takes the sub-band as the unit for modeling, and build up the information of the whole spectrogram by concatenating sub-band information just like building blocks. The higher the sampling rate, the more informative the signal representation is.

 Model Architecture

FISHER comprises a ViT encoder and a CNN decoder, and is trained using a teacher-student self-distillation framework, where the student is guided by the representations of the teacher, and the teacher is an exponential moving average (EMA) version of the student. 

 For the student branch, the patch sequence is masked by inverse block masking with a large mask ratio of 80%, and masked patches are discarded. After being encoded by the student encoder, the output of these unmasked patches are merged with the masked parts in the original spatial locations. The student decoder takes in the merged sequence and outputs the student patch representation and the [CLS] token. For the teacher branch, the teacher encoder processes the unmasked patch sequence, and the embeddings of all its layers are averaged to derive the distillation target.

 The self-distillation process is supervised at both the [CLS] and patch level. During inference, only the student encoder is employed and its [CLS] representations are concatenated to form the overall representation of the signal.

 The research team currently releases three versions of FISHER, namely FISHER-tiny (5.5M), FISHER-mini (10M) and FISHER-small (22M). All models are pre-trained on the combined dataset with a total volume of 17k hours.

 

RMIS Benchmark

 

 

图片1(8).png

To evaluate the comprehensive representation capability of industrial signals, the research team develops the RMIS benchmark, which comprises 5 anomaly detection datasets, 13 fault diagnosis datasets, covering 4 modalities. The RMIS benchmark currently supports two typical health management tasks, i.e. anomaly detection and fault diagnosis. To evaluate the inherent and generalization capabilities, the model is not fine-tuned on any downstream datasets, but instead directly uses k-nearest neighbor (KNN) for inference.

 Experiments

The research team first compares common SSL models on the RMIS benchmark, where speech models are generally lower than audio models due to the inconsistency. Thus, the research team employs 5 top audio pre-trained models as baseline models, with scales ranging from 5M to 1.2B.

 Results on RMIS Benchmark

 

图片1(9).png

On the RMIS benchmark, three versions of FISHER surpass all baselines by 3.91%, 4.34% and 5.03%, showcasing versatile and outstanding capabilities despite their much smaller model sizes. On anomaly detection tasks, FISHER is generally the second best model, only slightly behind BEATs, the best performing model in previous DCASE challenges. On fault diagnosis tasks, FISHER outperforms all baselines by a wide margin, where even FISHER-tiny is 9.24% higher than the best baseline model. This is mainly attributed to the capability of FISHER to utilize the full bandwidth of the original signal, whereas baseline models must downsample the signal, which consequently results in information loss. Meanwhile, the largest scale of FISHER only comprises 22M parameters, which is much smaller than common baselines with 90M parameters.

 

图片1(10).png

 

 Efficient Scaling

图片1(11).png

As presented in the above figure, the scaling curve of FISHER is constantly above the curves of all baseline models. This demonstrates the superiority of the pre-training scheme of FISHER for signal representation.

 It is worthy to note that directly scaling up in pre-training has encountered bottlenecks. For all baselines, the performance grows steadily as the model size scales from tiny to base (around 100M), yet unexpectedly drops as the scaling continues, which seems to contradict the scaling law. We believe that this is due to the poor quality of signal data for pre-training large-scale models. Industrial signals are sometimes extremely stationary and invariant. Such data are only sufficient for training models of limited scale, and the inflection point is probably situated around 100M.

 Therefore, greater emphasis should be accorded to data preparation when scaling up the model, and it is necessary to carry out data cleaning on a broader scale and with finer granularity. Moreover, given the success of FISHER, Test-Time Scaling (TTS) is considered a potential breakthrough point for health management tasks.

 

Multiple Split Ratios

图片1(12).png

In the RMIS benchmark, 12 out of 13 fault diagnosis datasets do not provide official split. For these datasets, the research team plots the performance curve under variable split ratios ranging from 0.05 to 0.95, and then estimates the area under the multi-split curve based on the trapezoidal rule. As presented in the above table, FISHER still holds significant advantages under multiple split ratios, achieving the largest area under the multi-split curve.