We introduce an approach to building a custom model from ready-made self-supervised models via their associating instead of training and fine-tuning. We demonstrate it with an example of a humanoid robot looking at the mirror and learning to detect the 3D pose of its own body from the image it perceives. To build our model, we first obtain features from the visual input and the postures of the robot’s body via models prepared before the robot’s operation. Then we map their corresponding latent spaces by a sample-efficient robot’s self-exploration at the mirror. In this way, the robot builds the solicited 3D pose detector, which quality is immediately perfect on the acquired samples instead of obtaining the quality gradually. The mapping, which employs associating the pairs of feature vectors, is then implemented in the same way as the key–value mechanism of the famous transformer models. Finally, deploying our model for imitation to a simulated robot allows us to study, tune up and systematically evaluate its hyperparameters without the involvement of the human counterpart, advancing our previous research.