Collect data materials and build datasets
(1) Construction of dairy cow target detection datasets. The datasets for this research method was derived from the video of 493 Holstein cows captured by a dairy farm camera in Baotou City, Inner Mongolia Autonomous Region in 2020. The video is 3840 segments of 45 minutes each, the video format is MPEG4, the height of the video frame is 1080 pixels, the width is 1920 pixels, the bit rate is 1639kb/s. We combined video images of dairy cows in dairy farm and in order to prevent the model training from over-fitting due to the similarity of dairy cow shapes in adjacent frames, converting the captured video into a picture by drawing one frame from eighty frames. The images that were not clear and had no dairy cows in the area to be recognized were eliminated. Finally, a total of 24210 data images were obtained after screening. Process all images through shuffling, randomly selecting 19730 as the training set, 2220 images as the validation set and 2260 images as the test set. (2) Construction of dairy cow identification datasets. Each dairy cow was cropped down and a total of 72020 dairy cow target images were obtained after screening. There were 493 cows in the images and the cow images were classified by ID. We randomly choice 293 cows as the training set with 38110 images and 200 cows as the test set with 33910 images.
The datasets of each cow in the training set and test set needed to be divided secondly. Each dairy cow was photographed at an angle of 360° and with 60° as one category, each dairy cow could be divided into six categories, representing six different angles of shooting. Finally, these 493 * 6=2958 categories of pictures were renamed.
Dairy cow target detection methods
The initial input image was resized to 640*640*3. After the backbone layer, three different scales of feature maps were generated through the head. We choice the Adam as optimizer, the initial learning rate was 0.001, the momentum factor is 0.9, the weight decay was 0.0005 and the batch size was 32. Since this paper only needed to detect the dairy cow targets, one class of dairy cow targets was needed to be classified. Before training begins, 9 anchors arranged from small to large were obtained through k-means algorithm
(Wang et al., 2022). Each ground-truth was matched with them and its width to width ratio and height to height ratio were calculated. The maximum ratio was taken and compared with the set threshold. If it was less than this threshold, it was considered a positive sample. Introducing the structure of REPVGG to improve the model performance by multi-way branching. An Aux-Head was added to work with Lead-head for model optimization. When the anchor was matched with ground-truth, three positive samples were assigned to Aux-Head and five positive samples were assigned to Lead-head with a loss weight ratio of 1:4. The final six-dimensional prediction values were obtained, which represented the border coordinates (x, y, w, h), the border confidence and the number of classes.
Cow identification methods
The Reid technique, or person Re-identification, is a sub-problem of image retrieval that uses the calculation of distance in metric learning to search for targets in an image gallery
(Hermans et al., 2017). In this paper, the target-detected images were used as the Query and we randomly selected 200 cows as the Gallery which could be representative of the entire dataset, then the images of the Query are used for the metric calculation of features with the images in the gallery to find the IDs of the target-detected dairy cows in the gallery.
Backbone
The dairy cow identification model extracted features through the improved Resnet50 network and it did not require target localization because the model was based on target detection. The high resolution feature information from the shallow network was not very useful for the recognition model, but might be used by the network as noise and affected the final recognition effect. Therefore, we only carried out feature extraction at different scales for the last two scales, setting the stride of the 1/32 scale to 1, so that the resolution of the last two scales could be kept consistent. Feature fusion refers to the fusion of feature information from different scales into a set of features through concat operation, which can better combine the features reflected by different scale features. The feature maps obtained after the last layer of convolution of the fourth scale and fifth scale were self-adaptively fused to make better use of the semantic information of the last two scale features, as shown in Fig 1.
The positive and negative sample pairs
The training method of positive and negative sample pairs seriously affects the accuracy of recognition. We set the batch size to 16 and the NUM_INSTANCE to 4, which meant randomly selecting 16 images from the prepared identity recognition datasets for processing in a mini-batch. These 16 images included 4 classes, with 4 images for each class of target. Randomly selected one from 16 images as a template sample, formed a positive sample pair with the same ID and a negative sample pair with different IDs. After data enhancement such as flip, rotation, crop, affine, calculated the distance between positive and negative samples, used the value of the maximum positive sample distance and the value of the minimum negative sample distance as hard example. Finally put it into the triplet loss for training.
Loss and distance calculation formula
This paper used a different triplet loss from the traditional Reid and proposed a joint training method with multiple loss. Center loss could reduce the intra-class gap, so adding it could reduce the gap between dairy cow features with the same ID and the same shooting angle. The data selected for the triple loss might not be uniformly distributed, which would lead to unstable performance of the model training process, slow convergence speed, easy over-fitting. We added label smoothing loss in the full connection layer, which could effectively solve the above problems, preventing the model from focusing only on the loss of correct label positions and also considering the loss of other incorrect label positions, increasing the generalization ability of the model. The triplet loss is as shown in formula (1). As described in (2) the positive and negative sample pairs, take the positive hard sample as dp and the negative hard sample as dn, set the margin to 0.25 and calculate the triplet loss. The center loss is as shown in formula (2). It simultaneously learned the deep feature center value of each cow class and reduce the distance between each cow class and its class center. The label smoothing loss is as shown in formula (3). The last layer of Resnet 50, which outputs the ID prediction logits of images, is a fully-connected layer with a hidden size being equal to numbers of cows N (N=493). Given an image, we denoted y as truth ID label and Pi as ID prediction logits of class i. The final loss is as shown in formula (4). After experimental verification, here b was 0.34.
....(1)
....(2)
....(3)
....(4)
d
p and d
n = Feature distances of positive pair and negative pair.
a = Set to 0.25.
y
j = Label of the th image in a mini-batch.
C
yj = Denotes the th class center of deep features.
f
ti = Denotes the feature of the th image.
B = Number of batch_size.
y = Truth ID label.
P
i = Indicates ID prediction logits of class i.
A joint optimized calculation of euclidean distance and cosine similarity distance was proposed. This is because the euclidean distance measures the absolute distance between features and it is directly related to the location coordinates between the points, whereas the cosine similarity distance measures the angle of the spatial vectors and it is more concerned with the difference in direction. Because we could not determine which measurement method was most effective, we propose the combined optimization distance calculation formula. After extracting features from the target and gallery images through backbone, the results passed in xi and yi. And by combining the two distances self-adaptively, it replaced the heavy and complex traversal method to find the optimal hyper-parameter value. For example, formula (5) is the euclidean distance formula and formula (6) is the cosine similarity formula. In formula (7), represent the weight of euclidean distance and cosine distance in the joint optimization distance formula, that is the impact of the two distance formulas on the recognition effect. First initialized α, β then normalized them and put them into the network as parameter to optimize. The final printout of the weights shows that α is 0.32 and β is 0.68.
....(5)
....(6)
....(7)
d = Indicates euclidean distance between two vectors.
x
i = Represents detecting the ith dimensional feature vector of the dairy cow.
y
i = Represents the ith dimensional feature vector of the gallery dairy cow.
cosq = Cosine similarity of two vectors. Distance represents the joint optimization distance calculation formula;
a = Euclidean distance weights.
b = Cosine distance weights.
Evaluation indicator
The metrics evaluated in this experiment are the Rank-k metric and the mAP metric. Take Rank-1 as an example, it means to count whether all query images are the same as their first returned result in the gallery, which means the accuracy of the first retrieval target of the image and the same calculation can obtain Rank5 and Rank10. mAP metric means average precision mean, which is a common evaluation metric in multi-target detection and multi-label image classification. It sums and averages the average precision AP (Average Precision) in multi category tasks, that is, it represents the accuracy of all retrieval results. The calculation formula is shown in formula (8):
....(8)
Precisionc = Accuracy rate for a single category.
Imagesc = Number of images containing targets of this type. APK = Average accuracy for a single category.
C = Total number of categories.
There are 493 dairy cows from 6 different angles, a total of 2958 categories.
Model lightweight methods
In this paper, the iterative magnitude pruning algorithm based on lottery hypothesis was used to filter the optimal sub-network from the dairy cow identification model. The experimental steps are as follows:
- The network was initialized with pretrained weights obtained by training the Resnet50 network and the network was subjected to k times of gradient descent to obtain the weights W0 and saved, where k was 0.1%-7% of the total training times.
- Trained the dairy cow datasets and got the weight WT (1) after convergence.
- Created a mask m(1), set a fixed pruning rate and unstructuredly pruned the model W T(1).
- The weights W0, obtained by performing k-gradient descent using the original network, reset the sparse network weights obtained by pruning.
- Repeated step 2~4, pay attention to the final accuracy of each pruning model after convergence and ended pruning until the accuracy of the model begins to decline, then took the result with the highest pruning accuracy.