Add EdgeTAM: On-Device Track anything Model

2025-11-12 08:54:45 +08:00 · 2025-11-12 08:54:45 +08:00 · ef9b2a06fe
parent cb9289101e
commit ef9b2a06fe
1 changed files with 9 additions and 0 deletions
--- a/Model.-.md
+++ b/Model.-.md
@ -0,0 +1,9 @@
+<br>On high of Segment Anything Model (SAM), SAM 2 additional extends its functionality from picture to video inputs via a memory bank mechanism and obtains a remarkable efficiency in contrast with earlier strategies, making it a foundation mannequin for video segmentation process. On this paper, we goal at making SAM 2 much more environment friendly so that it even runs on cell units while sustaining a comparable performance. Despite several works optimizing SAM for better effectivity,  [iTagPro bluetooth tracker](https://funsilo.date/wiki/User:LemuelNaranjo) we discover they don't seem to be ample for  [iTagPro locator](https://hwekimchi.gabia.io/bbs/board.php?bo_table=free&tbl=&wr_id=1234561) SAM 2 because all of them give attention to compressing the picture encoder, whereas our benchmark reveals that the newly launched reminiscence consideration blocks are additionally the latency bottleneck. Given this statement, we suggest EdgeTAM, which leverages a novel 2D Spatial Perceiver to cut back the computational value. In particular, the proposed 2D Spatial Perceiver encodes the densely stored body-degree memories with a lightweight Transformer that contains a set set of learnable queries.<br>
+
+<br>Given that video segmentation is a dense prediction task, we discover preserving the spatial construction of the reminiscences is important in order that the queries are break up into international-level and patch-stage groups. We also suggest a distillation pipeline that additional improves the efficiency with out inference overhead. DAVIS 2017, MOSE, SA-V val, and SA-V take a look at, whereas working at 16 FPS on iPhone 15 Pro Max. SAM to handle both picture and video inputs, with a memory bank mechanism, and is educated with a brand new large-scale multi-grained video monitoring dataset (SA-V). Despite achieving an astonishing performance compared to earlier video object segmentation (VOS) fashions and  [ItagPro](https://sun-clinic.co.il/he/question/itagpro-tracker-your-ultimate-solution-for-tracking-5/) allowing extra various user prompts, SAM 2, as a server-facet foundation model, isn't efficient for on-gadget inference. CPU and NPU. Throughout the paper, we interchangeably use iPhone and iPhone 15 Pro Max for simplicity.. SAM for better effectivity only consider squeezing its image encoder since the mask decoder is extraordinarily lightweight. SAM 2. Specifically, SAM 2 encodes previous frames with a memory encoder, and these frame-degree memories together with object-stage pointers (obtained from the mask decoder) serve as the memory bank.<br>
+
+<br>These are then fused with the features of current body through reminiscence attention blocks. As these memories are densely encoded, this leads to an enormous matrix multiplication during the cross-attention between present body options and memory features. Therefore,  [iTagPro bluetooth tracker](https://www.ebersbach.org/index.php?title=Privacy_Alert:_Your_IPhone_Is_Tracking_In_All_Places_You_Go:_Here_s_How_To_Seek_Out_The_Setting) despite containing relatively fewer parameters than the picture encoder, the computational complexity of the memory attention will not be inexpensive for on-gadget inference. The speculation is further proved by Fig. 2, where decreasing the number of reminiscence attention blocks virtually linearly cuts down the overall decoding latency and within each reminiscence consideration block,  [iTagPro smart device](https://sun-clinic.co.il/he/question/the-benefits-of-using-itagpro-tracker-for-personal-belongings/) removing the cross consideration provides the most important pace-up. To make such a video-based mostly tracking model run on machine, in EdgeTAM, we look at exploiting the redundancy in videos. To do this in follow, we propose to compress the uncooked body-degree reminiscences before performing memory attention. We start with naïve spatial pooling and observe a significant performance degradation, particularly when utilizing low-capability backbones.<br>
+
+<br>However, naïvely incorporating a Perceiver also results in a extreme drop in performance. We hypothesize that as a dense prediction job, the video segmentation requires preserving the spatial structure of the reminiscence bank, which a naïve Perceiver discards. Given these observations, we suggest a novel lightweight module that compresses body-stage memory feature maps whereas preserving the 2D spatial structure, named 2D Spatial Perceiver. Specifically, we break up the learnable queries into two groups, where one group capabilities equally to the unique Perceiver, the place each query performs global attention on the enter features and outputs a single vector because the body-stage summarization. In the opposite group, the queries have 2D priors, i.e., each query is only answerable for compressing a non-overlapping native patch, thus the output maintains the spatial structure whereas decreasing the entire variety of tokens. In addition to the structure enchancment, we further suggest a distillation pipeline that transfers the knowledge of the highly effective trainer SAM 2 to our pupil model, which improves the accuracy without charge of inference overhead.<br>
+
+<br>We discover that in each stages, aligning the options from picture encoders of the original SAM 2 and our efficient variant benefits the efficiency. Besides, we additional align the feature output from the reminiscence consideration between the trainer SAM 2 and our pupil mannequin within the second stage so that in addition to the image encoder, reminiscence-associated modules may receive supervision signals from the SAM 2 instructor. SA-V val and take a look at by 1.3 and 3.3, respectively. Putting together, we propose EdgeTAM (Track Anything Model for Edge units), that adopts a 2D Spatial Perceiver for efficiency and data distillation for  [iTagPro geofencing](https://chessdatabase.science/wiki/User:DAECathern) accuracy. Through complete benchmark, we reveal that the latency bottleneck lies within the reminiscence attention module. Given the latency evaluation, we propose a 2D Spatial Perceiver that significantly cuts down the memory consideration computational cost with comparable performance, which can be built-in with any SAM 2 variants. We experiment with a distillation pipeline that performs feature-smart alignment with the original SAM 2 in each the picture and video segmentation levels and observe performance improvements with none additional price throughout inference.<br>