Researchers Present VideoINR: A Video Implicit Neural Representation Training Model for Continuous Spatiotemporal Super-Resolution

This Article is written as a summay by Marktechpost Staff based on the paper 'VideoINR: Learning Video Implicit Neural Representation for
Continuous Space-Time Super-Resolution'. All Credit For This Research Goes To The Researchers of This Project. Check out the paper, github, and project.

Please Don't Forget To Join Our ML Subreddit

Humans observe the visual world as continuous, continuous data. On the other hand, recorded videos are stored with low spatial resolutions and frame rates. Recording and storing large amounts of video data is expensive over long periods of time. Therefore, this situation requires current computer vision systems to process low resolution, low frame rate video.

Many academics are investigating strategies for converting low-resolution video to high-resolution video over space and time, as delivering video in high-resolution, high-frame-rate formats is essential for the best experience. user.

By converting pixel information into learned features inside a neural network, machine learning-based image synthesis systems such as autoencoders and generative adversarial networks (GANs) can achieve better image compression efficiency than traditional pixel-based codecs.

Free 2 Minute AI NewsletterJoin over 500,000 AI people

Given low resolution, low frame rate video as input, many strategies use spatiotemporal video super-resolution (STVSR) based methodologies to simultaneously increase spatial resolution and frame rate. These systems are able to convert video content into learned features and then display them at (fixed) resolutions. Oversampling and frame interpolation are also possible with STVR systems, allowing for improved detail and output of videos at higher frame rates than when they were originally shot. Researchers are also working on techniques to do super-resolution in one step rather than two. However, these methods can only super-solve at a defined space and time scale ratio.

Researchers from Picsart AI Research (PAIR), USTC, UC San Diego, UIUC, UT Austin, and the University of Oregon propose Video Implicit Neural Representation (VideoINR) single as a continuous video representation. This allows simultaneous sampling and interpolation of video frames at any frame rate and spatial precision.

The recent development of implicit functions for 3D shape and image representations using Local Implicit Image Functions (LIIFs) and a ConvNet has influenced their study. Pixel gradients on low frame rate images are difficult to calculate, unlike photographs, where interpolation in space can rely on the gradients between pixels. To perform the interpolation, the network must capture the movement of pixels and objects, which is difficult to model using 2D or 3D convolutions.


Two low-resolution image frames are concatenated and passed to an encoder in the STVSR job, which produces a feature map with spatial dimensions. On the created feature map, VideoINR acts as a continuous video representation. It begins by defining the implicit spatial neural representation of a domain of continuous spatial features, from which a high-resolution image feature is sampled based on all query coordinates.

Rather than employing convolutional procedures to accomplish temporal interpolation, temporal implicit neural representations are learned to produce a motion flow field first, considering high-resolution functionality and sample time as a entries. This stream field can decode the high resolution feature and warp it to the target video image. An encoder generates a feature map from the input images, which VideoINR can then decode at any spatial resolution and frame rate. Since all procedures are differentiable, feature-level motion can be taught end-to-end without any additional supervision besides reconstruction error.

The researchers used the datasets of their experiments from Vid4, GoPro and Adobe240. Their results reveal that in addition to extrapolating out-of-distribution frame rates and spatial resolutions, VideoINR can represent video in arbitrary spatial and temporal resolutions on the scales of training distributions. Instead of decoding the entire video every time, the continuous learning feature provides the ability to decode only a certain region and time scale as needed.

On spatial and temporal distribution scales, VideoINR performs competitively with state-of-the-art STVSR approaches and significantly outperforms other methods on non-distribution scales.

The researchers also found that VideoINR did not work well in certain situations. In most of these scenarios, very large movements have to be handled, which remains a challenge for video interpolation. The team intends to work on resolving this issue in future work.

Comments are closed.