Jason Corso, Co-Founder and CEO, Voxel51 took to the stage during AutoSens in Detroit (2019) to discuss Computer Vision and Machine Learning for Large Scale Video Corpus Maintenance and Curation. We caught up with Jason after Detroit to find out a little more about him, his research and more about his presentation.
Your research interests include computer vision, robot perception, data science, and medical imaging, in which area have you seen the most developments in the past five years and in which area do you expect to make the most progress in the next five years?
No one would argue that the incorporation of highly-parameterized deep learning models has not revolutionized the way we model and process data-oriented disciplines like those I work in. We’ve seen performances move from “good research output” to “potentially commercially viable”. However, there are significant hurdles yet to overcome. For general methodology, I think we will see growth in domain adaptation as well as multimodal learning across application verticals. For a specific application vertical, I think we will see the most growth in robot perception, as online and active methods are energized by the new advances.
You have been quoted as saying “The most exciting use of AI for me focuses around a better collective use of our available resources” would you care to expand on that?
The amazing energy we see around AI and its potential is simply incredible. People of all ages and walks seems to be aware of this energy. I cannot deny it is great. Yet, at the same time, as we see new potential generated from AI, I believe it is critically important to understand the societal context in which we are operating. Questions like what are the impacts of our new AI methods on society—on mobility, on energy-use, on job-creation—need to be asked by everyone, including those of us creating the technology. It is our responsibility to engage in discussion around these questions as we enhance the technology.
You are Co-Founder and CEO of the computer vision startup Voxel51, what does Voxel51 do?
Voxel51 creates an ecosystem of advanced capability supporting next generation video understanding capabilities. The core of this ecosystem is our video platform that supports developers and scientists both novice and experts in computer vision. For the novice developers and scientists who want to leverage recent advances in video understanding, our platform allows them to easily add robust and cutting-edge video and image understanding capabilities to their products and work. For experts in computer vision, our platform is available to help them scale their video and image understanding methods to meet the intense demands of their customers by deploying their own custom methods on the platform at scale with little effort; to gain additional customers by deploying on our platform and exposing their method to the full ecosystem; and to enrich their video analytics capabilities by using our state of the art methods. The second part of our ecosystem is Scoop: our advanced and easy to use video and image dataset analysis tool that helps machine learning, data scientists and anyone working with large video and image datasets stop writing custom data management scripts and instead focus on the important questions about their data, turning weeks or months of dataset curation into hours or days. Both are available via our website and have free tiers to allow interested parties to explore them. https://voxel51.com
In your presentation at AutoSens you talked about managing large video-based datasets for automotive. Could you outline what you think is most important in terms of best practices for curating these datasets?
There are two parts to this answer. First is the part about actually curating datasets. The two most critical aspects of curating useful datasets for video machine learning, computer vision and data science include breadth of coverage in the application domain and balance across the semantic categories of focus. These two aspects underscore why we created Scoop at Voxel51. Instead of requiring laborious manual sifting through data and custom scripting to process it, Scoop lets you visually quickly understand these two issues by providing easy to understand statistics across the datasets.
Second is the part of how the datasets are used. An unfair advantage in the way I have tackled computer vision problems in my research career and how Voxel51 does is that we directly focus on video. Video is everywhere with certain estimates that there will be tens of billions of operational cameras worldwide within a few years. Yet, the most common way to process this video is as a disconnected set of images. Doing so nullifies the temporal continuity inherent in the video. In my research work and at Voxel51, we directly leverage the spatiotemporal video volume to yield more robust and computationally sound methods. We see performance improvements of more than 10% in raw accuracy when doing so on typical problems.