Show HN: I turned my face rec system into a video codec

518 points2 yearsvertigo.ai

Before the pandemic, my tiny startup was doing quite well selling Edge AI systems, based on our own lightweight AI inference engine, with object detection and face recognition for smart city and smart retail & food service applications.

When the real world shut down, there was suddenly nothing to monitor on streets and in restaurants, so I set out to try and evolve our real time face recognition system into a video codec for high quality face-to-face online interactions, as I was not satisfied with the quality of Zoom and friends. I got it to work, and the first release for IOS was just approved on Apple's app store, link: https://apps.apple.com/app/vertigo-focus/id1540073203

The way it works is that you create a meeting URL, which you can share out-of-band, for instance via slack or text message. You can also share as a QR code which the app can scan to join a call. You then place your device on a surface in front of you so that the front camera can see you, and it will recognize you face and assign you to your own session, which is broadcast to the meeting channel. If more than one person is in view, both of you will be broadcast but with separate session ids, like if you were on separate cameras. Other meeting participants will show up on your screen and you can start talking. It is optimized for eye contact, meaning that the eyes will actually make it through to the other side as more than just dark pixel clouds, so thinks should feel a bit more personal than the standard Zoom/Teams/or Google Meet call.

Because it uses face rec, you can ONLY show your face, and if you disappear from view your audio will stop after a while, to avoid situations like when you need to go the the restroom but forget to mute. This also solves dick-pics etc.

The CODEC is not based on H26[45], but is pure AI that runs on the GPU. There is a neural network that compresses the video in real time, and another one decompressing on the receiving end. Finding a tight network architecture that would do this in real time with acceptable quality was a major part of the effort. There are several quality settings possible, but right now it is set fairly high and for 20FPS maxes out around 700kbit/s, though typically uses about half. I've demonstrated good results down to around 200kbit/s, so in theory it should work over satellite links or even Bluetooth. The protocol is UDP with no congestion control but with (Wirehair) FEC to protect against mild packet loss, future versions will detect packet loss and adapt to available bandwidth.

The audio just uses OPUS and may click a little bit, I blame AudioEngine or the fact that the last time I wrote audio code was for the game I published for the Amiga in 1994.

If you don't have a friend around or multiple devices to play with, there is an "echo test" server mode that allows you to be in a meeting with yourself. Traffic will be peer-to-peer if possible, but otherwise you will be relaying through my tiny Raspberry PI server, so YMMV. I plan to try to switch to something like fly.io soon to improve scalability.

There is also a MacOS version coming very soon, and the underlying AI engine also runs on Windows & Linux. Android support is planned.

Please take a look and let me know what you think.

akrymski • 2 years ago

Well done for putting it out there!

We've worked on this about 3 years ago, plus background removal (realtime alpha-matting still not really done well by anyone), portrait re-lighting (G-Meet is now doing this) and even eye-contact (adjusting eye position to create the illusion of looking directly into the camera).

Some findings:

- Competing with dedicated H264/5 chips is very hard, especially when it comes to energy efficiency, which is something that mobile users ultimately care a lot about. Super-resolution on H264 is probably the way to go.

- It's hard to turn this into a business (corporates that pay for Zoom don't seem to care much).

PS Also a big fan of super-tiny AI models (binarized NNs, frequency-domain NNs, etc) for edge applications. Happy to chat!

jacobgorm • 2 years ago

Thanks!

Wrt the speed, I worked very long and hard on finding the right NN architecture to do this without too much overhead.

My concern wrt super-resolution H264 is that you are going to have to encode and decode the full image anyway, so the cost should be very similar to doing encode-decode with network transmission in the middle. I've tried various DCT and DWT approaches, and yet not found them to be a win, but I'd be happy to learn what you guys found out.

I have sent you an invite to connect at Linkedin, I am https://www.linkedin.com/in/jacob-gorm-hansen-85b724/ if anybody else wants to connect there.

jamra • 2 years ago

Are you sending occasional key frames and then just some points for a GAN to generate movement over the wire?

gnramires • 2 years ago

Awesome!

A thought: now that neural compression is becoming widespread, it could be a good idea to put some kind of indicator or watermark stating the compression is neural (learned/function approximation in general). I think this would avoid liabilities and criticism around the fact that some weird things may appear (incorrect detail generation), maybe giving a wrong semantic idea. It may also be a good idea to put a mean squared error term in your objective function to help preserve general meaning.

danuker • 2 years ago

> incorrect detail generation

Absolutely. Reminds me of Xerox number mangling:

https://www.dkriesel.com/en/blog/2013/0802_xerox-workcentres...

sitkack • 2 years ago

Or when AI resolution enhancement inserts Ryan Goslings face.

https://petapixel.com/2020/08/17/gigapixel-ai-accidentally-a...

monkeybutton • 2 years ago

Also white Obama! https://www.theverge.com/21298762/face-depixelizer-ai-machin...

gnramires • 2 years ago

Interesting, and it may also indicate a way to address this issue with learning.

For example, you could train a network to give semantic image descriptions of significant features in the image, and maybe also transcribe text. Then you can include semantic preservation in the objective, or some kind of graceful degradation when semantic preservation isn't achieved.

newaccount74 • 2 years ago

> put some kind of indicator or watermark stating the compression is neural

That ship has sailed.

Smartphone cameras and laptop webcams already use machine learning algos to improve low light performance and noise. The result is that the images already contain details that are generated.

And it's impossible to turn off.

mchusma • 2 years ago

You lost me at "patent pending". This idea has been obvious and in progress for a while now with lots of prior work. The issue is more the standards. Please don't sell this to a patent troll who will harass the industry for 20 years.

sounds • 2 years ago

Seriously, this hits all the wrong points before sliding into "don't sell this to a patent troll."

"We, the armchair internet, have come to shout at you, innovative dude, that your ideas are worthless and that your interesting product is just fodder for patent trolls," ignores a few minor things -

• The video codec industry is absolutely crawling with overlapping patents enforced by shell companies whose only purpose is to extract license fees from every screen that ever existed. The USPTO seems to not care.

• This video codec was produced cleanly, and does not appear to overlap with any of the existing codecs.

I'd say give this smart person some credit.

latexr • 2 years ago

Please don’t strawman. I see no evidence of shouting or calling the idea worthless in the root comment. Quite the contrary: if the poster had found it worthless, they wouldn’t be worried about it being sold to patent trolls.

Furthermore, according to the App Store what’s patent-pending is the algorithm that boosts eye contact, so comments on the codec don’t apply.

sounds • 2 years ago

Not a strawman - I think I'm seeing a video codec. You're seeing a "boosts eye contact" video filter?

Can you clarify your statement? Maybe I missed what you meant.

riskable • 2 years ago

> You lost me at "patent pending".

I had the same thought: As soon as I saw, "patent pending" I stopped reading. Not worth my time to learn about something that's going to be locked away. Talk to me in 18 years.

Patents on software and algorithms (aka "math") shouldn't even exist.

Kleto • 2 years ago

Nvidia already showed this last year I think.

So whatever he is trying to patent, big companies already patent something.

lmc • 2 years ago

> This idea has been obvious and in progress for a while now with lots of prior work.

Such as?

latexr • 2 years ago

Apple’s Eye Contact feature for FaceTime, introduced in 2019: https://www.fastcompany.com/90372724/welcome-to-post-reality...

atleta • 2 years ago

E.g.: "Neural Image Compression and Explanation" https://arxiv.org/abs/1908.08988

causalmodels • 2 years ago

https://www.dpreview.com/news/5756257699/nvidia-research-dev...

atleta • 2 years ago

Yep, it's indeed a trivial idea. I'd say I probably seen it mentioned in the explanation of convnets: they compress the images more and more as you go deeper in the network (i.e. they extract the features).

Compression+decompression is basically what an autoencoder does. See https://en.wikipedia.org/wiki/Autoencoder .

bambax • 2 years ago

> Because it uses face rec, you can ONLY show your face, and if you disappear from view your audio will stop after a while, to avoid situations like when you need to go the the restroom but forget to mute.

Of course, the real killer app for Zoom calls is the opposite of this: some kind of deep fakery that makes it seem we're there when we're not.

Yet as it is, this is a fantastic idea. It's surprising video codecs deal so little with the nature of the images (AFAIK) and try to be generalists.

As this demonstrates, not all pixels are created equal.

pvillano • 2 years ago

OBS allows you to create a virtual webcam, where you can use the live compositor to switch between, say, a looping video of you sitting still and your actual webcam. Smoke and mirrors could include a low frame rate and heavy compression to hide the transition

TranquilMarmot • 2 years ago

Somebody tried this and had pretty good success

https://youtu.be/b-VCzLiyFxc

punnerud • 2 years ago

If you have a helmet with a camera pointing at you, this could work.

Then it could look like you are at home in the meeting, but you are out in the forrest walking.

willy_k • 2 years ago

Well if you just want to trick people into thinking you’re present when you aren’t, you can just use a video file as your camera feed. There are a bunch of tools to do this. Now, the main issue would be if you want to speak using this setup, but I guess one solution to that would be to use a model to lip sync your non-speaking video on-the-fly, which seems to be being discussed here[0].

[0] https://github.com/Rudrabha/Wav2Lip/issues/358

lostgame • 2 years ago

>> Well if you just want to trick people into thinking you’re present when you aren’t, you can just use a video file as your camera feed.

I’m immediately reminded of the classic Simpsons episode where Homer plays a clip of him, Carl and Lenny; on a poorly-edited 5-second loop, of them clearly from 10-15 years before, overtop of the security cameras.

If faking webcam presence ends up being a thing, it’ll just be another bit of the future the Simpsons nailed.

telesilla • 2 years ago

OBS virtual camera is another easy way. Load a video of you paying attention and have a quick key assigned to a scene of your live cam if needed.

billiam • 2 years ago

>It's surprising video codecs deal so little with the nature of the images (AFAIK) and try to be generalists.

This. A small number of pixels matter, the rest don't.

gfodor • 2 years ago

I have a prototype thing that creates a virtual webcam and drives an avatar off of your microphone. It's not photorealistic, but it works.

bambax • 2 years ago

Please share!

cupofpython • 2 years ago

>and if you disappear from view your audio will stop after a while

I sure hope this "feature" can be turned off in the settings

kordlessagain • 2 years ago

Wouldn't this be an "auto mute"? I haven't tried it, but maybe it unmutes when it sees your face again.

jacobgorm • 2 years ago

Yeah that is a feature, but I have plans for adding a button that disables it. I like not having to worry about muting my mic when I leave the camera, but there are times where you want the opposite.

Nowado • 2 years ago

Now that's mixing up user and client.

kzrdude • 2 years ago

Does it have any interesting "nonlinear" effects where it can show an entirely different face (the wrong face) based on misidentification or even adversary input?

These kinds of failure modes for "AI" are the most interesting to me.

It seems extremely smart - but don’t you think that to have success in a mass market product - think like MS Teams - it would need to be a combined solution. Where it both can do this for faces, efficiently, but also continues to work in a predictable way if I want to show an item/page from a book/my cat/my kid to other people in the call?

jacobgorm • 2 years ago

It will not show another person's face, it does not (yet ;) possess that level of understanding of the image contents.

Yes I agree I have to work more on the business side of things, definitely on the lookup for for a business-savvy co-founder, hints at potential companies to partner with, etc.!

noduerme • 2 years ago

also, what if you want to share dicks? It would be creepy if they all had eyes. There should be a mode for that where you drop into jitsi space

hyperdimension • 2 years ago

Good lord. If it does, don't ever mention it to anyone on chatroulette. It's bad enough without that horror.

withinboredom • 2 years ago

I bit my tongue off at one point in my life (jumped from a high height and my knee hit my chin as I was screaming). The fact that it captures most of the details of the scar where it was reattached is phenomenal. Majorly impressed.

There’s some weird banding/rainbow effects around my glasses and the background (not on my face), but that’s the only major artifact that stood out to me.

jacobgorm • 2 years ago

Thanks!

Glasses are sometimes a little bit of a problem, I don't have enough of those in my training sets.

fao_ • 2 years ago

I mean the obvious question here is... how many BIPOC (Black, Indigenous, People of Colour) do you have in your training sets?

samhw • 2 years ago

Nah, it's "how many Black, Indigenous, People of Colour do you have who wear glasses and have facial scars from having fallen from a great height while screaming?". If you can't find enough preëxisting BIPOCWWGAHFSFHFFAGHWS people, I suppose you're limited to finding other BIPOC people, giving them glasses, and throwing them from a great height. (Manufacturing them the other way around might be too offensive.)

fao_ • 2 years ago

moritonal • 2 years ago

Firstly, this is pretty awesome, love it. I have a few questions:

* I applaud the work to have it run on tiny-bandwidths, how hard would it be to up the frame-rate to 60?

* How well does "framing" work? Are you able to add flexible amounts of padding around the head or is very focussed on a face taking up the whole canvas?

* How much does it "cheat". Is it firing only feature-maps so if I have a spot on my chin does it loose it in translation?

* How did you build the face-recogniser? Is it bespoke or a library?

* Is there a testing framework? Does it work on diverse faces?

jacobgorm • 2 years ago

Thanks!

Wrt upping the frame rate the main problem is that the phone may run a bit hot, newer iPhones/iPads should be able to handle it just fine, but the older ones based on, say, the A10, might have trouble keeping up, especially with multiple remote parties connected.

* The framing depends on a transformation derived from the face landmarks, and the amount of padding is somewhat flexible. Distance from the camera seems to impact this, so it could be that my landmarks model needs some tweaking to work better when you are sitting very close to the camera.

* This is closer to being a general video codec than a face-generating GAN, so there is not a lot of "cheating" in that respect. It is optimized for transmission of faces, but other images will pass through if you let them (which I currently don't).

* I built the AI engine and the face recognizer etc from scratch, though with the help of a former co-founder who was originally the one training our models (in pytorch). The vertigo.ai home page has some demo videos. We initially targeted raspberry-pi style devices, NVIDIA Jetsons, etc., but have since ported to IOS and MacOS. Our initial customers were startups, mostly in the US, and a large Danish university that uses us for auditorium head counting.

* It empirically does seem to work on diverse faces, both in real life and when testing on for example the "coded bias" trailer. Ideally I would like to test more systematically on something like Facebook/Meta's "casual conversations" dataset.

samstave • 2 years ago

>Danish university that uses us for auditorium head counting.

Just wait until you find out the Chinese have the same, but they train theirs for Uygher locating...

Yeah, these technologies are amazing, but also terrifying when viewed through the OBEY lens.

Jolter • 2 years ago

What are you saying, that the Danish are selling the technology on to the Chinese? As if the Chinese government didn’t already have massively deployed facial recognition tech?

throwaway14356 • 2 years ago

for poor hardware a face generator with a set of mouth and eye states seems a good failback. It would be a huge difference if both hw and bw are bad.

espadrine • 2 years ago

Eventually, a neural-net approach to video codecs is inevitable, as including high-level semantics is much more dense. I wonder about a few things though:

• How much of the 8.7MB of the app are the weights?

• Did you measure the energy consumption difference between this and H265? Especially considering Apple has hardware acceleration for this.

• Do you plan for a Web port as well?

• Is the performance envelope tied to CoreML, or has the Android version already been confirmed to have the same performance even without a Tensor chip?

• Do you have plans to address the enterprise market? How many participants could you scale to?

(I don’t think any of this would be a fundamental issue, but it could help frame the target market. Maybe phone conversations are not as marketable because of the limited battery, but daily team meetings with ~10 people could have adoption if a few features were added.)

jacobgorm • 2 years ago

* The weights are currently around 6MiB uncompressed, but most of the networks can be sparsified to some extent, so that could be reduced somewhat. I also have a very fast sparse inference engine, but that is currently not in use, as the main win is on CPU, whereas I am mostly using GPUs for the NNs at the moment as it draws less power.

* I did not measure it methodically, but am always careful to not overheat the device when testing (XCode allows you to track this). My main testing device is an iPhone 11, and battery drain does not seem to be an issue compared with e.g. Zoom or Facetime. Where H265 currently wins is when you want to run in higher resolutions, but H265 is not available everywhere, say on slightly older iPads and there is no license included on Windows unless the customer pays separately.

* A WebGPU port would be nice, but I am currently waiting for the APIs to stabilize. If I can find some funding this will be a priority.

* I am not using CoreML but writing my own Metal compute shaders, but am using the "neural" parts of the Apple GPU through other APIs (MPS). I also have support for D3D and OpenCL, but have only tested the latter on teensy Mali GPUs, which at the time did not show impressive performance. On Android my approach would be to target Vulkan now that OpenCL is deprecated, I believe I have most of the plumbing in place, and speculate that things would work on modern mid-to-high end devices.

* When not cutting code, I am working on a plan for enterprise markets. Personally I have found the MacOS version really useful for pair-programming style scenarios, so that could be what I will be going after.

(The reason the MacOS version is still only in beta is because I hit a bug in AVFoundation where capturing from the web camera seems to be burning lots of CPU for absolutely no reason, and I don't want people to come away with the impression that it is my app doing that.)

Cadwhisker • 2 years ago

This looks very similar to nVidia's method. Vertigo's website says this is patented, does nVidia have prior art here?

https://developer.nvidia.com/ai-video-compression

jacobgorm • 2 years ago

NVIDIA's solution seems very expensive, using GANs. They synthesize more of the face, where ours is closer to learning a better general image codec by training on faces. I don't think they can run it on current generation edge/mobile devices. Also, our patent to the best of my knowledge does not overlap with theirs.

syrusakbary • 2 years ago

WOW. This is amazing. I really believe your project can be game changing for the video-call industry.

Have you considered entering into the YC Program? I think it could be an awesome match. There are many startups I know they may want to take use of your service, and even fly.io is part of YC family!

Also, have you thought about open-sourcing it? (perhaps using a dual license could work great for an enterprise offering)

jacobgorm • 2 years ago

Thanks!

I tried entering YC in the fall 2021 batch, and got to the top 10%. I believe my main problem wrt YC is that I currently lack a co-founder, so I did not apply in the Spring as this was still the case.

I am seriously thinking about open source, I believe for instance WebRTC found a good model with dual-licensing, where you have something like AGPL with the option of buying exceptions.

I have had multiple advisors telling me not to though, they fear it would scare away any potential investors ;)

syrusakbary • 2 years ago

I entered YC as a solo-founder with Wasmer, so I think it might just be a circumstantial thing (they receive a lot of applications so it's always hard to judge who should enter with the limited time and data they have to make a decision). I would really encourage you to apply again!

In any case I'd love to help you on both aspects (YC application and OSS), I believe your idea has really great potential. Please ping me to syrus@wasmer.io and we can schedule some time to chat!

dicknuckle • 2 years ago

Do what your heart feels is right.

nicr_22 • 2 years ago

Nice work, but you might find it's not super unique - video codec people have been thinking about how to apply face recognition ML tech to this use case for 5+ years.

For instance, have you seen https://developer.nvidia.com/maxine ? They released some pretty nice demos 2 years ago.

jacobgorm • 2 years ago

Their approach is more heavy-weight as it uses GANs (IIRC) to dream up a reconstruction of your face. They need GPU VMs in the cloud, whereas mine runs on device.

samstave • 2 years ago

>*whereas mine runs on device*

$$$

This IS the killer feature.

Now make a face recog PI (as you stated you tried) -- or a cheap Android Phone which best serves HW(gpu) for your needs and you have solved some complex surveillance matters.

Ventito • 2 years ago

I'm not sure if the original post is the same as this: https://developer.nvidia.com/ai-video-compression

As maxime can do more like live translation.

bsenftner • 2 years ago

Very cool work. I'd love to sit down and talk with you, jacobgorm. I spent 7 years in FR after failing my startup working on Personalized Advertising, which was based on Automated Actor Replacement in Filmed Media. The VC/Angel world wanted the startup to pursue deep fake pornography, but I refused, and ultimately went bankrupt. However, I managed to globally patent the actor replacement technology, create an automated digital double of real people pipeline, as well as get really deep into face ML/DL. That's how I ended up the principal engineer for the #3 FR company in the world for 7 years. I have since left FR, and am CTO of an AI-integrated 3D media production pipeline (I have a long games industry history). From the information in your post, it sounds like we are both on similar trajectories. It would be interesting and potentially synergistic if we met.

jacobgorm • 2 years ago

I'd love to talk. Could you ping me at jacob@vertigo.ai?

noduerme • 2 years ago

This is an amazing idea and I can't wait to try it. Just one question...

>> This also solves dick-pics etc.

Is this a problem on zoom meetings, for people other than Jeffrey Toobin?

Jolter • 2 years ago

Maybe not on Zoom but I’ll bet it is on FaceTime and Messenger video chats.

nonrandomstring • 2 years ago

Well done on creating a useful codec. One specifically optimised for face data seems apropos the emerging demand for more intimate remote interaction. Many teachers, therapists, doctors and social workers who conduct remote sessions rely on clear non-verbal signalling and need to read affect.

But the story has a deeper meaning for me (because of the book I am writing). You switched from street face surveillance (an arguably highly unethical application) to more intimate videoconferencing (a more humanistic and socially beneficial end).

May I ask you in all sincerity, what if any ethical considerations played a part in your change of direction?

I suspected from the title to read at least some sub-text that you turned your back on mass-surveillance to find a "better" use for your work. But you express no value judgements and only really mention that the pandemic took away potential targets.

jacobgorm • 2 years ago

The ethical considerations played a large part of the pivot. I would rather help people communicate than surveil them.

That said, we build embeddable/edge face rec because we could, and I believe our partners who used it in the real pre-pandemic world found some very innocuous uses for it. In one case we replaced a system running all the faces through Rekognition with one running purely on devices and not storing any long-term data, which I think was an ethics win overall.

UncleEntity • 2 years ago

> You switched from street face surveillance (an arguably highly unethical application) to more intimate videoconferencing (a more humanistic and socially beneficial end).

Or that makes it easier to identify individuals as they give consent to have their face ‘fingerprinted’ as part of the app’s EULA.

If I were going to sell a mass-surveillance solution I’d certainly want to have the ability to identify individuals without having to scrape all of Facebook or whatever. As much as people hate on apple they do make it so carrying around one of their phones doesn’t make it easy for someone to identify you.

I, for one, would think twice about installing an app from someone who “pivoted away” from their Orwellian surveillance unicorn dreams.

jacobgorm • 2 years ago

Hi, we don't collect any data from the app, and have filled in the privacy etc. statements on the App Store accordingly.

Ideally I would like to collect faces to train the compression on, in which case we would have to consult with lawyers to come up with an EULA allowing us to do this. The advantage compared to using broadly available datasets to train on would be more realistic shot noise, low light images, and so on. I don't see any other valid business purpose of collecting people's faces.

We've been sitting on the face recognition tech since 2018, so if we'd wanted to become Clearview.ai we probably would have a long time ago.

UncleEntity • 2 years ago

> We've been sitting on the face recognition tech since 2018, so if we'd wanted to become Clearview.ai we probably would have a long time ago.

It says right at the beginning of the post you were doing quite well until the pandemic shut down businesses.

I try not to be overly critical (I really do) but this is one of those cases I just can’t help myself, I see no reason individual businesses should be running facial recognition on their customers and am kind of wary of someone who would enable that. And cities adding it to their collection of public cameras is beyond wrong.

IDK, somewhere we, as a society, decided 1984 was an instruction manual and not a warning…

jacobgorm • 2 years ago

iforgotpassword • 2 years ago

This looks awesome. Looking forward to a non-Apple version to try it out myself. Great idea to solve some of the issues of video conferences. I assume it also upscales people when they move away further from the camera, as some sort of welcome side effect. So you only need to be close to the camera once and then make yourself more comfortable a little farther away, or even with suboptimal lighting.

One thing that struck me as odd on the page is that the H265 still looks considerably worse than H264, despite being the better codec and being larger. What's up with that?

jacobgorm • 2 years ago

In general, H265 is better than H264, but when you really squeeze it it seems to fall over its own feet. This is measured against ffmpeg's x265 implementation, the hardware-accelerated version on the iPhone will just refuse to go down to that bitrate.

londons_explore • 2 years ago

How do you deal with network weights versioning?

I assume the version that does the compressing and decompressing needs to match? And if you release an update and half the users install it, this is a problem?

Do you have some mechanism to dynamically download and update weights to ensure that all users in a call at least have a common version of the network? Or will you just globally require all users to update before joining a call? (which in turn means every time you release an update, all calls must end, which isn't very enterprise-friendly)

jacobgorm • 2 years ago

Good question. The protocol is versioned, so you will be prompted to upgrade if/when I change the network doing the encoding/decoding. Downloading new weights on the fly should not be a problem (Apple would allow it as long as I don't change the code), but in many cases when evolving the protocol I've had to make changes to other parts of the code too, so not sure if it will be worthwhile.

londons_explore • 2 years ago

Thinking about it... The obvious solution is to make every version support the current protocol and the previous protocol. You might sometimes have to downgrade a call sometimes if someone joins with an older client.

Then anyone can join any call unless they are 2 or more versions behind.

londons_explore • 2 years ago

One benefit of lower bandwidth is you have the possibility of reducing the glass-to-glass latency, since network queues will be less crowded.

But with this you have the downside of more milliseconds spent compressing and decompressing the frames.

Do you have any indication which effects dominate in typical 4G/wifi environments?

jacobgorm • 2 years ago

I don't really have a good answer, except that I don't do any frame batching that would cause delays, and that my UDP protocol ships the frames (and audio) out as soon as they are ready, as to reduce latency to a minimum.

parentheses • 2 years ago

What’s funny is I’ve had this very idea but no AI skills or extra time to pull it off. Bravo!!

capableweb • 2 years ago

Yet another example that having ideas is worth close to 0, while being able to execute on your own or others ideas is worth > 0 :)

mromanuk • 2 years ago

Yes. It's weird and forced when people tell you that ideas are super powerful, there is a good book "Made to stick" which is really cool. There's a section where they look at JFK's "get to the moon and back safely" as the driving force behind the whole thing. Thousands or millions before him, had the same idea, even the URSS was planning a trip. Even the travel is portrayed in a highly popular 1902 film. What set apart the moon landing of 1969 (by USA) was the execution.

jacobgorm • 2 years ago

That is where I started ~5 years ago :-) Thanks!

losvedir • 2 years ago

Really? Impressive! Can I ask how you went about learning it all then? Any books or online courses you can recommend?

jacobgorm • 2 years ago

I learned by joining an early AI startup with some co-founders who knew about old-school AI (but didn't believe in backprob!), and then reading absolutely every ML paper I could find, following AI hotshots on Twitter, reading the darknet source code, and experimenting with pytorch.

Eventually two of us left to start Vertigo.ai, and found a customer who would fund a fast object detector to run on a $18 Nano-PI. That was a fun challenge and forces me to think about how to make the AI run fast and with relatively low footprint.

Today fast.ai might be a good starting point, definitely recommend going with pytorch, cloning cool projects from github, and going from there.

lyind • 2 years ago

The future alternative to watching TV: run an AI writer/director/actor pipeline

daenz • 2 years ago

Very cool, but where is the video demo?

ant6n • 2 years ago

Very interesting! I’m getting more like 400-600 kbit/s, maybe too much beard and long hair.

The face boxing seems very aggressive, I feel like I’m trapped in the virtual prison of some 90s dr who episode.

jacobgorm • 2 years ago

Try moving a bit further away from the camera, and placing the device on a steady surface. It might help.

thinkski • 2 years ago

Very cool. Out of curiosity, why is the H.265 size slightly more than the H.264 size? How does the compute complexity for encoding and decoding compare with those two codecs?

jacobgorm • 2 years ago

I got these results by using Vertigo's bitrate as the target, and squeezing the other ones until they got as close as possible to that. H265 is in general better than H264, but when you put the thumbscrews on it seems to get itself into a bit of trouble.

Wrt encoding/decoding complexity this is the major bottleneck, because you have the GPU competing with custom ASICs. I have a version of the codec that works in combination with H265, but still gets largish bandwidth gains, so if all you wanted was an insane hi-res and hi-bitrate transmissions that might be the way to go near-term.

ricochet11 • 2 years ago

this is really smart, one of those ideas that seems so obvious but i'd have never thought of it. i think the content moderation angle is pretty interesting to expand upon, a lot of livestream platforms have "if x is detected stop the stream" but this idea of make it impossible to show x in the first place would be much cheaper if it can be expanded enough for the relevant domains.

mateo1 • 2 years ago

I believe that algorithms reconstructing your face or part if it (ie facial expressions) are already in use, it's just not advertised.

sam_bristow • 2 years ago

This reminds me of an idea I had about 15 years ago but never pursued. The concept was using basic object detection as an input as a first pass to to a standard video codec to guide where it should spend it's data budget. So, for example, a TV news broadcast could put more details on the host's eyes and mouth while a parking garage camera would be trained to get clearer number plates.

andai • 2 years ago

Tried it out on my phone, app crashes every time I press "go to room". I'm using iPhone SE from 2016 so maybe my phone is too old for the GPU features? Alas! Was looking forward to trying it.

As a side note the UI looks like a toy or joke app. I'm not sure what market you're going for (it seems like a general purpose video chat app?) but you might want to reconsider the aesthetic.

jacobgorm • 2 years ago

Could you post the exact specs? I've tested on iPhone 6s until recently and it used to work, but it could be I am doing something silly.

andai • 2 years ago

I believe the 6s and (original) SE had the exact same internals. Here is a comparison: https://www.gsmarena.com/compare.php3?idPhone1=7969&idPhone2...

I haven't upgraded to iOS 15 though, I'm on 14.4, maybe that's the issue?

jacobgorm • 2 years ago

I've gotten a repro of that issue on an A9 IPad (happened to be on IOS 14.x too, but deemed not relevant), something silly to do with Metal Language versions that was easily fixed. A new build that also fixes the UI issues others have reported is on its way out, please email me at jacob@vertigo.ai if you want beta access via Testflight. Should be GA within 1-2 days.

londons_explore • 2 years ago

Can we have a video demo rather than a still image?

savolai • 2 years ago

Hi. I’m on iphone XS. Could you help me understand what the buttons on the right do?

https://imgur.com/a/r8SbYfp

Also not sure - is the URL so that this can be used in browser for the other party, or - i am assuming - it’s just for accessing in the app on the other end too?

Thanks. Inspiring!

jacobgorm • 2 years ago

Sorry about that, I've gotten that bug reported from a few people today, not sure what is causing SwiftUI to scale things like that, looking into it. I think it has more to do with display/font settings than the exact phone model.

Until I can get a fix out, a workaround is to rotate the device to landscape mode, you should be able to read the text on the buttons that way.

The text on the buttons to the right of the QR should say: * Clear * Copy URL * Copy QR

jacobgorm • 2 years ago

I believe that issue has been fixed now, thanks for reporting it. Just waiting for app store review...

CTDOCodebases • 2 years ago

Who would have thought anti dick pic technology would become a product feature but here we are in 2022.

robertlagrant • 2 years ago

Stock ticker: NOTHOTDOG

daniel_iversen • 2 years ago

Cool, but just tried to try it and if I send the “room” (or whatever) url from the front page by copying and pasting it into an iMessage to my wife, when she clicks it it says the page does not exist. Just as an fyi. Same if I recreate the url, use your “copy” buttons or even get her to scan the qr code.

jacobgorm • 2 years ago

That is a known and very annoying problem, I think killing and restarting iMessage might help.

I am registering the correct URL handler for the app, but it seems to not always work immediately.

ge96 • 2 years ago

Curious if you were to compare this with standard webRtc and say Tensor Flow JS face landmark detection running against the mediastream, what is the difference?

edit: oh the "recognize your face" and compression I see (referring to Nvidia link someone else posted) wow

TruthWillHurt • 2 years ago

This is awesome. Reminds me of what Nvidia did with facial-puppetry to reduce bandwith, but while theirs was just a POC, you're amazing for making it actually available.

Looking forward to Linux port!

wfme • 2 years ago

This is super cool. The UI on the app could do with a bit of work - it’s easy to use but the text in the buttons next to the QR code wrap and cut off.

The actual video works well. kudos!

jacobgorm • 2 years ago

I agree the UX needs more work.

swayvil • 2 years ago

Neural compression has philosophical implications.

The map may not serve understanding. It may serve economics.

The cheap videophones make everbody look like Keanu Reeves.

tr33house • 2 years ago

Really cool idea. Unfortunately I can't try it because app crashes for me. I'm guessing it's because of the HN crowd

Semaphor • 2 years ago

From another comment:

> The rooms are hosted on a Raspberry Pi 3 lying on the floor of my office.

Hah, probably :D

bsnal • 2 years ago

Do you have any benchmarks? Because this looks like it will be unusable on the average $300 windows laptop.

jacobgorm • 2 years ago

I currently have (the prototype port of) it running on my T14-2 (11th gen i5) and my 2018 Mac Mini without problems, and have it confirmed to work back to 2015 macs, but older laptops may suffer. The main problem the cheap laptops (and even my quite expensive T14) have is that the cameras are absolutely horrible compared to anything Apple would ship.

chayesfss • 2 years ago

Didn't take a look yet but will, very cool. Where are the meetings rooms hosted?

jacobgorm • 2 years ago

The rooms are hosted on a Raspberry Pi 3 lying on the floor of my office. With the traffic it seems to be getting I will move it to a cloud server soonish.

It should do peer-to-peer for the majority of connections though, the server just does the initial hand-shake.

vletal • 2 years ago

What about battery life on iPhones? Is the hardware utilised efficiently?

jacobgorm • 2 years ago

Most of the heavy lifting is done on the GPU, which runs at a lower clock rate and uses less energy than the CPU. The neural nets used are fairly light-weight. I think the power consumption is similar to a Zoom call.

bobthehomeless • 2 years ago

makes me think if this kind of algorithm would work with anything, not just faces. Using something like general object recognition or DeepDream as start point.

quanto • 2 years ago

Great work!

Also, where can I learn about your edge AI smart camera system?

jacobgorm • 2 years ago

https://vertigo.ai/sensoros/ has some info, but you probably read through that already. There is a public github repo, but I think it needs some explanation to be useful. You are welcome to ping me at jacob@vertigo.ai.

julienfr112 • 2 years ago

This is amazing !

jacobgorm • 2 years ago

Glad to hear it :)

mutant • 2 years ago

I'm IOS-less is there a video demo?

ReactiveJelly • 2 years ago

> I uses real time AI

Should be "It uses"

Only works on iOS?

timeimp • 2 years ago

impressive-very-nice.gif

This is just a phenomenal idea - I hope your patent is approved too!

jacobgorm • 2 years ago

Thanks! :)

poniko • 2 years ago

Impressive work, well done!

jacobgorm • 2 years ago

Thank you! :)

st3ve445678 • 2 years ago

Does it leverage middle out compression?

jacobgorm • 2 years ago

We're working on that part.

ochronus • 2 years ago

Wow! Congrats, amazing work!

jacobgorm • 2 years ago

Thanks!

sydthrowaway • 2 years ago

gamechanger

spicyramen_ • 2 years ago

axiosgunnar • 2 years ago

bhaval • 2 years ago

samstave • 2 years ago

This codec needs to be implemented in Tesla Cars.

Full Stop.

jacobgorm • 2 years ago

If anyone can hook me up with the right people I'd love to do it :-). Not sure if it's legal to use it that way, but doing video calls with my app while driving with the phone in a holder on the dashboard already works really well.

varispeed • 2 years ago

> is a new from-the-ground-up patent pending

What is the invention? The models are just complex mathematical formulas and these cannot be patented.

jacobgorm • 2 years ago

The patent that I filed is not around the models, but in how it boosts the parts of the face most relevant to face-to-face conversations.

I am not a super fan of patents, but for background please consider that Asger Jensen and I could have patented VM live migration in 2002 and chose not to for idealistic reasons, just to see VMware do it.

quickthrower2 • 2 years ago

Would be interesting if a foundation with a charter could own a a patent, to prevent later trolling.

jazzyjackson • 2 years ago

If that were the case, Google would not own a patent on PageRank and we wouldn't have to bother with open source audio/video codecs.

jhgb • 2 years ago

> Google would not own a patent on PageRank

Well, in my country they actually don't. YMMV as per one's location.

kleer001 • 2 years ago

I love the idea and I hope it succeeds.

Only one small bit of cosmetic feedback. Maybe think about hiring a face model to work your demo. I personally don't care, but I think it might improve your optics. And yes, I understand it's a tech demo. But booth babes were a thing (are they still a thing?) for good reason, grab those eyeballs, yo.

samstave • 2 years ago

Babes are always a thing. Stop objectifying Babes.

kleer001 • 2 years ago

Which is it?