ThirdMover t1_jdhvx8i wrote on March 24, 2023 at 2:45 PM

>GPT-4 is potentially missing a vital feature to take this one step further: Visual Grounding - the ability to say where inside an image a specific element is, e.g. if the model wants to click a button, what X,Y position on the screen does that translate to?

You could just ask it to move a cursor around until it's on the specified element. I'd be shocked if GPT-4 couldn't do that.

MjrK t1_jdiflsw wrote on March 24, 2023 at 4:50 PM

I'm confident that someone can fine-tune an end-to-end vision-tranformer that can extract user interface elements from photos and enumerate interaction options.

Seems like such an obviously-useful tool and Vit-22B should be able to handle it, or many other Computer Vision tools on Hugging Face... I would've assumed some grad student somewhere is already hacking away at that.

But then also, compute costs are a b**** but generating training data set should be somewhat easy.

Free research paper idea, I guess.

modcowboy t1_jdkz6of wrote on March 25, 2023 at 3:49 AM

Probably would be easier for the LLM to interact with the website directly through the inspect tool vs machine vision training.

MjrK t1_jdm4ola wrote on March 25, 2023 at 12:37 PM

For many (perhaps these days, most) use cases, absolutely! The advantage of vision in some others might be interacting more directly with the browser itself, as well as other applications, and multi-tasking... perhaps similar to the way we use PCs and mobile devices to accomplish more complex tasks

[deleted] t1_jdjk1iy wrote on March 24, 2023 at 9:12 PM

[removed]

plocco-tocco t1_jdj9is4 wrote on March 24, 2023 at 8:02 PM

It woulde be quite expensive to do tho. You have to do inference very fast with multiple images of your screen, don't know if it is even feasible.

ThirdMover t1_jdjf69i wrote on March 24, 2023 at 8:40 PM

I am not sure. Exactly how does inference scale with the complexity of the input? The output would be very short, just enough tokens for the "move cursor to" command.

plocco-tocco t1_jdjx7qz wrote on March 24, 2023 at 10:47 PM

The complexity of the input wouldn't change in this case since it's just a screen grab of the display. Just that you'd need to do inference at a certain frame rate to be able to detect the cursor, which isn't that cheap with GPT-4. Now, I'm not sure what the latency or cost would be, I'd need to get access to the API to answer it.

thePaddyMK t1_jdlr6bp wrote on March 25, 2023 at 9:52 AM

There is a paper that operates a website to generate traces of data to sidestep tools like Selenium: https://mediatum.ub.tum.de/doc/1701445/1701445.pdf

It's only a simple NN, though, no LLM behind it.

MassiveIndependence8 t1_jdl9oq9 wrote on March 25, 2023 at 5:41 AM

You’re actually suggesting putting every single frame into gpt-4? It’ll cost you a fortune after 5 seconds of running it. Plus the latency is super high, it might takes you an hour to process a “5 seconds” worth of images.

ThirdMover t1_jdlabwm wrote on March 25, 2023 at 5:49 AM

What do you mean by "frame"? How many images do you think GPT-4 would need to get a cursor where it needs to go? I'd estimate four or five should be plenty.

SkinnyJoshPeck t1_jdhis65 wrote on March 24, 2023 at 1:12 PM

i imagine you could interpolate, given access to more info about the image post-GPT analysis. i.e. i’d like to think it has some boundary defined for the objects it identifies in the image as part of metadata or something in the API.

Single_Blueberry t1_jdhtc58 wrote on March 24, 2023 at 2:28 PM

What would keep us from just telling it the screen resolution and origin and asking for coordinates?

Or asking for coordinates in fractional image dimensions.

MassiveIndependence8 t1_jdl9s3u wrote on March 25, 2023 at 5:42 AM

The problem is that it can’t do math and spatial reasoning that well

Single_Blueberry t1_jdnyc2d wrote on March 25, 2023 at 8:46 PM

Hmm I don't know. It's pretty bad at getting dead-on accurate results, but in many cases the relative error of the result is pretty low.

acutelychronicpanic t1_jdhksvy wrote on March 24, 2023 at 1:28 PM

Let it move a "mouse" and loop the next screen at some time interval. Probably not the best way to do it, but that seems to be how humans do it.

ingeniare t1_jdhxcds wrote on March 24, 2023 at 2:54 PM

I would think image segmentation for UI to identify clickable elements and the like is a very solvable task

RustaceanNation t1_jdiekiv wrote on March 24, 2023 at 4:44 PM

Google's Spotlight paper is intended for this use case.

Qzx1 t1_jdv429m wrote on March 27, 2023 at 12:49 PM

Source?

shitasspetfuckers t1_jed796l wrote on March 31, 2023 at 3:52 AM

> Google's Spotlight paper

https://ai.googleblog.com/2023/02/a-vision-language-approach-for.html

ThatInternetGuy t1_jdhpq8y wrote on March 24, 2023 at 2:03 PM

It's getting there.

DisasterEquivalent t1_jdk10wf wrote on March 24, 2023 at 11:15 PM

I mean, most apps have accessibility tags for all objects you can interact with (it is standard in UIKit) - The accessibility tags have hooks in them you can use for automation. so you should be able just have it find the correct element there without much searching.

[deleted] t1_jdhj5k5 wrote on March 24, 2023 at 1:15 PM

[deleted]

eliminating_coasts t1_jdhkkw3 wrote on March 24, 2023 at 1:26 PM

You could in principle send them four images, that align at a corner where the cursor is, if it can work out how images fit together.

CommunismDoesntWork t1_jdia6kb wrote on March 24, 2023 at 4:16 PM

It can do this just fine

Runthescript t1_jdknxkl wrote on March 25, 2023 at 2:11 AM

Are you trying to break captcha? Cause this is definitely how we break captcha

Suspicious-Box- t1_jdzj7wr wrote on March 28, 2023 at 10:16 AM

Just need training for that. Its amazing but what could it do with camera vision into the world and a robot body. Would it need specific training or could it brute force its way to moving a limb. The model would need to be able to improve itself real time though.

morebikesthanbrains t1_jdii4y7 wrote on March 24, 2023 at 5:06 PM

But what about the black box. Just feed it enough data, train it, and it should figure out what to do?

[D] I just realised: GPT-4 with image input can interpret any computer screen, any userinterface and any combination of them.

BinarySplit t1_jdh9zu6 wrote on March 24, 2023 at 11:57 AM