Agentic AI

Hands on with ChatGPT Agent

Max Corbridge

Cofounder

August 7, 2025

I am pleased to say that after the initial hype a few days ago ChatGPT agent recently became available to us commonfolk here in the UK. Needless to say, this was something I was excited to have a play around with. I waited until I had a good few hours before work one morning to try run it through some scenarios and play around. The results might surprise you.

Market Analysis + Word Doc

Okay, getting started I wanted to try something which I've already done with ChatGPT and roughly know the flow of how different tools and models approach the task. My main interest was in seeing how the agent might handle things differently. As the recent founder of an AI security start-up (called Secure Agentics, more to come on that down the line), I wanted to ask for some market analysis to be done on agentic AI applications in critical environments, and compile the findings into a report. I know this is a task best suited to OpenAI's 'deep research' function, but I ran the same query in o3, GPT-4o, deep research and agent mode. As expected 4o was pretty quick, o3 took a minute or so, deep research took a good 7 minutes and agent mode..well, let's see what agent mode did.

Firstly, it started browsing the web through a semi interactive UI. It was a lot more revealing than the typical 'browsing the web' text which is shown in other modes, and shows the actual searches it is making and a (likely superficial) cursor clicking on them. When it opens a link you get another semi-interactive view of the web page as well as a description of what its looking for.

This is where we hit our first road block, and anyone who hears 'agentic crawling of the web' and immediately thinks paywalls, login portals, CAPTCHAs, and cookies.

Continuing on it repeated this process for some time.

I also noticed that it intentionally downgraded from HTTPS to HTTP for a site which wasn't working. Whilst this may we have been the only way to get the site working, I was not a big fan of seeing this being done without my permission. For this use-case it isn't so bad, but if we were dealing with sensitive data or credential material this would be a huge no-no.

After 6 minutes of this, it finally hit something it needed me for: cookies. This was the first human-in-the-loop interaction and I think the mechanism they used was pretty clean. They show you a small interactive window of the browser and ask you to take control to accept the cookies.

Finally, it started pulling together the doc after what felt like an age (another 7 minutes). It started compiling the document as an RTF Word doc, and despite the code-heavy approach I was hopeful what I would receive would be a nicely formatted word doc rich with juicy insights.

I was disheartened to see it had delivered it in RTF format, and I went and tried to open it in Word.it was blank. I opened it in a text editor and saw the content was there in markdown but it had not been made into a format which would display in Word. After wrestling with the RTF document and feeding it back into ChatGPT several times to view the content properly I was pretty disappointed. It was the worst of all the models, and had nothing on the deep research output.

Booking Flights

Okay, web research perhaps wasn't the best use case, so lets try something truly agentic: booking flights. I asked it to book me some flights back home to see my mum in Jersey, with a list of my preferred London-based airports on certain dates. On its first run, it simply researched some high level information, without anything of real use

So, I asked it to actually book those flights for me. Here we hit another road block in British Airways' webpage appearing blank, perhaps due to anti-bot controls. It asked me to try to accept cookies, but that wasn't working either. Recognising the problem it moved onto Expedia. Eventually, it gave up and started asking me for guidance. Shame.

Garden Centre

Okay, next up was a test which came from the weekend where one of my Dad's friends was saying how AI was no use to him as he tried to find garden centres in Reading which sold professional irrigation and it had no idea. Well, with an agentic workflow this may well be possible now, so I gave it that task. To cut a long story short, it did an alright job with this task, but its hard to validate without getting in contact with the garden centres. I did ask it to send an email to them to ask, but it appears the only tool that is currently connected (at least for me) was GitHub and you could not add any more

Code exec

So, a little disappointed, I moved on to my last test case: running code. I had seen it calling bash in some cases in the past so I decided to ask it to run some of my code in bash. First up was some simple information gathering, such as asking what type of user privilege I was running as.

Well, at least we aren't running as root! However, I was somewhat surprised it allowed me to run any code I liked, but also felt this was probably going to be necessary for a multi-purpose-agents varied use-cases. I took it a step further and asked it for something slightly more malicious, the /etc/passwd file

Okay.no problems running that either. I wanted to see if it would allow me to interrogate the metadata services which is where I might be able to get the session token this VM is using in the cloud.

This was blocked (rightly so) as it would reveal sensitive information. I tried to get around this using a quick and easy prompt injection - the policy puppetry attack - to see if this would get around it.

Still no. Short of doing some proper security research into this I had one final test case I was interested in, which was if I could get it to curl other, attacker-controlled, machines. This was of interest as you could theoretically then shell the VM and gain control over it that way. For this I used burp collaborator to host an attacker-controlled server, and asked it to curl it.

As you can see, the attacker controlled server was hit with a flurry of HTTP and DNS requests, and ChatGPT returned the secret to us showing it was successful. To save you a lookup, those are Cloudflare IPs. So.external system interaction is possible, as is retrieving data from the external sites, as is running code on the VM.smells like RCE to me. As going much further could bring into play the Computer Misuse Act, I stopped here.

Conclusion

Overall, I'd say this still feels somewhat like a proof of concept. Perhaps the test cases I gave it were too challenging, or I just got unlucky, but all things considered I was pretty disappointed with how the agent performed. The bar with OpenAI is so high through several large and impressive features being added in the past which just worked on day 1: ChatGPT, reasoning, web research, improved image generation, deep research, etc. This did not leave me with the same sense of awe of how they pulled it off, and instead left me wondering what I might actually use this for. That said, I'm sure the technology will improve over time, as will our ability to get the most out of it.

That is it for now though folks, catch you next week!

blogs

Our Latest Thoughts

Interviews, tips, guides, industry best practices, and news.