When eight Google scientists wrote the landmark machine learning research paper “Attention Is All You Need,” the world rejoiced. The paper introduced the transformer architecture. Experiments led to the creation of large language models, and the world has not been the same since.

It was 2:35 a.m. on a Sunday in Chennai. Sameer was in the middle of one such experiment of fine-tuning a local version of deepseek-r1:32b. He was thorough and confident in his design. 

He has been building this application using the fine-tuned LLM for seven days. No luck yet. He had put the application through varying numbers of iterations and trials. Nothing satisfactory has come out yet. This time, he felt, was different. He might just have created a very powerful tool that took keywords as inputs and generated an image involving the objects in the keywords.

For example, if you entered “Books, boy, table” as keywords, the tool would generate the image of a boy who dozed off on a pile of books at the study table. If you entered “Water, glass, man” as keywords, the image of a thirsty man drinking a glass of water would be generated.

Finally, the development was complete, and Sameer was happy with the results. There were no errors, just four warning messages, which he anxiously LLMed through and convinced himself that those messages were not critical and could be easily ignored.

With a sigh of relief, Sameer started providing the tool with further inputs. “Cat, milk, tumbler.” The newly built application promptly generated the image of a cat trying to drink milk from a tumbler. To his surprise, that was not all that was generated. There was an extension to the output.

Elsewhere, 25 kilometres away, Raju’s cat spotted a tumbler full of milk in the kitchen and ran towards it. She must have misjudged the speed and distance. Before she could stop herself, she slipped and hit the tumbler, spilling the milk on the kitchen floor. Hearing the sound, Raju’s wife came running and scolded the cat for her bad behaviour.

Sameer could view all these real-life and real-time actions on his laptop display, captioned ‘Elsewhere 25 Kilometres Away’. It was 5:15 in the morning. He was left stunned and speechless. This time, hesitantly, he typed, “Boy, cycle, dad.” The tool generated a picture of a dad teaching his son cycling.

Elsewhere, 7 kilometres away, Raghu ran fast to catch up with his son’s cycling speed. It was 6 in the morning by then. Had Raghu not caught up on time, Sonu would have lost balance, fallen on the road, and hurt himself. “It is enough for today. Let’s go home. I am hungry.” Raghu instructed Sonu. Sonu was in no mood to stop. He continued to cycle as fast as he could, leaving his dad far behind him.

It was so real that Sameer could feel his presence with Raghu and the little Sonu in that scene. The tool stopped playing the real-life and real-time video and again waited for its user to input more keywords. Sameer typed in “Marriage, celebration.” Immediately, an image came up. It was an image of two people getting married surrounded by their relatives.

Elsewhere, 200 kilometres away, it was a big day for Himanshu and Jyothi. It took them almost four years to convince their parents to agree to the marriage. Finally, both sides gave in to the children’s wishes, and the wedding is happening “now.”

Sameer wanted to try something different this time. He typed in “Friend, car, fun.” The tool generated an image of two friends driving a car and ending up in a severe accident. “Hey, stop, stop!” Sameer shouted at the top of his voice and frantically pressed the space bar on his keypad to pause. I don’t want the accident to happen. Please stop. Two men are going to be badly hurt. Once triggered, it was not to be stopped mid-way. It had developed intelligence of its own through numerous hours of fine-tuning on historical data points.

Sameer stared helplessly at his laptop display, which had the caption “Elsewhere, 0 kilometres away.” Just then, his doorbell rang!

Linkedin
Disclaimer

Views expressed above are the author's own.

END OF ARTICLE