January 2025 - Sean Foley

January 7, 2025

Nvidia Personal AI Super Computer

At CES, Nvidia just announced at Project Digits which is branded as your “personal AI Super Computer.” What makes this interesting:

128GB of memory is enough to run 70B models. That can open up some new experimentation options.
You can link a couple of these together to run even larger models. This is the technique used for data center shizzle, so bringing this to the desktop is cool.
The Nvidia stack. Having access to the Blackwell architecture is sweet, but the secret sauce is the software stack, specifically CUDA. This is really Nvidia’s moat that gives them the competitive advantage. Build against this, and you can run on any of Nvidia’s stuff from the edge on up to hyperscale data centers.

If you are a software engineer, IMO it’s worth investing some $ in this type of hardware vs. spinning up cloud instances to learn. Why? There are things you can do locally on your own network that allow you to experiment/learn faster than in the cloud. For instance, video feeds are very high bandwidth that are easier to experiment with locally than pushing that feed to the cloud (and all the security that goes with exposing outside your firewall.)

Some related posts….
https://www.seanfoley.blog/visual-programming-with-tokens/
https://www.seanfoley.blog/musings-on-all-the-ai-buzz/

January 3, 2025January 3, 2025

Visual Programming with Tokens

I bought a few Nvidia Jetson devices to use around the house and experiment with. I went with these vs. a discrete GPU + desktop machine because of power consumption: A desktop machine + GPU will use 300W+, and these Jetson edge devices use 10-50W.

I normally experiment with machine learning & AI shizzle using either a Jupyter notebook or a python IDE. But in this demo, I decided to check out Agent Studio. You fire up the container image, open up the link in your browser, and start dragging/dropping shizzle on to the canvas. Seriously rapid experimentation.

The video source is an RSTP video feed from my security camera.
The video output also produces an RSTP video feed. Not shown in the demo but I also experimented with overlaying the LLM output (tokens) with the video source to produce an LLM augmented video feed showing what it “saw”
This feeds into a multimodal LLM with a prompt of “describe the image concisely.”
I wire up the output to a text-to-speech model. Since video/LLM is operating in a constant loop, I also experiment with wiring up a deduping node.

This demo allowed my to get an idea of how these bits perform. I was interested in tokens/sec, memory utilization, and CPU/GPU utilization. Next, I plan to build out an Agentic AI solution architecture for my home security.

January 2, 2025January 2, 2025

Musings on all the AI Buzz…

Some think we’re in an AI bubble where AI is being over-hyped. Massive data centers are being built specifically for AI hardware (aka NVIDIA). Power consumption is so extreme that Big Tech wants to go nuclear.

I’ve used the foundational models. I have also used LLMs on the edge. I watched the GTC March 2024 Keynote with NVIDIA CEO Jensen Huang.

Here are the things that stuck with me:

Gen-ai is good at extracting context from unstructured data. This is a game changer.
There is a ton of investment $$$ flowing into this space. This will impact how we build software and what/how users interact with computers. This is very much like the previous Dot Com boom where the Internet/Ecommerce would fundamental change commerce… it was just the huge investments were too early for the “other” technical changes needed to make ecommerce ubiquitous: mobile phones and cheap networking.
Data centers and hyper-scale are needed for some workflows, but you can’t overcome physics and the round-trip latency of data transfer, which makes true interactive multimodal interactions challenging.
AI at the Edge is where I think the future is headed. We will want to use our phone or tablet with multi-modal gen-ai to assist us with “things.” This requires low-latency. For example, the speech-to-text models running on your phone performing real-time transcription of a voice mail is immensely useful to avoid spam calls.
According to Jensen Huang’s keynote, the future of programming is tokens. You feed tokens around to different “AI’s” that are specialized in a particular domain. This is the basis for Agentic AI.

I’ve seen shifts in tech and the software industry over my career. Historically, it has been about higher and higher level of abstractions:

Assembly abstracted machine op codes. C/C++ abstracted assembly.
Managed languages abstracted unmanaged languages which conversely abstracted the host CPU and hardware architecture (write once/run anywhere).
Operating systems abstracted the computer hardware
Databases abstracted the file system
Sockets abstracted the network.

But in each one of those abstractions, you still wrote code to sequence all that shizzle. It was deterministic logic based on discrete math. Now, in this new gen-ai world you decompose your domain into tokens. And within this domain, you have sub-domains that are specialized/optimized AIs. Each one of these gen-ai domains produces a probabilistic result based on linear algebra and statistics. Basically, it is a very smart guesstimate.

So now, tokens are abstracting the programming languages. And the holy grail everyone is chasing is Artificial General Intelligence (AGI) where the AI can do its own planning/reasoning. This abstracts out the programming language because the computer can figure out its own shizzle.

If you are a software engineer, you will absolutely need to add these tools to your toolkit. And I am not talking about just using a gen-ai API LLM wrapper. You really should dig behind that LLM API wrapper and:

Learn how machine learning models are created and build your own “hello world” model. You could build the Not Hotdog App and become the next tech unicorn.
Learn about distillation techniques to reduce model size.
Learn about how Low-Rank Adaptation of Large Language Models (LoRA) can be used for fine tuning LLMs.
Learn/experiment with how quantized LMMs behave.
How to use Ollama to load your own LLM and how the various settings influence model behavior and performance.