Evaluating GenAI Platforms with Scoring Framework

I’m checking this out. As we need to select the “best” platform for our genAI applications, the permutations become truly daunting. Adrian provides a framework and code to do so - scoring and promoting the highest rated candidates. Fascinating!

Adrian Cockcroft

3w Edited

I created a new repo/tool today to evaluate and collect the rapidly changing tooling configurations that everyone is trying to figure out (using statistical experimental design) I used Claude/Gastown to both make it and operate it and have some initial comparison data on opus/sonnet and Python/TS/Go etc. for a small test. I’d be happy for some github stars if people think it could be useful. https://lnkd.in/gHYmbUXj - Edit: a few more hours on Monday and it’s coming along well. Interactive html dashboards, six languages and a small and large application. (Spoiler, Go wins the over all comparison…)

GitHub - adrianco/retort: Platform Evolution Engine. Distill the best from the combinatorial mess. github.com

To view or add a comment, sign in

More Relevant Posts

Bauplan

3,140 followers
3w
Report this post
The next generation of data infrastructure won't be built for analysts. It'll be built for agents. And honestly, most data platforms aren't ready for that yet. They were designed for humans- people who slow down, double-check, and ask for sign-off before anything hits production. Agents work differently. They move fast, they iterate, and they need infrastructure that can keep up without breaking things. That means Python-native, isolated by default, atomic merges, instant rollbacks. Not manual guardrails. On April 21 at 9am PT, we're showing what that looks like in practice- live, in #Python, with our friends at dltHub. Looking forward to seeing you there :)) Register: https://lnkd.in/eqVQ5C5n

Bauplan + dltHub: Python Based Data Infra for AI Agents · Luma luma.com
Like Comment
To view or add a comment, sign in
Zhongyue (John) Yang
2w
Report this post
Excited to share ParaLLeM developed by Galen Wei, a Beckman Scholar in our group mentored by Shone Ran!!! ParaLLeM is a Python library for building and running agentic LLM workflows at a large scale with 50% reduced token cost. Galen adopts a developer-first design: you keep orchestration logic in plain Python, then switch between sync and batch execution in one line. ParaLLeM is built around the Batch API, supports OpenAI, Anthropic, and Google models, and includes features such as structured output, function calling, web search, multimodal input, detailed logging, and reproducible traces. It is designed to help teams move from prototyping to scalable, production-grade workflows without rewriting their code, while batch mode can also reduce token cost by up to 50% for the same application. Github: https://lnkd.in/ekmu9AU7 Web: https://parallem.org/

Build agentic workflows at scale. parallem.org
Like Comment
To view or add a comment, sign in
Ahmad Gayibov
5d
Report this post
I built 4 AI agents from scratch. No LangChain. No AutoGen. Just the Anthropic SDK and Python. Agent 01: answered a Kubernetes question. Cost $0.002. Agent 02: searched the web in real time. Cost $0.18 — 90x more. Here's why that matters. Agent 03: connected to my live Kubernetes cluster and ran kubectl commands autonomously. Agent 04: created a full K8s stack from one text description. The difference between a chatbot and an agent is one while loop. Full breakdown in the article 👇

1 Comment
Like Comment
To view or add a comment, sign in
Grzegorz Ozanski
3w
Report this post
Spent some time recently improving test stability in a small Python project. What looked like “random failures” at first turned out to be a mix of: - leftover state between runs - timing issues - and assumptions that were true locally but not always in CI Nothing unusual — but a good reminder how fragile systems can become when small things accumulate. The interesting part wasn’t fixing individual tests, but understanding *why* the system behaved differently depending on the context. In the end, the most valuable changes were: - making state explicit instead of implicit - isolating test cases properly - and improving observability (logs > guesswork) Not a big project, but a good example of how reliability issues are often less about tools and more about system behavior. Curious how others approach this — do you treat flaky tests as isolated problems, or as signals of deeper issues?
Like Comment
To view or add a comment, sign in
Dominikus Nold
4w
Report this post
I've been running a public AI-assisted delivery pipeline for Python for months now - research, OpenSpec planning, TDD, evidence recording, pre-commit gates, PR review, release. The interesting part: the tool enforces the same workflow it uses itself. Full walkthrough coming soon 👇
1 Comment
Like Comment
To view or add a comment, sign in
Hila Ramati
4w
Report this post
One misconfiguration in a GitHub Actions workflow. Fast forward to Friday evening and we’re tracking six throwaway GitHub accounts, 500+ malicious PRs, and an attacker who spent three weeks iterating on payloads before anyone noticed. Always a pleasure digging into these with Rami McCarthy, Benjamin Read and Scott Piper 🔍

Rami McCarthy

Principal Security Researcher at Wiz
4w

Spent yesterday digging into the prt-scan campaign that hit GitHub last week. The public reporting focused on the final wave, but the real story starts three weeks earlier. Turns out all there were actually six accounts that trace back to one operator. The first wave was just 10 PRs testing injection vectors. By the end, they were pushing 475+ in 26 hours with AI-generated payloads that adapted to each repo's tech stack. The payloads got smarter, but not smart enough. The LLM kept hallucinating files like pip.py that don't exist in any standard Python project. Confidently wrong. Success rate was under 10%, but at 500+ attempts that still meant dozens of compromises. Volume is the only strategy. Swipe through for a preview, or check out the whole blog for the deep dive: https://lnkd.in/dvC8JWhU
Like Comment
To view or add a comment, sign in
Azhagan S
1w
Report this post
I’ve spent the last few weeks moving beyond just analyzing data to actually building the systems that create it. It’s been a massive learning curve, but I’ve finally started to wrap my head around: 𝗠𝗼𝗱𝗲𝗹𝘀 & 𝗔𝗱𝗺𝗶𝗻: How the database actually structured. 𝗩𝗶𝗲𝘄𝘀 & 𝗧𝗲𝗺𝗽𝗹𝗮𝘁𝗲𝘀: Passing variables between the backend and the UI. 𝗣𝗲𝗿𝗺𝗶𝘀𝘀𝗶𝗼𝗻𝘀: Managing users and groups. 𝗕𝗼𝗼𝘁𝘀𝘁𝗿𝗮𝗽: Keeping things clean and responsive. Huge shoutout to my mentor Rathan Kumar for the guidance. Check out the live site - https://lnkd.in/g-jkXR5x On to the next build! #Django #Python #DataEngineering #Learning #FullStack
Like Comment
To view or add a comment, sign in
Jamil Hanouneh
2w
Report this post
I had the chance to delve into the Noise2Void codebase from zero and I must say I absolutely adored the way it was organized. The project has the workflow decoupled into very specific parts such as data handling, masking, and model training, thus making it easier for the reader to grasp the contents of the code. What I found to be the most remarkable part is the trick that the code employs through the use of custom functions and training logic to hide pixels and let the network predict from surrounding context. This is an excellent project that exhibits how well clean Python structure and TensorFlow-based modeling can work together in a smart self-supervised learning environment. Another thing I found useful was the codebase with its logic being modular, which gives the pipeline a practical and well-engineered feeling instead of being too complicated. Repo: https://lnkd.in/eF7nZXxX
Like Comment
To view or add a comment, sign in
Rohith Reddy
2w
Report this post
The "Shadow" Fix: Python Version Compatibility **Hook:** Building for the "Latest & Greatest" is easy. Building for the "Real World" is where the engineering gets messy. **Body:** While finalizing my Enterprise RAG pipeline, I hit a silent production-breaker: A `TypeError` buried deep in a third-party dependency. The culprit? The `llama-parse` library uses Python 3.10+ type union syntax (`|`), but the production environment was locked to Python 3.9. Result: Immediate crash on boot. Instead of demanding a system-wide upgrade—which isn’t always possible in locked-down enterprise environments—I implemented a **Graceful Fallback Logic**: ✅ **Dynamic Imports**: Wrapped the cloud-parser initialization in a guarded `try-except` block. ✅ **Smart Routing**: If the Python environment is incompatible, the system automatically redirects to a local, high-fidelity `PyMuPDF` parser. ✅ **System Resilience**: The app stays online, the UI remains responsive, and 99% of RAG functionality remains available without a single user noticing a failure. Real Engineering isn't just about using the best tools—it’s about writing code that doesn't break when the environment isn't perfect. #Python #SoftwareEngineering #RAG #AIEngineering #SystemDesign #Resilience
Like Comment
To view or add a comment, sign in
Rizwan Wangde
1w
Report this post
Build a production AI agent in 10 lines of Python. Strands Agents SDK: ```python from strands import Agent from strands.models.bedrock import BedrockModel from strands_tools import calculator, web_search agent = Agent( model=BedrockModel("anthropic.claude-sonnet-4-20250514-v1:0"), tools=[calculator, web_search] ) response = agent("What's the GDP per capita of the top 5 economies?") ``` That's it. Tool calling, conversation management, streaming, multi-turn context is all handled. Why Strands over LangChain for AWS: - Built for Bedrock integration (not retrofitted) - Works with any Bedrock model - ToolSimulator for agent testing (just released) - Strands Evals for evaluation pipelines - Open-source, runs locally For production: pair with Bedrock AgentCore for managed deployment, guardrails, and observability. The stack: Strands (build) → ToolSimulator (test) → Strands Evals (evaluate) → AgentCore (deploy) Full lifecycle. Open-source foundation. Managed deployment when ready. Start: https://strandsagents.com #AWS #AIAgents #Python #Strands #Bedrock

Strands Agents — Open Source AI Agent SDK for Python & TypeScript strandsagents.com
Like Comment
To view or add a comment, sign in

1,721 followers

242 Posts

View Profile Connect

Evaluating GenAI Platforms with Scoring Framework

More Relevant Posts

Explore related topics

Explore content categories