đ Detailed Evaluation Criteria & Rubric
This hackathon is strictly focused on technical execution and the mastery of agentic workflows. We are evaluating your ability to build robust, scalable, and secure AI systems, not your business model or market viability.
The judges will evaluate all project submissions based on the following technical rubric:
1. OpenClaw Implementation & Use Case Impact (40%)
This is the core of the hackathon. We want to see how deeply and effectively you utilize the OpenClaw framework to solve a real-world problem within the HealthTech, AgroTech, or FinTech verticals. Judges will look for:
-
Advanced State Management: How well does your application maintain memory and context across complex multi-step workflows?
-
Tool-Calling Mastery: Seamless integration and execution of external tools, APIs, and functions by the agent.
-
Reasoning Loops: The ability of your agentic system to plan, execute, evaluate, and iterate autonomously.
-
Vertical Impact: Does the technical solution actually address a meaningful friction point in your chosen industry (Health, Agro, or Finance)?
2. OpenClaw Setup & Environment Configuration (25%)
Judges will evaluate how well the team leveraged OpenClawâs native capabilities to build a robust, well-configured agent â not just a chatbot with a system prompt.
-
Memory & Context Persistence: Does the agent have memory enabled and use it intelligently? Does it retain relevant information across conversations (patient history, crop status, previous transactions)? Or does every interaction start from scratch?
-
Soul Definition: How well did the team define the agentâs personality, role, and boundaries? Does the soul reflect the chosen vertical (a FinTech agent shouldnât sound like an AgroTech one)? Does it include clear instructions on what the agent can and cannot do?
-
Skills & Tools : Did the team create custom skills or only rely on defaults? Are the skills well-structured with clear descriptions so the agent knows when to invoke them? Is the integration with external APIs or tools functional and coherent with the use case?
-
Heartbeats & Cronjobs: Is the agent proactive or purely reactive? Did the team configure scheduled tasks that bring real value to the use case (irrigation alerts, medication reminders, periodic financial reports)? Or do the heartbeats feel forced, as if they were added just to check a box?
3. Security & Guardrails (25%)
Real-world AI applicationsâespecially in fields like finance and healthcareârequire absolute safety and reliability. Judges will look for:
-
Data Privacy: Secure handling of user inputs, simulated sensitive data, and API keys within your environment.
-
Prompt Injection Defense: Implementation of strict guardrails to prevent jailbreaking, adversarial attacks, or unauthorized tool usage.
-
Hallucination Mitigation: Mechanisms to catch and correct the agent if it begins to hallucinate or drift from its intended task.
4. Communication & Platform Integration (10%)
Judges will evaluate how effectively the team connected their agent to real communication channels and how well it behaves within those platforms.
-
Channel Integration: Which platforms did the team connect (WhatsApp, Discord, Slack, Telegram, iMessage)? Is the agent accessible where its target users actually are? A HealthTech agent on WhatsApp makes more sense than one only available through a local UI.
-
Multi-Platform Consistency: If the agent is deployed across multiple channels, does it maintain a consistent experience? Does memory and context carry over between platforms, or does switching channels break the flow?
-
Workspace & Team Readiness: Did the team go beyond personal messaging and integrate into collaborative environments (Slack workspaces, Discord servers, Microsoft Teams)? Is the agent ready to serve a team or organization, not just a single user?
-
Conversational Reliability: Does the agent communicate clearly and concisely within the platformâs constraints? Does it handle errors gracefully, confirm before executing critical actions, and provide progress feedback on longer tasks? Does it recover well when it doesnât understand a request?
-
Platform-Native Behavior: Does the agent take advantage of platform-specific features (Discord slash commands, WhatsApp buttons/lists, Slack threads, rich message formatting)? Or does it behave the same generic way regardless of where itâs running?