How is the app organized?

How is the app organized?

This use case evaluates AI technologies' ability to give a clear and useful overview of a codebase. A useful overview is:

  • Easy to understand

    • Incremental - Able to start simple and then give increasingly complex explanations

    • Possibly visual

  • Accurate

  • Able to explain:

    • How the app is structured:

      • What is the file/module pattern?

    • What is the overall architecture?

    • What technologies are being used to solve key challenges?

 

Evaluation Dimensions

Codebase

  • Metadata about the repo being tested, not the tool itself.

Lines of code (LOC):

  • S - 0–5,000

  • M - 5,001–50,000

  • L - 50,001–250,000

  • XL - 250,001+

Number of files (NOF):

  • S - 1–100

  • M - 101–500

  • L - 501–2,000

  • XL - 2,001+

Popularity (1-5) (npm weekly downloads / github stars):

  • 5 - 5M+ / 50,000+ (React, Express)

  • 4 - 1M–5M / 20,000–49,999 (Angular, Vue)

  • 3 - 100K–1M / 5,000–19,999 (Svelte, Vite)

  • 2 - 10K–100K / 1,000–4,999 (SolidJS, Remix)

  • 1 - <10K / < 1,000 (Rective.js, custom/legacy frameworks)

Result Evaluation

Rubric score (1-5) (Non-specific):

  • 5 - Excellent: Outstanding result. Fully meets or exceeds expectations.

  • 4 - Good: Very solid result. Minor issues or gaps. Mostly meets expectations.

  • 3 - Fair / Mixed: Acceptable but with noticeable flaws. Partially meets expectations.

  • 2 - Poor: Low quality or off-target. Major issues. Barely meets intent.

  • 1 - Unusable: Completely fails to meet expectations. Not usable.

How the actual output produced by the tool/prompt combo did.

Metric

Scale / Format

Description

Metric

Scale / Format

Description

Accuracy

1–5

Is the output factually and functionally correct based on your knowledge of the repo?

5 - 100% of the sentences are correct

4 - >= 85% of the sentences are correct

3 - >= 60% of the sentences are correct

2 - >= 45% of the sentences are correct

1 - < 45% of the sentences are correct

Quality. To make quality more objective, we broke it down into different criteria:

Metric

Scale / Format

Description

Metric

Scale / Format

Description

Explains business logic (not just API)

Yes / No

Context beyond code

Has code examples

Yes / No

 

Reference actual files/functions

Yes / No

 

Image/Diagram Support

Yes / No

Can it produce diagrams?

Includes API endpoint documentation

Yes / No / Don’t apply

For frontend/backend collaboration when relevant

Organization & Structure

1–5

5 - Table of contents, anchor links, headings, sidebars, search

4 - Mostly easy to scan and jump around

3 - Navigation exists, but sections aren’t well labeled

2 - Few navigational aids

1 - Wall of text or scattered files with no clear way to find things

Depth of Explanation

1–5

5 - Explains purpose, reasoning, edge cases, and usage thoroughly

4 - Explains most parts well, but lacks detail in a few areas

3 - Gives a surface-level overview, but lacks deep insight

2 - Very shallow; barely explains beyond the code

1 - No real explanation, just pasted code or vague comments

Digestibility

1–5

Is the content concise, easy to understand for your audience?

  • “Terms & Definitions”

Formatting Consistency

1–5

5 - Fully consistent formatting; clear hierarchy; code blocks well styled

4 - Mostly clean, with a few inconsistencies

3 - Some formatting issues, but readable

2 - Cluttered or inconsistent formatting

1 - Messy, broken Markdown, or unreadable layout

Multi-Page Output

Yes / No

Can it create full-length documentation or just fragments?


Process

How difficult it is to use the tool effectively.

Metric

Scale / Format

Description

Metric

Scale / Format

Description

Ease of Setup

1–5

How complex is the setup? Installs, config, permissions, etc.

Prompt Simplicity

1–5

Are you able to use short, natural prompts? Or does it need detailed scaffolding?

Run Cost

$ / Tokens / Subscription

How expensive (financially or time-wise) is each run?

Repeatability

1–5

Can you easily reuse the approach on other repos? Is it consistent?

Technologies and Approaches

  • IDE (Integrated Development Environment):
    A software suite that brings together essential tools (such as a code editor, debugger, and compiler) into one interface, enabling developers to write, test, and debug code more efficiently. It's a productivity environment for building software.

  • AI Agent:
    A self-contained AI entity capable of perceiving its environment, making decisions, and taking actions to perform specific tasks. It follows a sense-think-act loop and is typically focused on executing a single or bounded set of goals.

  • Agentic AI:
    A more advanced, autonomous AI system that not only performs tasks but also sets goals, plans across multiple steps, adapts dynamically, and may coordinate multiple AI agents or tools. It's like a conductor that directs simpler agents or tools to accomplish complex, evolving objectives.

  • Autonomous AI Tool (Black-box Tool):
    A self-contained, goal-oriented AI application that performs complex tasks (e.g., generating tutorials from code or summarizing entire repos) with no direct control over the internal AI agent or workflow.
    Users interact by providing minimal input—like a GitHub repo URL or a topic—and the tool autonomously handles planning and execution.

Technology:

  • DeepWiki DeepWiki (made up with Devin)

  • CodeToTutorial CodeToTutorial

  • VSCode

    • Copilot

      • GPT-4.1

      • Claude Sonnet 4

      • Gemini 2.5 Pro

    • Amazon Q

      • Claude Sonnet 4

  • Cursor

    • GPT-4.1

    • Claude Sonnet 4

  • Windsurf

    • GPT-4.1

    • Claude Sonnet 4

    • Gemini 2.5

Approaches:

  • Prompting for a markdown file

  • Prompting for a docx

  • Prompting for docx with images

  • Prompting for markdown with Mermaid diagrams

Prompting for:

  1. Architecture and File Structure

    • Describe the overall architecture of the application (e.g., monolith, microservices, layered).

    • Explain the folder/module structure.

    • Identify key technologies used and their roles in solving core problems.

EXTRAS:

Prompt:

Generate a Markdown document that explains how this application works, both from a business and technical perspective. Follow this structure:

  1. Overview

    • Describe what the application does from a business/user perspective.

    • Summarize the primary use cases and who the end users are.

  2. Key Business Processes

    • List the main user flows or automated system processes.

    • Explain the steps involved in completing typical tasks within the app.

  3. Architecture and File Structure

    • Describe the overall architecture of the application (e.g., monolith, microservices, layered).

    • Explain the folder/module structure.

    • Identify key technologies used and their roles in solving core problems.

  4. Flowchart

    • Generate a flowchart (in Mermaid.js syntax) that visualizes a typical user or system process.

    • Include the Mermaid code directly in the markdown.

Requirements:

  • Write progressively: start with a high-level overview and move into more technical detail.

  • Ensure all explanations are accurate and based on the codebase, not assumptions.

  • Do not include unnecessary commentary.

  • Use clear and concise language.

  • References to function and component names where relevant.

  • File paths when pointing out architecture or business logic locations.

Why Agent over Ask mode?

Use Agent Mode if:

  • You want the agent to read and analyze the full codebase (or a large portion of it).

  • You're looking for a deep, contextual, multi-file overview.

  • You're starting a new documentation or exploration task.

Use Ask Mode if:

  • You only want a quick, isolated answer based on the currently open file.

  • You're testing or iterating on a smaller piece of the codebase.

  • You don’t need the AI to understand the overall project structure.

If I use Cursor or Copilot with the same model, will it give me different results?

Yes — you will get different results from Cursor and Copilot, even if you're using the same model (like GPT-4.1 or Claude 3.5 Sonnet). The difference isn’t the model — it’s in how each tool structures the context and prompt that it sends to the model.