How I Built a Searchable Shopify Help Library for Study in NotebookLM with Python and ChatGPT5

When I decided to deepen my knowledge of Shopify, my goal was not to open a new store from scratch. I wanted to create a study system to eventually master the Shopify Admin for managing an existing store. I had used Shopify before, but only on a surface level. My aim was to get proficient quickly and have a personal, AI-friendly reference I could study without having to rely on the Shopify Help site. That meant finding a way to capture and organize the official help documentation into something clean, searchable, and ready for my preferred study buddy, NotebookLM.

Starting with the Entire Sitemap

The project began by downloading Shopify’s public sitemap.xml. This file contains the URLs of nearly every public help article on their site. Using CLI tools, I extracted over 2,200 URLs from the sitemap. That was far too much for my specific goal, so I narrowed the list to 738 URLs that focused strictly on Shopify Admin topics. I filtered out store setup guides, design themes, and irrelevant content, keeping only the material that would help me run an existing store efficiently.

This reduced set of URLs was the foundation for my learning resource.

The Objective

I wanted to create a dozen organized PDFs, each representing a logical category of Shopify Admin functions. The PDFs needed to be stripped of all layout clutter so that NotebookLM could parse them without confusion. This meant removing navigation menus, images, and icons, leaving only the clean, readable text.

The output had to be:

  • Organized into topic-level files
  • Clean so AI tools could parse easily
  • Complete in terms of essential Shopify Admin knowledge

First Attempts and Early Roadblocks

I started with a Python script that used Playwright to open each page and save it as a PDF. To avoid Shopify’s bot detection, I used a saved session state file from my browser session. This got me past the verification wall and allowed the script to run without manual intervention.

The first run gave me PDFs for every URL, but each one looked exactly like the live website. They contained dynamic menus, floating icons, sidebars, and other distractions. For my purposes, this was unusable. NotebookLM would treat these visual elements as part of the main text, breaking the reading flow.

Another problem was that the script generated hundreds of individual PDFs, one for each article, in some cases. I needed each category to be merged into a single document.

Solving the Problems

To fix the clutter problem, I added a “full reader mode” transformation step to the script. This step completely changed how the page was processed before printing. It:

  • Removed all images, SVGs, videos, canvases, iframes, and figures
  • Cleared background images
  • Stripped away promotional boxes, related article links, and comment sections
  • Flattened hyperlinks so that only the text remained
  • Blocked the browser from loading image, media, and font resources to speed up processing

I also applied custom CSS to hide headers, footers, navigation bars, and other non-core article content.

To fix the fragmentation problem, I modified the script to merge all PDFs from a given category into one file. Each category was represented by a .txt file containing its URLs. The script would read the list, generate clean PDFs for each URL, then merge them into a single category PDF.

Testing Before Full Deployment

Before running all 12 categories, I tested the updated script on just one: analytics_urls. This allowed me to quickly see if the reader mode cleanup was working as intended. The output was exactly what I hoped for: a clean, text-only PDF that was still structured enough for easy reading.

Once I was satisfied, I removed the restriction and processed all 12 categories.

The Final Output

The end result was 12 clean, well-organized PDFs covering the major functional areas of the Shopify Admin. The documents were stripped of layout noise, had consistent formatting, and contained only the content that mattered. They are now ready to be ingested into NotebookLM, where I can use AI to quiz me, summarize topics, and serve as a searchable personal reference.

Why This Approach Works

There are plenty of tools that can save web pages as PDFs, but they often include all the visual clutter that makes reading difficult for both humans and AI. By using Playwright in combination with custom JavaScript and CSS cleanup, I was able to control exactly what was included in the final output. Furthermore, who on Earth has time to do that for 738 individual web pages!?!? Not I.

The iterative testing process also made a difference. By focusing on one category first, I avoided wasting hours generating unusable PDFs for the entire set.

Lessons Learned

  • Start broad, then filter. Pull everything from the sitemap, then refine your list to match your actual learning goals.
  • Test in small batches. Run a single category before committing to the whole project.
  • Go beyond print CSS. Real reader mode stripping is essential for creating AI-friendly content.
  • Organize before capture. Grouping URLs by category made merging straightforward and the final library more useful.

Future Possibilities

This workflow is not limited to Shopify. It could be adapted for any SaaS platform with public documentation. Adding a change-detection feature could make it possible to update the library automatically when new articles are published or old ones are revised.

For now, I have exactly what I set out to build: a clean, structured Shopify Admin reference library that I can study and query with NotebookLM. It turns a scattered, visually noisy (albeit super helpful out of the box) help center into a concise, focused learning resource that I can use to learn the way I learn best.