# Webový Odtlačok — Glossary of Checks

26 detekcií technológií, obsahu a infraštruktúry

---

## F1 — CMS / Content Management System

**What is it:** CMS platform detection based on meta tags, HTML comments, URL structure, cookies, and platform-specific scripts. Recognizes WordPress, Joomla, Drupal, Ghost, Wix, Squarespace, Webflow, Typo3, Nette, Laravel, Django, HubSpot, Stranka.sk, Webnode, Blogger, and more.

**Why it matters:** The CMS is the foundation of web infrastructure — it determines security risks, performance, SEO capabilities, and maintenance costs. WordPress has different vulnerabilities than Webflow, and an e-shop on Shoptet requires different optimization than WooCommerce.

**Real-world example:** A site running WordPress 6.x has access to thousands of plugins but requires regular updates. A Webflow site is maintenance-free but less flexible. Fingerprint detects the CMS even when the operator has removed visible markers.


---

## F2 — E-commerce Platform

**What is it:** Identification of the e-commerce solution — Shoptet, PrestaShop, WooCommerce, Magento, OpenCart, Webareal, Shopify, Shoper, Upgates. Detection is performed via specific URL patterns, cart scripts, payment integrations, and meta tags.

**Why it matters:** The e-commerce platform directly affects conversion rate, product page loading speed, product SEO, and marketplace integrations. Each platform has specific limitations and optimization options.

**Real-world example:** Shoptet has native integration with Heureka.sk and Zbozi.cz, while WooCommerce requires plugins. Magento can handle millions of products but is more demanding on hosting. Fingerprint identifies both the platform and its version.


---

## F3 — JS / CSS Frameworks + CDN

**What is it:** Detection of JavaScript frameworks (jQuery, React, Vue.js, Angular, Alpine.js, HTMX, Turbo, Stimulus, Svelte) with versions, CSS frameworks (Bootstrap, Tailwind, Bulma, Foundation), and CDN providers (Cloudflare, CloudFront, Akamai, Fastly, jsDelivr).

**Why it matters:** The tech stack determines how modern, performant, and maintainable a website is. React 18 with Next.js is more performant than jQuery spaghetti code. The CDN provider affects latency and availability for end users.

**Real-world example:** A site using React 18 + Next.js + Tailwind CSS via Cloudflare CDN is modern and fast. A site with jQuery 1.x + Bootstrap 3 without a CDN is outdated and slow. Fingerprint also reveals versions, which helps identify security risks.

### Sources
- [HTTP Archive — Web Technology Report](https://httparchive.org/reports) — HTTP Archive

---

## F4 — Analytics & Marketing

**What is it:** Detection of analytics and marketing tools — Google Analytics (GA4, UA), GTM, Facebook Pixel, Hotjar, Heureka, Sklik, Criteo, Google Ads, SmartSupp, Biano, Luigi's Box, CookieYes, and more.

**Why it matters:** Analytics tools indicate the level of a company's digital maturity. A site without GA4 has no visitor data. The presence of remarketing pixels indicates active online marketing. CookieYes suggests GDPR compliance.

**Real-world example:** An e-shop with GA4 + GTM + Facebook Pixel + Heureka tracking has sophisticated analytics. A blog without any analytics has no visibility into traffic. Fingerprint also detects duplicate or conflicting tracking codes.

### Sources
- [Google Tag Manager](https://tagmanager.google.com/) — Google

---

## F5 — Payment Gateways

**What is it:** Identification of payment gateways and methods — GoPay, Stripe, PayPal, Comgate, Tatrapay, Sporopay, CardPay, Cash on Delivery, Bank Transfer. Detection via JavaScript SDK, checkout URL patterns, and form elements.

**Why it matters:** Payment methods directly affect e-shop conversion rates. Customers expect card payments, bank transfers, and cash on delivery. Missing payment methods result in lost orders.

**Real-world example:** An e-shop with GoPay (card + bank transfer) + cash on delivery covers 90% of Slovak customers. A site with only PayPal loses customers who don't have a PayPal account. Stripe is preferred for international payments.

### Sources
- [GoPay — Payment Gateway](https://www.gopay.com/) — GoPay

---

## F6 — Fonts

**What is it:** Detection of fonts in use — Google Fonts (with family extraction), Adobe Fonts (Typekit), Font Awesome, Custom WOFF/WOFF2. Analysis of the number of font families and their impact on performance.

**Why it matters:** Fonts are often the biggest render-blocking resource on a page. Each font family adds 50-200 KB to download. Too many fonts slow down LCP (Largest Contentful Paint) and degrade Core Web Vitals.

**Real-world example:** A site with 1-2 Google Fonts families has optimal loading. A site with 6+ different fonts and Font Awesome icons can have 500ms+ slower first render. Self-hosted WOFF2 fonts are faster than Google Fonts CDN.

### Sources
- [Google Fonts](https://fonts.google.com/) — Google

---

## F7 — CDN Provider

**What is it:** CDN provider identification — Cloudflare, Fastly, Akamai, CloudFront, Google CDN. Detection via HTTP headers (cf-ray, x-cache, x-amz-cf-id), DNS records, and certificates.

**Why it matters:** A CDN dramatically reduces latency for end users. A site without a CDN serves content from a single server, resulting in higher latency for distant visitors. Cloudflare also provides DDoS protection and WAF.

**Real-world example:** A site behind Cloudflare has TTFB under 100ms even for visitors from other continents. A site on shared hosting without a CDN can have TTFB of 500ms+ for international visitors. A CDN also reduces load on the origin server.

### Sources
- [Cloudflare — How CDN Works](https://www.cloudflare.com/learning/cdn/what-is-a-cdn/) — Cloudflare

---

## F8 — Hosting / Server Info

**What is it:** Web server and reverse proxy detection — Nginx, Apache, LiteSpeed, IIS, Tomcat + version + release year. Reverse proxy: Varnish, BigIP, HAProxy, Envoy, Traefik. Identification via Server header and specific headers.

**Why it matters:** Server software and its version affect performance and security. An outdated Apache version may contain known vulnerabilities. LiteSpeed is faster than Apache for PHP sites. A reverse proxy indicates enterprise infrastructure.

**Real-world example:** A site on Nginx 1.25 + Varnish cache has enterprise-grade infrastructure. A site on Apache 2.2 (EOL since 2018) is a security risk. Fingerprint reveals exact versions, which helps with security audits.

### Sources
- [Netcraft — Web Server Survey](https://www.netcraft.com/) — Netcraft

---

## F9 — Website Type Classification

**What is it:** Heuristic website classification based on detections — e-shop, marketplace, blog, forum, social network, aggregator, news portal, wiki, portfolio, catalog, booking, SaaS, streaming. Uses a combination of CMS, e-commerce platforms, and content.

**Why it matters:** The website type determines relevant metrics and benchmarks. An e-shop is evaluated differently than a blog — conversion rate vs. time on page. Classification enables comparison with relevant competitors in the same category.

**Real-world example:** A site with WooCommerce + product pages + a cart is classified as an e-shop. A site with WordPress + articles without products is a blog. A SaaS site has a login page, pricing, and documentation.

### Sources
- [Schema.org — WebSite Type](https://schema.org/WebSite) — Schema.org

---

## F10 — Social Networks

**What is it:** Detection of social network links — Facebook, Instagram, Twitter/X, LinkedIn, YouTube, TikTok, Pinterest. URL extraction from footer links, meta tags (og:see_also), and JSON-LD.

**Why it matters:** Social media presence indicates a company's digital maturity and marketing strategy. A LinkedIn profile suggests B2B focus, TikTok suggests a younger target audience. Absence of social media may signal an inactive business.

**Real-world example:** A company with Facebook + Instagram + LinkedIn + YouTube has a comprehensive social presence. An e-shop with only a Facebook page is using a minimum of channels. Fingerprint extracts the exact URL for each platform.

### Sources
- [Open Graph Protocol](https://ogp.me/) — Open Graph

---

## C1 — Visible Text Extraction

**What is it:** Removal of HTML tags, scripts, styles, and invisible elements — a clean text representation of the page. Used as input for keyword extraction, embeddings, and AI analysis.

**Why it matters:** Clean text is the foundation for all content analysis. AI models and search engines work with text, not HTML code. Quality extraction filters out navigational noise and preserves only content-relevant text.

**Real-world example:** From an e-shop HTML page, extraction removes the menu, footer, cookie banner and retains the product description, specifications, and reviews. This clean text is then used to generate embeddings and extract keywords.

### Sources
- [Google Search Essentials — Crawling](https://developers.google.com/search/docs/essentials) — Google

---

## C2 — Word Count

**What is it:** A basic content length metric for the analyzed page. Counts words in the extracted visible text after removing HTML tags and scripts.

**Why it matters:** Content length correlates with information depth and SEO performance. Pages with fewer than 300 words are considered 'thin content'. AI models prefer more comprehensive sources when generating responses.

**Real-world example:** A product page with 50 words doesn't have enough information for SEO or AI. An article with 1500+ words has a greater chance of ranking in Google and being cited in AI responses. The optimal length depends on the page type.

### Sources
- [Creating Helpful Content](https://developers.google.com/search/docs/fundamentals/creating-helpful-content) — Google

---

## KW1 — Keywords — Extraction

**What is it:** Automatic keyword extraction from URL paths, H1, title, meta description, breadcrumbs, category tree, and headings. Scoring: weight x log2(frequency + 1) x log2(product_count + 2).

**Why it matters:** Keywords define the thematic focus of a website and are the foundation for both SEO and AI visibility. Automatic extraction reveals what topics a site actually focuses on — often different from what the owner believes.

**Real-world example:** An electronics e-shop has the strongest keywords 'mobile phone', 'notebook', 'tablet'. However, if 'sale' appears as the strongest word in the extraction, the site communicates discounts rather than products.

### Sources
- [Google SEO Starter Guide](https://developers.google.com/search/docs/fundamentals/seo-starter-guide) — Google

---

## KW2 — Keywords — Categorization

**What is it:** Classification of extracted keywords into categories — product, service, location, brand. Helps understand the thematic structure of a website and identify content gaps.

**Why it matters:** Keyword categorization shows whether a site covers all important aspects. An e-shop should have strong product keywords, a local business should have local ones. Gaps in categories indicate missing content.

**Real-world example:** A restaurant in Bratislava has strong product keywords ('pizza', 'pasta') but is missing local ones ('Bratislava', 'Old Town'). This means weak local SEO visibility and a low chance of appearing in AI responses to local queries.

### Sources
- [Google SEO Starter Guide](https://developers.google.com/search/docs/fundamentals/seo-starter-guide) — Google

---

## SM1 — Sitemap Existence

**What is it:** Check whether the site has an accessible sitemap.xml or sitemap index at standard URLs (/sitemap.xml, /sitemap_index.xml). Verification of HTTP status and XML format validity.

**Why it matters:** A sitemap is a map of the website for search engines and AI crawlers. Without a sitemap, crawlers must discover pages through links, which is slower and less reliable. Both Google and AI bots use sitemaps for efficient indexing.

**Real-world example:** An e-shop with 10,000 products without a sitemap risks Google not discovering 30-50% of product pages. A site with an up-to-date sitemap has all pages indexed within 48 hours of publication.

### Sources
- [Sitemaps — Google Search Central](https://developers.google.com/search/docs/crawling-indexing/sitemaps/overview) — Google

---

## SM2 — URL Count in Sitemap

**What is it:** Counting URLs in the sitemap — the basis for tier recommendation (FREE=1, BASIC=20, PRO=50+ URLs). Analysis of URL distribution across subdomains and sections.

**Why it matters:** The URL count determines the website's scope and the recommended audit tier. A small site with 5 URLs only needs a basic audit, while a large e-shop with thousands of products needs the PRO tier for a complete analysis.

**Real-world example:** A personal blog with 10 articles falls into the BASIC tier. An e-shop with 500 product pages needs the PRO tier to analyze all URLs. The number of URLs in the sitemap vs. the actual page count reveals indexing issues.

### Sources
- [Sitemaps — Google Search Central](https://developers.google.com/search/docs/crawling-indexing/sitemaps/overview) — Google

---

## SM3 — Sitemap Validity

**What is it:** Verification of sitemap XML format, URL correctness, and accessibility of linked pages. Checks lastmod dates, changefreq, and priority attributes.

**Why it matters:** An invalid sitemap can cause crawlers to ignore it. Incorrect URLs, missing namespaces, or invalid dates lead to indexing errors. Up-to-date lastmod dates help crawlers re-crawl efficiently.

**Real-world example:** A sitemap with URLs pointing to 404 pages signals a neglected website. A sitemap without lastmod dates doesn't allow crawlers to distinguish new from old content. A valid sitemap with current dates speeds up indexing.

### Sources
- [Sitemaps XML Format](https://www.sitemaps.org/protocol.html) — sitemaps.org

---

## SSL1 — SSL Certificate — Existence

**What is it:** Verification that the domain uses HTTPS with a valid SSL/TLS certificate. Checks HTTP to HTTPS redirect and certificate validity for the given domain.

**Why it matters:** HTTPS has been a Google ranking requirement since 2018. Chrome and Firefox browsers display a 'Not Secure' warning for HTTP sites. SSL is essential for user trust and data protection in transit.

**Real-world example:** A site without SSL shows a red warning in the browser, immediately deterring visitors. An e-shop without HTTPS cannot accept card payments. All modern websites must have a valid SSL certificate.

### Sources
- [HTTPS as a Ranking Signal](https://developers.google.com/search/blog/2014/08/https-as-ranking-signal) — Google

---

## SSL2 — SSL Certificate — Issuer

**What is it:** Identification of the SSL certificate issuer — Let's Encrypt, DigiCert, Sectigo, GlobalSign, GeoTrust, and others. Certificate type: DV (Domain Validation), OV (Organization Validation), EV (Extended Validation).

**Why it matters:** The certificate type indicates the level of identity verification. DV (Let's Encrypt) only verifies domain ownership. OV and EV also verify the organization. For e-shops and financial services, an OV/EV certificate signals trustworthiness.

**Real-world example:** A bank with an EV certificate (DigiCert) has the highest level of verification. A blog with a Let's Encrypt DV certificate has basic encryption. Both are secure, but EV provides greater trust for sensitive transactions.

### Sources
- [Let's Encrypt — How It Works](https://letsencrypt.org/how-it-works/) — Let's Encrypt

---

## SSL3 — SSL Certificate — Validity

**What is it:** Check of the SSL certificate expiration date and the number of days until expiry. Warning for certificates approaching expiration (less than 30 days).

**Why it matters:** An expired SSL certificate causes the browser to block access to the site with an error page. Automatic renewal (Let's Encrypt, Cloudflare) eliminates this risk. Manually managed certificates require monitoring.

**Real-world example:** A certificate with 340 days of validity is fine. A certificate with 5 days until expiration requires immediate renewal. Let's Encrypt certificates auto-renew every 90 days, while commercial certificates renew annually.

### Sources
- [SSL Labs — SSL Server Test](https://www.ssllabs.com/ssltest/) — Qualys

---

## EMB1 — Vector Embeddings

**What is it:** Generation of 1024-dimensional vector embeddings from extracted text using the BGE-M3 model via OpenRouter. Vectors are stored in a pgvector database for semantic search.

**Why it matters:** Vector embeddings enable semantic comparison of websites — not by keywords, but by content meaning. Two sites with different words but the same focus will have similar vectors.

**Real-world example:** An electronics e-shop and a tech blog about gadgets will have similar embeddings, even though they use different terminology. The cosine similarity between their vectors will be high (>0.8), signaling content relatedness.

### Sources
- [BGE-M3 — Multi-Lingual Multi-Granularity Embedding Model](https://huggingface.co/BAAI/bge-m3) — BAAI

---

## EMB2 — Competitor Similarity

**What is it:** Cosine similarity search in the embeddings database — finding the most content-similar websites in the Be1st.ai database. Result: TOP N closest domains with similarity percentage.

**Why it matters:** Automatically finding similar websites reveals competitors the owner may not have known about. It also helps benchmark the site against actual competition rather than subjective estimates.

**Real-world example:** A Slovak clothing e-shop gets a list of the 5 closest sites from the database — e.g., ZOOT.sk (92%), About You (88%), Answear.sk (85%). The owner thus discovers who they are actually competing with for customers online.

### Sources
- [pgvector — Open-Source Vector Similarity Search for Postgres](https://github.com/pgvector/pgvector) — pgvector

---

