Static Unicode security scanner for developers and CI teams reviewing untrusted source code.
It is built for security engineers, maintainers, Go developers, and DevOps teams who need a fast, local, deterministic check before code lands in CI, a release, or a dependency tree. Instead of trying to be a general SAST platform, it focuses narrowly on Unicode-based deception: hidden characters, misleading script mixing, payload-like sequences, and nearby decode-or-execute patterns. Decoder and dynamic-execution markers are supporting context by default; the primary signal is the hidden Unicode itself and explicit payload correlations. The differentiator is simple: it makes invisible evidence readable and keeps the output precise enough for code review and CI decisions.
~> ghostscan --verbose ./testdata/invisible/single.txt
########
### ###
## ##
## ## ## ##
# ## ## ##
# ##
## ##### ##
## ###
## ##
## ### #####
## ##
### #
###########
ghostscan v0.2.0
Finding: Invisible unicode character
Evidence: <U+200B ZERO WIDTH SPACE>
RuleID: unicode/invisible
Severity: HIGH
File: /Users/johnsmith/ghostscan/testdata/invisible/single.txt
Line: 1
Column: 2
Count: 1 suspicious runes
Category: invisible unicode
Context:
A<U+200B ZERO WIDTH SPACE>B
Fingerprint: /Users/johnsmith/ghostscan/testdata/invisible/single.txt:unicode/invisible:1:2
8:57PM INF scanned 1 files (6 B) in 123µs
8:57PM INF skipped 0 files (none)- Visible evidence for invisible content: Renders hidden Unicode as strings like
<U+200B ZERO WIDTH SPACE>. - Focused Unicode threat coverage: Detects invisible characters, private-use Unicode, bidi controls, directional marks, mixed-script tokens, and combining marks.
- Payload-aware heuristics: Flags long hidden sequences, dense suspicious regions, and explicit payload-plus-decoder correlations while keeping standalone decoder noise out of default results.
- Context-aware severity: Uses bounded content-based file shape checks, conservative file-role hints, local finding region checks, and decoder proximity to reduce low-value invisible-character noise without downgrading bidi controls, long suspicious runs, or build and release contexts.
- Noise reduction for asset contexts: Suppresses obvious private-use glyph mappings in font-like SVG assets so icon fonts do not dominate the report.
- Safe repository traversal: Skips symlinks, binary files, oversize files, and common dependency or build directories.
- CI-friendly behavior: Uses deterministic ordering, human or JSON output, and exit codes
0,1, and2.
# Pre-built release binary
# Download the archive for your platform from:
# https://github.com/jcouture/ghostscan/releases
# Then extract it and place `ghostscan` on your PATH
# From source
git clone https://github.com/jcouture/ghostscan.git
cd ghostscan
go mod download
go run . --version
# Build a local binary
make build
./bin/ghostscan --help
# Go install
go install github.com/jcouture/ghostscan@latest
ghostscan --versionRequirements: Go
1.26.2is pinned ingo.modandmise.tomlfor source builds. Pre-built release archives are produced for Linux, macOS, and Windows.
You should see ghostscan dev (commit none) from a plain source build, or a real tag and commit in a release build.
Projects that want structured findings without invoking the CLI can import the public engine package directly:
import (
"context"
"github.com/jcouture/ghostscan/engine"
)
scanner := engine.New(engine.Options{})
result, err := scanner.ScanBytesDetailed(context.Background(), "blob.js", data)
if err != nil {
return err
}
for _, item := range result.Findings {
// consume structured findings
}The public engine supports:
ScanFilefor local filesScanBytesfor in-memory blobsScanStringfor string content- deterministic
Findingordering throughengine.SortFindings
The CLI remains the owner of repository walking, excludes, size limits, output formatting, and exit codes.
ghostscan [flags] [path]
path is optional; keep flags in front
Flags:
--exclude strings glob to skip; repeat as needed
--format string output format: human or json (default "human")
--max-file-size int skip files larger than this many bytes (0 = default)
-n, --no-color no ANSI paint
--no-default-excludes drop built-in excludes
--silent skip the banner
--verbose detailed finding blocks
-v, --version print version and exit
Flags must come before the optional positional path. For example, use ghostscan --silent ., not ghostscan . --silent.
# Scan the current repository
ghostscan .
# Scan a specific directory
ghostscan ./testdata/mixed
# Scan a single file
ghostscan ./testdata/invisible/single.txt
# CI-friendly output
ghostscan --silent --no-color .
# Show detailed findings
ghostscan --silent --no-color --verbose ./testdata/mixed/correlated_decoder_near_payload.js
# Add repeatable exclude globs
ghostscan --exclude "**/*.min.js" --exclude "vendor/**" .
# Disable built-in excludes and use only an explicit glob
ghostscan --no-default-excludes --exclude "**/*.gen.js" .
# Enforce a smaller max file size
ghostscan --max-file-size 1048576 .
# Emit machine-readable JSON
ghostscan --format json ./testdata/invisible/single.txtghostscan prints a human-readable terminal report by default and emits a single JSON document when --format json is selected.
In human verbose mode, each finding includes:
- file path
- line and column
- evidence with invisible Unicode rendered visibly
- local context
- rule ID
- severity (
LOW,MEDIUM,HIGH, orCRITICAL) - fingerprint
Standalone decoder and dynamic-execution markers are kept internal by default. When they appear near a hidden payload sequence, ghostscan emits a single correlated finding instead of separate decoder findings.
Verbose mode also reports exclusions during traversal:
SKIP dist/app.min.js (matched exclude: "**/*.min.js")
SKIP vendor (matched exclude: "vendor/**")
JSON output always writes one document to stdout with tool, scan, summary, findings, skipped_files, and errors keys, without ANSI color or human log lines. In JSON mode, fatal execution errors are emitted as a structured report with errors populated and exit code 2.
Exit codes:
| Exit code | Description |
|---|---|
| 0 | scan completed and found no suspicious patterns |
| 1 | scan completed and found suspicious patterns |
| 2 | execution failed because of invalid input or another runtime |
The current scanner behavior is intentionally narrow and real:
- Recursively scans a file or directory path.
- Parses flags only before the optional file or directory path.
- Does not follow symlinks.
- Treats files containing a NUL byte or recognized binary magic signature as binary and skips them.
- Uses a default max file size of
5 MiB. - Matches excludes against the full normalized relative path with
/separators. - Supports repeatable
--excludeglobs with**matching zero or more path segments andfilepath.Matchsemantics for other segments. - Applies built-in excludes by default:
.git/**,node_modules/**,vendor/**,dist/**,build/**,target/**,out/**, andcoverage/**. --no-default-excludesdisables the built-in exclude set completely.- Never executes scanned code or fetches network resources.
Every finding is assigned one of four severity levels: LOW, MEDIUM, HIGH, or CRITICAL. Severity is deterministic and context-aware — the same finding in a different file shape or placement can receive a different severity.
| Severity | Meaning |
|---|---|
LOW |
Suspicious but likely benign. Isolated invisible characters, non-leading single U+FEFF, and short accidental zero-width runs in prose, comments, whitespace, and data-like text. Safe to review at lower priority. |
MEDIUM |
Warrants investigation. Short invisible runs in executable source strings or unknown regions, private-use characters in data or prose, directional controls, and combining marks in tokens. |
HIGH |
Likely intentional obfuscation. Invisible characters inside identifiers, medium-length suspicious runs, private-use characters in code, bidi control characters, mixed-script tokens, and payload sequences. |
CRITICAL |
Strong attack signal. Long invisible or private-use runs (16+ characters), payload sequences with long runs, and any finding correlated with a nearby decode or dynamic-execution pattern. |
Severity is derived from five inputs, all computed from file content and local context:
-
Sequence length — how many suspicious runes appear in the finding. Isolated characters (1) are treated differently from short runs (2–5), medium runs (6–15), long runs (16–63), and very long runs (64+). Longer sequences receive higher severity regardless of context.
-
File shape — the file is classified as
code_like,data_like,prose_like, orunknownbased on bounded content analysis (first 64 KiB / 400 non-empty lines). Code-like files with brackets, operators, and keywords produce higher severity for the same finding than prose-like files with natural language. -
File role hints — conservative path and filename hints distinguish locale data, ordinary test source, and build or release paths. These hints are advisory only. They never suppress bidi controls, payloads, correlations, long suspicious runs, or
testdataand fixture inputs. -
Finding region — the immediate context around each finding is classified as whitespace-like, string-like, comment-like, token-like, prose-like, or unknown. An invisible character inside an identifier (
token_like) is more severe than one inside a comment or whitespace region. -
Decoder proximity — if a decode or dynamic-execution marker (
eval(,Buffer.from(,atob(, etc.) appears within 5 lines of a finding, severity is escalated by one level. Markers within 20 lines escalate findings that are alreadyHIGH.
| Rule | Base severity logic |
|---|---|
unicode/bidi |
Always HIGH. Bidi controls are never downgraded by context, comments, prose, or path hints. |
unicode/invisible |
Ranges from LOW to CRITICAL depending on sequence length, file shape, file role, and region. A file-start BOM is suppressed. A single non-leading U+FEFF is still reported but defaults to LOW; isolated characters in identifiers are HIGH; long runs are CRITICAL. |
unicode/private-use |
CRITICAL for long runs, HIGH for short/medium runs and code-like token regions, MEDIUM in prose or data contexts. |
unicode/payload |
HIGH for normal sequences, CRITICAL for long runs. |
unicode/correlation |
Always CRITICAL. A payload near a decoder is the strongest signal. |
unicode/mixed-script |
HIGH. Mixing Latin with Cyrillic or Greek in identifiers is a known attack vector. |
unicode/directional-control |
MEDIUM. Directional marks are less dangerous than full bidi overrides. |
unicode/combining-mark |
MEDIUM. Combining marks in token-like text are unusual but not as high-signal as invisible characters. |
U+FEFF at byte offset 0 is treated as a normal file BOM and is not reported. Everywhere else, U+FEFF is still detected.
ghostscan treats isolated and very short invisible-character findings differently from payload-like runs:
- isolated invisible characters default to
LOWunless they appear inside a token-like region or are elevated by nearby decode/execute markers - short runs in prose-like, comment-like, whitespace-like, and data-like contexts default to
LOW - low-signal invisible findings may be suppressed in ordinary test source only when they appear in benign string, comment, whitespace, or prose contexts with no nearby decode, execution, shell, or build markers
- build, release, packaging, CI, shell, and parser-sensitive fixture inputs are not softened by test-like path hints alone
- short runs in code-like strings or unknown regions stay visible and usually land at
MEDIUM - token-like invisible findings remain
HIGH - long invisible runs and payload findings stay strong regardless of surrounding file shape
I downloaded ghostscan on macOS and it is blocked by Gatekeeper. What should I do?
Remove the quarantine attribute from the binary:
xattr -d com.apple.quarantine ghostscanDoes ghostscan run or decode the code it scans?
No. It only performs static checks on file contents.
Can I scan a single file instead of a whole repository?
Yes. Pass the file path directly to ghostscan.
See LICENSE for details.