Version: 1.0.0-RC2 Status: Production Ready (Hybrid Architecture)
SenTient is a next-generation Entity Reconciliation and Relation Extraction engine designed to bridge the gap between messy, unstructured text and structured Knowledge Graphs (Wikidata/Wikibase).
The core philosophy is a Hybrid Orchestration System that combines three distinct technological lineages into a single "Funnel" pipeline to achieve high performance and accuracy:
3000) is decoupled from the Java Core (Jetty on Port 3333) and communicates via a REST-like Command Pattern API.The system operates on a "Funnel" logic: broad and fast at the top, narrow and precise at the bottom. The unit of work is the SmartCell object, which acts as the immutable contract across all layers.
index_solr + core_java (Clustering).popularity_score (< 100) or those matching stop words are immediately pruned.nlp_falcon (Python 3.9+, Flask, SBERT).falcon_extended_en.txt) and N-Gram generation to clean the signal.sentient_properties_v1 ElasticSearch index to infer the most likely Wikidata Property (Predicate), boosting or penalizing entity candidates accordingly.all-MiniLM-L6-v2). It then calculates Cosine Similarity between the input vector and candidate description vectors fetched from the sentient_entities_fallback Elastic index.core_java (Java 17, Jetty 10, Butterfly Framework).ReconcileCommand), which launches a non-blocking LongRunningProcess managed by the ProcessManager (utilizing a ThreadPoolExecutorAdapter). The Frontend polls for progress.DuckDBStore.insertBatch() to persist heavy vectors to disk and flags the in-memory Cell as RECONCILED (lightweight state).AbstractOperation Command Pattern, ensuring that state can be restored by re-applying the History log upon server crash.The QA Strategy relies on three pillars to statistically prove system improvement over time.
Scrutinizers are "Linting Rules for Data" located in config/qa/scrutinizer_rules.yaml. They run in the Java Core before export.
MATCHED cell has a null QID).WARNING.Accuracy is measured against ground truth using industry-standard datasets:
The evaluate_falcon_api.py script runs the full pipeline.
| Metric | Target (v1.0) | Acceptable Range |
|---|---|---|
| Precision | 0.85 | > 0.80 |
| Recall | 0.82 | > 0.75 |
| F-Score | 0.83 | > 0.78 |
| Latency (p95) | 200ms | < 500ms |
Deployment Rule: If Precision drops by > 2% after a model update (e.g., SBERT or Solr FST index), the deployment is rejected.
The SmartCell is the immutable data contract defined in schemas/data/smart_cell.json.
| Logical Field | JSON Type | Java Type | Python Type | Description |
|---|---|---|---|---|
raw_value | String | String | str | Original user input (never modified) |
status | Enum (String) | Recon.Judgment | str | Current lifecycle state (NEW, PENDING, MATCHED, etc.) |
consensus_score | Float | float (transient) | float | Final calculated confidence (0.0 to 1.0) |
match | Candidate Obj | ReconCandidate | dict | The single winning entity (if reconciled) |
vector | Array<Float> | double[] | np.ndarray | SBERT embedding payload |
features)The Candidate object contains a features object used for UI visualization and debugging. The Frontend renders a stacked bar chart based on these weights:
tapioca_popularity (Solr Log-Likelihood).falcon_context (Cosine Similarity from SBERT).levenshtein_distance (Normalized string distance from Java Core).All services are bound strictly to 127.0.0.1 for security.
| Service | Port | Protocol | Timeout |
|---|---|---|---|
| Java Core (Orchestrator) | 3333 | HTTP/1.1 | - |
| Falcon (Python) | 5005 | HTTP/1.1 | 120s (Throttled) |
| Solr (Tapioca) | 8983 | HTTP/2 | 500ms (Strict) |
| ElasticSearch | 9200 | HTTP/TCP | - |
The central configuration files are located in config/orchestration/environment.json and other files within the config/ directory.