Markdown Converter
Agent skill for markdown-converter
This guide shows you how to create and register custom prompts for extracting specific information from scientific papers using the RAG pipeline.
Sign in to like and favorite skills
This guide shows you how to create and register custom prompts for extracting specific information from scientific papers using the RAG pipeline.
NERxiv uses a
PROMPT_REGISTRY to manage different extraction tasks. Each entry contains:
PromptPrompt consists of several components:
from nerxiv.prompts.prompts import Prompt, Example prompt = Prompt( expert="Condensed Matter Physics", main_instruction="identify all mentions of computational methods", secondary_instructions=[ "Look for abbreviations like DFT, DMFT, QMC", "Include full names of methods mentioned", "Ignore methods used only as references" ], constraints=[ "Return only method names, one per line", "No additional explanation or thinking block" ], examples=[ Example( input="We use DFT+DMFT to calculate the electronic structure.", output="DFT+DMFT" ), Example( input="The results are compared with Quantum Monte Carlo simulations.", output="Quantum Monte Carlo" ) ] )
Let's create a prompt to extract author affiliations.
Create a new file
my_prompts.py:
from nerxiv.prompts.prompts import Prompt, PromptRegistryEntry, Example # Define the prompt affiliation_prompt = Prompt( expert="Scientific Text Analysis", main_instruction="extract all institutional affiliations of the authors", secondary_instructions=[ "Look for university names, research institutes, and companies", "Include department names if mentioned", "Look near author names or in footnotes" ], constraints=[ "Return each affiliation on a separate line", "Use the full institution name", "Do not include author names" ], examples=[ Example( input="John Doe¹ and Jane Smith² — ¹MIT, Cambridge, MA — ²Stanford University", output="MIT, Cambridge, MA\nStanford University" ), Example( input="Authors from the Department of Physics, University of Tokyo", output="Department of Physics, University of Tokyo" ) ] ) # Define the registry entry AFFILIATION_ENTRY = PromptRegistryEntry( retriever_query="Find sections mentioning authors, affiliations, institutions, or university names", prompt=affiliation_prompt )
Add your prompt to the registry:
from nerxiv.prompts import PROMPT_REGISTRY # Add to registry PROMPT_REGISTRY["affiliations"] = AFFILIATION_ENTRY
nerxiv prompt \ --file-path paper.hdf5 \ --query affiliations \ --model llama3.1:70b
StructuredPromptFor structured output (JSON), use
StructuredPrompt instead:
from nerxiv.prompts.prompts import StructuredPrompt, PromptRegistryEntry, Example from pydantic import BaseModel, Field class Affiliations(BaseModel): items: list[str] = Field([], description="A list of affiliations of the authors") # Create structured prompt affiliation_prompt = StructuredPrompt( expert="Scientific Text Analysis", output_schema=Affiliations, target_fields=["items"], constraints=[ "Return each affiliation as an element of the list `items`", "Use the full institution name", "Do not include author names" ], examples=[ Example( input="John Doe¹ and Jane Smith² — ¹MIT, Cambridge, MA — ²Stanford University", output='```json\n{\n\t"affiliations": {\n\t\t"items": ["MIT, Cambridge, MA", "Stanford University"]\n\t}\n}\n```' ), Example( input="Authors from the Department of Physics, University of Tokyo", output='```json\n{\n\t"affiliations": {\n\t\t"items": ["Department of Physics, University of Tokyo"]\n\t}\n}\n```' ), ] ) # Register it PROMPT_REGISTRY["affiliations"] = PromptRegistryEntry( retriever_query="Find sections mentioning authors, affiliations, institutions, or university names", prompt=affiliation_prompt )
As you can see, the amount of information and free text needed to be passed is less than in the case of
Prompt, see Anatomy of Prompt.
Bad:
main_instruction="get the methods"
Good:
main_instruction="identify all computational and experimental methods used in the study" secondary_instructions=[ "Include both acronyms (e.g., DFT) and full names", "Distinguish between primary methods used and methods mentioned for comparison", "Look in the methods section, introduction, and results" ]
Include edge cases:
examples=[ # Simple case Example( input="We use DFT for electronic structure calculations.", output="DFT" ), # Multiple methods Example( input="The material is studied using DFT, DMFT, and Quantum Monte Carlo.", output="DFT\nDMFT\nQuantum Monte Carlo" ), # Method mentioned but not used Example( input="Our DFT results differ from previous DMFT studies on similar systems.", output="DFT" ), # Abbreviation and full name Example( input="We employ density functional theory (DFT) for the calculations.", output="DFT | density functional theory" ) ]
Guide the output format:
constraints=[ "Return only the extracted information, no explanations", "Use pipe | to separate alternative names for the same entity", "Return 'None' if no relevant information is found", "Do not include thinking process or reasoning" ]
Make it specific:
# Too broad retriever_query="Find relevant information" # Better retriever_query="Identify paragraphs describing computational methods, software packages, and simulation parameters"
Test your prompt on sample text:
from nerxiv.rag import LLMGenerator # Sample text text = """ The calculations were performed using VASP version 6.3. The plane-wave cutoff energy was set to 520 eV, and the Brillouin zone was sampled with a 6x6x6 Monkhorst-Pack k-point grid. All calculations were run on a workstation with 2x Intel Xeon CPUs and 128 GB RAM. """ # Generate answer generator = LLMGenerator(model="llama3.1:8b", text=text, temperature=0.2) prompt_text = computational_prompt.build(text=text) answer = generator.generate(prompt=prompt_text) print("Extracted computational details:") print(answer)
If your prompt doesn't work well:
from nerxiv.chunker import Chunker from nerxiv.rag import CustomRetriever chunker = Chunker(text=paper_text) chunks = chunker.chunk_text() retriever = CustomRetriever( query=PROMPT_REGISTRY["your_query"].retriever_query ) top_text = retriever.get_relevant_chunks(chunks, n_top_chunks=5) print("Retrieved text:") print(top_text)
If the retrieved text doesn't contain what you need, adjust the chunking and retriever parameters.
# Very deterministic nerxiv prompt --file-path paper.hdf5 --query your_query -llmo temperature=0.1 # More creative nerxiv prompt --file-path paper.hdf5 --query your_query -llmo temperature=0.5
If the model output format is inconsistent, add more examples showing the exact format you want.
To share prompts with others:
# custom_prompts.py """ Custom prompts for NERxiv Usage: from custom_prompts import register_custom_prompts register_custom_prompts() # Then use normally nerxiv prompt --file-path paper.hdf5 --query my_custom_query """ from nerxiv.prompts import PROMPT_REGISTRY from nerxiv.prompts.prompts import Prompt, PromptRegistryEntry, Example def register_custom_prompts(): """Register all custom prompts to the global registry""" # Add your prompts here PROMPT_REGISTRY["custom_query"] = PromptRegistryEntry( retriever_query="...", prompt=Prompt(...) ) print(f"Registered {len(PROMPT_REGISTRY)} prompts") # Auto-register on import register_custom_prompts()